Azure Databricks: End-to-end web-based analytics platform
With Azure Databricks it is easy to analyze very large volumes of data using the enormous computing power available in the cloud environment.
What is Databricks and Azure Databricks?
Databricks is developed by the founders of Apache Spark and is an end-to end (from development to production) web-based analytics platform that makes it easy to combine Big Data, Data Science and Apacke Spark.
In 2017, Microsoft and Databricks, under the name Azure Databricks, entered into a collaboration that has enabled to fully integrate a Databricks platform in a Azure-environment.
This collaboration between Azure as a Cloud provider and Databricks as the Apache Spark platform, allows the huge computing power of Databricks to be integrated into a fully integrated cloud environment where the services speak the same language – now also with the Databricks framework
Watch a video and het a quick overview of Azure Databricks!
Databricks and Azure Databricks features
Databricks is basically created to make it easy to work with. The solution therefore contains a lot of user-friendly features:
- Real-time visualizations in the Notebook
- Integration with Github and Bitbucket
- Databrick’s own Notebook that acts as Databrick’s own Interactive Development Environment (IDE)
- An internal dashboard connected to your Notebook
- Direct scheduling of each Notebook
- Notebooks that support multiple languages: R, Python, SQL, and Scala
Azure Databricks and Apache Spark
One of the great strengths of the collaboration between Azure and Databricks is that you have an Apache Spark platform that is fully integrated with all known Azure components such as Azure Data Factory and Azure Blob Storage, allowing for continuous pipelines in each project.
A related important aspect of Databricks is the ability to share different profiles, making it easier and more secure for different profiles such, as Data Engineers and Data Scientist, to work together on individual projects in the Databricks environment.
Databricks also have the option of Auto-Scaling your resources. This means that you can have a cluster that automatically adapts to what you need at the given time. In general, the entire clustering aspect is handled by Databricks, which makes it easy to get started with Cluster Computing even for beginners. When you reach a more advanced level in the process of spark, there are also opportunities to monitor your programs directly from Databricks to optimize these.
How Databricks can be used
In addition to the general benefit of having an easy overview of your pipelines and cluster computing advantage when processing very large volumes of data, here we will look at two examples where Databrick’s makes particular sense:
Tasks that include using Machine Learning or Deep Learning
In addition to having an easy and clear integration to all the major Machine Learning libraries as well as necessities that come before and after Machine Learning development, Databricks has the advantage that the very large computations or data volumes that are commonly associated with Machine Learning and Deep Learning can easily be orchestrated on your automatically started cluster.
Tasks that include using real time streaming data such as IoT data from machines
Here you can take advantage of Databrick’s Apache Spark foundation, more specifically the extension to Apache Sparks core API: Spark Streaming, which is set up to work with frequently used data sources such as Kafka, HDFS and Twitter. Spark Streaming is also fully integrated into Databricks and is very easy to set up and use.
Is Azure Databrick for you?
Many different profiles can benefit from Databricks, but overall it makes sense if you:
Have many different profiles working together
Need to be able to handle queries in varying amounts
Work with very large amounts of data and/or very heavy calculations
Would like everything in one place with a user-friendly design and set-up
Want direct scheduling of each Notebook
Want the ability to switch freely between languages R, Python, SQL, and Scala – even in the same Notebook
What is Apache Spark?
Apache Spark is an open-source distributed cluster computing framework that started in 2013, which in recent years has become one of the preferred platforms for Artificial Intelligence and real-time applications. But what does it really mean to be an “open source distributed cluster computing framework”?
By open source is meant software where the source code is freely available for use and contribution. In this case, it’s about Scala, on which Apache Spark is built.
Distributed cluster computing means that the programs you execute are processed (distributed) on a group of computers (a cluster).
You could say that you break down a bigger problem into smaller problems, and let each node (computer) in its cluster handle a smaller chunk of the task at the same time and therefore reach the result much faster.
This distribution (mapping) of the tasks to the available nodes as well as the aggregation of the individual results (reducing) occurs automatically and in most cases is an advantage when it comes to time spent when working with very large data volumes.