What is Databricks and Azure Databricks?
Databricks is developed by the founders of Apache Spark and is an end-to end (from development to production) web-based analytics platform that makes it easy to combine Big Data, Data Science and Apacke Spark.
In 2017, Microsoft and Databricks, under the name Azure Databricks, entered into a collaboration that has enabled to fully integrate a Databricks platform in a Azure-environment.
This collaboration between Azure as a Cloud provider and Databricks as the Apache Spark platform, allows the huge computing power of Databricks to be integrated into a fully integrated cloud environment where the services speak the same language – now also with the Databricks framework
Get a quick overview of Azure Databricks in this video!
Azure Databricks and Apache Spark
One of the great strengths of the collaboration between Azure and Databricks is that you have an Apache Spark platform that is fully integrated with all known Azure components such as Azure Data Factory and Azure Blob Storage, allowing for continuous pipelines in each project.
A related important aspect of Databricks is the ability to share different profiles, making it easier and more secure for different profiles such, as Data Engineers and Data Scientist, to work together on individual projects in the Databricks environment.
Databricks also have the option of Auto-Scaling your resources. This means that you can have a cluster that automatically adapts to what you need at the given time. In general, the entire clustering aspect is handled by Databricks, which makes it easy to get started with Cluster Computing even for beginners. When you reach a more advanced level in the process of spark, there are also opportunities to monitor your programs directly from Databricks to optimize these.
Is Azure Databrick for you?
Many different profiles can benefit from Databricks, but overall it makes sense if you:
- Have many different profiles working together
- Need to be able to handle queries in varying amounts
- Work with very large amounts of data and/or very heavy calculations
- Would like everything in one place with a user-friendly design and set-up
- Want direct scheduling of each Notebook
- Want the ability to switch freely between languages R, Python, SQL, and Scala – even in the same Notebook
What is Apache Spark?
Apache Spark is an open-source distributed cluster computing framework that started in 2013, which in recent years has become one of the preferred platforms for Artificial Intelligence and real-time applications. But what does it really mean to be an “open source distributed cluster computing framework”?
By open source is meant software where the source code is freely available for use and contribution. In this case, it’s about Scala, on which Apache Spark is built.
Distributed cluster computing means that the programs you execute are processed (distributed) on a group of computers (a cluster).
You could say that you break down a bigger problem into smaller problems, and let each node (computer) in its cluster handle a smaller chunk of the task at the same time and therefore reach the result much faster.
This distribution (mapping) of the tasks to the available nodes as well as the aggregation of the individual results (reducing) occurs automatically and in most cases is an advantage when it comes to time spent when working with very large data volumes.