BigData Analysis Using Azure Databricks
![](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/23bzvxtaAQ4q3MTvcMuVR1z7xhxErnkeCtsgSPWO.png)
BigData Analysis Using Azure Databricks
BigData Analysis Using Azure Databricks
![Azure Databricks](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/LTy9uphP0yhB6IPrfaIePAAn9zZ4Jq0IjLHzhkRT.png)
Select the Azure Subscription & Resource Group. Next add a name for the Azure Databricks Workspace. Select the Azure region as per your choice. For our demo, we are going to use Trial (Premium - 14-Days Free DBUs) Pricing Tire. Click on the Next: Networking > button.
![Create Azure Databricks Workspace](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/zDEk9lr2eUq7as0wDesY5Pokx1hIo1qXffS6ukfJ.png)
For the demo we are not going to deploy Databricks Cluster with Public IP Access. Also we will allow Databricks workspace to create & manage VNet & Subnet to deploy the cluster. Click on the Review + create button.
![Azure Databricks Networking](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/PS3fmTt9zejsh17VGwTnh70xNaO4z2IYnZy36r6x.png)
Since we are using the Free trial we will skip the Data Encryption option. Click on the Review + create button.
![Azure Databricks Advanced Settings](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/HrmOjpBHtmOiDjjAs1RTUm5SPygur4hoxQ32pMt3.png)
Add tags to the Azure Databricks Resource for ease of management. Click on the Review + create button.
![Azure Databricks Tags](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/utN4CAJGHtZlxJBK8gd2YEvXRpAOGKKqc8AYJzvA.png)
Finally review the configuration & click on the Create button.
![Azure Databricks Create Service](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/8FChWME94yl7OpPDpGoLs9tRG4IivSCWScwXqref.png)
Wait for a few moment for the Azure Databricks Workspace to be Active. Next click on the Launch Workspace button. This will take us to the Databricks Console
![Azure Databricks Workspace](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/pA7qQg4YAz1m8hai3u8pICsyEtvuUTUZ3tCXxcoL.png)
![Databricks Console](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/f8ivgCukmbduRPwI8ZuQyVadYZUqszPc99ZMdCjg.png)
![Databricks Create Cluster](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/E73GDYjueSfQ3i07VGBlseIXUp7DjQwtAXdlWING.png)
![Databricks Clusters](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/j1jc9EB03H9LQUtD04X8qhzp7aM8ZvXon5hFDmMm.png)
![Databricks Notebooks](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/Qx26o7nafYYVEf9QNku0mVVckiF4nCb7gCWgtqav.png)
![Databricks Create Notebook](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/nGKKZoRoL1E29Uuh48c7qGCpgScf5cVfMI4mBhdM.png)
Add Cells as required.
Set name of your Azure Blob Storage Account & Shared Access Signature.
blob_account_name = "bigdatastorageacc124"
blob_sas_token = r"?sv=2017-07-29&ss=b&srt=sco&sp=rwdlacup&se=2021-12-30T00:00:00Z&st=2021-10-26T00:00:00Z&spr=https&sig=...."
Set Source Container name, Path & Spark Config
src_blob_container_name = "bigdatadataset124"
src_blob_relative_path = "datasets/"
src_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (src_blob_container_name, blob_account_name, src_blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (src_blob_container_name, blob_account_name), blob_sas_token)
# print('Remote blob path: ' + src_path)
Read Blob Data files using Spark Dataframe. Analyze datset.
from pyspark.sql.functions import sum as sum
df_sales = spark.read.csv(src_path + "sales/sales.psv", sep="|", header="true", inferSchema="true")
df_product = spark.read.csv(src_path + "product/product.psv", sep="|", header="true", inferSchema="true")
df_product_sales = df_sales.join(df_product, df_sales.product_id == df_product.id, "inner").select(df_product.make.alias("company"), df_sales.quantity, (df_sales.quantity * df_product.price).alias("amount"))
df_bda_sales = df_product_sales.groupBy("company").agg(sum("amount").alias("sales_amount"), sum("quantity").alias("sales_quantity"))
# df_bda_sales.show()
Set Target Container name, Path & Spark Config
tgt_blob_container_name = "bigdataresultset124"
tgt_blob_relative_path = "resultsets/"
tgt_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (tgt_blob_container_name, blob_account_name, tgt_blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (tgt_blob_container_name, blob_account_name), blob_sas_token)
# print('Remote blob path: ' + tgt_path)
Write Result set to Azure Blob Storage
(df_bda_sales.coalesce(1).write.mode("overwrite").option("header", "true").format("com.databricks.spark.csv").save(tgt_path + "bda_sales"))
Click on the Run All button at the top of the Notebook.
![Databricks Notebook](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/akiwPPy95rfe58ahDIlnudR3Ynfv5QFo9g5b96ee.png)
Next we will create a Databricks Job & Execute on demand or Schedule our Notebook as a Task.
![Databricks Job](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/Kw5E8lXWvu7DoOvnPMOa2oTsg8lVi9V2Eu40pivg.png)
Add a Job Name. A Job can have one or many dependent tasks. Next add a Task Name and set the Type as Notebook. Browse & select the Notebook we created. From the Cluster dropdown select the existing Cluster we created earlier. Next Click on the Create button.
![Databricks Create Job](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/XI1w0YjcvN9YcyCLDwVGrVYu2R101jFxtvLKQYBS.png)
Once the Job is created, Click on the Run Now button to execute our Job.
![Databricks Run Job](https://d1y19n2ra9pfoy.cloudfront.net/dwbi.org/media/nExM3utO2MUnfSIZuUgKU7TqNxOrDmUCezmeEfIA.png)
Finally after the analytics job run, verify the results in Azure Storage Container.