BigData Analysis Using Azure Databricks
data:image/s3,"s3://crabby-images/468f1/468f10b9636957a300ea2faeef8f55700c495934" alt=""
BigData Analysis Using Azure Databricks
BigData Analysis Using Azure Databricks
data:image/s3,"s3://crabby-images/da095/da095fe149b8d6803924687e7d3534ba6fa39912" alt="Azure Databricks"
Select the Azure Subscription & Resource Group. Next add a name for the Azure Databricks Workspace. Select the Azure region as per your choice. For our demo, we are going to use Trial (Premium - 14-Days Free DBUs) Pricing Tire. Click on the Next: Networking > button.
data:image/s3,"s3://crabby-images/d2007/d2007b8b22a487922386a222ca176afd381284e9" alt="Create Azure Databricks Workspace"
For the demo we are not going to deploy Databricks Cluster with Public IP Access. Also we will allow Databricks workspace to create & manage VNet & Subnet to deploy the cluster. Click on the Review + create button.
data:image/s3,"s3://crabby-images/83ad7/83ad7692143ed179179ad998afde3736711417cc" alt="Azure Databricks Networking"
Since we are using the Free trial we will skip the Data Encryption option. Click on the Review + create button.
data:image/s3,"s3://crabby-images/559ae/559aead823bedc85ba13e73c4bcb2339c4dfbbd0" alt="Azure Databricks Advanced Settings"
Add tags to the Azure Databricks Resource for ease of management. Click on the Review + create button.
data:image/s3,"s3://crabby-images/1acf9/1acf93d273b11ec3c8649043513711c3c42f28ab" alt="Azure Databricks Tags"
Finally review the configuration & click on the Create button.
data:image/s3,"s3://crabby-images/bbbc3/bbbc364fc4f0bc8046fbfe1e00df2a567e914b07" alt="Azure Databricks Create Service"
Wait for a few moment for the Azure Databricks Workspace to be Active. Next click on the Launch Workspace button. This will take us to the Databricks Console
data:image/s3,"s3://crabby-images/b72ec/b72ec30b23eced6a3422ab01c7953248c48e2dc4" alt="Azure Databricks Workspace"
data:image/s3,"s3://crabby-images/2f98f/2f98fd9ddcb06c7a3653783073bc2328cd710c31" alt="Databricks Console"
data:image/s3,"s3://crabby-images/2a794/2a79423838522b0a284608e19541fa0309531074" alt="Databricks Create Cluster"
data:image/s3,"s3://crabby-images/7a357/7a357ee6b538d6c31966b57bd3d82124dd5cff8a" alt="Databricks Clusters"
data:image/s3,"s3://crabby-images/0654c/0654c553f1a5c9709f5c90c5c8f6a632579bef31" alt="Databricks Notebooks"
data:image/s3,"s3://crabby-images/31b8f/31b8f10c2d13dc81581e45508cd3f417cf11eec3" alt="Databricks Create Notebook"
Add Cells as required.
Set name of your Azure Blob Storage Account & Shared Access Signature.
blob_account_name = "bigdatastorageacc124"
blob_sas_token = r"?sv=2017-07-29&ss=b&srt=sco&sp=rwdlacup&se=2021-12-30T00:00:00Z&st=2021-10-26T00:00:00Z&spr=https&sig=...."
Set Source Container name, Path & Spark Config
src_blob_container_name = "bigdatadataset124"
src_blob_relative_path = "datasets/"
src_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (src_blob_container_name, blob_account_name, src_blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (src_blob_container_name, blob_account_name), blob_sas_token)
# print('Remote blob path: ' + src_path)
Read Blob Data files using Spark Dataframe. Analyze datset.
from pyspark.sql.functions import sum as sum
df_sales = spark.read.csv(src_path + "sales/sales.psv", sep="|", header="true", inferSchema="true")
df_product = spark.read.csv(src_path + "product/product.psv", sep="|", header="true", inferSchema="true")
df_product_sales = df_sales.join(df_product, df_sales.product_id == df_product.id, "inner").select(df_product.make.alias("company"), df_sales.quantity, (df_sales.quantity * df_product.price).alias("amount"))
df_bda_sales = df_product_sales.groupBy("company").agg(sum("amount").alias("sales_amount"), sum("quantity").alias("sales_quantity"))
# df_bda_sales.show()
Set Target Container name, Path & Spark Config
tgt_blob_container_name = "bigdataresultset124"
tgt_blob_relative_path = "resultsets/"
tgt_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (tgt_blob_container_name, blob_account_name, tgt_blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (tgt_blob_container_name, blob_account_name), blob_sas_token)
# print('Remote blob path: ' + tgt_path)
Write Result set to Azure Blob Storage
(df_bda_sales.coalesce(1).write.mode("overwrite").option("header", "true").format("com.databricks.spark.csv").save(tgt_path + "bda_sales"))
Click on the Run All button at the top of the Notebook.
data:image/s3,"s3://crabby-images/f6667/f6667529d16fc7f0a0578adcac9a4a974427184e" alt="Databricks Notebook"
Next we will create a Databricks Job & Execute on demand or Schedule our Notebook as a Task.
data:image/s3,"s3://crabby-images/f096a/f096ac72af1306253aec8e0523e4d8444d2214e8" alt="Databricks Job"
Add a Job Name. A Job can have one or many dependent tasks. Next add a Task Name and set the Type as Notebook. Browse & select the Notebook we created. From the Cluster dropdown select the existing Cluster we created earlier. Next Click on the Create button.
data:image/s3,"s3://crabby-images/656bc/656bc9635a6d36feb07766efe63fe1f533e95860" alt="Databricks Create Job"
Once the Job is created, Click on the Run Now button to execute our Job.
data:image/s3,"s3://crabby-images/ac086/ac0867e05ab879ab7989789e485afafb6595b4f1" alt="Databricks Run Job"
Finally after the analytics job run, verify the results in Azure Storage Container.