Logo DWBI.org Login / Sign Up
Sign Up
Have Login?
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Login
New Account?
Recovery
Go to Login
By continuing you indicate that you agree to Terms of Service and Privacy Policy of the site.
Azure Analytics

BigData Analysis Using Azure Databricks

Updated on Oct 26, 2021

BigData Analysis Using Azure Databricks

BigData Analysis Using Azure Databricks

Azure Databricks
Azure Databricks

Select the Azure Subscription & Resource Group. Next add a name for the Azure Databricks Workspace. Select the Azure region as per your choice. For our demo, we are going to use Trial (Premium -  14-Days Free DBUs) Pricing Tire. Click on the Next: Networking > button.

Create Azure Databricks Workspace
Create Azure Databricks Workspace

For the demo we are not going to deploy Databricks Cluster with Public IP Access. Also we will allow Databricks workspace to create & manage VNet & Subnet to deploy the cluster. Click on the Review + create button.

Azure Databricks Networking
Azure Databricks Networking

Since we are using the Free trial we will skip the Data Encryption option. Click on the Review + create button.

Azure Databricks Advanced Settings
Azure Databricks Advanced Settings

Add tags to the Azure Databricks Resource for ease of management. Click on the Review + create button.

Azure Databricks Tags
Azure Databricks Tags

Finally review the configuration & click on the Create button.

Azure Databricks Create Service
Azure Databricks Create Service

Wait for a few moment for the Azure Databricks Workspace to be Active. Next click on the Launch Workspace button. This will take us to the Databricks Console

Azure Databricks Workspace
Azure Databricks Workspace
Databricks Console
Databricks Console
Databricks Create Cluster
Databricks Create Cluster
Databricks Clusters
Databricks Clusters
Databricks Notebooks
Databricks Notebooks
Databricks Create Notebook
Databricks Create Notebook

Add Cells as required.

Set name of your Azure Blob Storage Account & Shared Access Signature.

blob_account_name = "bigdatastorageacc124"
blob_sas_token = r"?sv=2017-07-29&ss=b&srt=sco&sp=rwdlacup&se=2021-12-30T00:00:00Z&st=2021-10-26T00:00:00Z&spr=https&sig=...."

Set Source Container name, Path & Spark Config

src_blob_container_name = "bigdatadataset124"
src_blob_relative_path = "datasets/"
src_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (src_blob_container_name, blob_account_name, src_blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (src_blob_container_name, blob_account_name), blob_sas_token)
# print('Remote blob path: ' + src_path)

Read Blob Data files using Spark Dataframe. Analyze datset.

from pyspark.sql.functions import sum as sum
df_sales = spark.read.csv(src_path + "sales/sales.psv", sep="|", header="true", inferSchema="true")
df_product = spark.read.csv(src_path + "product/product.psv", sep="|", header="true", inferSchema="true")
df_product_sales = df_sales.join(df_product, df_sales.product_id == df_product.id, "inner").select(df_product.make.alias("company"), df_sales.quantity, (df_sales.quantity * df_product.price).alias("amount"))
df_bda_sales = df_product_sales.groupBy("company").agg(sum("amount").alias("sales_amount"), sum("quantity").alias("sales_quantity"))
# df_bda_sales.show()

Set Target Container name, Path & Spark Config

tgt_blob_container_name = "bigdataresultset124"
tgt_blob_relative_path = "resultsets/"
tgt_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (tgt_blob_container_name, blob_account_name, tgt_blob_relative_path)
spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (tgt_blob_container_name, blob_account_name), blob_sas_token)
# print('Remote blob path: ' + tgt_path)

Write Result set to Azure Blob Storage

(df_bda_sales.coalesce(1).write.mode("overwrite").option("header", "true").format("com.databricks.spark.csv").save(tgt_path + "bda_sales"))

Click on the Run All button at the top of the Notebook.

Databricks Notebook
Databricks Notebook

Next we will create a Databricks Job & Execute on demand or Schedule our Notebook as a Task.

Databricks Job
Databricks Job

Add a Job Name. A Job can have one or many dependent tasks. Next add a Task Name and set the Type as Notebook. Browse & select the Notebook we created. From the Cluster dropdown select the existing Cluster we created earlier. Next Click on the Create button.

Databricks Create Job
Databricks Create Job

Once the Job is created, Click on the Run Now button to execute our Job.

Databricks Run Job 
Databricks Run Job 

Finally after the analytics job run, verify the results in Azure Storage Container. 

Top 10 Articles