Create Google Cloud Dataproc Cluster
Google Cloud Dataproc lets us provision Apache Hadoop clusters and connect to underlying analytic data stores. With Cloud Dataproc we can set up & launch a cluster to process and analyze data with various big data frameworks very easily.
Navigate to Google Dataproc homepage. Next click on the Clusters link under the Jobs on clusters section. Click on the CREATE CLUSTER button.
Provide a Dataproc cluster name & the google region to provision the cluster. Also select the cluster type.
Lets disable Cluster auto scaling for this demo.
Under the Optional components configuration, let us choose from many of the available big data frameworks.
For the purpose of demo in our next articles, let choose the below frameworks:
- Hadoop: Cluster for distributed processing of big data
- Hive: Distributed data warehouse system on top of Hadoop
- HCatalog: Allows to access Hive Metastore tables and storage management layer from various data processing frameworks
- Pig: Scripting language to transform large data sets
- Tez: Data processing framework for creating a complex directed acyclic graph (DAG) of tasks. Pig and Hive workflows can run using Hadoop MapReduce or they can use Tez as an execution engine
- Spark: Distributed processing framework and programming model for machine learning, stream processing, or graph analytics
- Presto: In-Memory Distributed SQL Query Engine for interactive analytic queries over large datasets from multiple sources
- Jupyter: Provides a development and collaboration environment for ad hoc querying and exploratory analysis
- Zeppelin: Notebook for interactive data exploration
Choose the number of Master Nodes & Machine Types. Next choose the Primary disk volume size for all the Nodes.
Choose the number of Worker Nodes & Machine Types. Next choose the Primary disk volume size for all the Nodes.
Lets disable Secondary Worker nodes for this demo.
Select the VPC & a Subnet to launch the Dataproc Cluster. Label the cluster for tracking & management.
An Initialisation action script is used to to customize settings, install applications or other modifications to the Dataproc cluster during provisioning. We will add a script to enable Pig script to access Hive HCatalog.
#!/bin/bash
# hive-hcatalog.sh
# This script installs Hive HCatalog on a Google Cloud Dataproc cluster.
set -euxo pipefail
function err() {
echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $@" >&2
return 1
}
function update_apt_get() {
for ((i = 0; i < 10; i++)); do
if apt-get update; then
return 0
fi
sleep 5
done
return 1
}
update_apt_get
# Install the hive-hcatalog package
apt-get -q -y install hive-hcatalog || err 'Failed to install hive-hcatalog'
# Configure Pig to use HCatalog
cat >>/etc/pig/conf/pig-env.sh <<EOF
#!/bin/bash
includeHCatalog=true
EOF
Choose existing Cloud Storage bucket as Dataproc Cluster staging to be used for storing job dependencies, job driver output & cluster config files.
Finally click on the Create cluster button. It will take few minutes to launch the Dataproc cluster.
Finally the Dataproc Cluster launch is successful and the clusters enters into Running state.
Let's take a look under the VM Instances tab. Here we will see the Dataproc Node instances.
Let's take a look under the Configuration tab to verify the desired settings of the Dataproc cluster.
Let's take a look under the Web interfaces tab. As part of the various big-data frameworks we selected earlier, the corresponding UI links are available.
In our next article we will check how to submit data processing jobs to a Dataproc cluster. Also we will check on few of the big-data frameworks like Presto, Jupyter, Zeppelin etc.