Create Google Cloud Dataproc Cluster

Updated on Oct 15, 2021

Google Cloud Dataproc lets us provision Apache Hadoop clusters and connect to underlying analytic data stores. With Cloud Dataproc we can set up & launch a cluster to process and analyze data with various big data frameworks very easily.

Navigate to Google Dataproc homepage. Next click on the Clusters link under the Jobs on clusters section. Click on the CREATE CLUSTER button.

Provide a Dataproc cluster name & the google region to provision the cluster. Also select the cluster type.

Lets disable Cluster auto scaling for this demo.

Under the Optional components configuration, let us choose from many of the available big data frameworks.

For the purpose of demo in our next articles, let choose the below frameworks:

Hadoop: Cluster for distributed processing of big data
Hive: Distributed data warehouse system on top of Hadoop
HCatalog: Allows to access Hive Metastore tables and storage management layer from various data processing frameworks
Pig: Scripting language to transform large data sets
Tez: Data processing framework for creating a complex directed acyclic graph (DAG) of tasks. Pig and Hive workflows can run using Hadoop MapReduce or they can use Tez as an execution engine
Spark: Distributed processing framework and programming model for machine learning, stream processing, or graph analytics
Presto: In-Memory Distributed SQL Query Engine for interactive analytic queries over large datasets from multiple sources
Jupyter: Provides a development and collaboration environment for ad hoc querying and exploratory analysis
Zeppelin: Notebook for interactive data exploration

Cloud Dataproc Cluster Components<br> — Cloud Dataproc Cluster Components

Choose the number of Master Nodes & Machine Types. Next choose the Primary disk volume size for all the Nodes.

Choose the number of Worker Nodes & Machine Types. Next choose the Primary disk volume size for all the Nodes.

Lets disable Secondary Worker nodes for this demo.

Select the VPC & a Subnet to launch the Dataproc Cluster. Label the cluster for tracking & management.

An Initialisation action script is used to to customize settings, install applications or other modifications to the Dataproc cluster during provisioning. We will add a script to enable Pig script to access Hive HCatalog.

#!/bin/bash
# hive-hcatalog.sh
# This script installs Hive HCatalog on a Google Cloud Dataproc cluster.

set -euxo pipefail

function err() {
  echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $@" >&2
  return 1
}

function update_apt_get() {
  for ((i = 0; i < 10; i++)); do
    if apt-get update; then
      return 0
    fi
    sleep 5
  done
  return 1
}

update_apt_get

# Install the hive-hcatalog package
apt-get -q -y install hive-hcatalog || err 'Failed to install hive-hcatalog'

# Configure Pig to use HCatalog
cat >>/etc/pig/conf/pig-env.sh <<EOF
#!/bin/bash

includeHCatalog=true

EOF

Choose existing Cloud Storage bucket as Dataproc Cluster staging to be used for storing job dependencies, job driver output & cluster config files.

Cloud Dataproc Cluster Initialisation Action

Finally click on the Create cluster button. It will take few minutes to launch the Dataproc cluster.

Finally the Dataproc Cluster launch is successful and the clusters enters into Running state.

Let's take a look under the VM Instances tab. Here we will see the Dataproc Node instances.

Let's take a look under the Configuration tab to verify the desired settings of the Dataproc cluster.

Let's take a look under the Web interfaces tab. As part of the various big-data frameworks we selected earlier, the corresponding UI links are available.

In our next article we will check how to submit data processing jobs to a Dataproc cluster. Also we will check on few of the big-data frameworks like Presto, Jupyter, Zeppelin etc.

Create Google Cloud Dataproc Cluster

PrimeChess.org

Top 10 Articles