Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our effort to describe big data in simple technical language, stripping-off all the marketing lingo and sales jargons. Shall we begin?

What is Big Data?

As a customary, let us begin with a somewhat formal definition of big data. We have copied the below definition from Wikipedia. Read it anyway.

In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

Don't worry if the above definition failed to incite your imagination. If you just need to copy the above definition and paste it in your presentation slide for management meeting tomorrow morning, then feel free to do so and leave this webpage. Otherwise, read on.

Okay. Now take a deep breath and look back to the above definition. It's actually a very good definition. All it says is, the term "Big Data" relates to any collection of data set which is so huge that traditional databases are unable to process them in timely manner. So how big is big enough? And what is the processing limit of the traditional databases?

How Big is Big?

Sources of big data

Image Courtesy: IBM.com

Ask yourself this question.

What is the maximum database size you have seen so far?

Now I know if you are a professional in database and related subjects, you may be able to answer this question. But if you do not know anything about databases (then what are you doing here, anyway?), it's not possible for you to answer this. So let me ask an easier question. How much do you think is the size of the database where US government stores every citizen's social security number (SSN)? US population is somewhere around 330 million. If everybody has got one SSN, that database should approximately store 330 million records (I know it's over simplification, Bill, but there are people who use Apple). Is this database big enough? Probably not. Think of Facebook. As of 31 Dec 2012, they have 680 million monthly active users. Is that big enough? Not really. How about Google then? People perform 6,000 million searches per day in Google. Is that big data? Yep, may be - depending on how you store this data and what you intend do with this data. The thing is, storing just 6000 million rows is not a big thing in itself. You can do that using conventional technologies (read RDBMS) and tools (Oracle Database for example). But it can be more intriguing that that.

Endure this -

What if I ask you to store 6000 million of search phrases that are searched in Google everyday for 2 years continuously and at the end of it prepare a report on first 25 most searched keywords related to "football"?

Now you might (rightfully) say, “this is insane”. 6000 million for 2 years means (2 X 6000 X 365) 4,380,000,000,000 records. And even if we store it, how will we possibly perform analysis on that data? Gee! Welcome to the world of Big Data!

World of Big Data

Now I am sure, if you have been doing RDBMS and SQL all your life, you wouldn't give in so easily. While you understand that your favorite RDBMS database will never handle that much data, but you will definitely ask this question -

WHY on EARTH anybody needs to do that kind of WEIRD analysis in the first place?

Or you may also question the rationality of this -

What about the cost factor?

Because storing such a staggering amount of data and performing analysis on that will probably take hundreds of thousands of dollars. This means the return on investment will be pathetic!

And my answer will be - "yes, it would be pathetic. For you. Not for big data!" Because Big data does not use RDBMS, SQL queries or conventional databases. True, had we tried to tame big data using conventional technologies - we would have failed. And that is why scientists have found a different way of handling such a huge volume of data. I will show you how. But let me answer another question before that. The question is about the need.

Why do we need big data?

In order to run your business intelligibly, you have to have the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experiences. All of these abilities revolve around a simple fact - your ability to perform data driven analysis. Because that is how you discover knowledge. But the problem is, how do you get enough data? And if you happen to have enough data, how do you perform analysis on that data? These are 2 different questions. And answer to these questions will lead us to Big Data. SO let's explore them.

Conventionally, in order to have enough data, you will have to setup enough means to collect your data. That's not always easy and definitely not cheap.

The Problem

Say, you are a cross-country fast food restaurant chain owner and you sell chicken burger and French fries. You are considering two new products, out of which you want to introduce one in your restaurants. The products are Mushroom burger and Pineapple Pie. If you are given the task to select one of these products over the other so as to maximize your profit, how would you take the decision? Clearly you have no other knowledge/experience/data about your customers liking or affinity towards one over the other.

If you have to take a data driven decision, the conventional way would be to run a survey campaign. Like, you might distribute a feedback form to all of your restaurant visitors to express their specific choices. And based on the data you collected thus, you may be able to take a decision. But there are 3 problems to this approach. First, the process is painfully slow. Second, the process is only limited to the choices of your existing customers and does not take into account your potential customers. And lastly, you need to design a computerized process to convert the results of your offline survey data to database records for analysis.

Possible Solution

The above example demonstrates the typical scarcity of data when a business wants to grow on an uncharted territory. But this doesn't have to be like this always. With the advent of social media like Facebook and Twitter, it's relatively easy to get such data. Think of this, 680 millions of Facebook users are going to several restaurants every month, liking particular foods, commenting "wow! I like the Pineapple pie after this burger" or "I hate the smell of shroom pizza" now and then.

social media data

A Facebook Post showing one's affinity towards a food!

This can be an incredible source of information for your restaurant chain business. This data is already there. The question is how do we use it? Clearly, creating a conventional RDBMS-based data model to store this data is out of question because of the time and processing required to transform this unstructured textual data to a structured database schema. And even if you do that, there is this herculean task of analyzing this data out of that. Fundamentally, the conventional process requires a lot of data movement (read physical IO) which calls for a better process.

In case if you are still thinking, let me clarify that conventional data warehouse designs based on ER model or dimensional model are not suitable here. If you want to store this huge data in ER model - analytics will be dreadful due to huge number of joins that you need to traverse to retrieve records. If you want to ease out the data retrieval part by doing a dimensional modeling - the data will be unnecessarily voluminous because of data redundancy.

Big Data On Rescure - Map Reduce

So upto this point, we have established the need for big data for better business intelligence and we also understand the pain associated with lot of data. Many refer this as 3-Vs of Big Data, namely: Volume, Variety and Velocity (see below) which ultimately increases the complexity associated with Big Data processing.

Challenges of Big Data

Fortunately Big data comprises of methods and technologies to solve this problem. One of these methods is called - "Map Reduce". You may use Map Reduce framework to analyze the unstructured data through streaming. Streaming is a great thing - because when you stream data, you obviate the need for storing the data and while streaming you can actually analyze the data on the go to find the answers of your questions. "Map Reduce" is a separate topic and definitely outside the scope of this essay, but I intend to write on this when I have some time.

The important point to note is, big data provides us means to overcome the issue that were otherwise not solvable using conventional technique. The good thing is, once you start exploiting the potential of big data, both the scope and possibility of data driven analysis increases widely. Just like we saw in our restaurant chain example, leveraging big data analytics on social media data, now you can extend the scope of your analysis far more widely and perform it in comparatively lesser time.

Examples of Big Data Analytics

All these are not just theory. There are diverse use-cases of big data right now in this world. Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. And yet for every user they can perform market basket analysis to show you the list of products that you are most likely to buy, given your previous buying habits and buying habits of millions of customers like you. The NASA Center for Climate Simulation (NCCS) stores 32 peta-bytes of climate observations and simulations on the Discover supercomputing cluster - to enable various climate predictions. In physics, The Large Hadron Colliders in CERN, Geneva represent about 150 million sensors delivering data 40 million times per second. These data are processed to find out presence of new elementary particles in nature, among other things. All of these could not have been possible using conventional database technologies.

There is no doubt in my mind that Big Data is definitely the next big thing. However, like everything else, there are lots of hypes associated with it. In fact according to a Gartner analyst, Big Data is plunging into a “trough of disillusionment” that may dissuade many companies from pursuing analytics technology. But like everything else in this world, as a technologist we need to learn to discern the reality from hype.

Now that you know what big data is, I will conclude this article here. Here is a very useful infograph summarizing many concepts of Big Data, that you may wish to download. In my next article, we will discuss the implementation and use-cases of big data analysis.


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Fools Guide to Big data - What is Big Data

    Sure enough, you have heard the term, "Big Data" many times before. There is no dearth of information in the Internet and printed medium about this. But guess what, this term still remains vaguely defined and poorly understood. This essay is our...

  • Configuring MySQL as Hive Metastore

    In the previous article, we have learnt How to Install and Configure Hive with default Derby metastore. However, an embedded derby based metastore can process only one request at a time. Since this is very restrictive, we will setup a traditional...

  • Install PIG In Client Node of Hadoop Cluster

    Apache Pig is a platform for analyzing large data sets. Pig Latin is the high level programming language that, lets us specify a sequence of data transformations such as merging data sets, filtering them, grouping them, and applying functions to...

  • How to Setup Hadoop Multi Node Cluster - Step By Step

    Setting up Hadoop in a single machine is easy, but no fun. Why? Because Hadoop is not meant for a single machine. Hadoop is meant to run on a computing cluster comprising of many machines. Running HDFS and MapReduce on a single machine is great for...

  • Set up Client Node (Gateway Node) in Hadoop Cluster

    Once we have our multi-node hadoop cluster up and running, let us create an EdgeNode or a GatewayNode. Gateway nodes are the interface between the Hadoop cluster and the outside network. Edge nodes are used to run client applications and cluster...

  • Install FLUME In Client Node of Hadoop Cluster

    Apache Flume is a distributed, robust, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data or streaming event data from different sources to a centralized data store. Its main goal is to...

  • Oracle Installation for SQOOP Import

    We would like to perform practical test of Apache SQOOP import/export utility between ORACLE relational database & Apache HADOOP file system, let us quickly setup an ORACLE server. For that we will be using cloud based services/servers as we did...

  • Hadoop MapReduce Basics

    The Hadoop, since its inception is changing the way the enterprises store, process and analyse data. MapReduce is the core part of the Hadoop framework and we can also call it as the core processing engine of Hadoop. It is a programming model...

  • Install SPARK in Hadoop Cluster

    Apache Spark is a fast and general purpose engine for large-scale data processing over a distributed cluster. Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Spark run programs up to 100x...

  • Stream Webserver Log into Hdfs using FLUME

    In this article we will use Apache Flume to gather stream access log data from our remote Web Server into Hadoop Distributed File System. We will be analyzing the access log in a real-time basis. So we have to setup Flume such that it collects the...