In this article we will learn how to perform quick analysis on Unstructured data using Hadoop & Hive. In this case we have a huge PHP Error Log file. What we want here is to parse the log file and get meaningful analytics from the error log. For example we want to understand which webpages are encountering the frequent PHP Errors, types of PHP Issue encountered (Warning, Notice, Error etc.). So lets see how easy it is, to get the answers to our questions, when we have HADOOP & HIVE in our toolbox.

Hadoop Distributed File System

We already have our php_error.log in our EdgeNode machine. Let us quickly check how the error messages look like.

root@EdgeNode:~# tail /root/php_errors.log

Now see the below sample error & warning messages. We need to define the parsing logic for this unstructured file going forward using regular expression.

[02-Sep-2016 13:04:20 Asia/Singapore] PHP Notice:  Uninitialized string offset: 0 in /srv/mywebsite.com/public_html/libraries/joomla/application/web.php on line 507
[02-Sep-2016 14:16:56 Asia/Singapore] PHP Notice:  Undefined variable: language in /srv/mywebsite.com/public_html/administrator/components/com_easysocial/tables/profile.php on line 767
[03-Sep-2016 01:06:17 Asia/Singapore] PHP Fatal error:  Allowed memory size of 209715200 bytes exhausted (tried to allocate 86 bytes) in /srv/mywebsite.com/public_html/libraries/joomla/object/object.php on line 122
[03-Sep-2016 02:23:55 Asia/Singapore] PHP Notice:  Undefined index: HTTP_HOST in /srv/mywebsite.com/public_html/libraries/joomla/application/web.php on line 863
[31-Oct-2015 18:38:47 Asia/Singapore] PHP Warning:  DateTime::createFromFormat() expects parameter 3 to be DateTimeZone, integer given in /srv/mywebsite.com/public_html/delivery/schedule.php on line 25
[31-Oct-2015 19:02:08 Asia/Singapore] PHP Fatal error:  Uncaught exception 'Exception' with message 'DateTime::__construct(): Failed to parse time string ('2015-10-28 16:00:00') at position 0 ('): Unexpected character' in /srv/mywebsite.com/public_html/delvery/schedule.php:19
Stack trace:
#0 /srv/mywebsite.com/public_html/delivery/schedule.php(19): DateTime->__construct(''2015-10-28 16:...', Object(DateTimeZone))
#1 {main}
  thrown in /srv/mywebsite.com/public_html/delivery/schedule.php on line 19

So our very first step is to put the file from local file system to hadoop distributed file system. So before that let's create a directory in HDFS where we will perform analysis, on a few different types of log files.

root@EdgeNode:~# hadoop fs -mkdir -p /log_analysis/php_error_log
root@EdgeNode:~# hadoop fs -copyFromLocal /root/php_errors.log /log_analysis/php_error_log/php_errors.log

Now our log file resides in the HDFS. Next before we create a HIVE table we need to determine the Input Regular Expression to parse this Unstructured log file. We will be using Hive RegexSerDe in this scenario.

File Parsing using Regular Expression

We will go step by step, on how to derive the correct Regular Expression to parse our log file. Based on the sample error messages we have derived the logic to generate the Match Groups or Capture Groups of our interest as shown below. To help to get the full Regular Expression you can take the help of any online tool e.g. http://rubular.com/
You may paste the sample data in the test string are and carry on developing your regular expression and at the same time check, whether the Match result & Match Groups is showing as expected.

A little snapshot of the derivation logic is given in the sample table format below:

Exclude Rows Starting with Pound #1 {main} ^(?<!#)
Exclude Opening Square Bracket from Group [ (?:[\[])
Match Group Date As String 03-Sep-2016 %1 ([^ ]*)
Exclude Space Between Date & Time from Group (?: )
Match Group Time As String 01:06:17 %2 ([^ ]*)
Exclude Space Between Date/Time & Region from Group (?: )
Match Group Continent/Country As String Asia/Singapore %3 ([^ ]*)
Exclude Closing Square Bracket from Group ] (?:\] )
Match Group Error Type As String PHP Fatal error %4 ([^\:]*)
Exclude Colon from Match Group : (?:\:)
Match Group Error Message As String Till the beginning of ( in ) keyword Allowed memory size of 209715200 bytes exhausted (tried to allocate 86 bytes) %5 (.*?) in
Match Group Error Filename As String /srv/mywebsite.com/public_html/libraries/joomla/object/object.php %6 ([^ ]*)
Exclude Space & words on line ( on line ) OR Space Colon ( : ) on line (?: on line |\:)
Match Group Line Number As String 122 %7 (\S+)

So finally our Regular Expression comes out as below. NOTE: Spaces are a part of our regex parse logic.

^(?<!#)(?:[\[])([^ ]*)(?: )([^ ]*)(?: )([^ ]*)(?:\] )([^\:]*)(?:\:)(.*?) in ([^ ]*)(?: on line |\:)(\S+)

HIVE Table

Now we need to create a HIVE table. We will go for an EXTERNAL table. Also the Row format Deserializer used is 'org.apache.hadoop.hive.serde2.RegexSerDe'.
Hadoop Hive RegexSerDe uses regular expression (regex) to deserialize data. It doesn't support data serialization. It can deserialize the data using regex and extracts groups as columns. In deserialization stage, if a row does not match the regex, then all columns in the row will be NULL. If a row matches the regex but has less than expected groups, the missing groups will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored.

Before we create a hive table, ensure Metastore Server & HiveServer2 is running, else start the services. Login to EdgeNode and start the services.

root@EdgeNode:~# hive --service metastore
root@EdgeNode:~# /usr/local/hive/bin/hiveserver2

Let's now create the Hive table backed by the HDFS log data file. Login to beeline & use the default database.

root@EdgeNode:~# beeline
beeline> !connect jdbc:hive2://localhost:10000 hive hadoop1234
0: jdbc:hive2://localhost:10000> use default;
0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE STG_PHP_ERROR_LOG ( 
LOG_DATE STRING,
LOG_TIME STRING,
REGION STRING,
ERROR_TYPE STRING,
ERROR_MESSAGE STRING,
FILENAME STRING,
LINE_NUMBER STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ( 
"input.regex" = "^(?<!#)(?:[\\[])([^ ]*)(?: )([^ ]*)(?: )([^ ]*)(?:\\] )([^\\:]*)(?:\\:)(.*?) in ([^ ]*)(?: on line |\\:)(\\S+)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s"
)
STORED AS TEXTFILE
LOCATION '/log_analysis/php_error_log/';

Once the table gets created successfully, its time to get the answers to our critcal questions. Lets query the table.

0: jdbc:hive2://localhost:10000> SELECT COUNT(1) FROM stg_php_error_log;
+----------+--+
|    c0    |
+----------+--+
| 4456594  |
+----------+--+
1 row selected (44.58 seconds)
0: jdbc:hive2://localhost:10000> SELECT COUNT(1),error_type FROM stg_php_error_log GROUP BY error_type;

+----------+----------------------------+--+
|    c0    |         error_type         |
+----------+----------------------------+--+
| 457      | NULL                       |
| 4406499  | PHP Notice                 |
| 49528    | PHP Warning                |
| 65       | PHP Fatal error            |
| 33       | PHP Parse error            |
| 2        | PHP Catchable fatal error  |
| 10       | PHP Unknown error          |
+----------+----------------------------+--+
7 rows selected (46.773 seconds)

Lets check few more query's

SELECT COUNT(1),filename FROM stg_php_error_log WHERE error_type IN ('PHP Fatal error','PHP Parse error') GROUP BY filename ;

SELECT DISTINCT filename, line_number FROM stg_php_error_log WHERE YEAR(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(log_date, 'dd-MMMM-yyyy')))) = 2016 AND MONTH(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(log_date, 'dd-MMMM-yyyy')))) = 8;

Thats all! In the next article we will analyze Web Sever's Access & Error Log files.

Do you know the answer?


Which of the following is not a database?


Have a question on this subject?

Ask questions to our expert community members and clear your doubts. Asking question or engaging in technical discussion is both easy and rewarding.

Are you on Twitter?

Start following us. This way we will always keep you updated with what's happening in Data Analytics community. We won't spam you. Promise.

  • Hadoop FileSystem Commands

    In this article we will learn the basic and mostly used Hadoop File System Commands. The File System (FS) shell commands interact with the Hadoop Distributed File System (HDFS). Most of the commands in FS shell behave like corresponding Unix...

  • Log File Analysis Using Hadoop & Hive- PHP Error Log

    In this article we will learn how to perform quick analysis on Unstructured data using Hadoop & Hive. In this case we have a huge PHP Error Log file. What we want here is to parse the log file and get meaningful analytics from the error log. For...