Logo DWBI.org Login / Sign Up
Sign Up
Have Login?
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Login
New Account?
Recovery
Go to Login
By continuing you indicate that you agree to Terms of Service and Privacy Policy of the site.
Big Data Analytics

Log File Analysis Using Hadoop & Hive- PHP Error Log

Updated on Oct 03, 2020

In this article we will learn how to perform quick analysis on Unstructured data using Hadoop & Hive. In this case we have a huge PHP Error Log file. What we want here is to parse the log file and get meaningful analytics from the error log. For example we want to understand which webpages are encountering the frequent PHP Errors, types of PHP Issue encountered (Warning, Notice, Error etc.). So let's see how easy it is, to get the answers to our questions, when we have HADOOP & HIVE in our toolbox.

Hadoop Distributed File System

We already have our php_error.log in our EdgeNode machine. Let us quickly check how the error messages look like.

root@EdgeNode:~# tail /root/php_errors.log

[02-Sep-2016 13:04:20 Asia/Singapore] PHP Notice:  Uninitialized string offset: 0 in /srv/mywebsite.com/public_html/libraries/joomla/application/web.php on line 507
[02-Sep-2016 14:16:56 Asia/Singapore] PHP Notice:  Undefined variable: language in /srv/mywebsite.com/public_html/administrator/components/com_easysocial/tables/profile.php on line 767
[03-Sep-2016 01:06:17 Asia/Singapore] PHP Fatal error:  Allowed memory size of 209715200 bytes exhausted (tried to allocate 86 bytes) in /srv/mywebsite.com/public_html/libraries/joomla/object/object.php on line 122
[03-Sep-2016 02:23:55 Asia/Singapore] PHP Notice:  Undefined index: HTTP_HOST in /srv/mywebsite.com/public_html/libraries/joomla/application/web.php on line 863
[31-Oct-2015 18:38:47 Asia/Singapore] PHP Warning:  DateTime::createFromFormat() expects parameter 3 to be DateTimeZone, integer given in /srv/mywebsite.com/public_html/delivery/schedule.php on line 25
[31-Oct-2015 19:02:08 Asia/Singapore] PHP Fatal error:  Uncaught exception 'Exception' with message 'DateTime::__construct(): Failed to parse time string ('2015-10-28 16:00:00') at position 0 ('): Unexpected character' in /srv/mywebsite.com/public_html/delvery/schedule.php:19
Stack trace:
#0 /srv/mywebsite.com/public_html/delivery/schedule.php(19): DateTime->__construct(''2015-10-28 16:...', Object(DateTimeZone))
#1 {main}
  thrown in /srv/mywebsite.com/public_html/delivery/schedule.php on line 19

Now see the above sample error & warning messages. We need to define the parsing logic for this unstructured file going forward using regular expression.

So our very first step is to put the file from local file system to hadoop distributed file system. So before that let's create a directory in HDFS where we will perform analysis, on a few different types of log files.

root@EdgeNode:~# hadoop fs -mkdir -p /log_analysis/php_error_log
root@EdgeNode:~# hadoop fs -copyFromLocal /root/php_errors.log /log_analysis/php_error_log/php_errors.log

Now our log file resides in the HDFS. Next before we create a HIVE table we need to determine the Input Regular Expression to parse this Unstructured log file. We will be using Hive RegexSerDe in this scenario.

File Parsing using Regular Expression

We will go step by step, on how to derive the correct Regular Expression to parse our log file. Based on the sample error messages we have derived the logic to generate the Match Groups or Capture Groups of our interest as shown below. To help to get the full Regular Expression you can take the help of any online tool e.g. http://rubular.com/

You may paste the sample data in the test string are and carry on developing your regular expression and at the same time check, whether the Match result & Match Groups is showing as expected.

A little snapshot of the derivation logic is given in the sample screenshot below:

So finally our Regular Expression comes out as below. NOTE: Spaces are a part of our regex parse logic.

^(?<!#)(?:[\[])([^ ]*)(?: )([^ ]*)(?: )([^ ]*)(?:\] )([^\:]*)(?:\:)(.*?) in ([^ ]*)(?: on line |\:)(\S+)

HIVE Table

Now we need to create a HIVE table. We will go for an EXTERNAL table. Also the Row format Deserializer used is 'org.apache.hadoop.hive.serde2.RegexSerDe'.

Hadoop Hive RegexSerDe uses regular expression (regex) to deserialize data. It doesn't support data serialization. It can deserialize the data using regex and extracts groups as columns. In deserialization stage, if a row does not match the regex, then all columns in the row will be NULL. If a row matches the regex but has less than expected groups, the missing groups will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored.

Before we create a hive table, ensure Metastore Server & HiveServer2 is running, else start the services. Login to EdgeNode and start the services.

root@EdgeNode:~# hive --service metastore
root@EdgeNode:~# /usr/local/hive/bin/hiveserver2

Let's now create the Hive table backed by the HDFS log data file. Login to beeline & use the default database.

root@EdgeNode:~# beeline
beeline> !connect jdbc:hive2://localhost:10000 hive hadoop1234
0: jdbc:hive2://localhost:10000> use default;
0: jdbc:hive2://localhost:10000> CREATE EXTERNAL TABLE STG_PHP_ERROR_LOG ( 
LOG_DATE STRING,
LOG_TIME STRING,
REGION STRING,
ERROR_TYPE STRING,
ERROR_MESSAGE STRING,
FILENAME STRING,
LINE_NUMBER STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ( 
"input.regex" = "^(?<!#)(?:[\\[])([^ ]*)(?: )([^ ]*)(?: )([^ ]*)(?:\\] )([^\\:]*)(?:\\:)(.*?) in ([^ ]*)(?: on line |\\:)(\\S+)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s"
)
STORED AS TEXTFILE
LOCATION '/log_analysis/php_error_log/';

Once the table gets created successfully, it's time to get the answers to our critical questions. Let's query the table.

0: jdbc:hive2://localhost:10000> SELECT COUNT(1) FROM stg_php_error_log;
+----------+--+
|    c0    |
+----------+--+
| 4456594  |
+----------+--+
1 row selected (44.58 seconds)
</pre>

<pre>
0: jdbc:hive2://localhost:10000> SELECT COUNT(1),error_type FROM stg_php_error_log GROUP BY error_type;

+----------+----------------------------+--+
|    c0    |         error_type         |
+----------+----------------------------+--+
| 457      | NULL                       |
| 4406499  | PHP Notice                 |
| 49528    | PHP Warning                |
| 65       | PHP Fatal error            |
| 33       | PHP Parse error            |
| 2        | PHP Catchable fatal error  |
| 10       | PHP Unknown error          |
+----------+----------------------------+--+
7 rows selected (46.773 seconds)

Lets check few more query's

SELECT COUNT(1),filename FROM stg_php_error_log WHERE error_type IN ('PHP Fatal error','PHP Parse error') GROUP BY filename ;

SELECT DISTINCT filename, line_number FROM stg_php_error_log WHERE YEAR(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(log_date, 'dd-MMMM-yyyy')))) = 2016 AND MONTH(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(log_date, 'dd-MMMM-yyyy')))) = 8;

Thats all! In a future article we will analyze Web Sever's Access & Error Log files.