Hadoop Data Lake and Transforming Data

A Data Lake is a term describe a large enterprise repository of data stored in Hadoop. More and more companies are concluding that a Data Lake is the right solution over a traditional ETL process and a restricted EDW. The Data Lake is inexpensive, scales easily, uses commodity hardware, provides a very flexible schema, and enables an easy way to transform data in parallel.

Schema
So why not use a relational database for this Data Lake? In a relational database, the schema is defined first and then data is forced into it. With Hadoop, you first load the data and then apply a schema as you read it out. This means adding new data to Hadoop is easier and faster because you don’t have to first define the schema.

Parallel
Processing is also in parallel. You can transform the data using Hadoop tools like Pig to then be loaded into a relational data store or just use it in Hadoop.

Greenplum Database External Tables
There are use cases where a relational database like Greenplum database is easier to use and performs better than Hadoop. A great feature of Greenplum database is the ability to create an External Table to Hadoop. These External Tables can be defined to either READ or WRITE data to Hadoop. Because Greenplum is an MPP database, each segment connects to Hadoop to READ/WRITE data. This makes the integration of Greenplum Database with Hadoop much faster than a single threaded approach that you might see with other database products.

Transform Example
So let’s say you get an employee file from a company you just acquired. You want to take this file and then make it available to the Enterprise in Hadoop as well as a data mart in Greenplum database.

We will first load the data into a stage directory, run a program to remove a control character, and then put it in the enterprise directory.

hadoop dfs -mkdir /stage
hadoop dfs -mkdir /enterprise

Next, I use a Pig program to remove a control character. In this example, I’m removing the “null character”.

The name of this file is pig_cleaning.pig.

dirty = load '$STAGE' as (data:chararray);
clean = foreach dirty generate REPLACE(data, '\\u0000', '');
store clean into '$TARGET';

Here is my employees.txt file I created. Notice the ^@ symbols. These are control characters as they appear when using vi. The pig_cleaning.pig script will remove these control characters.

Jon|Roberts|123 Main Street|New York|NY|10202|Sales Engineer|NorthEast
Abby|Cromwell|77 Juniper Drive|Raleigh|NC|27605|CIO|NorthWest
Lilly|Vargas|7894 Rayna Rd|Atlanta|GA|30301|VP Sales|SouthEast
Emily|Palmer|224 Warren Drive|St. Louis|MO|63101|VP Marketing|MidWest
Winston|Rigby|84 West Road|^@|CA|^@|President|West

Now I created a shell script that accepts a filename to load and the name of the directory in Hadoop to store the results in.

The name of this file is load.sh.

#!/bin/bash

# $1 is the filename
# $2 is the target directory name
hadoop dfs -put $1 /stage
pig -param STAGE=/stage/$1 -param TARGET=/enterprise/$2 pig_cleaning.pig
hadoop dfs -rm /stage/$1

This script loads the file into the /stage directory, runs the pig program to clean the file of the null character (^@), stores the output to the /enterprise directory, and then removes the stage file.

Executing the script is as easy as:

./load.sh employees.txt employees

Now what about Greenplum database? Here is how you can READ that data in Hadoop from the database. Note that in this example, I have Hadoop and Greenplum database on the same single host. Typically, these will be on separate hosts and instead of localhost, you would have the name of the NameNode like hdm1.

create schema enterprise;

create external table enterprise.employees
(fname text,
 lname text,
 address1 text,
 city text,
 state text,
 zip text,
 job text,
 region text)
 LOCATION ('gphdfs://localhost:8020/enterprise/employees/part*')
 FORMAT 'TEXT' (delimiter '|');

And now let’s execute a SELECT statement.

gpdb=# select * from enterprise.employees;
  fname  |  lname   |     address1     |   city    | state |  zip  |      job       |  region   
---------+----------+------------------+-----------+-------+-------+----------------+-----------
 Jon     | Roberts  | 123 Main Street  | New York  | NY    | 10202 | Sales Engineer | NorthEast
 Abby    | Cromwell | 77 Juniper Drive | Raleigh   | NC    | 27605 | CIO            | NorthWest
 Lilly   | Vargas   | 7894 Rayna Rd    | Atlanta   | GA    | 30301 | VP Sales       | SouthEast
 Emily   | Palmer   | 224 Warren Drive | St. Louis | MO    | 63101 | VP Marketing   | MidWest
 Winston | Rigby    | 84 West Road     |           | CA    |       | President      | West
(5 rows)

Conclusion
Hadoop is being used by Enterprises to create a Data Lake. Once there, it is fast and easy to transform the data. And with Greenplum database, it is easy to use SQL tools to access the data.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.