Monthly Archives: February 2013

Hadoop Data Lake and Transforming Data

A Data Lake is a term describe a large enterprise repository of data stored in Hadoop. More and more companies are concluding that a Data Lake is the right solution over a traditional ETL process and a restricted EDW. The Data Lake is inexpensive, scales easily, uses commodity hardware, provides a very flexible schema, and enables an easy way to transform data in parallel.

So why not use a relational database for this Data Lake? In a relational database, the schema is defined first and then data is forced into it. With Hadoop, you first load the data and then apply a schema as you read it out. This means adding new data to Hadoop is easier and faster because you don’t have to first define the schema.

Processing is also in parallel. You can transform the data using Hadoop tools like Pig to then be loaded into a relational data store or just use it in Hadoop.

Greenplum Database External Tables
There are use cases where a relational database like Greenplum database is easier to use and performs better than Hadoop. A great feature of Greenplum database is the ability to create an External Table to Hadoop. These External Tables can be defined to either READ or WRITE data to Hadoop. Because Greenplum is an MPP database, each segment connects to Hadoop to READ/WRITE data. This makes the integration of Greenplum Database with Hadoop much faster than a single threaded approach that you might see with other database products.

Transform Example
So let’s say you get an employee file from a company you just acquired. You want to take this file and then make it available to the Enterprise in Hadoop as well as a data mart in Greenplum database.

We will first load the data into a stage directory, run a program to remove a control character, and then put it in the enterprise directory.

hadoop dfs -mkdir /stage
hadoop dfs -mkdir /enterprise

Next, I use a Pig program to remove a control character. In this example, I’m removing the “null character”.

The name of this file is pig_cleaning.pig.

dirty = load '$STAGE' as (data:chararray);
clean = foreach dirty generate REPLACE(data, '\\u0000', '');
store clean into '$TARGET';

Here is my employees.txt file I created. Notice the ^@ symbols. These are control characters as they appear when using vi. The pig_cleaning.pig script will remove these control characters.

Jon|Roberts|123 Main Street|New York|NY|10202|Sales Engineer|NorthEast
Abby|Cromwell|77 Juniper Drive|Raleigh|NC|27605|CIO|NorthWest
Lilly|Vargas|7894 Rayna Rd|Atlanta|GA|30301|VP Sales|SouthEast
Emily|Palmer|224 Warren Drive|St. Louis|MO|63101|VP Marketing|MidWest
Winston|Rigby|84 West Road|^@|CA|^@|President|West

Now I created a shell script that accepts a filename to load and the name of the directory in Hadoop to store the results in.

The name of this file is


# $1 is the filename
# $2 is the target directory name
hadoop dfs -put $1 /stage
pig -param STAGE=/stage/$1 -param TARGET=/enterprise/$2 pig_cleaning.pig
hadoop dfs -rm /stage/$1

This script loads the file into the /stage directory, runs the pig program to clean the file of the null character (^@), stores the output to the /enterprise directory, and then removes the stage file.

Executing the script is as easy as:

./ employees.txt employees

Now what about Greenplum database? Here is how you can READ that data in Hadoop from the database. Note that in this example, I have Hadoop and Greenplum database on the same single host. Typically, these will be on separate hosts and instead of localhost, you would have the name of the NameNode like hdm1.

create schema enterprise;

create external table enterprise.employees
(fname text,
 lname text,
 address1 text,
 city text,
 state text,
 zip text,
 job text,
 region text)
 LOCATION ('gphdfs://localhost:8020/enterprise/employees/part*')
 FORMAT 'TEXT' (delimiter '|');

And now let’s execute a SELECT statement.

gpdb=# select * from enterprise.employees;
  fname  |  lname   |     address1     |   city    | state |  zip  |      job       |  region   
 Jon     | Roberts  | 123 Main Street  | New York  | NY    | 10202 | Sales Engineer | NorthEast
 Abby    | Cromwell | 77 Juniper Drive | Raleigh   | NC    | 27605 | CIO            | NorthWest
 Lilly   | Vargas   | 7894 Rayna Rd    | Atlanta   | GA    | 30301 | VP Sales       | SouthEast
 Emily   | Palmer   | 224 Warren Drive | St. Louis | MO    | 63101 | VP Marketing   | MidWest
 Winston | Rigby    | 84 West Road     |           | CA    |       | President      | West
(5 rows)

Hadoop is being used by Enterprises to create a Data Lake. Once there, it is fast and easy to transform the data. And with Greenplum database, it is easy to use SQL tools to access the data.

Sending Emails

A customer recently asked me how can they send an email from the command line to alert the DBAs when a backup starts or finishes. It can also be used to notify when something fails. So, here is my Python script that sends an email. You have to specify the SMTP server and this version assumes the SMTP doesn’t need credentials to send an email.

#!/usr/bin/env python
import sys, smtplib, string, getopt

def send_my_mail (email_smtp, email_from, email_to, email_subject, email_body):

  body = string.join((
    "From: %s" % email_from,
    "To: %s" % email_to,
    "Subject: %s" % email_subject,
  ), "\r\n")

  s = smtplib.SMTP(email_smtp)
  s.sendmail(email_from, email_to, body)

def main(argv):
   email_from = ''
   email_to = ''
   email_subject = ''
   email_body = ''
   email_smtp = ''
   if len(sys.argv)<11:
         print ' -f  -t  -s  -b  -d '
      opts, args = getopt.getopt(argv,"hf:t:d:b:s:")
   except getopt.GetoptError:
      print ' -f  -t  -s  -b  -d '
   for opt, arg in opts:
      if opt == '-h':
         print ' -f  -t  -s  -b  -d '
      elif opt in ("-f"):
         email_from = arg
      elif opt in ("-t"):
         email_to = arg
      elif opt in ("-d"):
         email_smtp = arg
      elif opt in ("-b"):
         email_body = arg
      elif opt in ("-s"):
         email_subject = arg

   send_my_mail (email_smtp, email_from, email_to, email_subject, email_body)

if __name__ == "__main__":

Most of the code is actually parsing the command line arguments. Wouldn’t it be nice to instead use a function in the database that doesn’t need to parse command line arguments? That is exactly what I did! I next created a function in Greenplum that sends an email too!

First you need to install plpythonu if you haven’t already.

  HANDLER plpython_call_handler;

Next, you create the function:

create or replace function fn_send_email(email_smtp text, email_from text, email_to text, email_subject text, email_body text) returns void as 
  import smtplib, string
  body = string.join(("From: %s" % email_from, "To: %s" % email_to, "Subject: %s" % email_subject, "", email_body), "\r\n")
  s = smtplib.SMTP(email_smtp)
  s.sendmail(email_from, email_to, body)
language plpythonu;

All of that code around parsing the command line is gone. It is very simple and easy to understand too.

Below is the function being used. Note: I used fake values for the SMTP server, email from, and email to.

select fn_send_email('', '', '', 'test subject', 'This is the body of my test message');