Monthly Archives: November 2014

Greenplum Database Connecting to Hortonworks UPDATED

Hortonworks 2.1 is now officially supported by Pivotal in the Greenplum Database version 4.3.3.x and 4.2.8.x by using External Tables.

Greenplum Database 4.3.3 Documentation
Greenplum Database 4.2.8 Documentation

The documentation has the basic configuration but I wanted to provide more detailed instructions.

Hortonworks
For my test, I downloaded the Hortonworks HDP 2.1 Sandbox. Next, I started the instance and I created a user named “gpadmin” just to make it easier for me to connect to the cluster.

Pivotal Greenplum Database
Step 1: Downloaded the Greenplum Database Trial Virtual Machine and started the cluster.

Step 2: As root, add the Hortonworks repo

wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.1.7.0/hdp.repo -O /etc/yum.repos.d/hdp.repo

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 3: As root, install the Hadoop client

yum install hadoop* openssl

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 4: As root, remove previous OpenJDK and install Oracle JDK. You will need to download the Oracle JDK from Oracle.

yum remove java
rpm -i jdk-7u67-linux-x64.rpm

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 5: As gpadmin, start the Greenplum database and set the Hadoop target version to Hortonworks

gpconfig -c gp_hadoop_target_version -v hdp2

Step 6: As gpadmin, edit your .bashrc and add the following

export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/usr/lib/hadoop/client

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 7: As gpadmin, restart the database

source .bashrc
gpstop -r -a

Step 8: As root, add host entry to the Master and Standby-Master for the Hadoop cluster

echo "192.168.239.219 hdp21" >> /etc/hosts

Note: This will need to be done on all nodes in a full Greenplum cluster and all Hortonworks nodes should be in the hosts file too. DNS can also be used.

Build Tables

DROP EXTERNAL TABLE IF EXISTS ext_foo;
CREATE WRITABLE EXTERNAL TABLE ext_foo
(i int, bar text) LOCATION ('gphdfs://hdp21/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
--Query returned successfully with no result in 60 ms.
INSERT INTO ext_foo SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;
--Query returned successfully: 100 rows affected, 2636 ms execution time.
DROP EXTERNAL TABLE IF EXISTS ext_get_foo;
CREATE EXTERNAL TABLE ext_get_foo
(i int, bar text) LOCATION ('gphdfs://hdp21/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
--Query returned successfully with no result in 28 ms.
SELECT * FROM ext_get_foo ORDER BY i LIMIT 10;
/*
1;"bar_1"
2;"bar_2"
3;"bar_3"
4;"bar_4"
5;"bar_5"
6;"bar_6"
7;"bar_7"
8;"bar_8"
9;"bar_9"
10;"bar_10"
*/

Maintenance Script for HAWQ and Greenplum database

A common question I get is what maintenance activity is needed in HAWQ and Greenplum database. There are lots of ways to handle this and a lot of customers do it as part of their ETL processing. I put together a script that is pretty generic and should be very useful for most implementations.

The script is available on Github in a new project I created for scripts. As I come up with more scripts for various activities, I will put it in this Github repository.

Link to repository

maintain.sh
Note: Run this script on a regular basis such as weekly or monthly.

Enjoy!