Greenplum Database Connecting to Hortonworks UPDATED

Hortonworks 2.1 is now officially supported by Pivotal in the Greenplum Database version 4.3.3.x and 4.2.8.x by using External Tables.

Greenplum Database 4.3.3 Documentation
Greenplum Database 4.2.8 Documentation

The documentation has the basic configuration but I wanted to provide more detailed instructions.

Hortonworks
For my test, I downloaded the Hortonworks HDP 2.1 Sandbox. Next, I started the instance and I created a user named “gpadmin” just to make it easier for me to connect to the cluster.

Pivotal Greenplum Database
Step 1: Downloaded the Greenplum Database Trial Virtual Machine and started the cluster.

Step 2: As root, add the Hortonworks repo

wget -nv http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.1.7.0/hdp.repo -O /etc/yum.repos.d/hdp.repo

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 3: As root, install the Hadoop client

yum install hadoop* openssl

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 4: As root, remove previous OpenJDK and install Oracle JDK. You will need to download the Oracle JDK from Oracle.

yum remove java
rpm -i jdk-7u67-linux-x64.rpm

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 5: As gpadmin, start the Greenplum database and set the Hadoop target version to Hortonworks

gpconfig -c gp_hadoop_target_version -v hdp2

Step 6: As gpadmin, edit your .bashrc and add the following

export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/usr/lib/hadoop/client

Note: this will need to be done on all nodes in a full Greenplum cluster.

Step 7: As gpadmin, restart the database

source .bashrc
gpstop -r -a

Step 8: As root, add host entry to the Master and Standby-Master for the Hadoop cluster

echo "192.168.239.219 hdp21" >> /etc/hosts

Note: This will need to be done on all nodes in a full Greenplum cluster and all Hortonworks nodes should be in the hosts file too. DNS can also be used.

Build Tables

DROP EXTERNAL TABLE IF EXISTS ext_foo;
CREATE WRITABLE EXTERNAL TABLE ext_foo
(i int, bar text) LOCATION ('gphdfs://hdp21/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
--Query returned successfully with no result in 60 ms.
INSERT INTO ext_foo SELECT i, 'bar_' || i FROM generate_series(1, 100) AS i;
--Query returned successfully: 100 rows affected, 2636 ms execution time.
DROP EXTERNAL TABLE IF EXISTS ext_get_foo;
CREATE EXTERNAL TABLE ext_get_foo
(i int, bar text) LOCATION ('gphdfs://hdp21/foo_bar') FORMAT 'text' (delimiter '|' null 'null');
--Query returned successfully with no result in 28 ms.
SELECT * FROM ext_get_foo ORDER BY i LIMIT 10;
/*
1;"bar_1"
2;"bar_2"
3;"bar_3"
4;"bar_4"
5;"bar_5"
6;"bar_6"
7;"bar_7"
8;"bar_8"
9;"bar_9"
10;"bar_10"
*/

4 thoughts on “Greenplum Database Connecting to Hortonworks UPDATED

  1. er

    Can u explain what are new position for GreePlum in the Hartoon world.

    Do u recomend use GP only as MPP SQL enginie with using only external table in HDFS ?

    Reply
    1. Jon Post author

      Greenplum will work best with local data. It will require less motion to get the data to a place to be processed and the optimizer will have statistics on the tables so it can generate the best query plan possible. External Tables to HDFS is a great way to move data between the Hadoop and Greenplum ecosystems.

      If you want a great SQL query engine for HDFS, then use HAWQ. It can run all 99 TPC-DS queries without modification. No other query engine for HDFS can do that. It is also much faster than the other query engines. HAWQ 2.0 is coming out soon which has so many improvements. It is all open source too. You can’t go wrong with HAWQ for querying data in HDFS.

      Reply
  2. Gurjeet

    Hi Jon,

    Is above steps are valid for greenplum db version 4.3.5.3 ?
    Could you please share the link from where we can download the installation software for hortonworks setup ?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.