Added a new configuration setting so you can set the Max Length for gpfdist. This is done by adding a new value to your gplink.properties file called, “gpfdistMaxLength”.
GPLink has been tested successfully with Hive. This is most useful in the Greenplum environment where you want to query a Hive table. HAWQ has the PXF protocol which supports Hive and is very performant. Please use PXF in HAWQ to access Hive. But with Greenplum, gplink now supports Hive!
This is in the gplink README file but the configuration is a bit more involved than a typical JDBC driver. This is because of how Hive is configured for logging. Plus, this configuration has changed through the Hive versions making it more complex to configure.
You will need the following jar files from your Hadoop cluster.
I tested with a Hortonworks cluster with Hive 126.96.36.199.4.
Older versions of Hive may have have duplicate SLF4J bindings and fail to work
properly. The error message is, “Class path contains multiple SLF4J bindings”. If
you get this, remove the slf4j-log4j12.jar file from the jar/ directory, source the
gplink_path.sh file, and then try again.
The extraProps value for external connections wasn’t being parsed correctly. The extraProps is most useful for Oracle in setting the fetch size. If you are using Oracle as a source with gplink, be sure to upgrade to this new version and change your connection from “extraProperies” or “extraProperties” to “extraProps”.
GPLink has a new version with the only change being the name of a script variable. The variable was renamed because it is shared between this project and another one I work on (TPC-DS). This common variable name would cause issues with the TPC-DS scripts when logging activities.
GPLink has a new version that fixes a situation where you may have started other gpfdist processes on your ETL server and you used a different order of parameters than what GPLink uses. The older versions would have a parsing error and this new version handles this situation. As before, if the port range of other gpfdist processes are outside the Upper and Lower port range defined by GPLink, then those gpfdist processes are ignored.
I’ve made some enhancements to gplink.
1. Automatically starts gpfdist when creating External Tables
2. New command line programs: gpltable (create table), gpldrop (drop table), and gplstart (start any missing gpfdist processes)
3. Removal of the need for the Greenplum or HAWQ client
4. No additional database objects are needed
5. Drop External Table handled from gplink also stops the gpfdist process it had started
6. Most code moved to Java with some basic shell scripts to automate calling the Java methods
Documentation and Source Code
gplink Github Repository
I have updated GPLink to handle timestamp and date formats problems. This has only been a problem with Oracle’s JDBC driver but it may be needed with other databases.
I created a new project that simplifies the process to create Greenplum or Hawq External Tables using gpfdist to stream data from any valid JDBC source. It is like pointing gpfdist at Sqoop to pull data without landing a file but gplink ensures that the data is cleansed first so that the data will be readable by Greenplum or Hawq.
This will work with PostgreSQL (yeah!), MySQL (yuck), DB2, Informix, etc. You will have to download all third party JDBC drivers separately.
This is a new project so I’m looking forward to any feedback you can provide.