Dynamic External Tables

A customer encouraged me to write a post about how it is possible to use shell scripts to dynamically alter the results from an External Web Table.

For this example, I will create a table in both Greenplum and SQL Server that will contain a list of files.  I want to just get the records from the External Web Table that I need dynamically rather than retrieving all of the rows from SQL Server or by recreating the External Web Table. Note: Outsourcer has “Append” jobs for getting just the new records but is has been designed for Integer data types, not character like in this example.

Greenplum Table:

CREATE TABLE myfiles
(filename TEXT NOT NULL PRIMARY KEY,
dir TEXT NOT NULL)
DISTRIBUTED BY (filename);

INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');

SQL Server Table:

CREATE TABLE myfiles
(filename VARCHAR(500) NOT NULL PRIMARY KEY,
 dir VARCHAR(MAX) NOT NULL);

INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');
--Note that these three rows don't exist in Greenplum
INSERT INTO myfiles VALUES ('2012-01-06_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-07_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-08_10-30-00_data.txt', '/data/dump');

Shell Script: /Users/gpadmin/get_myfiles.sh
Please note that the use of Outsourcer.jar requires creating a connection in os.ext_connection. Review the Outsourcer documentation on creating External Tables for more information.

max=`psql -A -t -c "SELECT MAX(filename) FROM myfiles"`
java -classpath /usr/local/os/jar/Outsourcer.jar:/usr/local/os/jar/sqljdbc4.jar:/usr/local/os/jar/ojdbc6.jar -Xms128m -Xmx256m ExternalData gpdb 5432 1 "SELECT filename, dir FROM poc.dbo.myfiles WHERE filename > '$max'"

Dynamic External Web Table in Greenplum

CREATE EXTERNAL WEB TABLE ext_myfiles
(filename TEXT,
 dir TEXT)
EXECUTE E'/Users/gpadmin/get_myfiles.sh' ON MASTER
FORMAT 'text' (delimiter '|' null 'null' escape E'\\')
ENCODING 'UTF8';

Insert statement in Greenplum

INSERT INTO myfiles SELECT * FROM ext_myfiles;

This technique will only retrieve the rows from SQL Server that need to be inserted. When dealing with millions of records, this will greatly improve performance.

You can manipulate your script to retrieve data based on anything you want all without having to recreate your External Web Table.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.