Monthly Archives: May 2012

Creating a Date Dimension

Here is another technique worth mentioning in regards to building a date dimension.  One way to do this is to create a loop and iterate over this to insert the records.  This is slow in virtually every database and we have a way to do this with a single INSERT statement.

First create the table:

DROP TABLE if exists hr.date_dim;

CREATE TABLE hr.date_dim
(
  date_key timestamp without time zone NOT NULL,
  date_week timestamp without time zone NOT NULL,
  date_month timestamp without time zone NOT NULL,
  date_quarter timestamp without time zone NOT NULL,
  date_year timestamp without time zone NOT NULL
)
DISTRIBUTED BY (date_key);

And now the magic:

INSERT INTO hr.date_dim
SELECT  mydate AS date_key,
        date_trunc('week', mydate) AS date_week,
        date_trunc('month', mydate) AS date_month,
        date_trunc('quarter', mydate) AS date_quarter,
        date_trunc('year', mydate) AS date_year
FROM    (
        SELECT '1995-01-01'::timestamp + interval '1 day' * (generate_series(0, (EXTRACT('days' FROM '2007-04-16'::timestamp - '1995-01-01'::timestamp)::int))) AS mydate
        ) AS sub;

My example builds the date dimension with every day populated between January 1, 1995 through April 16, 2007.  I did this as it was the minimum and maximum dates in my HR table with job history information.

This solution is similar to the other post I made about using generate_series to avoid a nested loop because I’m using generate_series again.  It is a very powerful and easy way to dynamically create a dataset without using a loop to do so.

Nested Loops, BETWEEN, and generate_series

Here is a common scenario in a data warehouse. You have designed yourself a date dimension and you have another table with a begin and end time for some activity.

The date dimension might look something like this:

CREATE TABLE hr.date_dim
(
date_key timestamp without time zone NOT NULL,
date_week timestamp without time zone NOT NULL,
date_month timestamp without time zone NOT NULL,
date_quarter timestamp without time zone NOT NULL,
date_year timestamp without time zone NOT NULL
)
DISTRIBUTED BY (date_key);

And you activity table would look something like this:

CREATE TABLE hr.job_history
(
employee_id numeric NOT NULL,
start_date timestamp without time zone NOT NULL,
end_date timestamp without time zone,
job_id character varying(10),
department_id numeric,
CONSTRAINT job_history_pkey PRIMARY KEY (employee_id, start_date)
)
DISTRIBUTED BY (employee_id, start_date);

One way to write the query joining the two tables is with BETWEEN. Here is what it looks like:

SELECT *
FROM hr.job_history jh
JOIN hr.date_dim d ON d.date_key BETWEEN jh.start_date AND jh.end_date;

The explain plan looks like this on my local install on my Mac:

Gather Motion 2:1  (slice2; segments: 2)  (cost=2.42..2301.81 rows=2494 width=78)
  ->  Nested Loop  (cost=2.42..2301.81 rows=2494 width=78)
        Join Filter: d.date_key >= jh.start_date AND d.date_key   Seq Scan on date_dim d  (cost=0.00..54.89 rows=2245 width=40)
        ->  Materialize  (cost=2.42..2.62 rows=10 width=38)
              ->  Broadcast Motion 2:2  (slice1; segments: 2)  (cost=0.00..2.40 rows=10 width=38)
                    ->  Seq Scan on job_history jh  (cost=0.00..2.10 rows=5 width=38)

Notice it is doing a Nested Loop operation. Yuck.

And here is the one that utilizes generate_series:

SELECT *
FROM (
SELECT *, start_date + interval '1 day' * (generate_series(0, (EXTRACT('days' FROM end_date - start_date)::int))) AS mydate
FROM hr.job_history
) AS jh
JOIN hr.date_dim d ON jh.mydate = d.date_key;

Look at the explain plan now!

Gather Motion 2:1  (slice2; segments: 2)  (cost=2.70..68.96 rows=7 width=166)
  ->  Hash Join  (cost=2.70..68.96 rows=7 width=166)
        Hash Cond: d.date_key = jh.mydate
        ->  Seq Scan on date_dim d  (cost=0.00..54.89 rows=2245 width=40)
        ->  Hash  (cost=2.58..2.58 rows=5 width=126)
              ->  Redistribute Motion 2:2  (slice1; segments: 2)  (cost=0.00..2.58 rows=5 width=126)
                    Hash Key: jh.mydate
                    ->  Result  (cost=0.00..2.27 rows=5 width=38)
                          ->  Seq Scan on job_history  (cost=0.00..2.27 rows=5 width=38)

It is doing a Hash Join and the cost has dropped significantly.

The lesson here is to avoid Nested Loops in Greenplum and one way to avoid this is to use an equal join instead of between. To achieve this, you can use the generate_series function in Greenplum as demonstrated here.

Dynamic External Tables

A customer encouraged me to write a post about how it is possible to use shell scripts to dynamically alter the results from an External Web Table.

For this example, I will create a table in both Greenplum and SQL Server that will contain a list of files.  I want to just get the records from the External Web Table that I need dynamically rather than retrieving all of the rows from SQL Server or by recreating the External Web Table. Note: Outsourcer has “Append” jobs for getting just the new records but is has been designed for Integer data types, not character like in this example.

Greenplum Table:

CREATE TABLE myfiles
(filename TEXT NOT NULL PRIMARY KEY,
dir TEXT NOT NULL)
DISTRIBUTED BY (filename);

INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');

SQL Server Table:

CREATE TABLE myfiles
(filename VARCHAR(500) NOT NULL PRIMARY KEY,
 dir VARCHAR(MAX) NOT NULL);

INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');
--Note that these three rows don't exist in Greenplum
INSERT INTO myfiles VALUES ('2012-01-06_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-07_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-08_10-30-00_data.txt', '/data/dump');

Shell Script: /Users/gpadmin/get_myfiles.sh
Please note that the use of Outsourcer.jar requires creating a connection in os.ext_connection. Review the Outsourcer documentation on creating External Tables for more information.

max=`psql -A -t -c "SELECT MAX(filename) FROM myfiles"`
java -classpath /usr/local/os/jar/Outsourcer.jar:/usr/local/os/jar/sqljdbc4.jar:/usr/local/os/jar/ojdbc6.jar -Xms128m -Xmx256m ExternalData gpdb 5432 1 "SELECT filename, dir FROM poc.dbo.myfiles WHERE filename > '$max'"

Dynamic External Web Table in Greenplum

CREATE EXTERNAL WEB TABLE ext_myfiles
(filename TEXT,
 dir TEXT)
EXECUTE E'/Users/gpadmin/get_myfiles.sh' ON MASTER
FORMAT 'text' (delimiter '|' null 'null' escape E'\\')
ENCODING 'UTF8';

Insert statement in Greenplum

INSERT INTO myfiles SELECT * FROM ext_myfiles;

This technique will only retrieve the rows from SQL Server that need to be inserted. When dealing with millions of records, this will greatly improve performance.

You can manipulate your script to retrieve data based on anything you want all without having to recreate your External Web Table.