Nested Loops, BETWEEN, and generate_series

Here is a common scenario in a data warehouse. You have designed yourself a date dimension and you have another table with a begin and end time for some activity.

The date dimension might look something like this:

CREATE TABLE hr.date_dim
(
date_key timestamp without time zone NOT NULL,
date_week timestamp without time zone NOT NULL,
date_month timestamp without time zone NOT NULL,
date_quarter timestamp without time zone NOT NULL,
date_year timestamp without time zone NOT NULL
)
DISTRIBUTED BY (date_key);

And you activity table would look something like this:

CREATE TABLE hr.job_history
(
employee_id numeric NOT NULL,
start_date timestamp without time zone NOT NULL,
end_date timestamp without time zone,
job_id character varying(10),
department_id numeric,
CONSTRAINT job_history_pkey PRIMARY KEY (employee_id, start_date)
)
DISTRIBUTED BY (employee_id, start_date);

One way to write the query joining the two tables is with BETWEEN. Here is what it looks like:

SELECT *
FROM hr.job_history jh
JOIN hr.date_dim d ON d.date_key BETWEEN jh.start_date AND jh.end_date;

The explain plan looks like this on my local install on my Mac:

Gather Motion 2:1  (slice2; segments: 2)  (cost=2.42..2301.81 rows=2494 width=78)
  ->  Nested Loop  (cost=2.42..2301.81 rows=2494 width=78)
        Join Filter: d.date_key >= jh.start_date AND d.date_key   Seq Scan on date_dim d  (cost=0.00..54.89 rows=2245 width=40)
        ->  Materialize  (cost=2.42..2.62 rows=10 width=38)
              ->  Broadcast Motion 2:2  (slice1; segments: 2)  (cost=0.00..2.40 rows=10 width=38)
                    ->  Seq Scan on job_history jh  (cost=0.00..2.10 rows=5 width=38)

Notice it is doing a Nested Loop operation. Yuck.

And here is the one that utilizes generate_series:

SELECT *
FROM (
SELECT *, start_date + interval '1 day' * (generate_series(0, (EXTRACT('days' FROM end_date - start_date)::int))) AS mydate
FROM hr.job_history
) AS jh
JOIN hr.date_dim d ON jh.mydate = d.date_key;

Look at the explain plan now!

Gather Motion 2:1  (slice2; segments: 2)  (cost=2.70..68.96 rows=7 width=166)
  ->  Hash Join  (cost=2.70..68.96 rows=7 width=166)
        Hash Cond: d.date_key = jh.mydate
        ->  Seq Scan on date_dim d  (cost=0.00..54.89 rows=2245 width=40)
        ->  Hash  (cost=2.58..2.58 rows=5 width=126)
              ->  Redistribute Motion 2:2  (slice1; segments: 2)  (cost=0.00..2.58 rows=5 width=126)
                    Hash Key: jh.mydate
                    ->  Result  (cost=0.00..2.27 rows=5 width=38)
                          ->  Seq Scan on job_history  (cost=0.00..2.27 rows=5 width=38)

It is doing a Hash Join and the cost has dropped significantly.

The lesson here is to avoid Nested Loops in Greenplum and one way to avoid this is to use an equal join instead of between. To achieve this, you can use the generate_series function in Greenplum as demonstrated here.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.