Autonomous Transactions for Logging

Updated July 17, 2013
***************
1. External Web Table to read data should run on HOST, not MASTER.
2. The /data/log_error.sh file needs to be created on every segment host.
3. Need to chmod 755 the .sh file on the segment hosts.
4. Need to touch /data/error.log on all of the segment hosts because the file needs to exist to prevent an error reading from an empty table.
***************

An Autonomous Transaction is not a feature of Greenplum database but it can be achieved with some of the features unique to Greenplum. If you aren’t familiar with an Autonomous Transaction, it is basically a new transaction that is created within a transaction but it isn’t tied to your first transaction.

What is the point? Well, in database programming, you may run into an EXCEPTION that you want to RAISE to the calling program but you also want to log the error. Raising an ERROR in Greenplum rolls back your transaction so the INSERT to a log table also gets rolled back.

Here is a way to create a log table that enables you to RAISE an ERROR in your code but the INSERT doesn’t get rolled back!

First create the error_log table:

DROP EXTERNAL TABLE IF EXISTS error_log;

CREATE WRITABLE EXTERNAL WEB TABLE error_log
(
function_name text,
location int,
error_message text
)
EXECUTE '/data/log_error.sh'
FORMAT 'CSV' DISTRIBUTED RANDOMLY;

Next create the /data/log_error.sh file:

read in
echo $in >> /data/error.log

Now you can read this error.log file if you want or you can create another EXTERNAL TABLE to read the data like this:

DROP EXTERNAL TABLE IF EXISTS my_errors;

CREATE EXTERNAL WEB TABLE my_errors
(
function_name text,
location int,
error_message text
)
EXECUTE 'cat /data/error.log' ON /*MASTER*/ HOST
FORMAT 'CSV';

So now to test this out:

gpdb=# BEGIN;
BEGIN
gpdb=# INSERT INTO error_log VALUES ('fn_build_mart', 7500, 'ERROR:  division by zero');
INSERT 0 1
gpdb=# ROLLBACK;
ROLLBACK
gpdb=# SELECT * FROM my_errors;
 function_name | location |      error_message      
---------------+----------+-------------------------
 fn_build_mart |     7500 | ERROR: division by zero
(1 row)

So even though I rolled back my transaction, the INSERT into the EXTERNAL WRITABLE WEB TABLE in Greenplum didn’t get rolled back and thus, you created an Autonomous Transaction in Greenplum.

Outsourcer 3.1 and New Name!

Outsourcer 3.1 is now available for download on the Installation page. Updated Documentation is also available on the Documentation page.

What’s new?
Oracle

  • FetchSize is now configurable. To minimize round trips to Oracle, make this setting rather large (greater than 10,000) which increases exporting speed but at the expense of needing more RAM. Adjust this setting up or down based on your environment. Default is 40,000.

SQL Server

  • Fix to better handle non-default schemas

DDL Refresh Type

  • Several customers have requested for Outsourcer to just create the tables in Greenplum based on the DDL in the source without loading the tables. This new Refresh Type does just that.

New Name: PivotalGuru.com

Greenplum is now part of Pivotal and to follow this name change, I now have the PivotalGuru.com site!

Pivotal is an exciting and rather large startup company with investments from EMC, VMWare, and GE. I am proud to be part of this new company and very excited about the new offerings in Big Data and Cloud Computing that Pivotal will bring to market.

Hadoop Data Lake and Transforming Data

A Data Lake is a term describe a large enterprise repository of data stored in Hadoop. More and more companies are concluding that a Data Lake is the right solution over a traditional ETL process and a restricted EDW. The Data Lake is inexpensive, scales easily, uses commodity hardware, provides a very flexible schema, and enables an easy way to transform data in parallel.

Schema
So why not use a relational database for this Data Lake? In a relational database, the schema is defined first and then data is forced into it. With Hadoop, you first load the data and then apply a schema as you read it out. This means adding new data to Hadoop is easier and faster because you don’t have to first define the schema.

Parallel
Processing is also in parallel. You can transform the data using Hadoop tools like Pig to then be loaded into a relational data store or just use it in Hadoop.

Greenplum Database External Tables
There are use cases where a relational database like Greenplum database is easier to use and performs better than Hadoop. A great feature of Greenplum database is the ability to create an External Table to Hadoop. These External Tables can be defined to either READ or WRITE data to Hadoop. Because Greenplum is an MPP database, each segment connects to Hadoop to READ/WRITE data. This makes the integration of Greenplum Database with Hadoop much faster than a single threaded approach that you might see with other database products.

Transform Example
So let’s say you get an employee file from a company you just acquired. You want to take this file and then make it available to the Enterprise in Hadoop as well as a data mart in Greenplum database.

We will first load the data into a stage directory, run a program to remove a control character, and then put it in the enterprise directory.

hadoop dfs -mkdir /stage
hadoop dfs -mkdir /enterprise

Next, I use a Pig program to remove a control character. In this example, I’m removing the “null character”.

The name of this file is pig_cleaning.pig.

dirty = load '$STAGE' as (data:chararray);
clean = foreach dirty generate REPLACE(data, '\\u0000', '');
store clean into '$TARGET';

Here is my employees.txt file I created. Notice the ^@ symbols. These are control characters as they appear when using vi. The pig_cleaning.pig script will remove these control characters.

Jon|Roberts|123 Main Street|New York|NY|10202|Sales Engineer|NorthEast
Abby|Cromwell|77 Juniper Drive|Raleigh|NC|27605|CIO|NorthWest
Lilly|Vargas|7894 Rayna Rd|Atlanta|GA|30301|VP Sales|SouthEast
Emily|Palmer|224 Warren Drive|St. Louis|MO|63101|VP Marketing|MidWest
Winston|Rigby|84 West Road|^@|CA|^@|President|West

Now I created a shell script that accepts a filename to load and the name of the directory in Hadoop to store the results in.

The name of this file is load.sh.

#!/bin/bash

# $1 is the filename
# $2 is the target directory name
hadoop dfs -put $1 /stage
pig -param STAGE=/stage/$1 -param TARGET=/enterprise/$2 pig_cleaning.pig
hadoop dfs -rm /stage/$1

This script loads the file into the /stage directory, runs the pig program to clean the file of the null character (^@), stores the output to the /enterprise directory, and then removes the stage file.

Executing the script is as easy as:

./load.sh employees.txt employees

Now what about Greenplum database? Here is how you can READ that data in Hadoop from the database. Note that in this example, I have Hadoop and Greenplum database on the same single host. Typically, these will be on separate hosts and instead of localhost, you would have the name of the NameNode like hdm1.

create schema enterprise;

create external table enterprise.employees
(fname text,
 lname text,
 address1 text,
 city text,
 state text,
 zip text,
 job text,
 region text)
 LOCATION ('gphdfs://localhost:8020/enterprise/employees/part*')
 FORMAT 'TEXT' (delimiter '|');

And now let’s execute a SELECT statement.

gpdb=# select * from enterprise.employees;
  fname  |  lname   |     address1     |   city    | state |  zip  |      job       |  region   
---------+----------+------------------+-----------+-------+-------+----------------+-----------
 Jon     | Roberts  | 123 Main Street  | New York  | NY    | 10202 | Sales Engineer | NorthEast
 Abby    | Cromwell | 77 Juniper Drive | Raleigh   | NC    | 27605 | CIO            | NorthWest
 Lilly   | Vargas   | 7894 Rayna Rd    | Atlanta   | GA    | 30301 | VP Sales       | SouthEast
 Emily   | Palmer   | 224 Warren Drive | St. Louis | MO    | 63101 | VP Marketing   | MidWest
 Winston | Rigby    | 84 West Road     |           | CA    |       | President      | West
(5 rows)

Conclusion
Hadoop is being used by Enterprises to create a Data Lake. Once there, it is fast and easy to transform the data. And with Greenplum database, it is easy to use SQL tools to access the data.

Sending Emails

A customer recently asked me how can they send an email from the command line to alert the DBAs when a backup starts or finishes. It can also be used to notify when something fails. So, here is my Python script that sends an email. You have to specify the SMTP server and this version assumes the SMTP doesn’t need credentials to send an email.

#!/usr/bin/env python
import sys, smtplib, string, getopt

def send_my_mail (email_smtp, email_from, email_to, email_subject, email_body):

  body = string.join((
    "From: %s" % email_from,
    "To: %s" % email_to,
    "Subject: %s" % email_subject,
    "",
    email_body
  ), "\r\n")

  s = smtplib.SMTP(email_smtp)
  s.sendmail(email_from, email_to, body)
  s.quit()

def main(argv):
   email_from = ''
   email_to = ''
   email_subject = ''
   email_body = ''
   email_smtp = ''
   if len(sys.argv)<11:
         print 'mail.py -f  -t  -s  -b  -d '
         sys.exit(2)
   try:
      opts, args = getopt.getopt(argv,"hf:t:d:b:s:")
   except getopt.GetoptError:
      print 'mail.py -f  -t  -s  -b  -d '
      sys.exit(2)
   for opt, arg in opts:
      if opt == '-h':
         print 'mail.py -f  -t  -s  -b  -d '
         sys.exit()
      elif opt in ("-f"):
         email_from = arg
      elif opt in ("-t"):
         email_to = arg
      elif opt in ("-d"):
         email_smtp = arg
      elif opt in ("-b"):
         email_body = arg
      elif opt in ("-s"):
         email_subject = arg

   send_my_mail (email_smtp, email_from, email_to, email_subject, email_body)

if __name__ == "__main__":
   main(sys.argv[1:])

Most of the code is actually parsing the command line arguments. Wouldn’t it be nice to instead use a function in the database that doesn’t need to parse command line arguments? That is exactly what I did! I next created a function in Greenplum that sends an email too!

First you need to install plpythonu if you haven’t already.

 CREATE PROCEDURAL LANGUAGE 'plpythonu'
  HANDLER plpython_call_handler;

Next, you create the function:

create or replace function fn_send_email(email_smtp text, email_from text, email_to text, email_subject text, email_body text) returns void as 
$$
  import smtplib, string
  body = string.join(("From: %s" % email_from, "To: %s" % email_to, "Subject: %s" % email_subject, "", email_body), "\r\n")
  s = smtplib.SMTP(email_smtp)
  s.sendmail(email_from, email_to, body)
  s.quit()
$$ 
language plpythonu;

All of that code around parsing the command line is gone. It is very simple and easy to understand too.

Below is the function being used. Note: I used fake values for the SMTP server, email from, and email to.

select fn_send_email('smtp_example.email.com', 'email_from@email.com', 'email_to@email.com', 'test subject', 'This is the body of my test message');

Outsourcer 3.0.1

  • Fix for Oracle dates in which the time was being truncated that started with version 2.7.  With 2.7, I added code to handle two digit years which Oracle uses for centuries between 0 and 100 but I used getDate instead of getTimestamp.
  • Fix for Oracle JDBC driver mismatch which may cause ArrayIndexOutOfBoundsException when using “getString”.
  • Fix for double quotes around object name in function os.fn_create_ext_table.  If you already upgraded to 3.0 and have executed the upgrade_3_0.sql file, please recreate this function with this SQL file.  If you are upgrading from 2.x, be sure to execute the upgrade_3_0.sql file.

Link to Outsourcer 3.0.1 => os_3_0_1

 

Interesting Limit to Inserts with Java

Outsourcer uses Java and in doing so, the INSERT statement uses the Statement Class and the executeUpdate method.  This method can be seen here.

The executeUpdate is used for INSERT, UPDATE, or DELETE and returns the number of rows impacted by the statement as an int.  In Java, an “int” is defined with a maximum of 2,147,483,647.  This means if your SQL statement affects more than 2.1 Billion rows, then Java will throw an error.  Java int documentation here.

A Greenplum customer using Outsourcer recently contacted me about this error message:

Unable to interpret the update count in command completion tag: INSERT 0 3913383319

This is the error message you will get from Java when you try to insert more than 2.1 Billion rows.  In this case, the customer inserted 3.9 Billion rows from Oracle using Outsourcer.

If you hit this limit in Outsourcer, don’t worry.  The rows actually get inserted and all you need to do is ANALYZE the table.  It is an odd limit that can be avoided.  You can use a VIEW in the source to break up the source table into smaller parts or you could use the External Table feature in Outsourcer.  You can also just ignore this error because the data does get inserted.

 

Outsourcer 3.0

I took the time of around Thanksgiving to work on Outsourcer 3.0. I have used the feedback I have received to enhance the security and performance of the application. I also addressed one of the known issues and found and fixed an unreported bug.

Links
Administrator Guide
os_3_0.zip

Security

  • The External Tables created by Outsourcer no long contain the username and password for the source connection.
  • Oracle sources now use ALL_% objects rather than DBA_% objects. This means you don’t need to grant DBA or SELECT on the DBA Views in Oracle to use Outsourcer.
  • TRUST authentication is used to start Outsourcer so no password is stored.

Enhancements

  • Oracle connections are now faster and use less memory than with 2.x.
  • New External Table feature for defining a SQL statement to execute in Oracle or SQL Server to be the source of an External Table.
  • Better examples of the job types for Oracle and SQL Server

Fixes

  • When a job fails and under certain conditions, Outsourcer could leave a database connection open to the source. Outsourcer now properly closes the connection on failed jobs.
  • If multiple jobs executed at the same time, it was possible for more than 1 job to attempt to create the target schema at the same time. This no longer is a problem.

Outsourcer 2.9.3

I found that for Oracle tables stored in MixedCase, I wasn’t creating the SELECT statement correctly but with 2.9.3, this is fixed. I somehow forgot to put double quotes around the table and schema name. This only affects Oracle objects stored in mixed case.

OS_2_9_3.zip

Outsourcer 2.9.2

I would like to thank the great people at Virgin Australia for helping me find some opportunities to support more features from the Oracle database. Together we identified that I was incorrectly handling ROWID which I fixed in 2.9. With that release, I thought I would also make the support of an Oracle INTEGER better only to realize that I was already handling it in the optimum way so I fixed that problem in 2.9.1. Lastly, I am now support dollar signs for source instance, database, schema, and table names which is the 2.9.2 release.

os_2_9_2.zip

Here is a link to the updated Administrator’s Guide which removes the known issue about a dollar sign in the password.

Administrator’s Guide

Outsourcer 2.9.1

I made a mistake in converting Oracle’s INT to Greenplum’s INT or BIGINT. Oracle doesn’t really have an INT data type and instead has a synonym for NUMBER. To complicate matters, the Oracle INT doesn’t conform to the limits of INT or BIGINT so these data types must be converted to NUMERIC in Greenplum which doesn’t have limits.

Outsourcer 2.9.1

Outsourcer 2.9

Outsourcer 2.9 contains enhancements for Oracle connections.

First, Outsourcer now supports ROWID and UROWID properly. In both cases, these Oracle data types are converted to VARCHAR in Greenplum.

Secondly, Oracle internally stores an Integer as a Numeric which doesn’t perform in Greenplum as a true Integer. Outsourcer now correctly identifies an Integer in Oracle.

The last enhancement is in the JDBC connection. Instead of using the SID, Outsourcer now uses the SERVICE NAME which supports Oracle Connection Manager.

Link to new download => OS_2_9.zip

Link to new documentation => Outsourcer_Administrator_Guide_2_9

How Do I Use Java In Greenplum?

This is the sixth in a series of “How do I do X in Greenplum database” and today’s topic is How do I use Java in Greenplum.

Greenplum supports many languages and one of these is Java. This is defined as a language called “pljava”. Because it is Java, it still requires you to compile the code and put it in a jar that is your classpath. There are a couple of additional steps to configure this in Greenplum so I will go through these steps.

Prerequisites:
1. JDK installed either on the master or a desktop. If on a desktop, you’ll need to copy the jar file to the master so you can then copy it to the segments.
2. Logged into the master (mdw) as user gpadmin.
3. Methods must be public and static.

First step, write some Java and here is an example of a file named Example.java:

public class Example
{
        public static String substring(String text, int beginIndex, int endIndex)
        {
                return text.substring(beginIndex, endIndex);
        }
}

Create a manifest file named manifest.txt:

Manifest-Version: 1.0
Main-Class: Example
Specification-Title: "Example"
Specification-Version: "1.0"
Created-By: 1.6.0_35-b10-428-11M3811
Build-Date: 09/28/2012 10:09 AM

Compile the Java:

javac *.java

Jar the file:

jar cfm analytics.jar manifest.txt *.class

Copy the analytics.jar to all Greenplum servers. The gphosts_file contains a list of all of your severs. An example of that is:

mdw
smdw
sdw1
sdw2
sdw3
sdw4

And the command to copy the Jar file:

gpscp -f gphosts_file analytics.jar =:/usr/local/greenplum-db/lib/postgresql/java/

Set your classpath variable inside Greenplum:

gpconfig -c pljava_classpath -v \'analytics.jar\'

Apply the configuration change:

gpstop -u

Install the pljava language and in this example, I’m putting it into the gpdb database. Note: this is a one time activity per database.

psql gpdb -f $GPHOME/share/postgresql/pljava/install.sql

Make sure the classpath is set correctly:

show pljava_classpath

It should say ‘analytics.jar’.

And finally, a working example:

create table temp (a varchar) distributed randomly;
insert into temp values ('a string');

--Example function
create or replace function java_substring(varchar, int, int)
returns varchar as 'Example.substring' language java;

--Example execution
select java_substring(a, 1, 3) from temp;

This was a pretty simple example of using PL/Java. Enjoy!

How Do I Prevent Blocking Locks in Greenplum?

This is the fifth in a series of “How do I do X in Greenplum database” and today’s topic is How do I prevent “blocking locks” in Greenplum?

Since different databases handle read consistency and locking differently, I will discuss how Oracle, Microsoft SQL Server, PostgreSQL, and finally Greenplum handle locking. This will give you a good way to compare and contrast the different products.

Oracle
Oracle handles read consistency with using “rollback”. When you update a table, the row is overwritten but first, the old version of the data is put into a rollback segment and the row is marked with a timestamp and a pointer to the old version. This allows users to query the database and not be blocked while an update happens.

If a query started at time1 and and update happens at time2, the query will see a snapshot of the data as of time1. To make this work, the query follows the pointer to the rollback segment to get the old version of the rows that were updated.

Oracle Problems
There are a couple of problems with this solution. The first is Disk IO because the database has to first read the data file and then read the rollback segment.

The second problem with this is the rollback segment must be sized pretty large in a warehouse environment when you can easily have a query execute for a long period of time. You will eventually run out of rollback and get the “snapshot too old” error on long running queries if you are actively changing data.

SQL Server
SQL Server uses locking by default to provide read consistency. This means an update or even an insert can block other sessions from reading a table. Yuck. Starting with SQL Server 2005, you can enable Read Committed Snapshot Isolation (RCSI) which makes the database behave more like Oracle but it uses TempDB instead of rollback for the old version of the rows updated.

SQL Server Problems
To get around locking, developers will use “read uncommitted” which is a dirty read. I never use a dirty read because it is a hack. Don’t do it.

Using RCSI uses TempDB so more IO is involved. Unfortunately, I also almost never see it used in companies. Most SQL Server shops don’t even know this option is available so they continue with their dirty reads.

PostgreSQL
PostgreSQL uses Multi-Version Concurrency Control (MVCC) to handle read consistency. In a nutshell, readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.

To a user, PostgreSQL behaves similarly to Oracle but it doesn’t use rollback. Instead, an update adds a new row version to the existing table. Because PostgreSQL doesn’t use rollback, you will never get a “snapshot too old” error on long running queries.

PostgreSQL Problems
The downside to PostgreSQL happens when an update affects a large percentage of a table. If you update an entire table, it effectively doubles the size on disk. To make the old rows available again, a “vacuum” will need to be run so that new updates and/or inserts will use the deleted rows. PostgreSQL uses a background daemon to automatically vacuum tables so this usually isn’t a big deal.

Greenplum
Greenplum is a fork of PostgreSQL 8.2 so it uses MVCC to handle read consistency. Readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.

Greenplum MVCC Exception
Greenplum differs from PostgreSQL for update and delete commands. It acquires an Exclusive Lock on a table. However, this lock doesn’t block readers. SQL queries will return data and users will see a snapshot of the data that isn’t dirty. This Exclusive Lock does however block other update and delete commands which is different from PostgreSQL.

Summary
Greenplum Database provides read consistency with Multi-Version Concurrency Control (MVCC) similarly to PostgreSQL but update and delete operations are serialized. You don’t need to worry about blocking locks for selecting data because MVCC uses snapshots of the database instead of blocking locks. You can get a blocking lock on update and delete when another update or delete command is executing on the same table but readers aren’t blocked! You will likely never even notice this lock because the database will handle the updates and deletes one at a time while you can still select data.

How Do I Tune Greenplum?

This is the fourth in a series of “How do I do X in Greenplum database” and today’s topic is How do I tune Greenplum?

I’ve been a DBA and Developer using Oracle, SQL Server, Sybase, PostgreSQL, and Greenplum. By far, “tuning” Greenplum is much easier than the other databases and the main reason why is the architecture of Greenplum versus an OLTP database. With those other databases, I always found myself looking at the top 20 worst SQL queries and then doing three things. First, tell the users that they write “bad” SQL. Second, I would create more and more indexes to handle the variety of queries. Lastly, I would redesign my dimensional model to handle how users really wanted to look at the data.

Data Model
Greenplum is data model agnostic. You can use your OLTP data model directly in the database or a 3NF model or a star schema or even a very flat single table. Some databases force you into a particular data model which isn’t agile at all. It even forces waterfall development because it takes a lot of time to design those star schemas!

So now, my steps to tune Greenplum!

Step 1: Distribution
Greenplum is a shared nothing database where no two nodes share the same data. Data is spread across multiple servers based on a distribution key defined on each table. A good key is typically a unique identifier in a table and this can be a single or multiple columns. If you pick a good key, each segment will have roughly the same number of rows and at Greenplum we call this the “skew”.

Here is how to check the distribution or skew of a table:

SELECT gp_segment_id, count(*)
FROM schema.table
GROUP BY gp_segment_id;

For most tables, this is all of the tuning that is needed. It is unusual that all of the tables in a database are very large. Usually there are a few large tables along with a large number of medium and small tables.

Step 2: Partitioning
For your larger tables, you will want to partition the tables to eliminate Disk IO. Greenplum supports list, range, and mutli-level partitioning.

CREATE TABLE sales (trans_id int, date date, amount decimal(9,2), region text) 
DISTRIBUTED BY (trans_id)
PARTITION BY RANGE (date)
SUBPARTITION BY LIST (region)
SUBPARTITION TEMPLATE (
    SUBPARTITION usa VALUES ('usa'), 
    SUBPARTITION asia VALUES ('asia'), 
    SUBPARTITION europe VALUES ('europe'),
    DEFAULT SUBPARTITION other_regions)
(START (date '2008-01-01') INCLUSIVE 
END (date '2009-01-01') EXCLUSIVE EVERY (INTERVAL '1 month'),
DEFAULT PARTITION outlying_dates );

The Greenplum optimizer will eliminate partitions based on the SQL executed. Again, partitioning is for your larger tables and won’t help very much for smaller tables.

Step 3: Compression and Column Orientation
Greenplum supports both Row and Column oriented tables (and partitions). For your larger tables and especially the ones with lots of columns, using Column orientation is a good idea. It is also a good idea to compress this. It is basically another strategy to reduce Disk IO.

Here is an example:

CREATE TABLE my_customer 
WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz) AS
SELECT customer_id, fname, lname,
address1, address2, city, state, zip,
create_date, bill_amount
FROM customer
WHERE type_cd = 'xyz'
DISTRIBUTED BY (customer_id);

I only do this for my large tables. It isn’t worth it to do this unless the tables are pretty large.

Step 4: Indexes
Last but not least are indexes. I rarely use an index except for enforcing a Primary Key. Greenplum supports B-Tree, Bitmap, and GiST indexes.

– B-Tree indexes can be unique or not.
– Bitmap indexes are good useful when there are 100 to 100,000 distinct values.
– GiST indexes are used to support Postgis.

Again, I rarely use indexes but there are cases when it is helpful. It is the last in the line of things to do to “tune” your database.

Summary
Tuning in Greenplum doesn’t mean looking at lots of queries and instead it means checking table distribution. Then for your larger tables (10% or less of your database probably) Partition, Compress, and make Column oriented. Then for maybe 1 or 2 tables, use an Index.

Gone are the days of looking at the “bad” queries. No more creating indexes for everything under the sun. No more waterfall data modeling projects either! Just load the data and follow my four tuning steps.

How Do I Determine The Language For A Function?

This is the third in a series of “How do I do X in Greenplum database” and today’s topic is How do I determine the language for a function.

Inline SQL
A common way to use a function is to transform data directly in the SELECT portion of a query. Here is a quick example.

CREATE TABLE person
(id int NOT NULL,
 fname text,
 lname text,
 company_name text)
DISTRIBUTED BY (id);

--sample data
INSERT INTO person VALUES (1, 'Jon', 'Roberts', null);
INSERT INTO person VALUES (2, null, 'Smith', null);
INSERT INTO person VALUES (3, null, null, 'ABC Company');

Now the use case is to retrieve the “name” of the person but it might be a combination of the three text columns. A business decision was made to do format the name as follows:

SELECT CASE WHEN company_name IS NULL AND fname IS NOT NULL AND lname IS NOT NULL THEN lname || ', ' || fname
            WHEN fname IS NULL AND lname IS NOT NULL THEN lname
            WHEN company_name IS NOT NULL AND fname IS NOT NULL AND lname IS NOT NULL THEN company_name || ': ' || lname || ', ' || fname
            WHEN company_name IS NOT NULL AND fname IS NULL AND lname IS NOT NULL THEN company_name || ': ' || lname 
            WHEN company_name IS NOT NULL AND fname IS NULL AND lname IS  NULL THEN company_name END as name
FROM person;

     name     
--------------
 Roberts, Jon
 ABC Company
 Smith
(3 rows)

It would be much easier and consistent to add this logic to a function to be executed so it centralizes the logic to a single place and makes writing SQL a bit easier. This is when using the language “SQL” is preferred. Here is an example and notice that parameters are referred to $1, $2, $3 rather than the name of the parameter. The SQL language does not support named variables.

CREATE OR REPLACE function fn_get_name(p_fname text, p_lname text, p_company_name text) RETURNS text AS
$$
SELECT CASE WHEN $3 IS NULL AND $1 IS NOT NULL AND $2 IS NOT NULL THEN $2 || ', ' || $1
            WHEN $1 IS NULL AND $2 IS NOT NULL THEN $2
            WHEN $3 IS NOT NULL AND $1 IS NOT NULL AND $2 IS NOT NULL THEN $3 || ': ' || $2 || ', ' || $1
            WHEN $3 IS NOT NULL AND $1 IS NULL AND $2 IS NOT NULL THEN $3 || ': ' || $2 
            WHEN $3 IS NOT NULL AND $1 IS NULL AND $2 IS  NULL THEN $3 END as name
$$
LANGUAGE sql;

And now using this function.

SELECT fn_get_name(fname, lname, company_name) AS name FROM person;
     name     
--------------
 Roberts, Jon
 ABC Company
 Smith
(3 rows)

Transformation
The next use case is when transforming data and PL/pgSQL does a great job at this. You can reference parmeters by name, declare variables, handle errors, etc. The code looks a lot like Oracle PL/SQL too.

CREATE OR REPLACE FUNCTION fn_update_person() RETURNS void AS
$$
DECLARE
    v_rowcount int;
BEGIN
    UPDATE person
    SET fname = initcap(fname),
        lname = initcap(lname),
        company_name = initcap(company_name);

    GET DIAGNOSTICS v_rowcount = ROW_COUNT;

    RAISE INFO '% Rows Updated.', v_rowcount;

END;
$$
LANGUAGE plpgsql VOLATILE;

Notice how I’m not looping through the data or issuing a commit after so many records. There are two reasons for this. First, Greenplum is designed for Big Data so manipulating lots of data in a single transaction is ideal. Secondly, a function in Greenplum is executed in a single transaction. You can’t create a transaction within a function because executing a function is by default in a transaction.

Summary
I hope this helps. When transforming data inline, use the SQL language and use the more robust PL/pgSQL language when transforming data.

How Do I Create Dynamic SQL in Greenplum Database?

This is the second in a series of “How do I do X in Greenplum database” and today’s topic is How do I create dynamic SQL.

Method 1: psql
The psql program is a command line program that is similar to isql, sqlcmd, or sqlplus. You can use it to dynamically create SQL statements to execute and then pipe it to another instance of psql to execute each statement in a separate transaction.

Example:
In this example, I will create a script dynamically that will analyze each table in the retail schema.

psql -A -t -c "SELECT 'ANALYZE ' || table_schema || '.' || table_name || ';' FROM information_schema.tables WHERE table_schema = 'retail'" | psql -a
ANALYZE retail.d_customers;
ANALYZE
ANALYZE retail.d_payment_methods;
ANALYZE
ANALYZE retail.d_products;
ANALYZE
ANALYZE retail.f_sales;
ANALYZE

psql is executed from the shell with -A (unaligned table output mode), -t (print rows only), and -c (command to execute). The Dynamic SQL is generated and then executed by using a | and psql -a. The -a means that the commands generated by the first psql program are echoed and executed.

This is a nice feature when you need to have each statement executed in a separate transaction.

Method 2: PL/pgSQL
The second way of doing this same task is in PL/pgSQL.

CREATE OR REPLACE FUNCTION retail.fn_analyze() RETURNS void AS
$$

DECLARE
    v_function_name text := 'retail.fn_analyze';
    v_location int;
    v_rec record;

BEGIN
    v_location := 1000;
    FOR v_rec IN SELECT 'analyze ' || table_schema || '.' || table_name AS sql
                 FROM information_schema.tables 
                 WHERE table_schema = 'retail' LOOP

        RAISE INFO '%: %', clock_timestamp(), v_rec.sql;        
        EXECUTE v_rec.sql;

    END LOOP;
    
EXCEPTION
    WHEN OTHERS THEN
        RAISE EXCEPTION '(%:%:%)', v_function_name, v_location, sqlerrm;
END;

$$
LANGUAGE plpgsql;

And the output of the execution.

psql -c "select retail.fn_analyze()"
INFO:  2012-09-11 11:21:45.580327-05: analyze retail.d_customers
INFO:  2012-09-11 11:21:46.102631-05: analyze retail.d_payment_methods
INFO:  2012-09-11 11:21:46.225238-05: analyze retail.d_products
INFO:  2012-09-11 11:21:46.556235-05: analyze retail.f_sales
 fn_analyze 
------------
 
(1 row)

Note that all of the work done in a function is in a single transaction. If there is a problem and the function fails, all of the work done up to that point will be rolled back. So in the case of analyzing tables, the psql technique is probably a better choice but for other use cases, the function technique is better.

How Do I Get The Number Of Rows Inserted?

This is the first in a series of “How do I do X in Greenplum database” and today’s first topic is How do I get the number of rows Inserted (or Updated/Deleted).

Knowing the number of rows Inserted, Updated, or Deleted is pretty common in database programming.  You may want to key off of this to determine the next step in a procedure or just capture the number of rows in an ETL process.

Sybase and Microsoft SQL Server use a Global Variable to do this which is @@ROWCOUNT.  It is dynamically populated after each statement execution.  In Oracle, you can use SQL%ROWCOUNT to do the same thing.

In Greenplum and in a PL/pgSQL function, you can get the rowcount by using GET DIAGNOSTICS.  Here is an example.

CREATE OR REPLACE FUNCTION fn_update_person() RETURNS void AS
$$
DECLARE
    v_rowcount int;
BEGIN
    UPDATE person
    SET state_code = UPPER(state_code);

    GET DIAGNOSTICS v_rowcount = ROW_COUNT;

    RAISE INFO '% Rows Updated.', v_rowcount;

END;
$$
language plpgsql;

There you go. Pretty simple to get the ROW_COUNT in Greenplum.

Myth: COUNT(*) is slower than COUNT(1) or COUNT(column name)

There is a strongly held myth by some that in Greenplum, COUNT(*) is somehow slower than COUNT(1) or COUNT(column name).  However, it is a myth and there isn’t a difference.  If you are interested for proof, keep reading.

The 4 arguments I’ve heard that COUNT(*) is bad and COUNT(1) or COUNT(column name) is better are:
1. COUNT(*) is slower
2. COUNT(1) is faster when there is an index on a column
3. COUNT(column name) is faster when you count a unique column
4. COUNT(1) is faster when the table is column oriented

1. COUNT(*) is slower

Setup

create schema demo;

create table demo.foo
(id int not null,
fname text,
lname text)
distributed by (id);

insert into demo.foo
select i, 'jon' || i, 'roberts' || i
from (select generate_series(1, 1000000) as i) as sub;

analyze demo.foo;

COUNT(*)

explain analyze select count(*) from demo.foo;

Aggregate  (cost=14228.92..14228.93 rows=1 width=8)
  Rows out:  1 rows with 483 ms to end, start offset by 0.474 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14228.86..14228.91 rows=1 width=8)
        Rows out:  2 rows at destination with 481 ms to first row, 483 ms to end, start offset by 0.486 ms.
        ->  Aggregate  (cost=14228.86..14228.88 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 483 ms to end, start offset by 0.908 ms.
              ->  Seq Scan on foo  (cost=0.00..11725.49 rows=500675 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.071 ms to first row, 398 ms to end, start offset by 0.908 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 200K bytes avg x 2 workers, 200K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 484.379 ms

COUNT(1)

explain analyze select count(1) from demo.foo;

Aggregate  (cost=14228.92..14228.93 rows=1 width=8)
  Rows out:  1 rows with 504 ms to end, start offset by 0.532 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14228.86..14228.91 rows=1 width=8)
        Rows out:  2 rows at destination with 502 ms to first row, 504 ms to end, start offset by 0.564 ms.
        ->  Aggregate  (cost=14228.86..14228.88 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 499 ms to end, start offset by 4.029 ms.
              ->  Seq Scan on foo  (cost=0.00..11725.49 rows=500675 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.030 ms to first row, 398 ms to end, start offset by 4.029 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 505.237 ms

Conclusion: No difference.

2. COUNT(1) is faster when there is an index on a column

COUNT(*) with PK Index

alter table demo.foo add primary key (id);

analyze demo.foo;

explain analyze select count(*) from demo.foo;

Aggregate  (cost=14209.66..14209.67 rows=1 width=8)
  Rows out:  1 rows with 511 ms to end, start offset by 0.459 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14209.60..14209.64 rows=1 width=8)
        Rows out:  2 rows at destination with 511 ms to end, start offset by 0.471 ms.
        ->  Aggregate  (cost=14209.60..14209.61 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 510 ms to end, start offset by 0.836 ms.
              ->  Seq Scan on foo  (cost=0.00..11710.08 rows=499904 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.039 ms to first row, 420 ms to end, start offset by 0.837 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 511.958 ms

COUNT(1) with PK Index

explain analyze select count(1) from demo.foo;

Aggregate  (cost=14209.66..14209.67 rows=1 width=8)
  Rows out:  1 rows with 511 ms to end, start offset by 0.563 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14209.60..14209.64 rows=1 width=8)
        Rows out:  2 rows at destination with 507 ms to first row, 511 ms to end, start offset by 0.596 ms.
        ->  Aggregate  (cost=14209.60..14209.61 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 507 ms to end, start offset by 1.022 ms.
              ->  Seq Scan on foo  (cost=0.00..11710.08 rows=499904 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.040 ms to first row, 404 ms to end, start offset by 1.023 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 511.875 ms

Conclusion: No difference.  The cost went down slightly with an Index but COUNT(*) versus COUNT(1) didn’t make a difference.

3. COUNT(column name) is faster when you count a unique column

COUNT(ID)

explain analyze select count(id) from demo.foo;

Aggregate  (cost=14209.66..14209.67 rows=1 width=8)
  Rows out:  1 rows with 533 ms to end, start offset by 0.609 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=14209.60..14209.64 rows=1 width=8)
        Rows out:  2 rows at destination with 531 ms to first row, 533 ms to end, start offset by 0.647 ms.
        ->  Aggregate  (cost=14209.60..14209.61 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 529 ms to end, start offset by 4.696 ms.
              ->  Seq Scan on foo  (cost=0.00..11710.08 rows=499904 width=4)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.030 ms to first row, 418 ms to end, start offset by 4.697 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 168K bytes avg x 2 workers, 168K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 533.647 ms

Conclusion: No difference

4. COUNT(1) is faster when the table is column oriented

Setup

create table demo.foo_column
(id int not null,
 fname text,
 lname text)
 WITH (appendonly=true, orientation=column)
 distributed by (id);

insert into demo.foo_column
select i, 'jon' || i, 'roberts' || i
from (select generate_series(1, 1000000) as i) as sub;

analyze demo.foo_column;

COUNT(*)

explain analyze select count(*) from demo.foo_column;
Aggregate  (cost=13350.06..13350.07 rows=1 width=8)
  Rows out:  1 rows with 368 ms to end, start offset by 0.544 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=13350.00..13350.04 rows=1 width=8)
        Rows out:  2 rows at destination with 362 ms to first row, 368 ms to end, start offset by 0.573 ms.
        ->  Aggregate  (cost=13350.00..13350.01 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 359 ms to end, start offset by 4.160 ms.
              ->  Append-only Columnar Scan on foo_column  (cost=0.00..10850.00 rows=500000 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.182 ms to first row, 266 ms to end, start offset by 4.160 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 264K bytes avg x 2 workers, 264K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 368.748 ms

COUNT(1)

explain analyze select count(1) from demo.foo_column;

Aggregate  (cost=13350.06..13350.07 rows=1 width=8)
  Rows out:  1 rows with 277 ms to end, start offset by 0.557 ms.
  ->  Gather Motion 2:1  (slice1; segments: 2)  (cost=13350.00..13350.04 rows=1 width=8)
        Rows out:  2 rows at destination with 204 ms to first row, 277 ms to end, start offset by 0.586 ms.
        ->  Aggregate  (cost=13350.00..13350.01 rows=1 width=8)
              Rows out:  Avg 1.0 rows x 2 workers.  Max 1 rows (seg0) with 273 ms to end, start offset by 4.296 ms.
              ->  Append-only Columnar Scan on foo_column  (cost=0.00..10850.00 rows=500000 width=0)
                    Rows out:  Avg 500000.0 rows x 2 workers.  Max 500001 rows (seg0) with 0.195 ms to first row, 174 ms to end, start offset by 4.297 ms.
Slice statistics:
  (slice0)    Executor memory: 152K bytes.
  (slice1)    Executor memory: 264K bytes avg x 2 workers, 264K bytes max (seg0).
Statement statistics:
  Memory used: 128000K bytes
Total runtime: 277.571 ms

Conclusion: No difference. The cost is lower to count the number of rows in a column oriented table but again, makes no difference if it is COUNT(1) or COUNT(*).

Summary

Don’t use silly code tricks like COUNT(1) because it doesn’t make any difference.

Outsourcer 2.7

I have a new version which is only a small bug fix for Oracle dates. If you are storing dates in Oracle less than year 100, then you will need this new version. If you aren’t, then you don’t need to upgrade.

Details:
Oracle has the ability to store dates between January 1, 4712 BCE through December 31, 4712 CE and I convert the dates to a string to load it. I was relying on the locale to do this conversion but Java apparently converts the date 0099-05-01 to 99-05-01. This doesn’t follow the ISO date standard so it fails to load this value.

My fix is to simply format Oracle DATE columns to a four digit year.

Link:
Outsourcer 2.7

Reference:
Oracle Documentation

Chorus

I recently dug into Greenplum Chorus to better understand the collaboration tool.  I installed it on my Mac with my Greenplum Database, Greenplum Hadoop, and Greenplum Command Center.  In this post, I will review how Chorus was designed and how it facilitates collaboration in an organization.
Greenplum Chorus

Chorus Features
First off, the image above shows Greenplum Chorus, Database, and Hadoop. I created the image to better understand the product so this isn’t a marketing slide given to me.

Starting with Chorus, you have these main features.

  • Scheduler: Used to refresh data in Sandbox (more on this below)
  • Data Definitions: Connections to Greenplum databases and Hadoop. You can browse the data in both databases and Hadoop and then link this to Workspaces.
  • Insights: Comments made in Chorus that are so important to the business, it is shared to everyone! No longer will revelations about data be isolated to a single person or a small group. Now you can share your insights with others!

Workspace Features

Next we have Workspaces.  In my image, I have a “Sales” Workspace as well as other Workspaces like HR, Marketing, Analytics, and Accounting.  This is where people work together around data and as you can see, the Chorus Administrator can create as many Workspaces as needed for your organization.  Each Workspace have these features:

  • Linked Data: Data from Greenplum databases and Hadoop that is linked to the workspace. This makes it simple to query this data in Chorus without needing to make a copy of the data.
  • Sandbox Data: This is either data copied from Greenplum databases using the scheduler or completely new tables derived using SQL you write. This is very powerful and goes a long way to providing business users the ability to find new value in data.
  • Chorus View: This is similar to a database view but the definition resides in Chorus.
  • Files: This is basically metadata (Text, Images, other) plus code (SQL) that is stored in the workspace. Versioning is done automatically too. You can execute the SQL directly in Chorus. Very powerful stuff.
  • Notes: This is more metadata about anything in your workspace.  Notes are also what can be promoted to an Insight for the entire organization.  You are building a living and breathing data dictionary with Chorus!

Visualization

Workspaces also have the ability to visualize data. This is done with graphing similar to a BI tool. The visualization is there to quickly understand the data and then take action on it. Maybe the action is to write a Note or Insight or might be to further investigate with additional queries to Greenplum database and Hadoop.  Chorus isn’t meant to replace reporting tools.  Instead, the aim is to quickly understand the data and then take action with a Note, Insight, and/or more investigation.

Security

Logging into Chorus can be handled by LDAP/Active Directory if you like.  Hadoop and Database connections can be made public to the Chorus users or you can require users to log into the data sources so security is handled by the Database and Hadoop.

Summary

Chorus is a great collaboration tool for Greenplum.  I am very impressed with the tool and expect more great things from the product in the future.

Creating a Date Dimension

Here is another technique worth mentioning in regards to building a date dimension.  One way to do this is to create a loop and iterate over this to insert the records.  This is slow in virtually every database and we have a way to do this with a single INSERT statement.

First create the table:

DROP TABLE if exists hr.date_dim;

CREATE TABLE hr.date_dim
(
  date_key timestamp without time zone NOT NULL,
  date_week timestamp without time zone NOT NULL,
  date_month timestamp without time zone NOT NULL,
  date_quarter timestamp without time zone NOT NULL,
  date_year timestamp without time zone NOT NULL
)
DISTRIBUTED BY (date_key);

And now the magic:

INSERT INTO hr.date_dim
SELECT  mydate AS date_key,
        date_trunc('week', mydate) AS date_week,
        date_trunc('month', mydate) AS date_month,
        date_trunc('quarter', mydate) AS date_quarter,
        date_trunc('year', mydate) AS date_year
FROM    (
        SELECT '1995-01-01'::timestamp + interval '1 day' * (generate_series(0, (EXTRACT('days' FROM '2007-04-16'::timestamp - '1995-01-01'::timestamp)::int))) AS mydate
        ) AS sub;

My example builds the date dimension with every day populated between January 1, 1995 through April 16, 2007.  I did this as it was the minimum and maximum dates in my HR table with job history information.

This solution is similar to the other post I made about using generate_series to avoid a nested loop because I’m using generate_series again.  It is a very powerful and easy way to dynamically create a dataset without using a loop to do so.

Nested Loops, BETWEEN, and generate_series

Here is a common scenario in a data warehouse. You have designed yourself a date dimension and you have another table with a begin and end time for some activity.

The date dimension might look something like this:

CREATE TABLE hr.date_dim
(
date_key timestamp without time zone NOT NULL,
date_week timestamp without time zone NOT NULL,
date_month timestamp without time zone NOT NULL,
date_quarter timestamp without time zone NOT NULL,
date_year timestamp without time zone NOT NULL
)
DISTRIBUTED BY (date_key);

And you activity table would look something like this:

CREATE TABLE hr.job_history
(
employee_id numeric NOT NULL,
start_date timestamp without time zone NOT NULL,
end_date timestamp without time zone,
job_id character varying(10),
department_id numeric,
CONSTRAINT job_history_pkey PRIMARY KEY (employee_id, start_date)
)
DISTRIBUTED BY (employee_id, start_date);

One way to write the query joining the two tables is with BETWEEN. Here is what it looks like:

SELECT *
FROM hr.job_history jh
JOIN hr.date_dim d ON d.date_key BETWEEN jh.start_date AND jh.end_date;

The explain plan looks like this on my local install on my Mac:

Gather Motion 2:1  (slice2; segments: 2)  (cost=2.42..2301.81 rows=2494 width=78)
  ->  Nested Loop  (cost=2.42..2301.81 rows=2494 width=78)
        Join Filter: d.date_key >= jh.start_date AND d.date_key   Seq Scan on date_dim d  (cost=0.00..54.89 rows=2245 width=40)
        ->  Materialize  (cost=2.42..2.62 rows=10 width=38)
              ->  Broadcast Motion 2:2  (slice1; segments: 2)  (cost=0.00..2.40 rows=10 width=38)
                    ->  Seq Scan on job_history jh  (cost=0.00..2.10 rows=5 width=38)

Notice it is doing a Nested Loop operation. Yuck.

And here is the one that utilizes generate_series:

SELECT *
FROM (
SELECT *, start_date + interval '1 day' * (generate_series(0, (EXTRACT('days' FROM end_date - start_date)::int))) AS mydate
FROM hr.job_history
) AS jh
JOIN hr.date_dim d ON jh.mydate = d.date_key;

Look at the explain plan now!

Gather Motion 2:1  (slice2; segments: 2)  (cost=2.70..68.96 rows=7 width=166)
  ->  Hash Join  (cost=2.70..68.96 rows=7 width=166)
        Hash Cond: d.date_key = jh.mydate
        ->  Seq Scan on date_dim d  (cost=0.00..54.89 rows=2245 width=40)
        ->  Hash  (cost=2.58..2.58 rows=5 width=126)
              ->  Redistribute Motion 2:2  (slice1; segments: 2)  (cost=0.00..2.58 rows=5 width=126)
                    Hash Key: jh.mydate
                    ->  Result  (cost=0.00..2.27 rows=5 width=38)
                          ->  Seq Scan on job_history  (cost=0.00..2.27 rows=5 width=38)

It is doing a Hash Join and the cost has dropped significantly.

The lesson here is to avoid Nested Loops in Greenplum and one way to avoid this is to use an equal join instead of between. To achieve this, you can use the generate_series function in Greenplum as demonstrated here.

Dynamic External Tables

A customer encouraged me to write a post about how it is possible to use shell scripts to dynamically alter the results from an External Web Table.

For this example, I will create a table in both Greenplum and SQL Server that will contain a list of files.  I want to just get the records from the External Web Table that I need dynamically rather than retrieving all of the rows from SQL Server or by recreating the External Web Table. Note: Outsourcer has “Append” jobs for getting just the new records but is has been designed for Integer data types, not character like in this example.

Greenplum Table:

CREATE TABLE myfiles
(filename TEXT NOT NULL PRIMARY KEY,
dir TEXT NOT NULL)
DISTRIBUTED BY (filename);

INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');

SQL Server Table:

CREATE TABLE myfiles
(filename VARCHAR(500) NOT NULL PRIMARY KEY,
 dir VARCHAR(MAX) NOT NULL);

INSERT INTO myfiles VALUES ('2012-01-01_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-02_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-03_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-04_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-05_10-30-00_data.txt', '/data/dump');
--Note that these three rows don't exist in Greenplum
INSERT INTO myfiles VALUES ('2012-01-06_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-07_10-30-00_data.txt', '/data/dump');
INSERT INTO myfiles VALUES ('2012-01-08_10-30-00_data.txt', '/data/dump');

Shell Script: /Users/gpadmin/get_myfiles.sh
Please note that the use of Outsourcer.jar requires creating a connection in os.ext_connection. Review the Outsourcer documentation on creating External Tables for more information.

max=`psql -A -t -c "SELECT MAX(filename) FROM myfiles"`
java -classpath /usr/local/os/jar/Outsourcer.jar:/usr/local/os/jar/sqljdbc4.jar:/usr/local/os/jar/ojdbc6.jar -Xms128m -Xmx256m ExternalData gpdb 5432 1 "SELECT filename, dir FROM poc.dbo.myfiles WHERE filename > '$max'"

Dynamic External Web Table in Greenplum

CREATE EXTERNAL WEB TABLE ext_myfiles
(filename TEXT,
 dir TEXT)
EXECUTE E'/Users/gpadmin/get_myfiles.sh' ON MASTER
FORMAT 'text' (delimiter '|' null 'null' escape E'\\')
ENCODING 'UTF8';

Insert statement in Greenplum

INSERT INTO myfiles SELECT * FROM ext_myfiles;

This technique will only retrieve the rows from SQL Server that need to be inserted. When dealing with millions of records, this will greatly improve performance.

You can manipulate your script to retrieve data based on anything you want all without having to recreate your External Web Table.

Killing Idle Sessions

A lot of users will open a connection to a database and then leave it open for days.  They will forget about this and then open more connections.  Eventually, they will hit the max_connections limit and then not be able to create any new connections to Greenplum.

An easy fix for this is to kill sessions programmatically that have been idle for more than an hour.  Of course, you can adjust this time period as appropriate.

Here is a simple SQL statement to identify the idle sessions:

SELECT procpid
FROM pg_stat_activity
WHERE current_query = '<IDLE>'
AND clock_timestamp() - query_start > interval '1 hour';

One way to kill these idle sessions is with a single command:

eval `psql -A -t -c "SELECT 'kill ' || procpid FROM pg_stat_activity WHERE current_query = '<IDLE>' AND clock_timestamp() - query_start > interval '1 hour'"`

Put that command into a shell script (be sure to source the greenplum_path.sh script first) and create a cron job that runs every 5 minutes.

If you want to track who you killed and what time, you can use this shell script and again, use a cron job to execute it every 5 minutes.

#!/bin/bash
source /usr/local/greenplum-db/greenplum_path.sh
for i in $( psql -A -t -c "SELECT procpid, usename FROM pg_stat_activity WHERE current_query = '<IDLE>' AND clock_timestamp() - query_start > interval '1 hour'" ); do
        procpid=$(echo $i | cut -d\| -f1)
        usename=$(echo $i | cut -d\| -f2)

        echo `date`\|$procpid\|$usename >> kill.log
        kill $procpid
done

Rolling partitions with Polymorhpic Storage

Greenplum’s Polymorhpic storage is a very useful feature in Greenplum database where tables can be created and altered with a mixture of columnar, compressed, and row storage.

Polymorphic Storage

Here is an example of creating one of these tables and then managing this over time.

CREATE TABLE edw.sales_fact
(
  store_id integer,
  sales_person_id integer,
  product_id integer,
  sale_date date,
  sale_price numeric,
  product_cost numeric,
  spiff numeric
)
DISTRIBUTED BY (store_id, sales_person_id, product_id, sale_date)
PARTITION BY RANGE(sale_date) 
        (
        START ('2011-01-01'::timestamp) INCLUSIVE END ('2011-02-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5),
        START ('2011-02-01'::timestamp) INCLUSIVE END ('2011-03-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=quicklz),
        START ('2011-03-01'::timestamp) INCLUSIVE END ('2011-04-01'::timestamp) EXCLUSIVE
        )
;

As you can see, this table has three partitions with each having different storage techniques.

  • January 2011 is columnar and compressed with ZLib
  • February 2011 is columnar and compressed with QuickLZ
  • March 2011 is stored as a row store and is not compressed

As time goes on, the DBA may want to make March 2011 compressed and then add a partition for April 2011. To do this, we first need to move the March 2011 data to a new table and exchange the partition. The last step is to add the new April 2011 partition.

CREATE TABLE edw.sales_fact_march
(
  store_id integer,
  sales_person_id integer,
  product_id integer,
  sale_date date,
  sale_price numeric,
  product_cost numeric,
  spiff numeric
)
WITH (appendonly=true, orientation=column, compresstype=quicklz)
DISTRIBUTED BY (store_id, sales_person_id, product_id, sale_date);

INSERT INTO edw.sales_fact_march 
SELECT * FROM edw.sales_fact 
WHERE sale_date >= '2011-03-01' 
AND sale_date < '2011-04-01';

ANALYZE edw.sales_fact_march;

This next step is optional but you can now TRUNCATE the March 2011 partition because all of the data is in the sales_fact_march table. Plus, we need to exchange the two and it is faster to exchange with a truncated table.

TRUNCATE TABLE edw.sales_fact_1_prt_3;

Note: I got the partition table name from pg_partitions.

Now exchange my row stored partition for the columnar and compressed partition.

ALTER TABLE edw.sales_fact 
EXCHANGE PARTITION FOR (RANK(3)) 
WITH TABLE edw.sales_fact_march 
WITH VALIDATION;

Note: I got the RANK from pg_partitions.

We no longer need the temporary March 2011 data so we can drop this table.

DROP TABLE edw.sales_fact_march;

Lastly, add an empty April 2011 partition.

ALTER TABLE edw.sales_fact 
ADD PARTITION START ('2011-04-01'::timestamp) INCLUSIVE 
END ('2011-05-01'::timestamp) EXCLUSIVE;

The new DDL for the table looks like this:

CREATE TABLE edw.sales_fact
(
  store_id integer,
  sales_person_id integer,
  product_id integer,
  sale_date date,
  sale_price numeric,
  product_cost numeric,
  spiff numeric
)
WITH (
  OIDS=FALSE
)
DISTRIBUTED BY (store_id, sales_person_id, product_id, sale_date)
PARTITION BY RANGE(sale_date) 
          (
          START ('2011-01-01'::date) END ('2011-02-01'::date) WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5) , 
          START ('2011-02-01'::date) END ('2011-03-01'::date) WITH (appendonly=true, orientation=column, compresstype=quicklz), 
          START ('2011-03-01'::date) END ('2011-04-01'::date) WITH (appendonly=true, orientation=column, compresstype=quicklz), 
          START ('2011-04-01'::date) END ('2011-05-01'::date)
          )
;

Loading data with External Tables and gpfdist

Straight from the Greenplum Admin Guide:

gpfdist is Greenplum’s parallel file distribution program. It is used by readable external tables and gpload to serve external table files to all Greenplum Database segments in parallel. It is used by writable external tables to accept output streams from Greenplum Database segments in parallel and write them out to a file.

In this post, I’m going to demonstrate how to use a readable external table to load data that uses gpfdist. This makes it easy for Greenplum users to load massive amounts of data by just writing SQL!

First, start gpfdist as a background process.

gpfdist -d /Users/gpadmin -p 8080 &

Create my Table in Greenplum.

CREATE TABLE foo
(bar_id int not null,
 bar text,
 bar_description text)
DISTRIBUTED BY (bar_id);

Now create my External Table.

CREATE EXTERNAL TABLE ext_foo
(LIKE foo) LOCATION ('gpfdist://localhost:8080/demo/foo.txt')
FORMAT 'TEXT' (DELIMITER AS '|' NULL AS 'null');
  • I could have spelled out all of the columns but I used a shortcut by using “LIKE foo”
  • The location indicates it is using gpfdist and the host is local with port 8080
  • My file is located /Users/gpadmin/demo/foo.txt but gpfdist is serving /Users/gpadmin so I need to only specify the demo subdirectory and then the file name
  • You can pick TEXT of CSV format but TEXT will execute faster
  • I used pipes as my delimiter and spelled out null for my null values
Insert the data.
INSERT INTO foo SELECT * FROM ext_foo;

You are done!

Other tips:

  • When creating the External Table, you can also specify an error_table (created dynamically if not found) with a reject limit.
LOG ERRORS INTO error_table SEGMENT REJECT LIMIT count ROWS
  • If you have a really large file to load, you can split the file into multiple files and then specify multiple locations in the External Table.  You can even specify different hosts for each file with multiple gpfdists running at once.

And remember, that inserting data this way pushes the data directly to the segments.  It bypasses the MASTER server so it scales much better than other database’s loading utilities.  Greenplum can load 10TB/Hour for a full rack DCA when using gpfdist!

LDAP Security

Security is always divided into two main actions of Authentication and Authorization.  Authentication is proving you are who you say you are by usually providing a username and a password.  Authorization is the process of approving or denying access to objects such as SELECT on a table.  Greenplum always handles Authorization internally.  You can’t delegate this to a third party but you can with Authentication.

Authentication is handled in Greenplum just like it is in PostgreSQL.  Edit your pg_hba.conf file on the MASTER and provide information about how users should be authenticated.  Typically, companies are using an LDAP solution for security and this is also usually the hub in a hub and spoke security model.  Furthermore, Microsoft Active Directory is usually the LDAP solution picked because so many desktops are using Windows.

The security hub has a company’s password rules like failed attempts, password complexity, password expiration, etc. and you can leverage this very easily with Greenplum’s authentication model.  Simply edit your pg_hba.conf file and at the very bottom of the file (it is read sequentially), add this:

host all all 0.0.0.0/0 ldap ldapserver=<ldap server> ldapprefix=”<domain>\” ldapsuffix=””

For example:

  • LDAP server = ldap1
  • Windows Domain = myco

So your entry at the bottom of the pg_hba.conf file would look like this:

host all all 0.0.0.0/0 ldap ldapserver=ldap1 ldapprefix=”myco\” ldapsuffix=””

During a login attempt, the pg_hba.conf file is read and the first matching criteria is used.  So if you want to use basic password authentication for an ETL user or trust for local users, simply put these entries before the LDAP entry.  Then the last line is the LDAP entry that is the “catch-all” which authenticates users to your Active Directory.

How does it work?  It takes the username and password passed in and attempts to connect to the LDAP server with the ldapprefix (domain) specified.  If the user can connect to the LDAP server, then the user is authenticated and then connects to Greenplum. It is that simple.  Once in the database, Greenplum handles Authorization as normal.

Merge Statements

Merge is a feature in most databases but I have never liked it much.  My source table always had duplicates in it which prevented me from using it.

Here is an example of such with Microsoft SQL Server (BTW, Oracle behaves the same way):

CREATE TABLE stage
(
id int identity not null,
provider_id int NOT NULL,
provider text NOT NULL
);

insert into stage (provider_id, provider) values (1, 'bubba');
insert into stage (provider_id, provider) values (2, 'gump');
insert into stage (provider_id, provider) values (3, 'foo');
insert into stage (provider_id, provider) values (4, 'bar');
insert into stage (provider_id, provider) values (4, 'jon');

CREATE TABLE target
(
provider_id int NOT NULL primary key,
provider text NOT NULL
);

My stage table is where new records get loaded from a file and then my target table is what I’m trying to keep up to date.  So using a MERGE statement, I should be able to insert the new records and if they exist already, update it.

merge target
using (select * from stage) as stage
on (target.provider_id = stage.provider_id)
when matched then
update set provider = stage.provider
when not matched then
insert (provider_id, provider)
values (stage.provider_id, stage.provider);

That is the syntax for MERGE in SQL Server and when executed, I get this error message:

Msg 2627, Level 14, State 1, Line 1

Violation of PRIMARY KEY constraint 'PK__target__00E213107132C993'. Cannot insert duplicate key in object 'dbo.target'.

The statement has been terminated.

Why?  Well my stage table has two records with the same provider_id.  When this happens, I don’t want to error but instead, I want the last value in the file to be used.  Unfortunately, MERGE doesn’t handle this.  Instead, you have to write code.

In Greenplum, here is how you do a “smart” MERGE that handles real world data where the target has duplicates:

CREATE TABLE stage
(
id serial not null,
provider_id int NOT NULL,
provider text NOT NULL
)
DISTRIBUTED BY (provider_id);

insert into stage (provider_id, provider) values (1, 'bubba');
insert into stage (provider_id, provider) values (2, 'gump');
insert into stage (provider_id, provider) values (3, 'foo');
insert into stage (provider_id, provider) values (4, 'bar');
insert into stage (provider_id, provider) values (4, 'jon');

CREATE TABLE target
(
provider_id int NOT NULL primary key,
provider text NOT NULL
)
DISTRIBUTED BY (provider_id);

Now create the function:

create or replace function fn_merge_target() returns void as
$$
declare
    v_function text := 'fn_merge_target';
    v_location int;

begin
    v_location := 1000;
    --delete the old records
    delete from target t
    using stage s
    where t.provider_id = s.provider_id
    and t.provider <> s.provider;

    v_location := 2000;
    --inserts
    insert into target (provider_id, provider)

    select sub.provider_id, sub.provider
    from (  select s.*,
    rank() over (partition by s.provider_id order by s.id desc) as rank
    from stage s
    left join target t on s.provider_id = t.provider_id
    where t.provider is null  ) as sub
    where sub.rank = 1;

exception
    when others then
        raise exception '(%:%:%)', v_function, v_location, sqlerrm;
end;
$$
language plpgsql;

Execute the function:

select fn_merge_target();

And now query target:

select * from target order by provider_id;

I have 4 rows and for provider_id 4, I have the last version from the stage table rather than failing the command.

In summary, MERGE sounds nice but I’ve never been able to use it.  My source or stage table always has duplicates and instead of failing, I want to apply the last version of each primary key record.  I demonstrated an easy to use way to provide a much smarter MERGE that is optimized for Greenplum and uses one of the built in analytical functions (rank()).

Archiving Data

A customer recently had a pretty common question about archiving old data in Greenplum.  They wanted to purge old data in their OLTP system so that it remains small and fast.  Historical data that isn’t changing, stays in Greenplum.

An easy way to do this is with a view and two tables.  The view would UNION ALL the data between the two tables.  Unfortunately, Greenplum treats UNION ALL similarly to other databases which executes each query separated by the UNION ALL serially.  The optimal way to handle this is to partition the data.

So let’s say the customer has a schema called edw and a table called auths.

create schema edw;

CREATE TABLE edw.auths
(id int not null,
fname varchar,
lname varchar,
episodedate timestamp)
DISTRIBUTED BY (id);

Insert some sample data.

INSERT INTO edw.auths VALUES (1, 'foo', 'bar', '2001-05-01'::timestamp);
INSERT INTO edw.auths VALUES (2, 'foo', 'bar', '2012-12-01'::timestamp);
INSERT INTO edw.auths VALUES (3, 'too', 'cool', '2010-01-01'::timestamp);
INSERT INTO edw.auths VALUES (4, 'too', 'cool', '2010-01-02'::timestamp);
INSERT INTO edw.auths VALUES (5, 'old', 'data', '2004-01-02'::timestamp);

Next, rename this schema.

ALTER SCHEMA edw RENAME TO edw_archive;

Create the new edw schema.

CREATE SCHEMA edw;

In this case, anything older than 2011-01-01 will be archived and anything newer will be in the current OLTP system.

CREATE TABLE edw.auths
(LIKE edw_archive.auths)
PARTITION BY RANGE (episodedate)
(START ('2000-01-01'::timestamp) INCLUSIVE END ('2010-01-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5),
DEFAULT PARTITION auths_current);

I’m using a pretty high level of compression and column orientation for the archive data.  This will help performance and use less disk space for the archived data.  The default partition has the current data in it and it remains a row store that isn’t compressed.   This is a good example of Polymorphic storage where partitions within the same table are stored in different ways.

Insert the old data into the new table.

INSERT INTO edw.auths SELECT * FROM edw_archive.auths;

Just to show that everything works.

SELECT * FROM edw.auths;

With Outsourcer, you will want to change the job to load edw.auths to insert data into the default partition and don’t touch the archive partition.  To do this, get the names of the underlying inherited tables.

SELECT * FROM pg_partitions WHERE schemaname = 'edw' AND tablename = 'auths';

You can see the two tables independently.

--new data

SELECT * FROM edw.auths_1_prt_auths_current;

--archive data

SELECT * FROM edw.auths_1_prt_2;

To change Outsourcer, first remove the old job for edw.auths and insert the new one with the target of edw.auths_1_prt_auths_current.

DELETE FROM os.job WHERE (target).schema_name = 'edw' AND (target).table_name = 'auths';

INSERT INTO os.job(
refresh_type, target, source)
VALUES ('refresh',
('edw', 'auths_1_prt_auths_current'),
('sqlserver', 'jonnywin', null, null, 'edw', 'dbo', 'auths', 'sa', 'sa'));

Now when you execute this job, it will truncate the current partition, insert the data from the OLTP system, and then analyze the partition.  The archive partition will be left untouched.  Querying data across the archived and current data will execute much faster than a UNION ALL because it runs in parallel and it is compressed and column oriented.

PL/pgSQL Functions

Overview

Greenplum leverages PostgreSQL internals which has great benefits from security, Multi-Version Concurrency Control, and languages.  Greenplum supports a wide variety of languages but my favorite by far has to be PL/pgSQL.

PL/pgSQL feels a lot like Oracle PL/SQL.  It has the same “Ada” look and feel which makes it easy for an Oracle developer to start writing PL/pgSQL.

Since Greenplum is supporting the same language that PostgreSQL does, you can find excellent information online about different techniques of writing quality PL/pgSQL functions.  For example, here is good documentation on how to port from Oracle PL/SQL to PL/pgSQL => http://www.postgresql.org/docs/8.2/interactive/plpgsql-porting.html

Writing Functions

I follow the same template for each PL/pgSQL function I write and below are each of the different things I always follow.

Naming Conventions

  • Functions– prefix with fn_ to denote it is a function
  • Table names– all lower case with underscores.  The data dictionary stores objects in all lower case so it is easier to read table names with underscores.
  • Variables– prefix with v_
  • Parameters– prefix with p_

Error Handling

  • Function Name– I add the function name to each function I write so that I can include this in the exception message.  This is very helpful when you have functions executing other functions and you need to know what function really failed.
  • Location I prefer to add a location variable to functions and denote where I am in the code frequently.  I use the location value to know where the error occurred.
  • Exception BlockI raise the exception with:
  • RAISE EXCEPTION ‘(%:%:%)’, v_function_name, v_location, sqlerrm;

The parenthesis surrounds the error message from this function.  If the function executes another function and the second function is the one that failed, the parenthesis help to indicate what function, the location the error, and the error message that occurred.

Template

 

CREATE OR REPLACE FUNCTION fn_name() RETURNS void AS
$$

DECLARE
    v_function_name text := 'fn_name';
    v_location int;

BEGIN
    v_location := 1000;

EXCEPTION
    WHEN OTHERS THEN
        RAISE EXCEPTION '(%:%:%)', v_function_name, v_location, sqlerrm;
END;

$$
LANGUAGE plpgsql;

As you develop your code, increment the v_location variable so if you do get an error, it will be raised the approximate location in your code.

More examples of my PL/pgSQL code can be found in the Outsourcer code.  I have about 8 functions that you can reference for examples of writing PL/pgSQL functions.

Ugly Data

It seems that everyone has it. Ugly data. Data that isn’t so clean in transactional systems like control characters or rarely used characters that are used programmatically. This ugly data typically errors out during a load process. This requires someone to cleanse the data by writing scripts and then loading the data.

This is time consuming and feels like you are re-inventing the wheel over and over again.

Outsourcer does the cleansing for you! As the data is written to Greenplum, control characters are removed and special characters are escaped properly.

Here is a demo of Outsourcer in action loading ugly data.

Oracle Sessions

Just like for SQL Server, you may want to know what is going on in Oracle from inside Greenplum.  Here is an External Web Table that uses Outsourcer to give you that information.

First create a connection for Outsourcer:

INSERT INTO os.ext_connection
(type, server_name, instance_name, port, database_name, user_name, pass)
VALUES
('oracle', 'jonnywin', null, 1521, 'xe', 'os_test', 'os_password');

--get the ID you just inserted
SELECT id 
FROM os.ext_connection 
WHERE type = 'oracle' 
AND server_name = 'jonnywin'
AND port = 1521
AND database_name = 'xe';

--in my example, the value is 3.

Next create the schema in Greenplum for this table:

CREATE SCHEMA oracle;

This will create the External Table in Greenplum for Outsourcer:

--notice the double escaping of the dollar sign.
SELECT os.fn_create_ext_table('oracle.sessions', ARRAY[
'username varchar(30)','osuser varchar(30)','machine varchar(64)',
'terminal varchar(30)','program varchar(48)',
'module varchar(48)','sql_text text','logon_time timestamp',
'service_name varchar(64)'], 
3,
'SELECT s.USERNAME, s.OSUSER, s.MACHINE, s.TERMINAL, s.PROGRAM, s.MODULE,
a.sql_text, s.LOGON_TIME, s.service_name 
FROM v\\\\$session s, v\\\\$sqlarea a 
WHERE a.address = s.sql_address');

Now you can execute the query to get the session information:

select * from oracle.sessions;

More information on External Tables is in the Outsourcer documentation.

Why is SQL Server so slow?

This is a very good question.  When using Outsourcer, you may find a job run longer or shorter from time to time and in my experience, it is because the source database is waiting on something.  It is the bottleneck 99% of the time.  So why is it slow today versus yesterday?

Databases have wait stats and SQL Server has this information too.  You can query SQL Server to find what it is waiting on by logging into SQL Server and executing a SQL command.  With the code provided below, you can leverage Outsourcer to query SQL Server from Greenplum!

First create a connection for Outsourcer:

INSERT INTO os.ext_connection
(type, server_name, instance_name, port, database_name, user_name, pass)
VALUES
('sqlserver', 'jonnywin', null, null, null, 'os_test', 'os_password');

--get the ID you just inserted
SELECT id 
FROM os.ext_connection 
WHERE type = 'sqlserver' 
AND server_name = 'jonnywin'

--in my example, the value is 1.

Next create the schema in Greenplum for this table:

CREATE SCHEMA sqlserver;

This will create the External Table in Greenplum for Outsourcer:

SELECT os.fn_create_ext_table('sqlserver.sessions', 
ARRAY['sql_time timestamp','start_time timestamp','status varchar(30)',
'session_id smallint','sqltext text'], 
1,
'SELECT getdate() as sql_time, req.start_time, req.status, 
req.session_id, sqltext.TEXT as sqltext 
FROM sys.dm_exec_requests req 
CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext 
order by req.start_time');

Now you can execute the query to get the session information:

select * from sqlserver.sessions;

I have found that the most critical thing to look for from SQL Server is the “status”.  You will see it “Running” when it is actually doing something but “Suspended” when it is waiting on a resource.

So if you see queries generated by Outsourcer “Suspended” in SQL Server, that is why it is slower today versus yesterday.  Work with your SQL Server DBA to determine what resource(s) SQL Server is waiting on next.

More information on External Tables is in the Outsourcer documentation.