Monthly Archives: September 2012

How Do I Use Java In Greenplum?

This is the sixth in a series of “How do I do X in Greenplum database” and today’s topic is How do I use Java in Greenplum.

Greenplum supports many languages and one of these is Java. This is defined as a language called “pljava”. Because it is Java, it still requires you to compile the code and put it in a jar that is your classpath. There are a couple of additional steps to configure this in Greenplum so I will go through these steps.

Prerequisites:
1. JDK installed either on the master or a desktop. If on a desktop, you’ll need to copy the jar file to the master so you can then copy it to the segments.
2. Logged into the master (mdw) as user gpadmin.
3. Methods must be public and static.

First step, write some Java and here is an example of a file named Example.java:

public class Example
{
        public static String substring(String text, int beginIndex, int endIndex)
        {
                return text.substring(beginIndex, endIndex);
        }
}

Create a manifest file named manifest.txt:

Manifest-Version: 1.0
Main-Class: Example
Specification-Title: "Example"
Specification-Version: "1.0"
Created-By: 1.6.0_35-b10-428-11M3811
Build-Date: 09/28/2012 10:09 AM

Compile the Java:

javac *.java

Jar the file:

jar cfm analytics.jar manifest.txt *.class

Copy the analytics.jar to all Greenplum servers. The gphosts_file contains a list of all of your severs. An example of that is:

mdw
smdw
sdw1
sdw2
sdw3
sdw4

And the command to copy the Jar file:

gpscp -f gphosts_file analytics.jar =:/usr/local/greenplum-db/lib/postgresql/java/

Set your classpath variable inside Greenplum:

gpconfig -c pljava_classpath -v \'analytics.jar\'

Apply the configuration change:

gpstop -u

Install the pljava language and in this example, I’m putting it into the gpdb database. Note: this is a one time activity per database.

psql gpdb -f $GPHOME/share/postgresql/pljava/install.sql

Make sure the classpath is set correctly:

show pljava_classpath

It should say ‘analytics.jar’.

And finally, a working example:

create table temp (a varchar) distributed randomly;
insert into temp values ('a string');

--Example function
create or replace function java_substring(varchar, int, int)
returns varchar as 'Example.substring' language java;

--Example execution
select java_substring(a, 1, 3) from temp;

This was a pretty simple example of using PL/Java. Enjoy!

How Do I Prevent Blocking Locks in Greenplum?

This is the fifth in a series of “How do I do X in Greenplum database” and today’s topic is How do I prevent “blocking locks” in Greenplum?

Since different databases handle read consistency and locking differently, I will discuss how Oracle, Microsoft SQL Server, PostgreSQL, and finally Greenplum handle locking. This will give you a good way to compare and contrast the different products.

Oracle
Oracle handles read consistency with using “rollback”. When you update a table, the row is overwritten but first, the old version of the data is put into a rollback segment and the row is marked with a timestamp and a pointer to the old version. This allows users to query the database and not be blocked while an update happens.

If a query started at time1 and and update happens at time2, the query will see a snapshot of the data as of time1. To make this work, the query follows the pointer to the rollback segment to get the old version of the rows that were updated.

Oracle Problems
There are a couple of problems with this solution. The first is Disk IO because the database has to first read the data file and then read the rollback segment.

The second problem with this is the rollback segment must be sized pretty large in a warehouse environment when you can easily have a query execute for a long period of time. You will eventually run out of rollback and get the “snapshot too old” error on long running queries if you are actively changing data.

SQL Server
SQL Server uses locking by default to provide read consistency. This means an update or even an insert can block other sessions from reading a table. Yuck. Starting with SQL Server 2005, you can enable Read Committed Snapshot Isolation (RCSI) which makes the database behave more like Oracle but it uses TempDB instead of rollback for the old version of the rows updated.

SQL Server Problems
To get around locking, developers will use “read uncommitted” which is a dirty read. I never use a dirty read because it is a hack. Don’t do it.

Using RCSI uses TempDB so more IO is involved. Unfortunately, I also almost never see it used in companies. Most SQL Server shops don’t even know this option is available so they continue with their dirty reads.

PostgreSQL
PostgreSQL uses Multi-Version Concurrency Control (MVCC) to handle read consistency. In a nutshell, readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.

To a user, PostgreSQL behaves similarly to Oracle but it doesn’t use rollback. Instead, an update adds a new row version to the existing table. Because PostgreSQL doesn’t use rollback, you will never get a “snapshot too old” error on long running queries.

PostgreSQL Problems
The downside to PostgreSQL happens when an update affects a large percentage of a table. If you update an entire table, it effectively doubles the size on disk. To make the old rows available again, a “vacuum” will need to be run so that new updates and/or inserts will use the deleted rows. PostgreSQL uses a background daemon to automatically vacuum tables so this usually isn’t a big deal.

Greenplum
Greenplum is a fork of PostgreSQL 8.2 so it uses MVCC to handle read consistency. Readers don’t block writers and writers don’t block readers. Each transaction sees a snapshot of the database rather than locking tables.

Greenplum MVCC Exception
Greenplum differs from PostgreSQL for update and delete commands. It acquires an Exclusive Lock on a table. However, this lock doesn’t block readers. SQL queries will return data and users will see a snapshot of the data that isn’t dirty. This Exclusive Lock does however block other update and delete commands which is different from PostgreSQL.

Summary
Greenplum Database provides read consistency with Multi-Version Concurrency Control (MVCC) similarly to PostgreSQL but update and delete operations are serialized. You don’t need to worry about blocking locks for selecting data because MVCC uses snapshots of the database instead of blocking locks. You can get a blocking lock on update and delete when another update or delete command is executing on the same table but readers aren’t blocked! You will likely never even notice this lock because the database will handle the updates and deletes one at a time while you can still select data.

How Do I Tune Greenplum?

This is the fourth in a series of “How do I do X in Greenplum database” and today’s topic is How do I tune Greenplum?

I’ve been a DBA and Developer using Oracle, SQL Server, Sybase, PostgreSQL, and Greenplum. By far, “tuning” Greenplum is much easier than the other databases and the main reason why is the architecture of Greenplum versus an OLTP database. With those other databases, I always found myself looking at the top 20 worst SQL queries and then doing three things. First, tell the users that they write “bad” SQL. Second, I would create more and more indexes to handle the variety of queries. Lastly, I would redesign my dimensional model to handle how users really wanted to look at the data.

Data Model
Greenplum is data model agnostic. You can use your OLTP data model directly in the database or a 3NF model or a star schema or even a very flat single table. Some databases force you into a particular data model which isn’t agile at all. It even forces waterfall development because it takes a lot of time to design those star schemas!

So now, my steps to tune Greenplum!

Step 1: Distribution
Greenplum is a shared nothing database where no two nodes share the same data. Data is spread across multiple servers based on a distribution key defined on each table. A good key is typically a unique identifier in a table and this can be a single or multiple columns. If you pick a good key, each segment will have roughly the same number of rows and at Greenplum we call this the “skew”.

Here is how to check the distribution or skew of a table:

SELECT gp_segment_id, count(*)
FROM schema.table
GROUP BY gp_segment_id;

For most tables, this is all of the tuning that is needed. It is unusual that all of the tables in a database are very large. Usually there are a few large tables along with a large number of medium and small tables.

Step 2: Partitioning
For your larger tables, you will want to partition the tables to eliminate Disk IO. Greenplum supports list, range, and mutli-level partitioning.

CREATE TABLE sales (trans_id int, date date, amount decimal(9,2), region text) 
DISTRIBUTED BY (trans_id)
PARTITION BY RANGE (date)
SUBPARTITION BY LIST (region)
SUBPARTITION TEMPLATE (
    SUBPARTITION usa VALUES ('usa'), 
    SUBPARTITION asia VALUES ('asia'), 
    SUBPARTITION europe VALUES ('europe'),
    DEFAULT SUBPARTITION other_regions)
(START (date '2008-01-01') INCLUSIVE 
END (date '2009-01-01') EXCLUSIVE EVERY (INTERVAL '1 month'),
DEFAULT PARTITION outlying_dates );

The Greenplum optimizer will eliminate partitions based on the SQL executed. Again, partitioning is for your larger tables and won’t help very much for smaller tables.

Step 3: Compression and Column Orientation
Greenplum supports both Row and Column oriented tables (and partitions). For your larger tables and especially the ones with lots of columns, using Column orientation is a good idea. It is also a good idea to compress this. It is basically another strategy to reduce Disk IO.

Here is an example:

CREATE TABLE my_customer 
WITH (APPENDONLY=true, ORIENTATION=column, COMPRESSTYPE=quicklz) AS
SELECT customer_id, fname, lname,
address1, address2, city, state, zip,
create_date, bill_amount
FROM customer
WHERE type_cd = 'xyz'
DISTRIBUTED BY (customer_id);

I only do this for my large tables. It isn’t worth it to do this unless the tables are pretty large.

Step 4: Indexes
Last but not least are indexes. I rarely use an index except for enforcing a Primary Key. Greenplum supports B-Tree, Bitmap, and GiST indexes.

– B-Tree indexes can be unique or not.
– Bitmap indexes are good useful when there are 100 to 100,000 distinct values.
– GiST indexes are used to support Postgis.

Again, I rarely use indexes but there are cases when it is helpful. It is the last in the line of things to do to “tune” your database.

Summary
Tuning in Greenplum doesn’t mean looking at lots of queries and instead it means checking table distribution. Then for your larger tables (10% or less of your database probably) Partition, Compress, and make Column oriented. Then for maybe 1 or 2 tables, use an Index.

Gone are the days of looking at the “bad” queries. No more creating indexes for everything under the sun. No more waterfall data modeling projects either! Just load the data and follow my four tuning steps.

How Do I Determine The Language For A Function?

This is the third in a series of “How do I do X in Greenplum database” and today’s topic is How do I determine the language for a function.

Inline SQL
A common way to use a function is to transform data directly in the SELECT portion of a query. Here is a quick example.

CREATE TABLE person
(id int NOT NULL,
 fname text,
 lname text,
 company_name text)
DISTRIBUTED BY (id);

--sample data
INSERT INTO person VALUES (1, 'Jon', 'Roberts', null);
INSERT INTO person VALUES (2, null, 'Smith', null);
INSERT INTO person VALUES (3, null, null, 'ABC Company');

Now the use case is to retrieve the “name” of the person but it might be a combination of the three text columns. A business decision was made to do format the name as follows:

SELECT CASE WHEN company_name IS NULL AND fname IS NOT NULL AND lname IS NOT NULL THEN lname || ', ' || fname
            WHEN fname IS NULL AND lname IS NOT NULL THEN lname
            WHEN company_name IS NOT NULL AND fname IS NOT NULL AND lname IS NOT NULL THEN company_name || ': ' || lname || ', ' || fname
            WHEN company_name IS NOT NULL AND fname IS NULL AND lname IS NOT NULL THEN company_name || ': ' || lname 
            WHEN company_name IS NOT NULL AND fname IS NULL AND lname IS  NULL THEN company_name END as name
FROM person;

     name     
--------------
 Roberts, Jon
 ABC Company
 Smith
(3 rows)

It would be much easier and consistent to add this logic to a function to be executed so it centralizes the logic to a single place and makes writing SQL a bit easier. This is when using the language “SQL” is preferred. Here is an example and notice that parameters are referred to $1, $2, $3 rather than the name of the parameter. The SQL language does not support named variables.

CREATE OR REPLACE function fn_get_name(p_fname text, p_lname text, p_company_name text) RETURNS text AS
$$
SELECT CASE WHEN $3 IS NULL AND $1 IS NOT NULL AND $2 IS NOT NULL THEN $2 || ', ' || $1
            WHEN $1 IS NULL AND $2 IS NOT NULL THEN $2
            WHEN $3 IS NOT NULL AND $1 IS NOT NULL AND $2 IS NOT NULL THEN $3 || ': ' || $2 || ', ' || $1
            WHEN $3 IS NOT NULL AND $1 IS NULL AND $2 IS NOT NULL THEN $3 || ': ' || $2 
            WHEN $3 IS NOT NULL AND $1 IS NULL AND $2 IS  NULL THEN $3 END as name
$$
LANGUAGE sql;

And now using this function.

SELECT fn_get_name(fname, lname, company_name) AS name FROM person;
     name     
--------------
 Roberts, Jon
 ABC Company
 Smith
(3 rows)

Transformation
The next use case is when transforming data and PL/pgSQL does a great job at this. You can reference parmeters by name, declare variables, handle errors, etc. The code looks a lot like Oracle PL/SQL too.

CREATE OR REPLACE FUNCTION fn_update_person() RETURNS void AS
$$
DECLARE
    v_rowcount int;
BEGIN
    UPDATE person
    SET fname = initcap(fname),
        lname = initcap(lname),
        company_name = initcap(company_name);

    GET DIAGNOSTICS v_rowcount = ROW_COUNT;

    RAISE INFO '% Rows Updated.', v_rowcount;

END;
$$
LANGUAGE plpgsql VOLATILE;

Notice how I’m not looping through the data or issuing a commit after so many records. There are two reasons for this. First, Greenplum is designed for Big Data so manipulating lots of data in a single transaction is ideal. Secondly, a function in Greenplum is executed in a single transaction. You can’t create a transaction within a function because executing a function is by default in a transaction.

Summary
I hope this helps. When transforming data inline, use the SQL language and use the more robust PL/pgSQL language when transforming data.

How Do I Create Dynamic SQL in Greenplum Database?

This is the second in a series of “How do I do X in Greenplum database” and today’s topic is How do I create dynamic SQL.

Method 1: psql
The psql program is a command line program that is similar to isql, sqlcmd, or sqlplus. You can use it to dynamically create SQL statements to execute and then pipe it to another instance of psql to execute each statement in a separate transaction.

Example:
In this example, I will create a script dynamically that will analyze each table in the retail schema.

psql -A -t -c "SELECT 'ANALYZE ' || table_schema || '.' || table_name || ';' FROM information_schema.tables WHERE table_schema = 'retail'" | psql -a
ANALYZE retail.d_customers;
ANALYZE
ANALYZE retail.d_payment_methods;
ANALYZE
ANALYZE retail.d_products;
ANALYZE
ANALYZE retail.f_sales;
ANALYZE

psql is executed from the shell with -A (unaligned table output mode), -t (print rows only), and -c (command to execute). The Dynamic SQL is generated and then executed by using a | and psql -a. The -a means that the commands generated by the first psql program are echoed and executed.

This is a nice feature when you need to have each statement executed in a separate transaction.

Method 2: PL/pgSQL
The second way of doing this same task is in PL/pgSQL.

CREATE OR REPLACE FUNCTION retail.fn_analyze() RETURNS void AS
$$

DECLARE
    v_function_name text := 'retail.fn_analyze';
    v_location int;
    v_rec record;

BEGIN
    v_location := 1000;
    FOR v_rec IN SELECT 'analyze ' || table_schema || '.' || table_name AS sql
                 FROM information_schema.tables 
                 WHERE table_schema = 'retail' LOOP

        RAISE INFO '%: %', clock_timestamp(), v_rec.sql;        
        EXECUTE v_rec.sql;

    END LOOP;
    
EXCEPTION
    WHEN OTHERS THEN
        RAISE EXCEPTION '(%:%:%)', v_function_name, v_location, sqlerrm;
END;

$$
LANGUAGE plpgsql;

And the output of the execution.

psql -c "select retail.fn_analyze()"
INFO:  2012-09-11 11:21:45.580327-05: analyze retail.d_customers
INFO:  2012-09-11 11:21:46.102631-05: analyze retail.d_payment_methods
INFO:  2012-09-11 11:21:46.225238-05: analyze retail.d_products
INFO:  2012-09-11 11:21:46.556235-05: analyze retail.f_sales
 fn_analyze 
------------
 
(1 row)

Note that all of the work done in a function is in a single transaction. If there is a problem and the function fails, all of the work done up to that point will be rolled back. So in the case of analyzing tables, the psql technique is probably a better choice but for other use cases, the function technique is better.

How Do I Get The Number Of Rows Inserted?

This is the first in a series of “How do I do X in Greenplum database” and today’s first topic is How do I get the number of rows Inserted (or Updated/Deleted).

Knowing the number of rows Inserted, Updated, or Deleted is pretty common in database programming.  You may want to key off of this to determine the next step in a procedure or just capture the number of rows in an ETL process.

Sybase and Microsoft SQL Server use a Global Variable to do this which is @@ROWCOUNT.  It is dynamically populated after each statement execution.  In Oracle, you can use SQL%ROWCOUNT to do the same thing.

In Greenplum and in a PL/pgSQL function, you can get the rowcount by using GET DIAGNOSTICS.  Here is an example.

CREATE OR REPLACE FUNCTION fn_update_person() RETURNS void AS
$$
DECLARE
    v_rowcount int;
BEGIN
    UPDATE person
    SET state_code = UPPER(state_code);

    GET DIAGNOSTICS v_rowcount = ROW_COUNT;

    RAISE INFO '% Rows Updated.', v_rowcount;

END;
$$
language plpgsql;

There you go. Pretty simple to get the ROW_COUNT in Greenplum.