gpsnap Video

gpsnap is a utility currently available for Greenplum in AWS and Azure. It automates backup and recovery with AWS EBS snapshots and Azure Disk snapshots. For AWS, you can also copy these snapshots to a different AWS Region which is an ideal solution for Disaster Recovery.

Check it out!

Monitoring Sessions

I recently found out that one of my assumptions in Greenplum was completely wrong! After all of these years, I didn’t understand how pg_stat_activity worked!

I always thought that each row in this system table equated to a single session so, if I wanted to get the number of sessions logged in, I could do this:

select count(*) from pg_stat_activity;

But, this is wrong! One row actually means one server process and not one session.

When a session submits a query, pg_stat_activity may report multiple rows for that single session as it execute the query. Why? A single query can spawn multiple backend processes and these backend processes can show up in pg_stat_activity.

Example – Query 22 from TPC-DS
More Information on TPC-DS

select i_product_name
,i_brand
,i_class
,i_category
,avg(inv_quantity_on_hand) qoh
from inventory
,date_dim
,item
where inv_date_sk=d_date_sk
and inv_item_sk=i_item_sk
and d_month_seq between 1212 and 1212 + 11
group by rollup(i_product_name
,i_brand
,i_class
,i_category)
order by qoh, i_product_name, i_brand, i_class, i_category
limit 100;

First, look at pg_stat_activity and see that I have two sessions open. One is executing the simple query from pg_stat_activity (sess_id=1013) and the other is idle (sess_id=1011).

gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity;
 procpid | sess_id |      substring       
---------+---------+----------------------
   31027 |    1013 | select procpid, sess
   30804 |    1011 | <idle>
(2 rows)

Now, execute Query 22 from sess_id=1011.

gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity;
 procpid | sess_id |      substring       
---------+---------+----------------------
   31027 |    1013 | select procpid, sess
   30804 |    1011 | select i_product_nam
   31137 |    1011 | select i_product_nam
(3 rows)

I now see that sess_id=1011 has started a second process to complete the query. It is still a single database session but pg_stat_activity now shows two rows for this session.

Even after the query is over, I can still see the idle process created for sess_id=1011.

gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity;
 procpid | sess_id |      substring       
---------+---------+----------------------
   31027 |    1013 | select procpid, sess
   30804 |    1011 | <idle>
   31137 |    1011 | <idle>
(3 rows)

So how do you get the number of sessions from pg_stat_activity? With this query:

gpadmin=# select count(*) from (
gpadmin(#     select sess_id from pg_stat_activity group by sess_id
gpadmin(# ) as sub;
 count 
-------
     2
(1 row)

Default Storage Options

When you create a table in Greenplum, the default storage options make the table heap (appendonly=false) which also means the table is row oriented and not compressed. But, you can change the default with the GUC gp_default_storage_options.

Here are the defaults:

gpadmin=# show gp_default_storage_options;
                            gp_default_storage_options                            
----------------------------------------------------------------------------------
 appendonly=false,blocksize=32768,compresstype=none,checksum=true,orientation=row

And a quick test, shows how you can default to something else.

gpadmin=# set gp_default_storage_options='appendonly=true';                                                                                                                                               SET
gpadmin=# create table foo as select i from generate_series(1, 1000) as i distributed by (i);
SELECT 1000
gpadmin=# \d foo
Append-Only Table "public.foo"
 Column |  Type   | Modifiers 
--------+---------+-----------
 i      | integer | 
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: t
Distributed by: (i)

You can make the changes permanent with gpconfig.

gpconfig -c 'gp_default_storage_options' -v 'appendonly=true, orientation=column'

And you can make it limited to a single database.

ALTER DATABASE mytest SET gp_default_storage_options = 'appendonly=true, orientation=column';

It’s a Snap!

gpsnap and gpcronsnap are new utilities for Pivotal Greenplum on AWS. These utilities leverage Amazon EBS Snapshots to take very quick snapshot backups.

EBS Snapshot
Elastic Block Storage (EBS) is the preferred storage in Amazon AWS for many reasons like persistence, performance, and high availability. A great feature of EBS is the ability to take a “Snapshot”. A snapshot can be taken of a disk volume and to get a consistent backup, you need to stop the database, take a snapshot of every volume, and then restart the database.

EBS Restore
The restore process requires stopping the database, detaching and deleting the existing volumes, creating new volumes from the snapshots, attaching the volumes to the right nodes, mounting the new volumes to the right directories, and then starting the database. If you don’t delete the existing volumes and just create new ones, you will incur more storage cost in AWS and you may hit your disk quota. So take a snapshot before you execute a restore so you can revert back if needed.

gpsnap makes this entire process easy. It does all of the tasks needed to take a consistent snapshot of the database. It also automates the tasks for restoring from a snapshot.

The commands include:
– gpsnap list: lists snapshots
– gpsnap create: creates a new snapshot
– gpsnap delete : deletes a specific snapshot
– gpsnap delete all: deletes all snapshots
– gpsnap restore : restores a specific snapshot

gpcronsnap is the second utility and you can probably guess from the name that it is the scheduling portion of the utility. By default, a snapshot backup will be created weekly and retain up to four snapshots. This is all configurable in a configuration file too.

So you can see, it is now a snap for Pivotal Greenplum on AWS to take snapshot backups!

You can either Bring Your Own License (BYOL) or pay Hourly for Greenplum on AWS. Here are the links:
Pivotal Greenplum on AWS (BYOL)
Pivotal Greenplum on AWS (Hourly)

Pivotal Greenplum Version 5.x

Version 3.x
I’ve been using Greenplum since version 3.0 back in 2007. That version was actually the first commercially available version too. The versions were released about once every 6 to 12 months and got as high as 3.3.

Version 4.x
Once the version got to 4.3, the number just seemed to get stuck. 4.3.0 was a pretty big release which changed “Append Only” tables to be “Append Optimized” which simply meant you could start updating and deleting tables stored in the append-only format. But more enhancements came to 4.3 but the version number never exceeded 4.3.

Labs Methodology
Major enhancements came to version 4.3.x and the innovation came at a faster pace but you may not have noticed this if you were just looking at the version number. Pivotal’s engineering team embraced the Pivotal Labs methodologies of pair programming and quick iterations. There was a huge transformation happening in this group.

Version 5.x
When 5.0 came out, it was a big deal. Greeplum had always been a fork of PostgreSQL 8.2 but with 5.0, Greenplum was rebased to 8.3. A lot of work went into this change which also requires migrating from 4.3.x to 5.0.

Now this next change, I didn’t expect. During the 4.3.x development, the only time the version number would change to either 4.4 or 5.0, it would be when the database required a migration to upgrade. With 5.x, the version numbers are coming fast and don’t require migrations to upgrade. It just a simple binary swap to upgrade.

The result has been the release of 5.0 on 2017-09-14, 5.1 on 2017-10-20, and 5.2 on 2017-11-18. Do you see the pattern? Monthly point releases! All of these releases so far have been simple binary upgrades but have a ton of improvements each time.

Version 6.0?
Version 6 is in sight. Seriously. I would expect this to be the next rebase of Greenplum to PostgreSQL 8.4 and will require a database migration to upgrade. It is amazing how the Labs culture and open source software development has enhanced the development of Greenplum.