It took a couple of tries but after help from Craig Sylvester, I think I got it now. It now handles both Greenplum version 4.x and 5.x.
I recently found out that one of my assumptions in Greenplum was completely wrong! After all of these years, I didn’t understand how pg_stat_activity worked!
I always thought that each row in this system table equated to a single session so, if I wanted to get the number of sessions logged in, I could do this:
select count(*) from pg_stat_activity;
But, this is wrong! One row actually means one server process and not one session.
When a session submits a query, pg_stat_activity may report multiple rows for that single session as it execute the query. Why? A single query can spawn multiple backend processes and these backend processes can show up in pg_stat_activity.
Example – Query 22 from TPC-DS
More Information on TPC-DS
select i_product_name ,i_brand ,i_class ,i_category ,avg(inv_quantity_on_hand) qoh from inventory ,date_dim ,item where inv_date_sk=d_date_sk and inv_item_sk=i_item_sk and d_month_seq between 1212 and 1212 + 11 group by rollup(i_product_name ,i_brand ,i_class ,i_category) order by qoh, i_product_name, i_brand, i_class, i_category limit 100;
First, look at pg_stat_activity and see that I have two sessions open. One is executing the simple query from pg_stat_activity (sess_id=1013) and the other is idle (sess_id=1011).
gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity; procpid | sess_id | substring ---------+---------+---------------------- 31027 | 1013 | select procpid, sess 30804 | 1011 | <idle> (2 rows)
Now, execute Query 22 from sess_id=1011.
gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity; procpid | sess_id | substring ---------+---------+---------------------- 31027 | 1013 | select procpid, sess 30804 | 1011 | select i_product_nam 31137 | 1011 | select i_product_nam (3 rows)
I now see that sess_id=1011 has started a second process to complete the query. It is still a single database session but pg_stat_activity now shows two rows for this session.
Even after the query is over, I can still see the idle process created for sess_id=1011.
gpadmin=# select procpid, sess_id, substring(current_query, 1, 20) from pg_stat_activity; procpid | sess_id | substring ---------+---------+---------------------- 31027 | 1013 | select procpid, sess 30804 | 1011 | <idle> 31137 | 1011 | <idle> (3 rows)
So how do you get the number of sessions from pg_stat_activity? With this query:
gpadmin=# select count(*) from ( gpadmin(# select sess_id from pg_stat_activity group by sess_id gpadmin(# ) as sub; count ------- 2 (1 row)
When you create a table in Greenplum, the default storage options make the table heap (appendonly=false) which also means the table is row oriented and not compressed. But, you can change the default with the GUC gp_default_storage_options.
Here are the defaults:
gpadmin=# show gp_default_storage_options; gp_default_storage_options ---------------------------------------------------------------------------------- appendonly=false,blocksize=32768,compresstype=none,checksum=true,orientation=row
And a quick test, shows how you can default to something else.
gpadmin=# set gp_default_storage_options='appendonly=true'; SET gpadmin=# create table foo as select i from generate_series(1, 1000) as i distributed by (i); SELECT 1000 gpadmin=# \d foo Append-Only Table "public.foo" Column | Type | Modifiers --------+---------+----------- i | integer | Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: t Distributed by: (i)
You can make the changes permanent with gpconfig.
gpconfig -c 'gp_default_storage_options' -v 'appendonly=true, orientation=column'
And you can make it limited to a single database.
ALTER DATABASE mytest SET gp_default_storage_options = 'appendonly=true, orientation=column';
Added a new configuration setting so you can set the Max Length for gpfdist. This is done by adding a new value to your gplink.properties file called, “gpfdistMaxLength”.
gpsnap and gpcronsnap are new utilities for Pivotal Greenplum on AWS. These utilities leverage Amazon EBS Snapshots to take very quick snapshot backups.
Elastic Block Storage (EBS) is the preferred storage in Amazon AWS for many reasons like persistence, performance, and high availability. A great feature of EBS is the ability to take a “Snapshot”. A snapshot can be taken of a disk volume and to get a consistent backup, you need to stop the database, take a snapshot of every volume, and then restart the database.
The restore process requires stopping the database, detaching and deleting the existing volumes, creating new volumes from the snapshots, attaching the volumes to the right nodes, mounting the new volumes to the right directories, and then starting the database. If you don’t delete the existing volumes and just create new ones, you will incur more storage cost in AWS and you may hit your disk quota. So take a snapshot before you execute a restore so you can revert back if needed.
gpsnap makes this entire process easy. It does all of the tasks needed to take a consistent snapshot of the database. It also automates the tasks for restoring from a snapshot.
The commands include:
– gpsnap list: lists snapshots
– gpsnap create: creates a new snapshot
– gpsnap delete
– gpsnap delete all: deletes all snapshots
– gpsnap restore
gpcronsnap is the second utility and you can probably guess from the name that it is the scheduling portion of the utility. By default, a snapshot backup will be created weekly and retain up to four snapshots. This is all configurable in a configuration file too.
So you can see, it is now a snap for Pivotal Greenplum on AWS to take snapshot backups!
Michael Goddard, who is a colleague of mine, put together this demo of the Self-Healing feature I created for the Pivotal Greenplum on AWS Marketplace products. If any node in the cluster fails, it will automatically be replaced and correctly added back!
I’ve been using Greenplum since version 3.0 back in 2007. That version was actually the first commercially available version too. The versions were released about once every 6 to 12 months and got as high as 3.3.
Once the version got to 4.3, the number just seemed to get stuck. 4.3.0 was a pretty big release which changed “Append Only” tables to be “Append Optimized” which simply meant you could start updating and deleting tables stored in the append-only format. But more enhancements came to 4.3 but the version number never exceeded 4.3.
Major enhancements came to version 4.3.x and the innovation came at a faster pace but you may not have noticed this if you were just looking at the version number. Pivotal’s engineering team embraced the Pivotal Labs methodologies of pair programming and quick iterations. There was a huge transformation happening in this group.
When 5.0 came out, it was a big deal. Greeplum had always been a fork of PostgreSQL 8.2 but with 5.0, Greenplum was rebased to 8.3. A lot of work went into this change which also requires migrating from 4.3.x to 5.0.
Now this next change, I didn’t expect. During the 4.3.x development, the only time the version number would change to either 4.4 or 5.0, it would be when the database required a migration to upgrade. With 5.x, the version numbers are coming fast and don’t require migrations to upgrade. It just a simple binary swap to upgrade.
The result has been the release of 5.0 on 2017-09-14, 5.1 on 2017-10-20, and 5.2 on 2017-11-18. Do you see the pattern? Monthly point releases! All of these releases so far have been simple binary upgrades but have a ton of improvements each time.
Version 6 is in sight. Seriously. I would expect this to be the next rebase of Greenplum to PostgreSQL 8.4 and will require a database migration to upgrade. It is amazing how the Labs culture and open source software development has enhanced the development of Greenplum.
I haven’t posted much here lately because I’ve been working on deploying GemFire and Greenplum in the AWS Marketplace. It has been interesting work so I thought I would make a post about it here.
Amazon Web Services (AWS)
Amazon has the most popular cloud in the US with many products available which now includes Pivotal Greenplum and Pivotal GemFire. Both products are available as Bring Your Own License (BYOL) or Billed Hourly. BYOL simply means you already own a license and you wish to apply this to a deployment in AWS.
If you are not familiar with AWS, there are many ways to configure resources which can be overwhelming to learn and optimize. I have done extensive testing in these product offerings to ensure stability, security, repeatability, and performance. With just a few clicks, you can have an on-demand cluster built that rivals a full rack of on-premise hardware.