Default Storage Options

When you create a table in Greenplum, the default storage options make the table heap (appendonly=false) which also means the table is row oriented and not compressed. But, you can change the default with the GUC gp_default_storage_options.

Here are the defaults:

gpadmin=# show gp_default_storage_options;

And a quick test, shows how you can default to something else.

gpadmin=# set gp_default_storage_options='appendonly=true';                                                                                                                                               SET
gpadmin=# create table foo as select i from generate_series(1, 1000) as i distributed by (i);
gpadmin=# \d foo
Append-Only Table ""
 Column |  Type   | Modifiers 
 i      | integer | 
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: t
Distributed by: (i)

You can make the changes permanent with gpconfig.

gpconfig -c 'gp_default_storage_options' -v 'appendonly=true, orientation=column'

And you can make it limited to a single database.

ALTER DATABASE mytest SET gp_default_storage_options = 'appendonly=true, orientation=column';

It’s a Snap!

gpsnap and gpcronsnap are new utilities for Pivotal Greenplum on AWS. These utilities leverage Amazon EBS Snapshots to take very quick snapshot backups.

EBS Snapshot
Elastic Block Storage (EBS) is the preferred storage in Amazon AWS for many reasons like persistence, performance, and high availability. A great feature of EBS is the ability to take a “Snapshot”. A snapshot can be taken of a disk volume and to get a consistent backup, you need to stop the database, take a snapshot of every volume, and then restart the database.

EBS Restore
The restore process requires stopping the database, detaching and deleting the existing volumes, creating new volumes from the snapshots, attaching the volumes to the right nodes, mounting the new volumes to the right directories, and then starting the database. If you don’t delete the existing volumes and just create new ones, you will incur more storage cost in AWS and you may hit your disk quota. So take a snapshot before you execute a restore so you can revert back if needed.

gpsnap makes this entire process easy. It does all of the tasks needed to take a consistent snapshot of the database. It also automates the tasks for restoring from a snapshot.

The commands include:
– gpsnap list: lists snapshots
– gpsnap create: creates a new snapshot
– gpsnap delete : deletes a specific snapshot
– gpsnap delete all: deletes all snapshots
– gpsnap restore : restores a specific snapshot

gpcronsnap is the second utility and you can probably guess from the name that it is the scheduling portion of the utility. By default, a snapshot backup will be created weekly and retain up to four snapshots. This is all configurable in a configuration file too.

So you can see, it is now a snap for Pivotal Greenplum on AWS to take snapshot backups!

You can either Bring Your Own License (BYOL) or pay Hourly for Greenplum on AWS. Here are the links:
Pivotal Greenplum on AWS (BYOL)
Pivotal Greenplum on AWS (Hourly)

Pivotal Greenplum Version 5.x

Version 3.x
I’ve been using Greenplum since version 3.0 back in 2007. That version was actually the first commercially available version too. The versions were released about once every 6 to 12 months and got as high as 3.3.

Version 4.x
Once the version got to 4.3, the number just seemed to get stuck. 4.3.0 was a pretty big release which changed “Append Only” tables to be “Append Optimized” which simply meant you could start updating and deleting tables stored in the append-only format. But more enhancements came to 4.3 but the version number never exceeded 4.3.

Labs Methodology
Major enhancements came to version 4.3.x and the innovation came at a faster pace but you may not have noticed this if you were just looking at the version number. Pivotal’s engineering team embraced the Pivotal Labs methodologies of pair programming and quick iterations. There was a huge transformation happening in this group.

Version 5.x
When 5.0 came out, it was a big deal. Greeplum had always been a fork of PostgreSQL 8.2 but with 5.0, Greenplum was rebased to 8.3. A lot of work went into this change which also requires migrating from 4.3.x to 5.0.

Now this next change, I didn’t expect. During the 4.3.x development, the only time the version number would change to either 4.4 or 5.0, it would be when the database required a migration to upgrade. With 5.x, the version numbers are coming fast and don’t require migrations to upgrade. It just a simple binary swap to upgrade.

The result has been the release of 5.0 on 2017-09-14, 5.1 on 2017-10-20, and 5.2 on 2017-11-18. Do you see the pattern? Monthly point releases! All of these releases so far have been simple binary upgrades but have a ton of improvements each time.

Version 6.0?
Version 6 is in sight. Seriously. I would expect this to be the next rebase of Greenplum to PostgreSQL 8.4 and will require a database migration to upgrade. It is amazing how the Labs culture and open source software development has enhanced the development of Greenplum.

AWS Marketplace

I haven’t posted much here lately because I’ve been working on deploying GemFire and Greenplum in the AWS Marketplace. It has been interesting work so I thought I would make a post about it here.

Amazon Web Services (AWS)
Amazon has the most popular cloud in the US with many products available which now includes Pivotal Greenplum and Pivotal GemFire. Both products are available as Bring Your Own License (BYOL) or Billed Hourly. BYOL simply means you already own a license and you wish to apply this to a deployment in AWS.

If you are not familiar with AWS, there are many ways to configure resources which can be overwhelming to learn and optimize. I have done extensive testing in these product offerings to ensure stability, security, repeatability, and performance. With just a few clicks, you can have an on-demand cluster built that rivals a full rack of on-premise hardware.

Greenplum and GemFire on AWS
Greenplum Billed Hourly
Greenplum Bring Your Own License (BYOL)
GemFire Billed Hourly
GemFire BYOL

Dropping Corrupt Tables in Greenplum

Greenplum is a Massively Parallel Processing database which means the data is distributed across multiple segments on multiple hosts. Each segment process stores a catalog of objects and it is possible to get this out of sync because of hardware failures or killing processes. For example, if you are vacuuming a table and then decide to kill the process on the Master host, the segments may not get the signal properly and rollback the transaction. This can happen when using “kill -9” rather than just “kill” to stop a process.

You may get this error message if a table is corrupt:

gpadmin=# select * from corrupt;
ERROR:  relation with OID 109802 does not exist  (seg0 slice1 gpdbsne:40000 pid=78502)

So you decide to drop the table and start over but that doesn’t work either.

gpadmin=# drop table corrupt;
ERROR:  table "corrupt" does not exist  (seg0 gpdbsne:40000 pid=78502)

The easy fix for this is to use “if exists” in the drop command.

gpadmin=# drop table if exists corrupt;
Time: 10.423 ms

Now the table is gone and you can recreate it. Next time, don’t use kill -9 and don’t kill vacuum commands.

More Details
If you are wondering how I managed to corrupt this table, here are the steps.
1. Create the table

gpadmin=# create table corrupt (id int, foo text, bar text) distributed by (id);
Time: 31.195 ms

2. Insert some data

insert into corrupt select i as id, 'foo_' || i, 'bar_' || i from generate_series(1,100) as i;
INSERT 0 100
Time: 46.419 ms
gpadmin=# analyze corrupt;
Time: 108.193 ms
gpadmin=# \q

3. Connect to a segment directly. I’m using a single node for this test.

[gpadmin@gpdbsne ~]$ ps -ef | grep postgres | grep M
gpadmin   78213      1  0 09:49 ?        00:00:00 /usr/local/greenplum-db- -D /data/primary/gpseg1 -p 40001 -b 3 -z 2 --silent-mode=true -i -M mirrorless -C 1
gpadmin   78214      1  0 09:49 ?        00:00:00 /usr/local/greenplum-db- -D /data/primary/gpseg0 -p 40000 -b 2 -z 2 --silent-mode=true -i -M mirrorless -C 0
gpadmin   78239      1  0 09:49 ?        00:00:00 /usr/local/greenplum-db- -D /data/master/gpseg-1 -p 5432 -b 1 -z 2 --silent-mode=true -i -M master -C -1 -x 0 -E

I see that the segments are using ports 40,000 and 40,001 while the master is using 5432. I’ll connect to seg0 on port 40,000 and then drop the table. This only drops the object in this one segment. I’ll also do a quick query to make sure I only see half of the data because I have 2 segments.

PGOPTIONS='-c gp_session_role=utility' psql -p 40000
psql (8.2.15)
Type "help" for help.

gpadmin=# select count(*) from corrupt;
(1 row)

Time: 3.485 ms

4. Corrupt the database by dropping the table only on one segment.

drop table corrupt;
Time: 9.410 ms
gpadmin=# \q

5. Confirm that the table is corrupt.

[gpadmin@gpdbsne ~]$ psql
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# select * from corrupt;
ERROR:  relation with OID 109802 does not exist  (seg0 slice1 gpdbsne:40000 pid=78502)

And again, the fix:

gpadmin=# drop table if exists corrupt;
Time: 10.423 ms