– SQL Server replicated tables had incorrect scale for numeric columns.
– Changed functions to use legacy optimizer to avoid logging in Greenplum.
When you create a table in Greenplum, the default storage options make the table heap (appendonly=false) which also means the table is row oriented and not compressed. But, you can change the default with the GUC gp_default_storage_options.
Here are the defaults:
gpadmin=# show gp_default_storage_options; gp_default_storage_options ---------------------------------------------------------------------------------- appendonly=false,blocksize=32768,compresstype=none,checksum=true,orientation=row
And a quick test, shows how you can default to something else.
gpadmin=# set gp_default_storage_options='appendonly=true'; SET gpadmin=# create table foo as select i from generate_series(1, 1000) as i distributed by (i); SELECT 1000 gpadmin=# \d foo Append-Only Table "public.foo" Column | Type | Modifiers --------+---------+----------- i | integer | Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: t Distributed by: (i)
You can make the changes permanent with gpconfig.
gpconfig -c 'gp_default_storage_options' -v 'appendonly=true, orientation=column'
And you can make it limited to a single database.
ALTER DATABASE mytest SET gp_default_storage_options = 'appendonly=true, orientation=column';
Added a new configuration setting so you can set the Max Length for gpfdist. This is done by adding a new value to your gplink.properties file called, “gpfdistMaxLength”.
gpsnap and gpcronsnap are new utilities for Pivotal Greenplum on AWS. These utilities leverage Amazon EBS Snapshots to take very quick snapshot backups.
Elastic Block Storage (EBS) is the preferred storage in Amazon AWS for many reasons like persistence, performance, and high availability. A great feature of EBS is the ability to take a “Snapshot”. A snapshot can be taken of a disk volume and to get a consistent backup, you need to stop the database, take a snapshot of every volume, and then restart the database.
The restore process requires stopping the database, detaching and deleting the existing volumes, creating new volumes from the snapshots, attaching the volumes to the right nodes, mounting the new volumes to the right directories, and then starting the database. If you don’t delete the existing volumes and just create new ones, you will incur more storage cost in AWS and you may hit your disk quota. So take a snapshot before you execute a restore so you can revert back if needed.
gpsnap makes this entire process easy. It does all of the tasks needed to take a consistent snapshot of the database. It also automates the tasks for restoring from a snapshot.
The commands include:
– gpsnap list: lists snapshots
– gpsnap create: creates a new snapshot
– gpsnap delete
– gpsnap delete all: deletes all snapshots
– gpsnap restore
gpcronsnap is the second utility and you can probably guess from the name that it is the scheduling portion of the utility. By default, a snapshot backup will be created weekly and retain up to four snapshots. This is all configurable in a configuration file too.
So you can see, it is now a snap for Pivotal Greenplum on AWS to take snapshot backups!
Michael Goddard, who is a colleague of mine, put together this demo of the Self-Healing feature I created for the Pivotal Greenplum on AWS Marketplace products. If any node in the cluster fails, it will automatically be replaced and correctly added back!
I’ve been using Greenplum since version 3.0 back in 2007. That version was actually the first commercially available version too. The versions were released about once every 6 to 12 months and got as high as 3.3.
Once the version got to 4.3, the number just seemed to get stuck. 4.3.0 was a pretty big release which changed “Append Only” tables to be “Append Optimized” which simply meant you could start updating and deleting tables stored in the append-only format. But more enhancements came to 4.3 but the version number never exceeded 4.3.
Major enhancements came to version 4.3.x and the innovation came at a faster pace but you may not have noticed this if you were just looking at the version number. Pivotal’s engineering team embraced the Pivotal Labs methodologies of pair programming and quick iterations. There was a huge transformation happening in this group.
When 5.0 came out, it was a big deal. Greeplum had always been a fork of PostgreSQL 8.2 but with 5.0, Greenplum was rebased to 8.3. A lot of work went into this change which also requires migrating from 4.3.x to 5.0.
Now this next change, I didn’t expect. During the 4.3.x development, the only time the version number would change to either 4.4 or 5.0, it would be when the database required a migration to upgrade. With 5.x, the version numbers are coming fast and don’t require migrations to upgrade. It just a simple binary swap to upgrade.
The result has been the release of 5.0 on 2017-09-14, 5.1 on 2017-10-20, and 5.2 on 2017-11-18. Do you see the pattern? Monthly point releases! All of these releases so far have been simple binary upgrades but have a ton of improvements each time.
Version 6 is in sight. Seriously. I would expect this to be the next rebase of Greenplum to PostgreSQL 8.4 and will require a database migration to upgrade. It is amazing how the Labs culture and open source software development has enhanced the development of Greenplum.
I haven’t posted much here lately because I’ve been working on deploying GemFire and Greenplum in the AWS Marketplace. It has been interesting work so I thought I would make a post about it here.
Amazon Web Services (AWS)
Amazon has the most popular cloud in the US with many products available which now includes Pivotal Greenplum and Pivotal GemFire. Both products are available as Bring Your Own License (BYOL) or Billed Hourly. BYOL simply means you already own a license and you wish to apply this to a deployment in AWS.
If you are not familiar with AWS, there are many ways to configure resources which can be overwhelming to learn and optimize. I have done extensive testing in these product offerings to ensure stability, security, repeatability, and performance. With just a few clicks, you can have an on-demand cluster built that rivals a full rack of on-premise hardware.
Greenplum is a Massively Parallel Processing database which means the data is distributed across multiple segments on multiple hosts. Each segment process stores a catalog of objects and it is possible to get this out of sync because of hardware failures or killing processes. For example, if you are vacuuming a table and then decide to kill the process on the Master host, the segments may not get the signal properly and rollback the transaction. This can happen when using “kill -9” rather than just “kill” to stop a process.
You may get this error message if a table is corrupt:
gpadmin=# select * from corrupt; ERROR: relation with OID 109802 does not exist (seg0 slice1 gpdbsne:40000 pid=78502)
So you decide to drop the table and start over but that doesn’t work either.
gpadmin=# drop table corrupt; ERROR: table "corrupt" does not exist (seg0 gpdbsne:40000 pid=78502)
The easy fix for this is to use “if exists” in the drop command.
gpadmin=# drop table if exists corrupt; DROP TABLE Time: 10.423 ms
Now the table is gone and you can recreate it. Next time, don’t use kill -9 and don’t kill vacuum commands.
If you are wondering how I managed to corrupt this table, here are the steps.
1. Create the table
gpadmin=# create table corrupt (id int, foo text, bar text) distributed by (id); CREATE TABLE Time: 31.195 ms
2. Insert some data
insert into corrupt select i as id, 'foo_' || i, 'bar_' || i from generate_series(1,100) as i; INSERT 0 100 Time: 46.419 ms gpadmin=# analyze corrupt; ANALYZE Time: 108.193 ms gpadmin=# \q
3. Connect to a segment directly. I’m using a single node for this test.
[gpadmin@gpdbsne ~]$ ps -ef | grep postgres | grep M gpadmin 78213 1 0 09:49 ? 00:00:00 /usr/local/greenplum-db-220.127.116.11/bin/postgres -D /data/primary/gpseg1 -p 40001 -b 3 -z 2 --silent-mode=true -i -M mirrorless -C 1 gpadmin 78214 1 0 09:49 ? 00:00:00 /usr/local/greenplum-db-18.104.22.168/bin/postgres -D /data/primary/gpseg0 -p 40000 -b 2 -z 2 --silent-mode=true -i -M mirrorless -C 0 gpadmin 78239 1 0 09:49 ? 00:00:00 /usr/local/greenplum-db-22.214.171.124/bin/postgres -D /data/master/gpseg-1 -p 5432 -b 1 -z 2 --silent-mode=true -i -M master -C -1 -x 0 -E
I see that the segments are using ports 40,000 and 40,001 while the master is using 5432. I’ll connect to seg0 on port 40,000 and then drop the table. This only drops the object in this one segment. I’ll also do a quick query to make sure I only see half of the data because I have 2 segments.
PGOPTIONS='-c gp_session_role=utility' psql -p 40000 psql (8.2.15) Type "help" for help. gpadmin=# select count(*) from corrupt; count ------- 50 (1 row) Time: 3.485 ms
4. Corrupt the database by dropping the table only on one segment.
drop table corrupt; DROP TABLE Time: 9.410 ms gpadmin=# \q
5. Confirm that the table is corrupt.
[gpadmin@gpdbsne ~]$ psql Timing is on. psql (8.2.15) Type "help" for help. gpadmin=# select * from corrupt; ERROR: relation with OID 109802 does not exist (seg0 slice1 gpdbsne:40000 pid=78502)
And again, the fix:
gpadmin=# drop table if exists corrupt; DROP TABLE Time: 10.423 ms