Greenplum Database 4.3

Pivotal recently released this new version without any major announcements but customers should definitely check it out! There are three main enhancements that I will cover in this blog post.

1. Append Optimized Tables
2. WAL Standby Replication
3. Orca Optimizer

Append-Optimized Tables
Greenplum database has had “Append-Only” tables for several versions. It only allowed for inserting new data to a table which has a savings of about 20 bytes per row because it doesn’t need the overhead of per-row visibility. Append-Only tables allows you to use compression or change the orientation to column so it is used for larger tables in the database.

With Greenplum Database 4.3, Append-Only tables are now Append-Optimized. You still get the benefits of an Append-Only table but now you will be able to UPDATE and DELETE from these tables as well.

Here is an example with a compressed and column oriented table that is Append-Optimized:

gpdb=# CREATE TABLE bar (a int, b text)
gpdb-# WITH (appendonly=true, orientation=column, compresstype=quicklz)
gpdb-# DISTRIBUTED BY (a);
CREATE TABLE
--Notice how it is still using "appendonly" in the syntax

INSERT some data and SELECT from this new table.

gpdb=# INSERT INTO bar VALUES (1, 'jon');
INSERT 0 1
gpdb=# INSERT INTO bar VALUES (2, 'roberts');
INSERT 0 1
gpdb=# SELECT * FROM bar;
 a |    b    
---+---------
 2 | roberts
 1 | jon
(2 rows)

UPDATE example:

gpdb=# UPDATE bar SET b = UPPER(b);
UPDATE 2
gpdb=# SELECT * FROM bar;
 a |    b    
---+---------
 1 | JON
 2 | ROBERTS
(2 rows)

DELETE example:

gpdb=# DELETE FROM bar WHERE a = 1;
DELETE 1
gpdb=# SELECT * FROM bar;
 a |    b    
---+---------
 2 | ROBERTS
(1 row)

This looks just like any other table but the difference is in the row versioning. Instead of using MVCC (which was borrowed from PostgreSQL), Append-Optimized tables have a Visibility Map which is a bitmap of what rows are no longer visible due to a DELETE or UPDATE statement on the table.

Because it isn’t using MVCC anymore, serializable transactions with UPDATE and DELETE to an Append-Optimized table are not allowed. A serializable transaction is a strict transaction isolation in which transactions execute as if they run one after another rather than concurrently. Serializable is almost never used in a big data platform so it isn’t a big deal.

Another change is how VACUUM works. Instead of marking stale rows as deleted so that row can be overwritten (MVCC), VACUUM compacts the physical files and resets the visibility map.

When should use use a Heap table (default) or an Append-Optimized table? If you have any OLTP work in Greenplum, Heap is better. If you have frequent UPDATE or DELETE statements, heap is likely better because VACUUM should run faster. Query performance may also be a bit better for smaller tables that are Heap. So for larger tables that have no or infrequent UPDATE or DELETE statements, use Append-Optimized. If you want to compress the table or make it column oriented, you must make it Append-Optimized.

WAL Standby Replication
Greenplum database has a Master server where users connect and execute queries. The Master server manages sessions, security, metadata, and query plans. Queries are executed on the Segment servers in the Greenplum cluster. To provide high availability, the Master server has a Standby-Master server. Prior to 4.3, Greenplum used block level replication to keep the data on the Master in sync with Standby-Master.

With 4.3, the WAL (Write-Ahead Log) is now streamed to the Standby-Master that is in a standby mode. This provides faster promotion of the Standby-Master in the event where the Master fails. It also has an improvement in performance for the Master server.

Orca
Engineering has refactored the query optimizer to make it easier to maintain and enhance. They also took this opportunity to enhance this optimizer to handle certain edge cases where the optimizer picked a poor plan. It does a better job at handling certain scenarios with partitioned tables, subqueries, and common table expressions.

Orca should not be used in Production in release 4.3. To use it, you need to be part of the Early Access Program (EAP). It is a very simple command to enable it but we want customers to provide feedback before you make Orca the default optimizer. Contact your Pivotal Field Engineer to get into this program and if you don’t know who this is, contact me and I’ll help you.

Summary
Greenplum database 4.3 is a great release with many new features. Customers can download it now from emc.subscribenet.com and documentation from support.emc.com.

12 thoughts on “Greenplum Database 4.3

  1. Siva Janamanchi

    Hi Jon,

    Happy New Year to you !

    Wanted to know the impact of this on outsourcer, if any : (from the GP 4300 Readme doc)
    “Inserting data into a partitioned table can only be done at parent partitioned table created with the CREATE TABLE command.
    When creating a partitioned table, Greenplum Database creates additional tables to manage the partitioning of data in a partitioned table. Using the INSERT command to insert data into tables created by Greenplum Database for use by a partitioned table is not allowed.”

    thanks and regards
    Siva

    Reply
    1. Jon Post author

      I’ve been talking with Engineering about this new limitation. As of now, this doesn’t directly impact Outsourcer but if you make the Target table in Outsourcer a partition, then it will fail with GPDB 4.3. To get around this, you would need to use a Transform job that uses a static External Table with the INSERT to the parent table rather than the partition.

      Reply
    1. Jon Post author

      Incremental backups are actually more powerful now. An incremental backup requires using an Append Only table but now AO tables are really Append Optimized. This means you can UPDATE and DELETE these tables and the incremental backup will only store the changes. Great question!

      Reply
      1. Joe

        Looking at the 4.3 system administrator guide, it outlines under which specific conditions a table will be “flagged” for an incremental backup:

        The difference between an incremental backup and a full backup is that append-optimized tables are backed up only if one of the following operations was performed on the table after the last backup:
        ALTER TABLE
        INSERT
        TRUNCATE
        DROP and then re-create the table

        I’m a little confused as to whether this is an exhaustive list, or if the incremental backup will indeed track updates and deletes as well.

        Reply
        1. Jon Post author

          In the Greenplum Database 4.3 Utilities Guide, the commands are:
          ALTER TABLE
          INSERT
          DELETE
          UPDATE
          TRUNCATE
          DROP and then re-create the table

          In 4.3, Append-Only became Append-Optimized so it supports UPDATE and DELETE commands. If you have an AO table, then an incremental backup will only backup new or changed data which includes UPDATE and DELETE. So yes, it will track those commands as well.

          If you look at the 4.2 documentation, it won’t have UPDATE and DELETE included for AO tables because you couldn’t UPDATE or DELETE an AO table.

          Reply
  2. betty

    Hi Jon,
    how to enable orca optimizer? we’ve installed 4.3 and want to test the performance improvement of the new optimizer. previously we’ve got a few queries with wrong execution plan. hopefully this new version could resolve it.

    Reply
    1. Jon Post author

      You will need to contact your Pivotal Account Executive to get into the Early Adoption Program in order to use the Orca optimizer in Greenplum Database 4.3.

      Reply
  3. John

    Hi,

    When you say incremental backup in GP.

    1: I took full backup of database
    2: took incremental backup — this only backup of Append-Only /Append-Optimized tables.
    Is incremental backup copy — change heap tables and other objects in DB (changed views, Changed functions) like other databases incremental OR not?

    Reply
    1. Jon Post author

      Yes, an incremental backup will backup heap tables, functions, views, etc so if there were any changes, the changes will be in the incremental backup.

      Reply
  4. Suman

    I tried to run incremental backup on specific schemas , but it fails and takes incremental backup on the last full backup of the DB but not the increment on a specific schema mentioned in the gpcrondump. If GP engineering team can add this feature, will be really great!
    Also, it doesn’t have feature added yet to restore only a specific schema, not just the tables from the entire DB backup set.

    Reply
    1. Jon Post author

      The incremental feature is for full backups only because when you run the restore, it relies on the previous full backup to provide a complete set.

      I do agree that having a -s on gpdbrestore would be a great feature to add.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.