Archiving Data

A customer recently had a pretty common question about archiving old data in Greenplum.  They wanted to purge old data in their OLTP system so that it remains small and fast.  Historical data that isn’t changing, stays in Greenplum.

An easy way to do this is with a view and two tables.  The view would UNION ALL the data between the two tables.  Unfortunately, Greenplum treats UNION ALL similarly to other databases which executes each query separated by the UNION ALL serially.  The optimal way to handle this is to partition the data.

So let’s say the customer has a schema called edw and a table called auths.

create schema edw;

CREATE TABLE edw.auths
(id int not null,
fname varchar,
lname varchar,
episodedate timestamp)
DISTRIBUTED BY (id);

Insert some sample data.

INSERT INTO edw.auths VALUES (1, 'foo', 'bar', '2001-05-01'::timestamp);
INSERT INTO edw.auths VALUES (2, 'foo', 'bar', '2012-12-01'::timestamp);
INSERT INTO edw.auths VALUES (3, 'too', 'cool', '2010-01-01'::timestamp);
INSERT INTO edw.auths VALUES (4, 'too', 'cool', '2010-01-02'::timestamp);
INSERT INTO edw.auths VALUES (5, 'old', 'data', '2004-01-02'::timestamp);

Next, rename this schema.

ALTER SCHEMA edw RENAME TO edw_archive;

Create the new edw schema.

CREATE SCHEMA edw;

In this case, anything older than 2011-01-01 will be archived and anything newer will be in the current OLTP system.

CREATE TABLE edw.auths
(LIKE edw_archive.auths)
PARTITION BY RANGE (episodedate)
(START ('2000-01-01'::timestamp) INCLUSIVE END ('2010-01-01'::timestamp) EXCLUSIVE WITH (appendonly=true, orientation=column, compresstype=zlib, compresslevel=5),
DEFAULT PARTITION auths_current);

I’m using a pretty high level of compression and column orientation for the archive data.  This will help performance and use less disk space for the archived data.  The default partition has the current data in it and it remains a row store that isn’t compressed.   This is a good example of Polymorphic storage where partitions within the same table are stored in different ways.

Insert the old data into the new table.

INSERT INTO edw.auths SELECT * FROM edw_archive.auths;

Just to show that everything works.

SELECT * FROM edw.auths;

With Outsourcer, you will want to change the job to load edw.auths to insert data into the default partition and don’t touch the archive partition.  To do this, get the names of the underlying inherited tables.

SELECT * FROM pg_partitions WHERE schemaname = 'edw' AND tablename = 'auths';

You can see the two tables independently.

--new data

SELECT * FROM edw.auths_1_prt_auths_current;

--archive data

SELECT * FROM edw.auths_1_prt_2;

To change Outsourcer, first remove the old job for edw.auths and insert the new one with the target of edw.auths_1_prt_auths_current.

DELETE FROM os.job WHERE (target).schema_name = 'edw' AND (target).table_name = 'auths';

INSERT INTO os.job(
refresh_type, target, source)
VALUES ('refresh',
('edw', 'auths_1_prt_auths_current'),
('sqlserver', 'jonnywin', null, null, 'edw', 'dbo', 'auths', 'sa', 'sa'));

Now when you execute this job, it will truncate the current partition, insert the data from the OLTP system, and then analyze the partition.  The archive partition will be left untouched.  Querying data across the archived and current data will execute much faster than a UNION ALL because it runs in parallel and it is compressed and column oriented.

2 thoughts on “Archiving Data

  1. Saschia

    Nice article!
    Maybe you can answer an extra question: if I change the schema for an existing partitioned table, I see the table self on the new schema but all partitions are still on the old schema. Do you know how to prevent this or how to change schema for partitions?
    Thanks in advance.

    Reply
  2. Jon Post author

    Changing a table’s schema is a rather rare task but as you pointed out, the partitions for the table aren’t changed when the table’s schema is changed.

    Example: I have a table named public.fact and has three partitions.

    select partitionschemaname, partitiontablename 
    from pg_partitions 
    where tablename = 'fact';
    

    It returns three rows:

    'public'	'fact_1_prt_1'
    'public'	'fact_1_prt_2'
    'public'	'fact_1_prt_3'
    

    I also have an edw schema that I want to move the table to so I alter the base table.

    alter table fact set schema edw;
    

    The partitions are still in the public schema so to move these, I need to alter the partition tables too.

    alter table fact_1_prt_1 set schema edw;
    alter table fact_1_prt_2 set schema edw;
    alter table fact_1_prt_3 set schema edw;
    

    Now this query shows the partitions have moved to the new schema.

    select partitionschemaname, partitiontablename 
    from pg_partitions 
    where tablename = 'fact';
    

    It returns these three rows:

    'edw'	'fact_1_prt_1'
    'edw'	'fact_1_prt_2'
    'edw'	'fact_1_prt_3'
    
    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.