Merge Statements

Merge is a feature in most databases but I have never liked it much.  My source table always had duplicates in it which prevented me from using it.

Here is an example of such with Microsoft SQL Server (BTW, Oracle behaves the same way):

CREATE TABLE stage
(
id int identity not null,
provider_id int NOT NULL,
provider text NOT NULL
);

insert into stage (provider_id, provider) values (1, 'bubba');
insert into stage (provider_id, provider) values (2, 'gump');
insert into stage (provider_id, provider) values (3, 'foo');
insert into stage (provider_id, provider) values (4, 'bar');
insert into stage (provider_id, provider) values (4, 'jon');

CREATE TABLE target
(
provider_id int NOT NULL primary key,
provider text NOT NULL
);

My stage table is where new records get loaded from a file and then my target table is what I’m trying to keep up to date.  So using a MERGE statement, I should be able to insert the new records and if they exist already, update it.

merge target
using (select * from stage) as stage
on (target.provider_id = stage.provider_id)
when matched then
update set provider = stage.provider
when not matched then
insert (provider_id, provider)
values (stage.provider_id, stage.provider);

That is the syntax for MERGE in SQL Server and when executed, I get this error message:

Msg 2627, Level 14, State 1, Line 1

Violation of PRIMARY KEY constraint 'PK__target__00E213107132C993'. Cannot insert duplicate key in object 'dbo.target'.

The statement has been terminated.

Why?  Well my stage table has two records with the same provider_id.  When this happens, I don’t want to error but instead, I want the last value in the file to be used.  Unfortunately, MERGE doesn’t handle this.  Instead, you have to write code.

In Greenplum, here is how you do a “smart” MERGE that handles real world data where the target has duplicates:

CREATE TABLE stage
(
id serial not null,
provider_id int NOT NULL,
provider text NOT NULL
)
DISTRIBUTED BY (provider_id);

insert into stage (provider_id, provider) values (1, 'bubba');
insert into stage (provider_id, provider) values (2, 'gump');
insert into stage (provider_id, provider) values (3, 'foo');
insert into stage (provider_id, provider) values (4, 'bar');
insert into stage (provider_id, provider) values (4, 'jon');

CREATE TABLE target
(
provider_id int NOT NULL primary key,
provider text NOT NULL
)
DISTRIBUTED BY (provider_id);

Now create the function:

create or replace function fn_merge_target() returns void as
$$
declare
    v_function text := 'fn_merge_target';
    v_location int;

begin
    v_location := 1000;
    --delete the old records
    delete from target t
    using stage s
    where t.provider_id = s.provider_id
    and t.provider <> s.provider;

    v_location := 2000;
    --inserts
    insert into target (provider_id, provider)

    select sub.provider_id, sub.provider
    from (  select s.*,
    rank() over (partition by s.provider_id order by s.id desc) as rank
    from stage s
    left join target t on s.provider_id = t.provider_id
    where t.provider is null  ) as sub
    where sub.rank = 1;

exception
    when others then
        raise exception '(%:%:%)', v_function, v_location, sqlerrm;
end;
$$
language plpgsql;

Execute the function:

select fn_merge_target();

And now query target:

select * from target order by provider_id;

I have 4 rows and for provider_id 4, I have the last version from the stage table rather than failing the command.

In summary, MERGE sounds nice but I’ve never been able to use it.  My source or stage table always has duplicates and instead of failing, I want to apply the last version of each primary key record.  I demonstrated an easy to use way to provide a much smarter MERGE that is optimized for Greenplum and uses one of the built in analytical functions (rank()).

4 thoughts on “Merge Statements

  1. Tamas

    Actually, your solutions scans and joins the tables several times. Imagine that you are merging 1m records to 1bln destination, but only few of them are actually changed. In this case your approach will be very very slow.

    Anyway, there are better ways to do merge statements…

    Reply
    1. Jon Post author

      You are correct that this is not a solution for every scenario. However, I would take the same approach in merging 1 million records into 1 billion row table but I would also leverage partitioning to eliminate a significant amount of IO on the target table.

      Reply
  2. Waldar

    Why? Well my stage table has two records with the same provider_id. When this happens, I don’t want to error but instead, I want the last value in the file to be used. Unfortunately, MERGE doesn’t handle this. Instead, you have to write code.

    This is untrue, you can simply use the same analytic functions, for example in Oracle :
    merge target
    using (select s.*
    , rank() over (partition by s.provider_id order by s.id desc) as rk
    from stage s) as stage
    on (target.provider_id = stage.provider_id and stage.rk = 1)
    when matched then
    update set target.provider = stage.provider
    when not matched then
    insert (provider_id, provider)
    values (stage.provider_id, stage.provider);

    The main downfall of merge is that you don’t know how many rows were updated or deleted, you just optain xxx rows have been merged.
    Workaround exists but aren’t very trivial.

    Reply
    1. Jon Post author

      My statement is true. You must write code, like you illustrated, to handle something I think should be built into MERGE.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.