Over the years I have come across a lot of merge statements and I find most of the time there is one fundamental flaw in how it is written by most people.

In the update section in most cases I find that there is no section there to check if anything has changed against the target from the source, therefore everything gets updated each and every single time. Whereas if there was this section then only the records that changed would be updated. This saves a lot of time on performance of the query.

Here is an example of a merge that checks if something is changed between source and target before updating the record:

MERGE dbo.DimEmployee	TRG
USING Staging.Employee	SRC ON TRG.EmployeeID = SRC.EmployeeID
WHEN MATCHED AND
(
	      ISNULL(SRC.FirstName,'Unknown')	 != ISNULL(TRG.FirstName,'Unknown')
       OR ISNULL(SRC.LastName,'Unknown')	 != ISNULL(TRG.LastName,'Unknown')
       OR ISNULL(SRC.MiddleName,'Unknown')	 != ISNULL(TRG.MiddleName,'Unknown')
       OR ISNULL(SRC.NameStyle,'Unknown')	 != ISNULL(TRG.NameStyle,'Unknown')
       OR ISNULL(SRC.Title,'Unknown')		 != ISNULL(TRG.Title,'Unknown')
)
THEN UPDATE SET
	TRG.FirstName  = SRC.FirstName
    ,TRG.LastName   = SRC.LastName
    ,TRG.MiddleName = SRC.MiddleName
    ,TRG.NameStyle  = SRC.NameStyle
    ,TRG.Title	   = SRC.Title

WHEN NOT MATCHED THEN
INSERT
(
	 FirstName
	,LastName
	,MiddleName
	,NameStyle
	,Title
)
VALUES
(
	SRC.FirstName
	,SRC.LastName
	,SRC.MiddleName
	,SRC.NameStyle
	,SRC.Title
)

Hope this helps you improve the performance of you MERGE statements.

11 thoughts on “Merge Statements

  1. Your test for changes can be written more simply and readably (IMHO) as

    WHEN MATCHED AND EXISTS (
    SELECT SRC.FirstName, SRC.LastName, SRC.MiddleName, SRC.NameStyle, SRC.Title
    EXCEPT
    SELECT TRG.FirstName, TRG.LastName, TRG.MiddleName, TRG.NameStyle, TRG.Title
    )

    • Yes it can be written this way as well. With regards to performance I have found that the way I have written it is quicker on larger data sets.

    • Hi, my site runs on WordPress and comments go for approval first before appearing. I do not always check on weekends to approve comments. I have approved it now.

  2. I do something similar:

    WHEN MATCHED AND EXISTS(
    SELECT SRC.FirstName, SRC.LastName
    EXCEPT TRG.FirstName, TRG.LastName
    ) THEN UPDATE …

  3. Hi,

    nice point. I absolutely agree. I have automated this into a self developed ETL-tool. However my technique doesn’t use IS NULL: instead I write out the whole condition for performance and SARGABILITY reasons.

    Here’s an example:
    ISNULL(SRC.FirstName,’Unknown’) != ISNULL(TRG.FirstName,’Unknown’)
    turns to
    ( SRC.FirstName = TRG.FirstName
    OR (SRC.FirstName IS NULL AND TRG.FirstName IS NOT NULL)
    OR (SRC.FirstName IS NOT NULL AND TRG.FirstName IS NULL)
    )

    However I have thought about optimizing all this using a checksum for simplicity and perhaps performance (needs to be tested)….take a look at https://stackoverflow.com/questions/11186183/hash-a-sql-row for example.

    Martin

  4. I agree with this extra check but have seen this approach break dealing with Data Warehouse tables – the check of every column in a wide table just stopped any performance dead (my colleague and I waited once for three hours before calling our test “failed miserably”).
    In these cases and generally a HASHBYTES() derived column helps. A persisted computed column calculated as varbinary 20 using SHA1 algorithm (2008 or before) or varbinary 32 using SHA2_256 algorithm allows:
    * fast comparison (and 1 line of code)
    * For SQL server quick bits comparison (computers “think” binary)
    * on temporary tables calculation of hash key “on the fly”

    One big performance test of CHECKSUM can be found at http://blog.kejser.org/exploring-hash-functions-in-sql-server/.
    Checksum might work for unimportant data sets as it creates duplicate collisions (https://social.technet.microsoft.com/Forums/lync/en-US/684c8903-0c17-40b7-a592-e95b65d8d3b3/checksum-vs-hashbytes-issue?forum=sqldatawarehousing). Collision on SHA1 was detected in the wild with a chance of something close to 1 to octillion.

  5. @Martin: I think, you can ignore SARGABILITY in this case. I doubt, that someone has an index on all checked / modified data, so it has always to read the clustered index (ideally should it “JOIN” in the USING statement on the clustered index column(s))

  6. Very true! I learned this the hard way when I was merging a target which was involved in a sync between 2 systems. The number of updates was very large and this resulted in an unnecessary strain on our poorly written sync process.

    Good post!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.