Persisting Data – part 4: Upsert and MERGE

note- Nov 29, 2015 – test scripts at https://github.com/bwunder/data-load-methods…

In part 3, test metrics across a range of import sizes where data is persisted to an un-indexed heap table were compared. The un-indexed heap was ideal for isolating the import  – e.g., get data into the database just far enough so it is safe – operation from the ‘normalization’ activity needed to render the data useful.

An import to an un-indexed heap may be a useful step in persisting application data, but it leaves the story unfinished. The data must ultimately be normalized – or de-normalized, ab-normalized, ‘flattened’, aggregated, pivoted, etc. as the case may be. What matters here is that the term normalization is intended to describe everything that must happen just after the import so that the data is ready for use and indexed such that the application is able to retrieve the data when needed. Obviously, the actual resource cost of ‘normalization’ will vary to some undetermined degree from application to application and installation to installation. All we can be certain of is that each application will be different. By how much is unknown until someone with adequate knowledge of the data flow and the database engine digs in to get that answer.

That variability is why it is so helpful to have the un-indexed heap metrics described in part 2 available as a baseline reference.

This series of posts hope to offer a generalized test scenario intended as a starting place for digging in. If, from this information, the thought process is stimulated to identify and characterize the patterns currently in use by an application and  to see how those pattern might generally compare with the other persistence methods and models, then the purpose will be achieved. It is a fairly complex topic. Breaking the topic into a few posts allows me to better focus on different aspects in greater detail.  This post is specific to the DML operation in SQL Server to complete the final or ‘normalization’ step.

Hopefully this is not an un-interesting detour out of the data design lab and into the database engine. The old school upsert was augmented by the OUTPUT clause of the INSERT, UPDATE and DELETE T-SQL statements in SQL Server 2005 and then – apparently – complemented by the MERGE statement in SQL Server 2008. MERGE also supports the OUTPUT clause. There remains a lot of code that uses the old school upsert. The OUTPUT clause is so effective that it is quickly coming into common usage. Applications not making use of the OUTPUT clause may be missing the set-based boat. (Applications that tramp surrogate candidate keys from the database to the application code that pushed the insert, then back to the database as foreign key references surely did miss the boat.)

Continuing to work with the test methodology already established to measure, chart and compare the “Baseline Load Times per Row to Un-Indexed Heap” from part 2, a meaningful comparison of upsert and MERGE is possible. The test results are even a bit interesting, yet far from compelling one way or the other. There should be no big surprise that which is better suited for a specific purpose will invariably depend on the details of the purpose.

Ultimately, folks that find something below of interest will want to conduct similar testing using source and target data structures representative of the local application.

The chart below depicts the general range of expected throughput for the tested data flow scenario. This is a range chart where each of the wide lines represents the likely variability of the load. Sample populations include tests with and without placing a table lock on each target table.

UpsertAndMerge_vs_Heap

From 1 to about 500 rows, the normalized row-at-a-time operation is faster than all TVP methods and even faster than the heap. The row-at-a-time MERGE appears to have a slight advantage over the row-at-a-time upsert up through 500 rows. Then at 500 rows the row-at-a-time MERGE and upsert bifurcate along the upper and lower limits of the very wide row-at-a-time heap range! Only the upsert row-at-a-time operation seems to demonstrate flat scalability across the test row range. To move data row-at-a-time, probably better to stick with upserts and expect your loads to take a long time if your batches are larger.

Conversely, both methods of normalizing from a TVP show improved throughput as the set size increases. At around 50 rows the set throughput surpasses the throughput of either row-at-a-time method. The difference between the set based MERGE and the upsert is hardly significant.  The narrower width of the MERGE range suggests it is more stable than upsert – at least in these solo process test scenarios. For sets, and all other things being equal, it is a toss-up whether MERGE or upsert is the better choice.

~~~~

To conduct these tests to compare merge and upsert, two base stored procedures, 1 used for row-at-a-time upserts and 1 used for tvp based upserts were modified to replace the UPDATE + INSERT with a MERGE statement.

Row-at-a-time upsert tests use the following pattern:

CREATE PROCEDURE dbo.pRowUpsert
    ( @RowNumber INT
    , @Host NVARCHAR(128)
    , @GMTDateTime DATETIME
    , @Name VARCHAR(50)
    , @Value INT 
    , @FloatingPoint REAL
    , @Boolean BIT
    , @Bytes VARBINARY(60)
    , @LOB NVARCHAR(MAX)
    , @LOBCheckSum INT 
    , @LoggingLevel TINYINT
    , @LoadMethod NVARCHAR(128) )
AS
BEGIN

  DECLARE @TimerDT DATETIME2;

  DECLARE @MergeParentRows INT
        , @MergeParent_ms INT
        , @InsertChildRows INT
        , @InsertChild_ms INT;

  DECLARE @UpsertMapper TABLE 
      ( Id BIGINT );

  BEGIN TRY
    
   SET @MergeParentRows = 0;
   SET @MergeParent_ms = 0; 
   SET @InsertChildRows = 0;
   SET @InsertChild_ms = 0; 
   
   IF @LoggingLevel > 1
     SET @TimerDT = SYSDATETIME();

    BEGIN TRANSACTION;

     UPDATE p
     SET [Occurrences] += 1
       , [LastUpdateDate] = CAST(SYSDATETIME() AS DATE)
       , [LastUpdateTime] = CAST(SYSDATETIME() AS TIME)     
     OUTPUT inserted.Id 
     INTO @UpsertMapper
     FROM dbo.tParent p
     WHERE p.[Name] = @Name
     AND p.[LOBCheckSum] = @LOBCheckSum
     AND p.[GMTDate] = CAST(@GMTDateTime AS DATE)
     AND p.[GMTHour] = DATEPART(hour, @GMTDateTime);

     SET @MergeParentRows = @@ROWCOUNT    

     IF @MergeParentRows = 0
     
       BEGIN
     
         INSERT dbo.tParent 
           ( [Name]
           , [GMTDate]
           , [GMTHour] 
           , [LOB]
           , [LOBCheckSum] )
         OUTPUT inserted.Id 
         INTO @UpsertMapper 
         VALUES 
           ( @Name
           , CAST(@GMTDateTime AS DATE)
           , DATEPART(hour, @GMTDateTime)
           , @LOB
           , @LOBCheckSum);
                   
       END
           
     SET @MergeParentRows = (SELECT COUNT(*) FROM @UpsertMapper);

     IF @LoggingLevel > 1
       SET @MergeParent_ms = DATEDIFF(ms, @TimerDt, SYSDATETIME());

     IF @LoggingLevel > 1
       SET @TimerDT = SYSDATETIME();

     INSERT dbo.tChild 
      ( ParentId 
      , GMTDateTime
      , Value
      , Host
      , FloatingPoint
      , Boolean
      , Bytes )
     SELECT Id
         , @GMTDateTime
         , @Value
         , @Host
         , @FloatingPoint
         , @Boolean
         , @Bytes
     FROM @UpsertMapper; 

     SET @InsertChildRows = @@ROWCOUNT;

     IF @LoggingLevel > 1
       SELECT @MergeParentRows AS [@MergeParentRows]
            , @MergeParent_ms AS [@MergeParent_ms]
            , @InsertChildRows AS [@InsertChildRows]
            , DATEDIFF(ms, @TimerDt, SYSDATETIME()) AS [@InsertChild_ms];

     COMMIT TRANSACTION;
    
     RETURN @InsertChildRows;
     
  END TRY
  
  BEGIN CATCH

     DECLARE @Number INT
           , @Severity TINYINT
           , @State TINYINT
           , @Proc NVARCHAR(128)
           , @Line INT
           , @Message NVARCHAR(MAX);   

     SELECT @Number = ERROR_NUMBER()
          , @Severity = ISNULL(ERROR_SEVERITY(), 1)
          , @State = ISNULL(ERROR_STATE(), 1)
          , @Proc = ISNULL('Procedure ' + SPACE(1) + ERROR_PROCEDURE() + SPACE(1), '')
          , @Line = ERROR_LINE()
          , @Message = N':' + CHAR(13) + CHAR(10) + ERROR_MESSAGE();

     IF XACT_STATE()  0
       ROLLBACK TRANSACTION;

     RAISERROR ('dbo.pRowUpsert: Msg %d, Level %d, State %d, %s Line %d %s'
               ,16 ,1 ,@Number, @Severity, @State, @Proc, @Line, @Message)

  END CATCH

END;

GO

In the pattern above, the Upsert brings the parent identity along whether an update or insert is required. No need to create, populate, or evaluate any local variables. Genuine old school upserts might have to use an additional SELECT to get the parent surrogate key in the UPDATE case and rely upon the SCOPE_IDENTITY() system function – or perhaps the riskier @@IDENTITY global variable – to provide the identity to the child row at INSERT. The OUTPUT clause reduces the required IOs and eliminates some in-line processing requirements. It is much cleaner. Genuine old school ADO.NET upserts – with the identity marshaled between the entities by the application – require addition inter-process communication (IPC) serialization and some usually fascinating data gymnastics in the code. It should be obvious that doing the work in one place – in this case at the database server – will always be faster than distributed transactional processing.

For comparison, below is the MERGE pattern for the row-at-a-time operation:

CREATE PROCEDURE dbo.pRowMerge
    ( @RowNumber INT                                                                                                                                                                    
    , @Host NVARCHAR(128)
    , @GMTDateTime DATETIME
    , @Name VARCHAR(50)
    , @Value INT 
    , @FloatingPoint REAL
    , @Boolean BIT
    , @Bytes VARBINARY(60)
    , @LOB NVARCHAR(MAX)
    , @LOBCheckSum INT 
    , @LoggingLevel TINYINT
    , @LoadMethod NVARCHAR(128) )
AS
BEGIN

  DECLARE @TimerDT DATETIME2;

  DECLARE @MergeParentRows INT
        , @MergeParent_ms INT    
        , @InsertChildRows INT
        , @InsertChild_ms INT;

  DECLARE @MergeMapper TABLE 
      ( Id BIGINT );

  BEGIN TRY

    BEGIN TRANSACTION;

      SET @MergeParentRows = 0;
      SET @MergeParent_ms = 0; 
      SET @InsertChildRows = 0;
      SET @InsertChild_ms = 0; 
      
      IF @LoggingLevel > 1
        SET @TimerDT = SYSDATETIME();

      MERGE dbo.tParent AS p 
      USING (SELECT @RowNumber as [RowNumber]
                  , @Name AS [Name]
                  , CAST(@GMTDateTime AS DATE) AS [GMTDate]
                  , DATEPART(hour, @GMTDateTime) AS [GMTHour]
                  , @LOB AS [LOB]
                  , @LOBCheckSum AS [LOBCheckSum]
                  , 1 AS [Occurrences]) AS t
      ON p.[Name] = t.[Name]
      AND p.[LOBCheckSum] = t.[LOBCheckSum]
      AND p.[GMTDate] = t.[GMTDate]
      AND p.[GMTHour] = t.[GMTHour]
      WHEN MATCHED THEN
          UPDATE SET [Occurrences] += t.[Occurrences]
                   , [LastUpdateDate] = CAST(SYSDATETIME() AS DATE)
                   , [LastUpdateTime] = CAST(SYSDATETIME() AS TIME)     
      WHEN NOT MATCHED THEN
          INSERT ( [GMTDate] 
                 , [GMTHour] 
                 , [Name]
                 , [LOBCheckSum]
                 , [LOB] )
           VALUES
                 ( t.[GMTDate]
                 , t.[GMTHour]
                 , t.[Name]
                 , t.[LOBCheckSum]
                 , t.[LOB])
      OUTPUT Inserted.Id INTO @MergeMapper;

      SET @MergeParentRows = @@ROWCOUNT;

      IF @LoggingLevel > 1
        SET @MergeParent_ms = DATEDIFF(ms, @TimerDt, SYSDATETIME());

      IF @LoggingLevel > 1
        SET @TimerDT = SYSDATETIME();

      INSERT dbo.tChild
       ( ParentId 
       , GMTDateTime
       , Value
       , Host
       , FloatingPoint
       , Boolean
       , Bytes )
      SELECT Id
          , @GMTDateTime
          , @Value
          , @Host
          , @FloatingPoint
          , @Boolean
          , @Bytes
      FROM @MergeMapper; 

      SET @InsertChildRows = @@ROWCOUNT;

      IF @LoggingLevel > 1
        SELECT @MergeParentRows AS [@MergeParentRows]
             , @MergeParent_ms AS [@MergeParent_ms]
             , @InsertChildRows AS [@InsertChildRows]
             , DATEDIFF(ms, @TimerDt, SYSDATETIME()) AS [@InsertChild_ms];
    
    COMMIT TRANSACTION;
    
    RETURN @InsertChildRows;

  END TRY
  
  BEGIN CATCH

     DECLARE @Number INT
           , @Severity TINYINT
           , @State TINYINT
           , @Proc NVARCHAR(128)
           , @Line INT
           , @Message NVARCHAR(MAX);   

     SELECT @Number = ERROR_NUMBER()
          , @Severity = ISNULL(ERROR_SEVERITY(), 1)
          , @State = ISNULL(ERROR_STATE(), 1)
          , @Proc = ISNULL('db object' + SPACE(1) + ERROR_PROCEDURE() + SPACE(1), '')
          , @Line = ERROR_LINE()
          , @Message = N':' + CHAR(13) + CHAR(10) + ERROR_MESSAGE();

     -- don't mess with anyone else's xacts
     IF XACT_STATE()  0
       ROLLBACK TRANSACTION;

     RAISERROR ('dbo.pRowMerge Msg %d, Level %d, State %d, %s Line %d %s'
               ,16, 1, @Number, @Severity, @State, @Proc, @Line, @Message)

  END CATCH

END;

GO

One interesting difference is that the MERGE requires one less write operation to the @MergeMapper table variable. Like OUTPUT, I would have to agree that MERGE is cleaner. One might therefore expect that MERGE will require less IO than the UPDATE+INSERT pattern. If anything, test results indicate the opposite.

Table-valued parameter based upserts are easily adapted to the row patterns. All that is absolutely necessary is to change the references to parameters/local variables to  reference the passed TVP.

Here is the TVP upsert pattern:

CREATE PROCEDURE dbo.pTVPUpsert
    ( @TVP TVPTableType READONLY 
    , @LoggingLevel TINYINT
    , @LoadMethod NVARCHAR(128) )
AS

BEGIN

  DECLARE @TimerDT DATETIME2;

  DECLARE @MergeParentRows INT
        , @MergeParent_ms INT    
        , @InsertChildRows INT
        , @InsertChild_ms INT;

  CREATE TABLE #UpsertMapper 
    ( Id BIGINT PRIMARY KEY
    , Name VARCHAR(50)
    , LOBCheckSum INT
    , GMTDate DATE
    , GMTHour INT
    , UNIQUE (Name, LOBCheckSum, GMTDate, GMTHour));

  CREATE TABLE #TempTable
    ( Name VARCHAR(50)
    , LOBCheckSum NVARCHAR(128)
    , GMTDate DATE
    , GMTHour INT 
    , Occurrences INT 
    , RowNumber INT
    , PRIMARY KEY (Name, LOBCheckSum, GMTDate, GMTHour)) 
       
  -- deduplicate before upsert
  INSERT #TempTable 
    ( Name
    , LOBCheckSum
    , GMTDate
    , GMTHour
    , Occurrences 
    , RowNumber)
  SELECT Name       
       , LOBCheckSum
       , CAST([GMTDateTime] AS DATE)
       , DATEPART(hour, [GMTDateTime])
       , COUNT(*)
       , MIN(RowNumber) 
  FROM @TVP
  GROUP BY [Name]
       , [LOBCheckSum]
       , CAST([GMTDateTime] AS DATE)
       , DATEPART(hour, [GMTDateTime]);
  
  IF @LoggingLevel > 1
    SET @TimerDT = SYSDATETIME();

  BEGIN TRY
 
    BEGIN TRANSACTION;

      UPDATE p 
      SET [Occurrences] += t.[Occurrences]
        , [LastUpdateDate] = CAST(SYSDATETIME() AS DATE)
        , [LastUpdateTime] = CAST(SYSDATETIME() AS TIME)     
      OUTPUT inserted.ID
           , inserted.Name
           , inserted.LOBCheckSum
           , inserted.GMTDate
           , inserted.GMTHour INTO #UpsertMapper
      FROM dbo.tParent AS p
      JOIN #TempTable AS t
      ON p.Name = t.Name
      AND p.LOBCheckSum = t.LOBCheckSum
      AND p.GMTDate = t.GMTDate
      AND p.GMTHour = t.GMTHour;

      SET @MergeParentRows = @@ROWCOUNT;

      INSERT dbo.tParent
        ( [Name]
        , [LOBCheckSum]
        , [GMTDate]
        , [GMTHour]
        , [LOB]
        , [Occurrences])
      OUTPUT inserted.ID
           , inserted.Name
           , inserted.LOBCheckSum
           , inserted.GMTDate
           , inserted.GMTHour INTO #UpsertMapper
      SELECT tmp.[Name]
           , tmp.[LOBCheckSum]
           , tmp.[GMTDate]
           , tmp.[GMTHour]
           , tvp.[LOB] AS [LOB] 
           , tmp.[Occurrences]  
      FROM #TempTable AS tmp
      LEFT JOIN dbo.tParent AS p
      ON p.Name = tmp.Name
      AND p.LOBCheckSum = tmp.LOBCheckSum
      AND p.GMTDate = tmp.GMTDate
      AND p.GMTHour = tmp.GMTHour
      JOIN @TVP AS tvp 
      ON tvp.RowNUmber = tmp.RowNumber
      WHERE p.[Id] IS NULL;

      SET @MergeParentRows += @@ROWCOUNT;

      IF @LoggingLevel > 1
        SET @MergeParent_ms = DATEDIFF(ms, @TimerDt, SYSDATETIME());

      IF @LoggingLevel > 1
        SET @TimerDT = SYSDATETIME(); 

      INSERT dbo.tChild 
       ( ParentId 
       , GMTDateTime
       , Value
       , Host
       , FloatingPoint
       , Boolean
       , Bytes )
      SELECT 
         map.Id
       , tvp.GMTDateTime
       , tvp.Value
       , tvp.Host
       , tvp.FloatingPoint
       , tvp.Boolean
       , tvp.Bytes
      FROM #UpsertMapper AS map
      JOIN @TVP AS tvp 
      ON map.[Name] = tvp.[Name]
      AND map.[LOBCheckSum] = tvp.[LOBCheckSum]
      AND map.[GMTDate] = CAST(tvp.GMTDateTime AS DATE)
      AND map.[GMTHour] = DATEPART(hour, tvp.GMTDateTime);

      SET @InsertChildRows = @@ROWCOUNT;
       
      IF @LoggingLevel > 1
        SELECT @MergeParentRows AS [@MergeParentRows]
             , @MergeParent_ms AS [@MergeParent_ms]
             , @InsertChildRows AS [@InsertChildRows]
             , DATEDIFF(ms, @TimerDt, SYSDATETIME()) AS [@InsertChild_ms];

    COMMIT TRANSACTION;
    
    RETURN @InsertChildRows;

  END TRY
  
  BEGIN CATCH

     DECLARE @Number INT
           , @Severity TINYINT
           , @State TINYINT
           , @Proc NVARCHAR(128)
           , @Line INT
           , @Message NVARCHAR(MAX);   

     SELECT @Number = ERROR_NUMBER()
          , @Severity = ISNULL(ERROR_SEVERITY(), 1)
          , @State = ISNULL(ERROR_STATE(), 1)
          , @Proc = ISNULL('db object' + SPACE(1) + ERROR_PROCEDURE() + SPACE(1), '')
          , @Line = ERROR_LINE()
          , @Message = N':' + CHAR(13) + CHAR(10) + ERROR_MESSAGE();

     -- don't mess with anyone else's xacts
     IF XACT_STATE()  0
       ROLLBACK TRANSACTION;

     RAISERROR ('dbo.pTVPUpsert - Msg %d, Level %d, State %d, %s Line %d %s'
               ,16 ,1 ,@Number, @Severity, @State, @Proc, @Line, @Message);

  END CATCH

END;

GO

And the TVP MERGE pattern is:

CREATE PROCEDURE dbo.pTVPMerge
    ( @TVP TVPTableType READONLY 
    , @LoggingLevel TINYINT
    , @LoadMethod NVARCHAR(128) )
AS

BEGIN

  DECLARE @TimerDT DATETIME2;

  DECLARE @MergeParentRows INT
        , @MergeParent_ms INT    
        , @InsertChildRows INT
        , @InsertChild_ms INT;

  CREATE TABLE #MergeMapper 
    ( Id BIGINT PRIMARY KEY
    , Name VARCHAR(50)
    , LOBCheckSum INT
    , GMTDate DATE
    , GMTHour INT
    , UNIQUE (Name, LOBCheckSum, GMTDate, GMTHour));

  CREATE TABLE #TempTable
   ( Name VARCHAR(50)
   , LOBCheckSum NVARCHAR(128)
   , GMTDate DATE
   , GMTHour INT 
   , Occurrences INT 
   , RowNumber INT
   , PRIMARY KEY ([Name],[LOBCheckSum],[GMTDate],[GMTHour])) 

  IF @LoggingLevel > 1
    SET @TimerDT = SYSDATETIME();
       
  INSERT #TempTable 
    ( Name
    , LOBCheckSum
    , GMTDate
    , GMTHour
    , Occurrences 
    , RowNumber)
  SELECT Name
       , LOBCheckSum
       , CAST(GMTDateTime AS DATE)
       , DATEPART(hour, GMTDateTime)
       , COUNT(*)
       , MIN(RowNumber) 
  FROM @TVP
  GROUP BY Name
       , LOBCheckSum
       , CAST(GMTDateTime AS DATE)
       , DATEPART(hour, GMTDateTime);

  BEGIN TRY
 
    BEGIN TRANSACTION;

      MERGE dbo.tParent AS p
      USING (SELECT tmp.[Name]
                  , tmp.[GMTDate]
                  , tmp.[GMTHour]
                  , tvp.[LOB]
                  , tmp.[LOBCheckSum]
                  , tmp.[Occurrences]
             FROM #TempTable AS tmp
             JOIN @TVP AS tvp 
             ON tmp.RowNumber = tvp.RowNumber) AS derived
      ON p.[Name] = derived.[Name]
      AND p.[LOBCheckSum] = derived.[LOBCheckSum]
      AND p.[GMTDate] = derived.[GMTDate]
      AND p.[GMTHour] = derived.[GMTHour]
      WHEN MATCHED THEN
          UPDATE SET [Occurrences] += derived.[Occurrences]
                   , [LastUpdateDate] = CAST(SYSDATETIME() AS DATE)
                   , [LastUpdateTime] = CAST(SYSDATETIME() AS TIME)     
      WHEN NOT MATCHED THEN
          INSERT ( [GMTDate] 
                 , [GMTHour] 
                 , [Name]
                 , [LOBCheckSum]
                 , [LOB] )
           VALUES
                 ( derived.[GMTDate]
                 , derived.[GMTHour]
                 , derived.[Name]
                 , derived.[LOBCheckSum]
                 , derived.[LOB] )
      OUTPUT Inserted.Id
           , Inserted.Name
           , Inserted.LOBCheckSum
           , Inserted.GMTDate
           , Inserted.GMTHour INTO #MergeMapper;

      SET @MergeParentRows = @@ROWCOUNT;

      IF @LoggingLevel > 1
        SET @MergeParent_ms = DATEDIFF(ms, @TimerDt, SYSDATETIME());

      IF @LoggingLevel > 1
        SET @TimerDT = SYSDATETIME();

      INSERT dbo.tChild
        ( ParentId 
        , GMTDateTime
        , Value
        , Host
        , FloatingPoint
        , Boolean
        , Bytes )
      SELECT 
          map.Id
        , tvp.GMTDateTime
        , tvp.Value
        , tvp.Host
        , tvp.FloatingPoint
        , tvp.Boolean
        , tvp.Bytes
      FROM #MergeMapper map
      JOIN @TVP tvp
      ON map.[Name] = tvp.[Name]
      AND map.[LOBCheckSum] = tvp.[LOBCheckSum]
      AND map.[GMTDate] = CAST(tvp.GMTDateTime AS DATE)
      AND map.[GMTHour] = DATEPART(hour, tvp.GMTDateTime); 

      SET @InsertChildRows = @@ROWCOUNT;

      IF @LoggingLevel > 1
        SELECT @MergeParentRows AS [@MergeParentRows]
             , @MergeParent_ms AS [@MergeParent_ms]
             , @InsertChildRows AS [@InsertChildRows]
             , DATEDIFF(ms, @TimerDt, SYSDATETIME()) AS [@InsertChild_ms];

     COMMIT TRANSACTION;

     RETURN @InsertChildRows;


  END TRY
  
  BEGIN CATCH

     DECLARE @Number INT
           , @Severity TINYINT
           , @State TINYINT
           , @Proc NVARCHAR(128)
           , @Line INT
           , @Message NVARCHAR(MAX);   

     SELECT @Number = ERROR_NUMBER()
          , @Severity = ISNULL(ERROR_SEVERITY(), 1)
          , @State = ISNULL(ERROR_STATE(), 1)
          , @Proc = ISNULL('db object' + SPACE(1) + ERROR_PROCEDURE() + SPACE(1), '')
          , @Line = ERROR_LINE()
          , @Message = N':' + CHAR(13) + CHAR(10) + ERROR_MESSAGE();

     IF XACT_STATE() < 0
       ROLLBACK TRANSACTION;

     RAISERROR ('dbo.pTVPMerge - Msg %d, Level %d, State %d, %s Line %d %s'
               ,16, 1, @Number, @Severity, @State, @Proc, @Line, @Message)

  END CATCH

END;

GO

The first chart below shows row-at-a-time loads that transform the source data record into an update else insert to the parent followed by an insert to the child. Whether the parent is inserted or updated, the identity, along with the natural key columns from the incoming data is made available to the child insert without the need to join to the parent during the child insert. Tests are standardized to load identical batches of rows into the same target hierarchy. The Parent identity column is included in all child rows, whether key Batch sizes are fixed at 1, 5, 10, 50, 100, 500, 1000, 5000, 10000, 50000 and 100,000 rows. Each batch is wrapped in a transaction. Several steps were taken to assure meaningful test results. Multiple executions per batch size within each test run were used. Multiple test runs were executed for both load methods.

MERGE_n_UPDATE_raw

Looking first at the row-at-a-time series (i.e., RowMerge, RowUpdate) note these observations:

On the left side of the chart – where the batches are smaller, it is clear that the row-at-a-time operations can provide more throughput capability than set based operations at low rows per batch. The advantage decreases as the batch size increases.  Then, at the point where neither the upsert or the MERGE show an advantage – just beyond 1000 rows per batch, the row-at-a-time MERGE begins to degrade exponentially while the upsert’s scale-ability remains relatively flat up through the largest batches tested.

As demonstrated in part 2 and 3, if the transaction wrapping each batch of the MERGE  is removed a slight degradation is noticeable for small batches. Larger batches do as well as the upsert.

Bear in mind that the insert operation is an upsert or MERGE of a parent table followed by an insert to a child table. The transaction will not automatically serialize access to the table. Only when a table lock is explicitly placed on the table of enough data is changed that row or page locks escalate to HOBT locks would concurrency be degraded due to serialization. With that said, using an explicit table lock will deliver the best performance if there is no concurrency concern. The scenarios where multiple concurrent loads are required will be examine in a future post.

 

Advertisements
This entry was posted in Data Loading, Testing. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s