Persisting Data – part 5: Concurrency and Locking

note- Nov 29, 2015 – finally got a copy of test scripts on github. see https://github.com/bwunder/data-load-methods for working version of all TSQL shown in this post…
The table lock is important for balancing the application’s concurrency and throughput requirements during a persistence operation. We can expect the incoming data load to have better throughput characteristics for our process when the table is locked. We can expect less contention for other processes when only smaller granularity locks are taken: other active processes are affected while our process holds the table lock.  It is an important decision. It is worthwhile to consider the locking and other configuration options available that can affect throughput and concurrency.

The range charts in previous posts in this series depict ranges that combine samples with explicit table locks on each targeted table with samples where SQL Server was left to managed locks during the data change operation.  The un-indexed heap charts show that an explicit table level lock (e.g., when there is no/minimal concurrent load) makes almost no difference when the batch or set size is below 500 rows. As the row counts grow to 1000 and above, the load rate (expressed as ms/row) is increasingly better with the table lock in place.  Not only is there a dramatic reduction in lock management overhead but with ‘minimal logging’ throughput capability before disk saturation during the persistence operation is greatly increased: as is evident when the pesky WRITELOG wait type dominates.

HeapLoadTimes

The chart below extends those findings to the parent/child hierarchy scenario previously described.

TargetLoadTimes

To be sure, concurrency is considerably more configurable than deciding whether to place a lock on the target table(s) or not. Many configuration options in addition to the table lock could impact the balance of throughput and concurrency. Generally speaking, set based operations have an inherent advantage over row-at-a-time operations within SQL Server owed to the massive data buffers, indexes, file groups, read-ahead, compression, lock escalation, IO packet sizes, etc., etc., etc… Beyond the generalization, it is definitely worth while to spend a bit of time to understand and think about the choices during construction.

Below are some of my thoughts about locking. (by no means complete)

Forcing a particular locking behavior on all processing in the interest of helping a persistence operation is rarely desirable. Better to do so only for those queries that see a net benefit and leave queries to be optimized by SQL Server when the forced behavior has no benefit. To this end there are quite a number of hints that allow concurrency~throughput attenuation at the statement level. It could take volumes to describe them all adequately. Furthermore, application requirements ultimately define the appropriate locking granularity. The right locking strategy could vary widely from one application to another.

One good reason to NOT lock the table is concurrency. Exclusive row and page locks, especially if the rows/pages are just now added, do not necessarily prevent other users’ attempts to read from or even write to other rows on other pages of the table at the same time. Conversely, the table lock requires that all other transactional processes wait until the exclusive lock is released to make any query or modification anywhere in the table. The less the application must endure this serialization effect the better.

Short of using a table lock there are a few SQL Server configuration and application design options available to further optimize locking behavior. One example is the transaction itself. A transaction is a prerequisite for the table lock as was shown in part 2 and not always adequately used in legacy application. In those charts it is apparent that there is an incremental benefit up to a fairly high threshold of batch/set size to use one transaction to wrap the bulk data load operation. It is also apparent that larger batch/set sizes can increase memory pressure and subsequently queue at the disk.

One common and effective old school ‘trick’ is the use of the NOLOCK – or the synonymous READUNCOMMITTED – query hint on select queries. Only do this when dirty reads are orl korrekt (OK). I am tempted to contend that dirty reads are always problematic; the consequences can manifest in the most unexpected places. Applications that allow dirty reads implicitly allow corruption. The trouble with a purist approach is that applications can become immediately usable simply by allowing dirty reads in a few select spots. And I have seen more than one web application go to the extreme of imposing NOLOCK hints on all tables in all queries: mindlessness is not an effective optimization technique. It is difficult to discount something as easy and effective as lock hints when implemented thoughtfully and with adequate knowledge of application design and behavior. There are many other hints that might affect the persistence operation in a good way under the right circumstance.

Below are the available hints. Any may affect – perhaps assist, perhaps impair – the performance, throughput and concurrency characteristics of a persistence operation. The SELECT statement is intended only to show where each of the three hint types are applied.  Then the options for each hint type are listed in  a carefully executed cut-n-paste from Books Online. Notice that table hints can used as query hints.

SELECT <column name[, column name[, ...]]>
FROM <table or view> AS alias1 WITH(<table_hint>)
INNER <join_hint> JOIN <table or view> AS alias2 WITH(<table_hint>)
  ON alias1.<column name> = alias2.<column name> [AND ...]
OPTION(<query_hint>)
<table_hint> ::=[NOEXPAND ] {    INDEX ( index_value [ ,...n ] ) | INDEX = ( index_value )
| FASTFIRSTROW | FORCESEE | HOLDLOCK | NOLOCK| NOWAIT| PAGLOCK | READCOMMITTED 
| READCOMMITTEDLOCK | READPAST | READUNCOMMITTED | REPEATABLEREAD | ROWLOCK 
| SERIALIZABLE | TABLOCK | TABLOCKX | UPDLOCK | XLOCK }
<query_hint> :=  { HASH | ORDER } GROUP  | { CONCAT | HASH | MERGE } UNION  
| { LOOP | MERGE | HASH } JOIN  | FAST number_rows  | FORCE ORDER  
| MAXDOP number_of_processors  
| OPTIMIZE FOR ( @variable { UNKNOWN | = literal_constant } [ , ...n ] ) 
| OPTIMIZE FOR UNKNOWN 
| PARAMETERIZATION { SIMPLE | FORCED } | RECOMPILE | ROBUST PLAN  | KEEP PLAN  
| KEEPFIXED PLAN | EXPAND VIEWS  | MAXRECURSION number  | USE PLAN N'xml_plan' 
| TABLE HINT ( exposed_object_name  [ , <table_hint> [ [, ]...n ] ] )
<join_hint> ::=       { LOOP | HASH | MERGE | REMOTE }
The Books Online “Query Hints” topic provides a good rundown of each option. The “Hints” topic also trumpets this good advice:

Caution:

Because the SQL Server query optimizer typically selects the best execution plan for a query, we recommend that <join_hint>, <query_hint>, and <table_hint> be used only as a last resort by experienced developers and database administrators.

It is very nearly impossible to provide any rule-of-thumb recommendations about which hints are worth pursuing and which are a waste of time without adequate knowledge of the application. Using hints is something of a claim that the query author knows how to interact with the data better than the database engine. Once in a while that is even true.

Most of the time, if persistence can be done in batches or a set, the table lock should be used if the persistence is done in a batch window and only low level page and/or row locks should be used when concurrency requirements are high at all times.

For the range charts shown in the Persisting Data posts, test results with and without the table lock are aggregated without regard to the lock state. There is considerable overlap in the performance range of the two options in a single threaded context. In general the low (fastest) end of the the range comes from tests with a table lock and the high (slowest) end of each range is from a test that does not explicitly place the table lock. That is not always true, just a generalized statement: YMMV. The intent is to provide a design aid that is useful without knowing for certain if a table lock is needed or not. The next step of course would be to compare the result of this test with results while the target tables are under a representative concurrent load. (more on that later…)

Configuration changes with a scope other than the current query may be as appropriate as hinting. Compression and limited lock escalation control can be configured only at the table level.  Locking and fill factor can be manipulated at the index level.  Transaction Isolation Level – as a more sweeping alternative to isolation manipulation at the <table_hint> level – is set at the database or transaction level. Only at the database level is the choice between resource locks and row versioning available. We can go all the way down to the hardware and identify configuration options that affect the balance between throughput and concurrency.

Ideally, the application design will be such that hints, non-standard table configuration, and arcane or unusual settings are not required. When special configuration is required, hint based optimizations are more desirable for  persistence operations than changes that can affect all access to an important table or the entire database and/or changes that are not easily detected by the uninitiated. When the optimizations are in the query any undesirable side-effects are limited in scope to that query and the visibility to the maintenance programmer is good.

Advertisements
This entry was posted in Data Loading, Testing. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s