A recap of the previous posts in this series:
Part 1 is a technology review of some of the more common .NET application layer persistence techniques. One key take-away is that many currently trending object relational mappers (ORMs) can only pull sets from the database. They do not push sets to the database. This is particularly disappointing when using a Microsoft SQL Server database engine behind Microsoft‘s ORM technologies where the row-at-a-time option cannot even be overridden with a set based options. ORM database persistence is always done one row at a time. The most elegant and efficient way to get rows from .NET into SQL Server 2008 – the table-valued parameter (TVP) – is not supported by .NET ORM libraries. Language Integrated Query (LINQ) and Entity Framework (EF) 4.0 do not even know what a TVP is though ADO.NET is all over the TVP. A DataSet or IDataReader is a TVP. It’s that simple, but it will not work in a mapper’s DataContext.
Part 2 attempts to establish a test protocol to accurately compare the effects of selected database configuration options on a persistence operation. In this test transactional batch wrappers and data compression are enabled/disable one at a time between loads to an un-indexed heap. The throughput, stated as millisecond per row (ms/row), of a long running sequence of persistence operations is established in this way. The protocol is intended to approximate the server side persistence behaviors of typical ADO.NET persistence operations like the DataAdapter.Update(), DataReader() or LINQ/EF’s DataContext.SubmitChanges() methods.
Targeting an un-indexed heap – the logical path of least resistance – when moving rows into the SQL Server allows isolation of the overhead to move rows into the database and the overhead to process the data into the target hierarchy.
Part 2 testing measures only the import overhead. The data must be further processed before data loaded to an un-indexed heap is usable. Part 2 results indicate that row-at-a-time transfers takes about the same amount of time per row when sent to database in batches of 5 to 100,000 rows, regardless if the batch size. As expected, transactions employing table locks and row compression provides improved persistence throughput over implicit transactions or page compression respectively.
Part 3 confirms that SQL Server is able to receive sets as TVPs into un-indexed heaps with remarkably less resouce intensity and therefore better throughput than the same data when sent as rows or batches of rows as is done by .NET ORM persistence methods. Perhaps even a little better throughput than the ETL bulk loading practice of using sets imported from flat files – perhaps because the latter method does not isolate the cost to create the flat file from the load operation.
Part 3 results confirm that above about 100 rows per transfer the TVP shows an exponential increase in throughput advantage over batches of the same number of rows. Furthermore, between 100 rows per transfer and 5000 rows per transfer, the table-value parameter moves rows into the database incrementally faster than BCP.exe. The TVP appears to be at least as capable as conventional ETL bulk loading methods. That trend appears to hold to some unknown level beyond the largest tested row size of 100,000.
Part 4 moves the focus from moving data into SQL Server to normalizing that data into a hierarchy of tables after import to import optimized staging tables. Testing sets out to determine if the common upsert technique (UPDATE else INSERT) or the new SQL Server 2008 MERGE statement is better when adding data to an existing set of tables. The test case is specifically aimed at measuring the persistence of 1 to 100,000 data rows into a two level relational hierarchy where the parent’s database generated key is propagated to the child. The data is the same as is used in parts 2 & 3.
Part 4 results show there is little throughput difference between the upsert technique and the MERGE statement in the test scenario. The upsert appears ever so slightly better. Because of this finding, most subsequent testing will standardized on the upsert to avoid the uninteresting doubling of data in some already information rich test cases to follow. At some point, once any bottlenecks are understood, coming back to the MERGE to see if things look better may be worthwhile.
Part 5 compares the throughput of three styles of database persistence methods:
- direct application-to-database row-at-a-time persistence methods
- direct application-to-database TVP persistence methods
- flat file originated ETL bulk loading methods
The third (iii) style carries an additional physical IO cost before the data reaches SQL Server to create the file. This cost is isolated in the test results. In the other two styles, data is moved from application buffers to SQL Server buffers. The traditional ETL flat file transfer carries a Physical IO cost to put the data into the file and then more cost as it bulk loads the file data into a SQL Server staging heap before normalizing the data into the target table hierarchy. The file load operation is not considered or measured here, only the movement of the data from the file to the database.
Part 5 shows an exponentially better throughput profile for direct TVP based approaches than row-at-a-time processing as the number of rows in the persistence operation grows above 100 rows. The TVP also shows incrementally better throughput than out of process bulk load (BCP.exe) across the range of transfer row sizes tested. T-SQL BULK INSERT TVP methods generates a curve surprisingly similar to the row-at-a-time method. Test results provide compelling support to the thesis that TVPs are the way that all external data should get into SQL Server.
Part 6 introduces SQL Server Integration Services (SSIS) as a test client to evaluate the characteristics of batched row and set based persistence operations from a client application perspective. We can expect any client, including SSIS to add inter-process communications overhead. Since the SSIS service used for testing is running locally on the same server as the SQL Server instance, there is an elevated level of local system resource contention and a short ride on the network that may not be representative of most applications.
An interesting part 6 finding is that below 100 rows, the connection overhead is insignificant because the overhead of the inter-process connection is greater. It appears that the region where row-at-a-time processing is better – small row sizes -is buried with external costs that leave row-at-a-time and TVP on [sub]par. Both are insignificant compared to the connection overhead.
Part 7 measures throughput for multiple concurrent data load processes. A controlled number of the same long running persistence process used in previous test are executed in parallel.
Part 7 test executions suffer a high rate of deadlocks. This is true regardless if batches of rows or sets of rows are sent to the database. The problem is rooted in the default SQL Server lock escalation behavior at the target tables. Clearly, testing that does not evaluate representative concurrent processing requirements is inadequate.
And Part 8 shows a method for identifying and resolving resource contention between contending persistence operations. After retooling the locking strategy at the database to avoid deadlocks in this scenario – with no need to touch the application side code base – part 7 testing is repeated.
Although there was no deadlocking after the post part 7 adjustments, some blocking under high concurrency is observed. It appears that matching the escalation locks to the row versioning isolation has an advantage. Row locks with row versioning isolation is the way to go in the test scenario.
Interestingly, row-at-a-time processing shows very good stability – even if relatively slow. It doesn’t seem to matter how many rows you add one at a time, each row takes about the same amount of time. And I doesn’t seem to matter how many concurrent processes are adding rows one row at a time, each row takes about the same amount of time. Not speedy, but linear and reliable. The throughput for 1000 rows sent one row at a time is 1000 times the throughput for 1 row. I want to call this linear scalability when the economy of scale = 0, but that may be my misunderstanding of the term linear scalability.
In contrast to the tortoise like predictability of the row at a time processing, the table-valued parameter loads have enough spikes that the hare must have a punk hair-do. Even TVPs are no panacea for contention. Testing shows that the throughput when sending sets to the database does increases exponentially with the number of rows sent and at the same time decreases incrementally across the board as the number of concurrent data loading processes increases. The exact point that user experience or service level becomes unacceptably degraded by contention is generally defined by business rule.
To be sure degradation during testing can be heavily influenced by the platform hardware and (mis)configuration. I mention this because inferring requirements based on the actual numbers produced on my test platform would be silly. The trends should hold, the comparison should be accurate enough to get you started. More precise information would have to come from tests of the intended load pattern using representative hardware. And diligence on your part to pursue and understand all test anomalies is necessary.
Test results convince me that when rows per operation are small (less than 50 ?), using a batched row-at-a-time persistence method is likely capable of better database throughput than a table variable or a flat file based load method. Conversely, at any larger rows per persistence operations the table variable and flat file methods out-class the row-at-a-time methods. Coincidentally, the TVP appears to be somewhat better than BCP up through 100,000 rows.
The test results are evidence that TVPs are well worth additional consideration. TVPs hold much promise to deliver better throughput than the out-of-the-box row-at-a-time persistence processing that comes with trending application development software. This in spite of the risk that concurrent incoming TVP’s can experience increased blocking latentcies.
The possibility presents itself to switch the load method based on the payload row size. This could involve coding a hybrid data collector into the application. The likelihood that such additional complexity or the resulting coupling is a good idea is dubious. Better to think long and hard about this one than to attempt to impose a metadata state machine on the application. The tight coupling could impeded throughput much like propagating a database generated surrogate key from parent to child at insert when done at the application.
Application scenarios where the number of rows sent is highly variable may benefit from a simple and elegant option that minimizes coupling between the application and the data store. An option that actually reduces the coupling and allows the client to send data a row-at-a-time, in a TVP or even as a file. An option that can eliminate the lock contention of parallel loads by only bringing the data into an un-indexed heap in parallel and then moving all data sent from n number of application processes during a user chosen interval into the target hierarchy. An option that benefits from SQL Server 2008‘s improvements yet brings reliability and robust capabilities to SQL Server 2000/2005. In fact, I previously wrote about this option about 10 years ago when SQL Server 2000 was the current version.
AB Tables exploits SQL Server’s ability to quickly change the definition of a view. The concept if quite simple. Tables A and B are used for staging data only. Only the persistence processing uses these tables. A and B are identical except for the character A or B in the name. They may even have different indexing strategies. Persistence processing uses a view that is based only on table A to write data sent by the application and a view based only on table B to read data from the staging table as it is processed into the table(s) ultimately used by the application.
In the state shown above, the Writer’s View is used by application to receive new data into Table A. The Reader’s View is configured to process a set of data into the target table or hierarchy from Table B but is empty and not in use as the processing begins.
In a continuous process with periodicity defined by a target minimum latency for data availability from the target hierarchy, these steps are repeated:
- Reader’s View is dropped
- Table B is prepared to begin receiving rows
- The Writer’s View is ALTERed to reference Table B instead of Table A
- Table A is prepared to begin providing rows
- Table A data is sent to the target hierarchy
After one rollover the Writer’s View is now adding rows to Table B and the Reader’s View is now reading rows from Table A.
The SQL Server 2008 ALTER VIEW operation will be speedy and easy too implement. Only an instant of serialization is required. It is necessary to kick everyone off of the view before the alteration can be applied and in OLTP/Web data scenarios taking the database to single-user to alter a view is overly disruptive.
The tunability – if there is such a word – of ABTables is expanded by capabilities borrowed from other strategies. The Writer’s View can be used with any and all transfer styles. Data values, rows, TVPs and flat files can all share the same set of ABTables. Indexing can be easily adjusted to optimize the second or upsert phases with no need to impose a penalty on sending application processing. File count and File Group placement can be adjusted as desired.
ABTable testing is based upon what has been learned thus far with test data about inserting to an un-indexed heap from parts 2 & 3 the upsert processing from part 4 creates an ABTable scenario that can be compared to the ‘direct’ method tests from parts 5 thru 8. Row sizes per transfer is limited to 1-5000 rows per transfer range. This is the range where previous tests indicate an advantage when moving data in to the database. (Larger transfers are problematic on my test environment. My laptop has serious problems trying to work with 5, or even 3 concurrent processes trying to load 100,000 rows in one set. In every likelihood, reference database server hardware would be able to work with a higher rows per transfer if that is interesting or important to the application.
Increased concurrency in this test’s context refers only to the number of cloned streams adding rows to the heap in parallel. Reader contention is not considered. The effect for the upsert process is that there will be incrementally more rows to process in the single set based insert each time the A & B tables. This one process can easily become a bottleneck as the number of active clones is increased. For testing we can control the number of rows in the first step – moving data into the database – to align with the set rows per batch levels of previous tests. The insert into the target will have to take what ever is transferred since the last execution. An attempt will be made to align the test results in a way to allow comparison with previous test results discussed in this series. Also note that in the test scenario, where the target heap is switched once a minute, the post-staging latency – the time before the the data is available once it is loaded to the SQL Server – can be up to one execution of the second step longer than the time that the sending application is tied up with the insert to the heap.
Latency would need to be measured from the instant the data originated to be meaningful. Application accumulators and queuing, delays on the wire, moving data into SQL Server and merging/upserting into an existing hierarchy are all likely to introduce latency. To best understand latency the end-to-end configuration and communication stacks that will be used in the live environment should also be used in the test environment. That is to say, if the application will rely upon JDBC, latency is best measured using a JDBC connection; or if application data will originate from a temperature sensor on the factory floor sent via oData services, latency measurements would be meaningful only to the extent that the complete data bridge is adequately simulated and observed during testing, and so on. As has been the practice in previous tests in the series, this test will only measure or consider for measurements those latencies internal to SQL Server. Application-like configuration and test design should be used to the extent practical to extend the test across the inter-process communication stack(s).
Given that we really cannot see or determine actual latency without intimate knowledge of the application(s) the database supports, the AB Table results are none-the-less interesting. Deadlocking was never observed. Queries are nicely behaved through the range of 1 thru 10 concurrent processes.
Looking only at the concurrency level of 10, the chart indicates a great reduction in contention.
AB Tables are a viable option in high concurrency data loading scenarios where moderate latentcies related to row accumulation are acceptable.