Persisting Data – part 1: The “impedance mismatch”

note- Nov 29, 2015 – finally got a copy of test scripts on github. see https://github.com/bwunder/data-load-methods for working version of all TSQL shown in this post…
This post is an introduction to a few posts to follow that will describe a testing methodology for designing optimized database synchronization protocols. Synchronization, as used here, is the operation of persisting accumulated data changes at the application layers to the database. The methodology is intended to be equally effective whether building a new application or resolving application bottlenecks that arise as data is persisted to an application’s database. Here we merely set the table (pun intended) for the test results posts that will follow.

Care has been taken in testing to not targeted a particular programming language or application type in these tests. Care has also been taken to measure only the relevant database operation at the database server. Inter-process Communication, Network IO and work done to prep the data are not measured. Nevertheless, implementation details of these externalities will affect your result. It’s more than my externalities are not valid for your application. There really is no way I can help you there unless you want to give me a call or send me an email.

Shops currently developing with Object-Relational Mappers (ORMs), ADO.NET, WCF, oData or JAVA will often find the test results quite provocative. Those shops frequently employ no option but the row-at-a-time synchronization built in to the mapping technology. The information that will be shown in these posts may provide enough information to encourage some to take another look?

~~~~

I had originally understood the object-relational impedance mismatch to be a productivity issue. The intra-day task of retooling the thought process to work with JOINS and SQL predicates after hours of work with state, behaviors and callbacks is always awkward for me. From another perspective, in my role as a DBA I have helped many excellent programmers and developers put together a fairly straightforward query after they had struggled with the query problem but could never quite get the answer they needed. That even the best developers have struggled with the mental shift required to move between application development and database queries has been true for much longer than any ORM has been in wide use.

However, with the wide adoption of ORMs, the “impedance mismatch” has become more of a technology brick wall. A war of words seethes unabated between OOP developers and RDBMS DBAs. Each side lobbing the occasional incendiary comment over the wall. This progress stopping tit-for-tat is as unacceptable and ineffective in the software shop or data center as it is in the legislature. Much development energy today is aimed at entity modeled solutions wired to object-relational mappers. Good DBAs and developers must find ways to best solve the problem.

Entity Framework (EF), LINQ, the JAVA Persistence API (JPA), Hybernate/NHbernate are significantly more developer friendly than what came before. Developers feel more productive a higher percentage of the time. Mostly this is good. However, someone must continue to design, test and develop in the abyss behind object-relational mappers of all but the simplest applications. First to define the ORM and then to maintain and perhaps optimize the mapper to database plumbing. This is where the impedance mismatch may continue to thrive. Here the costly mental shift from relational algebra to object oriented programming remains a given.

To my way of thinking, object mappers would do well to persist through to an object store like or better yet a document store like MongoDB.

It could be that the role of DBA slowly morphs to a role who’s responsibilities are to take care of all things under/behind the the mapper will emerge. The skills required must include the ability to test, tune and debug queries and resolve database performance issues. The DBA already has a head start there. Data administrators in agile environments might – one day real soon if it happens -become much more intimate with the local entity model(s) as well.

The data driven application can persist a working set of data from memory to the database almost as fast as the data changes. In ADO.NET applications this synchronization to persisted storage might typically invoke a DataAdapter.Update(), or – when LINQ or the Entity Framework are used – issue a DataContext.SubmitChanges(), or may even implement the IBCPSession or IRowsetFastLoad OLEDB interface or the SQLBulkCopy ADO.NET class when a DataWriter does not fill the bill. The JAVA Persistence API is also in wide use though more often with database products other SQL Server. The options for persisting data are many. Making the right choices is difficult without a broad comparative basis.

Data driven applications tend initially to rely exclusively upon a single ORM, API or Interface to access the database. This protects any advantage in simplicity or performance afforded by the chosen data access method. When the chosen data access model does not satisfy SLA or performance requirements, a suitable secondary protocol may be used. One reason could be to resolve synchronization issues apart from the built-in mapper or dataset mechanism.  The ADO.NET Command object, the JAVA Statement object and of course what ever variety of Execute() methods – a quick MSDN check shows the DataContext object has 7 including a generic, ADO.NET has 5 including TableDirect are the likely suspects. Other API are in use to varying degree.

Application layer data caching services such as the open source memcached , ASP.NET Application Cache, the MongoDB JSON document store or – in some cases – a custom data access layer (DAL) can – and do – add complexity to synchronization activities. Likewise application layer optimizations such as the Lucene indexes or a content management system can also constrain synchronization processing, though to a lesser degree.

As any application becomes successful, resource utilization becomes more competitive.  Data changes increase in number and/or size resulting in obvious symptoms of bottleneck by monitoring even basic usage metrics (e.g., % CPU, Avg. Disc Queue Length, Pages/sec, and Batch Request Per Second) . For better or worse, it is easier to tell you have a bottleneck than it is to know what bottleneck you have.

To get to the bottom of a bottleneck, a methodical analysis is necessary. The “waits & queues” methodology has emerged as something of a de facto standard for finding bottlenecks. The visibility into waits made possible with Dynamic Management Views (DMV) and SQL Profiler‘s ability to correlate a trace with Windows performance data beginning with SQL Server 2005 are powerful aids for a waits and queues analysis. I’ll leave the reader to research and obtain “waits & queues” resources. Definitely a good idea for anyone with DBA responsibilities to master this methodology. If you google for “waits and queues” the impedance mismatch is evident in the hit list. Not much mention of entities or mappers.

Even when the bottleneck is well understood, finding the most elegant resolution to synchronization induced IO bottlenecks may not be obvious. The technical evolution – and increasingly the philosophical direction – of the agile development environment and the relational storage engine are in conflict. Due in no small part to the resulting polarization between ORM oriented developers and nervous DBAs, synchronization bottlenecks in entity modeled applications are becoming interleaved with application logic over time. Not only is this a sad commentary on how we work together, but can leave a lot of application code to unravel when refactoring the synchronization to persistent storage in a way that provides relief from an IO bottleneck.

Increasingly, application development paradigms are gravitating toward row at a time synchronization processing models. Even Microsoft‘s flagship ORM does not provide table-value parameter (TVP) functionality in the mappers as of the .NET Framework 4.0. Batching options available through the DataAdapter.BatchSize or careful control of when DataContext.SubmitChanges() is called combined with transactional wrappers can help performance to a degree. Everything at the database still happens one row at a time.

This is unfathomable considering that SQL Server 2008 smokes when the application feeds it table-value parameters (TVPs). Conversely, people have actually made a name for themselves in the SQL community by extolling the by now well known drawbacks of row-at-a-time processing. Another aspect of the “impedance mismatch” or just a case of the EF technology’s divergence from SQL Server technology? I mean, TVPs and EF represent more recent innovations to improve the landscape from the same vendor but they don’t work together. I am not sure what to make of that! My suspicious mind has me wondering to what extent the impedance mismatch supports the vendor’s bottom line…

There are a few blog posts to suggest that Microsoft has been working toward TVP capability for the EF for quite a while. The way rows are added to the object then submitted to database is ideal for the TVP. There was a connect.microsoft.com item (571256) from June with a request for SubmitChanges() to auto-magically use SQLBulkCopy when there are many rows to send to the database. Not sure of the status. The connect item seems to have vanished.

Batching methods do help reduce overhead at the database by including many one row operations within a shared transaction scope thus reducing cumulative overhead. However, the batching approach bloats application memory and alters the data stream such that data row are sent to the database in surges. The verbosity of the single statement operations are not reduced or optimized in any other way as far as I can tell.

Undoubtedly – and equally unfortunately – any allied object bloat to mask and avoid the impedance mismatch is acceptable until the application begins to experience a persistence related bottleneck. Then the reality sets in that using sets in SQL Server is remarkably faster and less resource intensive along with the recognition that the application may been adapted to minimize the bottleneck(s) in past row-at-a-time iteration in a way that now proves inadequate. Un-doing it all is often the cleanest option when refactoring to a set based approach.

Other than ‘throw hardware at it’, we are left to our own devices to figure out how to make a set based approach work when LINQ or EF don’t quite cut the mustard. TVPs can be used to performance advantage in many scenarios where synchronization would otherwise bottleneck the application. The question becomes , “At what cost”? How and how much must the application change to gain benefit from TVPs during data loading operations? I’ll leave those questions for you to answer.

When considering a software resolution to an IO bottleneck, the lessons learned by the data warehousing folks are invaluable. Best when applied during the design and conceptual development to help answer most fundamental questions, yet still powerful for improving existing applications.

In the data warehouse, very little is done one row at a time. Set based operations are the rule. The data push from source systems is often delivered for data warehouse consumption as a file of exported rows. Prior to Change Data Capture in SQL Server 2008 data loading optimizations for the data warehouse consisted mainly of work flow management, decomposition, and in-flight transformations. Early on it became obvious that isolating the load operation from the add data to a live table operation resulted in faster loads and more flexible retry capabilities. At the pinnacle, external data is staged into an un-indexed heap table using an n-thread parallel loading pattern, the heap is then indexed and the data inserted – or more commonly upserted or with SQL Server 2008 maybe MERGEd – into the target schema. The reduced overhead of no indexes, no relationships, and no ordering while the bits are moving into the database storage engine along with the many IO subsystem optimizations such as  ‘minimal logging’ when targeting un-indexed heaps and lightning fast index builds gets interesting.

Consider this white paper produced by the sages and savants of the SQL Server Customer Advisory Team (SQLCATs) on the multitude of considerations that must go in to a data warehouse data loading strategy. The white paper is loaded with tips for optimizing ‘batch’ loads for millions of rows or more into a data warehouse.

But what about OLTP and web applications where the application is pushing a large number of writes to the database and must also remain online and available during the data loading process? Are data warehouse fundamentals so easily extrapolated to  databases under a significant transactional load? Can data warehouse batch processing best practice concepts be applied to a database that must remain on-line, available and responsive 24-x-7? Not likely, not holistically at any rate. Row sizes that can reap performance benefit form parallel loading is not likely to be a wise action into a loaded OLTP database running on commodity hardware any time soon – maybe someday soon. That does not mean builders of on-line or line-of-business applications cannot benefit from the knowledge and information data warehouse builders use to deliver ever more reliable and responsive data warehouses.

Movement toward set based operations is the most likely direction to resolve synchronization bottlenecks in web and OLTP applications. For example, I fully expect Microsoft will make table-valued parameters more available in the EF as soon as they figure out how. Will it be in an EF4 patch? I have no idea. I am equally disappointed that I do not see table-valued parameters better exploitable with nHybernate classes.  At this time, the most appealing option may be to completely take the ORM out of the database access when persisting multi-row sets. Most applications would have a small number of discrete operations that would use this alternative, but overall the application could be much better behaved.

Microsoft’s ORM already is taking table-valued parameters completely out of the database access. Visual Studio 2010’s Entity Designer reports,

The function ‘TVPMerge’ has a parameter ‘TVP’ at parameter index 0 that has a data type ‘table type’ which is not supported. The function was excluded.

when you ask it to work with a table-valued parameter. The only way this can be acceptable is if either SQL Server or object relational mappers are dead end technologies and are already deprecated from the common vision at Microsoft. I doubt the latter. It’s just business as usual: we have to do the best we can with the tools available. When you work with Microsoft stuff you simply have to accept that a certain amount of the development effort will go into work-arounds and that in a future development cycles the cost of the work around will increase. Set based loading strategies can be beneficial work-arounds at this time. Just keep in mind that once Microsoft figures out how to implement TVPs in EF, the workaround will probably have to be removed in order to exploit the finally built-in finctionality. Some things never change…

In the next few posts a test pattern for comparing the effect of  several different ways to load relatively smaller sets of data (1 to 100,000 rows) is presented. The test pattern will support sending data to most tables or table hierarchies. The example used send rows into  a 2-table Parent/Child Hierarchy and is build from sys.dm_os_ring_buffer output. A diverse set of column types is included including a the XML column from the DMV. A stored procedure is called for all tests with the data values passed as parameters to the stored procedure.

Other scenarios might be more or less complex.  To get the most benefit from the testing protocol presented it is recommended that the example load scenario be replaced with  a target table or table collection comparable to that used in the local environment.

Advertisements
This entry was posted in Data Loading. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s