Making JSON Hay out of SQL Server Data

Moving data in and out of a relational database is a relentless run-time bottleneck. I suspect you would agree that effecting metadata change at the database is even more disruptive than the run of the mill CRUD. I often hear the same [straw] arguments for a new cloud vendor or new hardware or new skill set or a code rewrite to relieve throughput bottlenecks. But what if what you really need is a new data store model? What if you have been building, and rebuilding, fancy databases as cheaply as possible on scenic oceanfront property beneath a high muddy bluff across the delta of a rushing river from a massive smoking volcano in the rain? That is to say, maybe an RDBMS is not quite the right place to put your data all [most?] of the time? Maybe… just maybe… SQL Server – and Oracle and PostgreSQL – are passé and the extant justifications for normalization of data are now but archaic specks disappearing into the vortex of the black hole that is Moore’s Law?

On the off chance that there is some truth to that notion I figure it behooves us to at least be aware of the alternatives as they gain some popularity. I personally enjoy trying new stuff. I prefer to take enough time examining a thing so that what I am doing with it makes sense to me. In late 2012 the open source MongoDB project caught my attention. I was almost immediately surprised by what I found. Intelligent sharding right out of the box for starters. And MongoDB could replicate and/or shard between a database instance running on Windows and a database instance running on Linux for instance, or Android or OSX [or Arduino or LONworks?]. And there was shard aware and array element aware b-tree indexing, and db.Collection.stats() – akin to SQL Server’s SHOWPLAN. Even shard aware Map-Reduce aggregations so the shards can be so easily and properly distributed across an HDFS – or intercontinental cluster for that matter – with ease. And tools to tune queries!  I was hooked in short order on the usability and the possibilities so I dug in to better understand the best thing since sliced bread.

The “mongo shell” – used for configuration, administration and ad hoc queries – lives on an exclusive diet of javascript. Equally easy to use API drivers are available from for Python, Ruby, PHP, Scala, C, C++, C#, Java, Perl, Erlang and Haskell. There is more to the API than you find with the Windows Azure or AWS storage object, or Cassandra or SQLite for that matter, but still not as much complexity for the developer or waiting for results for the user as is invariably encountered with relational models.

In the course(s) of learning about the API and struggling to remembering all the things I never knew about javascript and the precious few things I never knew I knew about javascript I found myself working with – and schooling myself on – node.js (node). Node is a non-blocking single threaded workhorse suitable for administrative work and operational monitoring of servers, smartphones, switches, clouds and ‘the Internet of things‘.  The mongo shell is still the right tool for configuration, indexing,  testing and most janitoral grunt work at the database. Unlike node, the shell is not async by default and not all of the lowlevel power of the mongo shell is exposed through the APIs. Nonetheless, node uses the native javascript MongoDB API. And I must say that having the application and the db console in the same language and using the exact same data structures is huge for productivity. Minimal impedence for the DBA to force the mental shift between mongo server and node callbacks. Virtually no impedance for the developer to mentally shift between app layers and data layers!

Perhaps node.js is a seriously excellent cross platform, cross device administrative tool as I believe, but I can only guarantee that it is fun. It is an open source environment with potential beyond any I can imagine for Powershell or ssh. Packages that expose functionaity through javascript libraries and written in  C, C++, C#, Java and/or Python   exist for connection to node. node makes no bones that MongoDB is the preferred data store. I am by no means a data store connoisseur though I have snacked at the NoSQL corner store and stood in the Linux lunch line often enough to feel entitled to an opinion. You’ll have to take it from there.

FWIW:, MongoDB’s sponsor corp has a real good 6 lesson course on-line for kinesthetic learners that will give you a journeyman skills with MongoDB and get you started with node.js. Mostly hands on so there is a lot of homework. And its free.

Update April 22, 2015: While checking out the otherwise mediocre Node js Jump Start course on the Microsoft Virtual Academy I did come upon another resource that might be better suited to the visual learner: The Little MongoDb Book by Carl Seguin available as a no-cost pdf – or in low-cost hardcopy at your favorite book store. 

To otherwise help ease your introduction – if you decide to kick the MongoDB tires using node.js and you are a SQL DBA – I provide an example data migration below moving a SQL Server model into a MongoDB document that you can easily set up locally or modify to suit your data. Most of the work will be completing the download and installation of the few open source software libraries required.

For data here I use the simplified products table hierarchy from the AdventureWorksLT2012 sample database available from codeplex. product is easily recognizable as what I will call the unit of aggregation.

The unit of aggregation is all data that describes an atomic application object or entity. From the already abstracted relational perspective one could think of the unit of aggregation as everything about an entity de-normalized into one row in one table. In practice, many relational data models have already suffered this fate to one extent or another.      

In the AdventureWorksLT database I see three candidates for unit of aggregation: customers (4 tables), products (6 tables) and Sales (2 tables, parent-child – the child is probably the unit of aggregation). Product is interesting because there are nested arrays (1 to many relationships) and a grouping hierarchy (category). Here is a diagram of the SQL Server data:


This is loaded to into a collection of JSON documents of products with the following JSON format. The value (right) side of each ‘name -value’ pair in the document indicates the table and column to be used from the SQL data.

    _id : Product.ProductID,
    Name: Product.Name,
    ProductNumber: Product.ProductNumber,
    Color: Product.Color,
    StandardCost: Product.StandardCost,
    ListPrice: Product.ListPrice,
    Size: Product.Size,
    Weight: Product.Weight,
    SellStartDate: Product.SellStartDate,
    SellEndDate: Product.SellEndDate,
    DiscontinuedDate: Product.DiscontinuedDate,
    ThumbNailPhoto: Product.ThumbNailPhoto,
    ThumbNailPhotoFileName: Product.ThumbNailPhotoFileName,
    rowguid: Product.rowguid,	
    ModifiedDate: Product.ModifiedDate,
        ProductCategoryID: ProductCategory.ProductCategoryID,
        ParentProductCategoryID : ProductCategory.ParentProductCategoryID,
        Name: ProductCategory.Name,
        ModifiedDate: ProductCategory.ModifiedDate 	
        ProductModelID: ProductModel.ProductModelID,
        Name: ProductModel.Name,
        CatalogDescription: ProductModel.CatalogDescription ,
        ModifiedDate: ProductModel.ModifiedDate 	
              ProductDescriptionID: ProductModel.ProductDescriptionID, 			
              Culture: ProductModelProductDescription.Culture,
              Description: ProductDescription.Description,
              ModifiedDate: ProductDescription.ModifiedDate 	
            ,{more descrs in the square bracketed array}... 
   ,{more products - it's an array too}... 

The code is executed from a command prompt with the /nodejs/ directory in the environment path . I am using node (0.10.25) on Windows Server 2012 with SQL Server 2012 SP1 Developer Edition at the default location and MongoDB 2.2.1 already installed prior to installing node. SQL Server is running as a service and mongod is running from a command prompt. I am using only Windows Authentication. For SQL Server access I am using the edge and edge-sql npm packages. edge asynchronously marshals T-SQL through the local NET framework libraries and returns JSON but only works with Windows.

( April 3, 2017 update: have a look at the mssql package on NPM. Microsoft is the package owner and the output from the database is JSON by default. Works with Node.js on Linux, iOS and Windows )

edge-sql result sets come back to the javascript application as name-value pairs marshaled from a .NET ExpandoObject that looks and smells like JSON to me. The work left after the queries return results is merely to assemble the atomic data document from the pieces of relatioinal contention and shove it into a MongoDB collection. This all works great for now, but I am not totally convinced that edge will make the final cut. I will also warn you that if you decided to start adapting the script to another table hierarchy you are will be forced to also come to understand Closures and Scope in Javascript callbacks. I hope you do. It’s good stuff. Not very SQLish though.

  npm install mongodb
  npm install edge
  npm install edge-sql

  edge expects valid SQLClient connection string in environment variable
  before node is started.
  EDGE_SQL_CONNECTION_STRING=Data Source=localhost;Initial Catalog=AdventureWorksLT;Integrated Security=True

  edge-sql is a home run when aiming SQL data at a JSON target because
  you just supply valid T-SQL and edge will return the ADO recordset
  as a JSON collection of row objects to the scope of the query callback.   	

  edge-sql for a production system is a strike out (w/backwards K)
  1. returns only the first result requested no matter how
    many results produced
  2. the javascript file containing edge.func text is vulnerable to
    SQL injection hijack by adding a semicolon followed by any valid T-SQL
    command or statement provided first word in edge.func callback comment
    is insert, update, delete or select (not case sensitive)
  STEEEEERIKE 3. the connection is made with the security context of the
    Windows user running the script so database permissions and data can
    be hijacked	through an attack on the file with an edge.func('sql')

var edge = require('edge');
var mongoClient = require('mongodb').MongoClient;

var mongoURI = 'mongodb://localhost:27017/test';

// named function expressions (compile time)
// you paste tested SQL queries in each functions comment block
// edge parses it back out and executes as async ADO
var sqlProductCursor = edge.func( 'sql', function () {/*
  SELECT  ProductID
        , ProductCategoryID
        , ProductModelID
   FROM SalesLT.Product;
var sqlProduct = edge.func( 'sql', function () {/*
  SELECT  ProductID AS _id
        , Name
        , ProductNumber
        , Color
        , StandardCost
        , ListPrice
        , Size
        , Weight
        , SellStartDate
        , SellEndDate
        , DiscontinuedDate
        , ThumbnailPhoto -- varbinary MAX!
        , ThumbnailPhotoFileName
        , rowguid
        , ModifiedDate
  FROM SalesLT.Product
  WHERE ProductID = @ProductID;
var sqlProductCategory =  edge.func( 'sql', function () {/*
  SELECT ProductCategoryID
       , ParentProductCategoryID
       , Name
  FROM SalesLT.ProductCategory
    WHERE ProductCategoryID = @ProductCategoryID;
*/} );
var sqlProductModel = edge.func( 'sql', function () {/*
  SELECT ProductModelID
        , Name
        , CatalogDescription
        , ModifiedDate
  FROM SalesLT.ProductModel
  WHERE ProductModelID = @ProductModelID;
var sqlProductModelProductDescription =
  edge.func( 'sql', function () {/*
    SELECT 	pmpd.ProductDescriptionID
          , pmpd.Culture
          , pd.Description
          , pd.ModifiedDate
    FROM SalesLT.ProductModelProductDescription AS pmpd
    LEFT JOIN SalesLT.ProductDescription AS pd
    ON pmpd.ProductDescriptionID = pd.ProductDescriptionID
    WHERE ProductModelID = @ProductModelID;

mongoClient.connect( mongoURI, function( error, db ) {

mongoClient.connect( mongoURI, function( error, db ) {
  sqlProductCursor( null, function( error, sqlp ) {
    if ( error ) throw error;
    for (var i=0; i < sqlp.length; i++) {
      ( function (j) {
          sqlProduct (
            { "ProductID" : j.ProductID },
            function ( error, product ) {
              sqlProductCategory (
                { "ProductCategoryID" : j.ProductCategoryID },
                function ( error, category ) {
                  sqlProductModel (
                    { "ProductModelID" : j.ProductModelID },
                    function ( error, model ) {
                      sqlProductModelProductDescription (
                        {	"ProductModelID" : j.ProductModelID },
                        function ( error, descrs ) {
                          model[0].descrs = descrs;
                          product[0].category = category[0];
                          product[0].model = model[0];
                          db.collection('products').insert( product ,
                            function( error, inserted ) {
                              if (error) throw error;
                        });	// descrs
                    }); // model
                  }); // category
            }); // product
        })(sqlp[i]); // product closure

That’s all there is to it.

This entry was posted in NoSQL. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s