One brute force attack is an all out assault that tries every possibility to guess the content of a cell. One brute force target is a ‘word’ stored in a hashed cell. A cell in SQL Server cell-level encryption is generally taken as a column. A dictionary attack is a digital commando raid: a smaller more coordinated attack; and very likely to be more effective than brute force. Both can target hash values stored in SQL Server whether SQL Server did the hashing (HASHBYTES) or the application generated the values. The dictionary attacker assembles the most likely phrases or keys – the dictionary, then ‘joins’ the dictionary to the values to be hacked to identify the key or phrase used to generate the hash. The dictionaries can be built off-line and before-hand to create a swift and powerful attack. The more information exposed to the attacker a priori, the more precise the possible attack. The brute force attacker uses every possible key or phrase so is usually slower.
Dictionary attacking seems to be a popular past-time judging by the number of freely downloadable dictionary attacking applications available. There does not appear to be reliable research to be certain of the prevalence of brute force and dictionary attacks against hashed values in SQL server. My sense is that there may be a disturbing number of attacks that go completely undetected and/or unreported. And as more applications necessarily wade into the cryptographic swamp, attacks will surely increase.
It is important to mention that without encryption attackers have a cake walk. Information is lost the instant the attacker makes that first connection. No need to build a dictionary or try even a single possibility of brute force. Poor data governance will reduce or eliminate the number of brute force and dictionary attacks on your systems!
A simple hash of a value is particularly vulnerable to a dictionary attack. A simple hash being one that simply applies the encryption algorithm to the value.
HASHBYTES( N'SHA1', @Value );
Salting a value before hashing helps to confound the dictionary attacker essentially by injecting nonsense into the value. If you can use it, the more salt the better. A salt of comparable size to the value being hashed is recommended.
HASHBYTES( N'SHA1', @Value + @Salt);
With salt, the dictionary attack is useless provided the attacker does not have access to the salt. The salt must be carefully protected. A brute force attack may still succeed though may take significantly longer to complete than without salt. Salt exposed as clear text in another column of the table or in another table or even hard coded creates vulnerability to the dictionary attack once discovered by the attacker.
This underscores the relentless high-stakes battle being waged between data application owners and hackers over hashing and encryption. If a reasonably skilled hacker is determined to crack your encryption, they can eventually do so. SQL Server is particularly vulnerable because the encryption keys are stored with the data. Good defense from dictionary and brute force attacks include:
- strong, well tested, multi-dimensional security (defense-in-depth)
- strongest encryption possible (security above performance)
- anonymous lookups – using salted HASHBYTES() (decrypt only when needed)
- securely store and backup all salt – (store local and remote copies)
- securely share public keys (never expose private keys)
- encrypt data at rest and all data backups (TDE, file system encryption)
- backup encryption keys (store local and remote copies)
- securely backup all encryption passphrases (store local and remote copies)
- rotate encryption keys and passphrases (force brute force attack restarts)
- encryption by combination of keys and passphrases (separation of responsibilities)
- very strong passphrase strength
- stringent access approval requirements (separation of responsibilities)
- continuously test and update roll-back and roll-forward recovery plans
Dynamic key management can get messy and/or expensive real quick in recovery scenarios. However, every time you rotate in a new key most attacks reset to square one. The benefit of having a known good backup when you need it is priceless. The headache of sound key management is worth the benefit. Rotate keys faster than the secrets can be guessed and you should be good. How big that window is depends on the strength of the security, strength of the encryption algorithm, strength of the salt as well as the general level of interest in the data and the sophistication of the attacker.
Hashing only part of the value is a common way to further explode the attacker’s dictionary. This approach includes some unknown part of the value – of length lambda – in the hash. The excluded part of the value in the hash – along with the salt – makes it that much more difficult for the attacker to assemble the dictionaries.
HASHBYTES( N'SHA1' , LEFT( @Value , FLOOR( LEN( @Value ) / 2.0 ) + @Salt)
Hashing also increases the risk that two rows with different values will produce the same hash. It is a small possibility but that risk is always there with HASHBYTES() or any function that tries to uniquely identify a value using a type with a domain too narrow to uniquely identify all possible values. 15 years ago DATETIME columns with a 1/3 mille-second minimum resolution granularity were being used to uniquely identify rows with some success. Few would even consider a DATETIME as a candidate key today. In less than 15 years the same will be true for HASHBYTES. It is not safe in T-SQL to presume a hash absolutely and uniquely identifies the value represented by the hashed value unless the hash is defined as a unique constraint or unique index of the table where the encrypted value is stored. When the uniqueness of the hash is enforced by DRI, it is also a good idea to always handle the duplicate key error. It is more likely that other columns must be evaluated to determine if a match on the hash is in fact a match on the value. Often – sometimes even intentionally – another column is available to resolve the query; others time – and again sometimes even intentionally – an encrypted value may need to be unencrypted to complete the join. Having to un-encrypt to fully identify the row can cause application pain if queries must decrypt large sets, but nicely reduces the vulnerability to attack. It is possible in almost all cases to first get the rows that match a specified hash value into a derived table and then to decrypt only the few resulting rows to find the one requested by a query that knows more about the value it wants than just the hash. The bigger the encrypted blobs, the more the encryption of a few resulting rows can become a significant effort for both process and database server. In this regard it is essential to attenuate the hash processing to the application. Probably a good idea to do a sanity check about the value of the hash for a large cell. In general, it is more likely that full-text query processing is needed than an exact match on a 1GB blob.
In light of this unmitigated potential for conflict a de-duplication external to the hashing operation is needed. Never treat a hash of a value as unique unless you put a constraint on the value that it must be unique in the set. I’m thinking, if the hash cannot be considered unique anyway and the logic must deal with possibility of a duplicate and the data I intend to encrypt is huge (< million rows of customer info is not huge) or under exceptionally heavy load (<5 short running concurrent batches per second is not a heavy load for my laptop), it seems reasonable to me to reduce the hash one step further and define a bucket of hashes that can be represented by the same BINARY_CHECKSUM of the hash. It is not necessary to explicitly put any values into the buckets. That will (may) happen by itself as the hashed value is persisted to the database. In most of my use cases the chance of a duplicate even at the integer level of narrowing is somewhere close to the chance of moon falling out of the sky. The important aspect to recognize is that the buckets do not necessarily represent a single hash just as the hash does not necessarily represent as single value; and then to assure that when duplicates are found in the bucket there is enough information in the query or schema to resolved which of the multiple matches is the one desired. Narrowing the type of the hash to an integer will result in an easier number to work with for the developer, a reduced execution context footprint and reduced read-ahead requirements.
The narrowing method used here also opens the door to randomly generating row identity values at the INT level – or why not SMALLINT or BIGINT or NUMERIC(7,5) or any primitive SQL type? – so that the sequence of creation is not revealed in the surrogate key.
Raul Garcia blogged a hash like Message Authentication Code (MAC) scheme that protected against dictionary attacks a while back. In Mr Garcia’s example the salt appended to the value is the hash of an encrypted NEWID(). As he points out, the encryption of the NEWID produces a non-deterministic value that is a much better random number generator than the SQL Server NEWID() or RAND() functions. The statement he uses to compute a MAC is:
HASHBYTES( N'SHA1' , ENCRYPTBYKEY( KEY_GUID( N'ValueKey' ) , CONVERT( VARBINARY(100) , NEWID() ) ) )
The new GUID is cast to a VARBINARY because SQL Server encryption functions do not have the UNIQUEIDENTIFIER in its vocabulary. The GUID cast as VARBINARY is encrypted to produce a nondeterministic and therefore random value. The encryption function returns a random VARBINARY(8000) that is subsequently hashed with the SHA1 algorithm to produce a VARBINARY(20) digest that represents the encrypted new GUID in a way that has no means to be reverse-engineered. The result is non-deterministic so may be interesting as an unordered identity column – but there remains always that chance of a duplicate hash value.
CRYPT_GEN_RANDOM can also be used but the name I would have to give the technique is not near as fun. Instead of “checksum truncation of the random lambda” we get “VARBINARY truncation of the random random.” My guess is this might not scale quite as well either due to the overhead of encryption functions versus hashing functions – but I have not test that, just guessing…
CAST( CRYPT_GEN_RANDOM ( 4 , ENCRYPTBYKEY( KEY_GUID('ValueKey') , N'type align with NEWID cast' , 1 , CAST( NEWID() AS NCHAR(36) ) ) ) AS INT );
Instead of substring truncation as described above I now propose in most scenarios to use a BINARY_CHECKSUM of the hash to capture an integer representation of the original GUID (Correct me if I am wrong but is not that still deterministic truncation akin SUBSTRING, but in a way that considers the complete underlying value to reduce the probability of any brute force guess being correct instead of throwing part of the values unique identity on the floor?). An index of 4 bytes INT values will fit on 1/10th the number of pages as would an index of the same number of rows with about 40 bytes per row in a VARBINARY(8000). Integers will get you to about the same place in terms of selectivity of the hash value with a significantly reduced total scan length requiring less read-ahead and therefore less memory pressure. Data keyed on whole numbers is endlessly easier to humanly work with than data keyed on a large binary value.
For the MAC that looks something like:
BINARY_CHECKSUM( HASHBYTES( N'SHA1' , ENCRYPTBYKEY( KEY_GUID( N'ValueKey' ) , CONVERT( VARBINARY(100) , NEWID() ) ) ) )
And for a deterministic hashed matching value more like:
BINARY_CHECKSUM( HASHBYTES( N'SHA1' , LEFT( @Value , FLOOR( LEN( @Value ) / 2.0 ) + @Salt) )
;where @SALT is a well protected static value, perhaps generated as a MAC akin to Raul Garcia’s original proposal?
The per row storage advantage of the INT returned by BINARY_CHECKSUM over the VARBINARY(20) returned by HASHBYTES using the ‘SHA1’ algorithm is tiny. But it will add-up in a table with millions and millions of rows. Plus, the overhead difference between the INT and the VARBINARY is significant in other ways. As already mentioned, indexes and therefore scan lengths will be somewhat shorter. Sorting will be slightly faster. Development and testing are also marginally easier with the INT type. I’m spending too much time here to be certain that any performance benefit of INT of VARBINARY is not exaggerated – if it even exists.
It may also be controversial to claim CHECKSUM as a truncation method. CHECKSUM bucketing is a well established and accepted practice however the de facto truncation method at this time is based upon the left most (higher order) bits of the value. The NIST publication SP-800-133, “Recommendation for Cryptographic Key Generation” seems to at least allow for some flexibility in truncation of the hash (I highlighted a few spots in red):
Let K be the directly-generated key or the seed to be used to generate a key. K is a bit string value of the following form:
K = U ⊕ V, (1)
• U and V are of the same length as K,
• U and V are independent of each other, and
• U is derived from the output of an approved Random Bit Generator (RBG)
within the key-generating module that is generating K; the output of the RBG may have been transformed by an approved post-processing method (see Section 5.2) to obtain U.
FIPS 198-1, “Keyed-hash Message Authentication Code (HMAC)” specification – the more recent NIST darling over the classic Cipher Block Chaining MAC (CBC-MAC) that SQL Server uses – has something to say about truncation as well:
A well-known practice with MACs is to truncate their outputs (i.e., the length of the MACs used is less than the length of the output of the HMAC function L). Applications of this standard may truncate the outputs of the HMAC. When truncation is used, the λ leftmost bits of the output of the HMAC function shall be used as the MAC. For information about the choice of λ and the security implications of using truncated outputs of the HMAC function, see SP 800-107
Note that in the HMAC specification, the bits to use are stated as an option-less directive! (“…shall be used.“) SP-800-107, “Recommendation for Applications Using Approved Hash Algorithms” seems pretty clear that the standard truncation method for a hash is only a recommendation for the readers convenience. (Is this really how standards are born? One person writes down an example to try to be helpful. Later someone needs to borrow the captured helpful thought. e viola! In the process of the borrow, the convenience morphs into an edict. Right or wrong, a standard is born.)
5 Cryptographic Hash Function Usage
5.1 Truncated Message Digest
Some applications may require a message digest that is shorter than the (full-length) message digest provided by an approved cryptographic hash function specified in [FIPS 180-3]. In such cases, it may be appropriate to use a subset of the bits produced by the cryptographic hash function as the (shortened) message digest.
For application interoperability, a standard method for truncating cryptographic hash function outputs (i.e., message digests) is provided strictly as a convenience for implementers and application developers. The proper use of a truncated message digest is an application-level issue.
Let the shortened message digest be called a truncated message digest, and let λ be its desired length in bits. A truncated message digest may be used if the following requirements are met:
1. If collision resistance is required, λ shall be at least twice the required collision resistance strength s (in bits) for the truncated message digest (i.e., λ ≥ 2s).
2. The length of the output block of the approved cryptographic hash function to be used shall be greater than λ (i.e., L > λ).
3. The λ left-most bits of the full-length message digest shall be selected as the truncated message digest
In Raul Garcia’s MAC the lambda – the length of bits being hashed- is not exploited to forge the authentication code. It is based on a new GUID at its core, cast to VARBINARY(100) – somewhat arbitrarily using that type and size – because the UNIQUEIDENTIFIER type cannot be used in the hashing function.
In the NIST SP-800-133 sample method the left-most lambda bytes of the value the code authenticates are used. The protection from dictionary attacks is about as perfect as it gets in Mr. Garcia’s example, however the NIST method can be reproduced as needed in hashed password authentication schemes. Raul’s method may be better for use in queries where the MAC’s value is always part of the question (e.g. DRI, joins and indexes). The NIST recommended method is the only option of the two that could correctly be used in a uniqueness test that reproduces the hash and searches for a match in the data set – or at least a pre-test to get it down to one bucket of MACs that must be cracked open to finally determine the desired row.
In all cases, the only way to assure that a hash or hash-bucket collision in T-SQL does not crash or corrupt the application it to code for collisions. The duplicate hash value collision is a fundamental keyed set dilemma comparable to the old school design mistake of using a DATETIME as a unique key. When a data type cannot describe all the unique values that it are assigned in the set that type can be described as narrowing. Hashed values are narrowing columns for clear text values, buckets are narrowing columns for hashes. A collision occurs when a duplicate value is actually placed in a narrowing column. When a narrowing column is a candidate key an additional piece or two of uniquely identifiable information must be included to resolve the collision. The same hash value must be usable to describe two or more unique values or the application must be open to spurious duplicate key errors. Specifically, a table with a narrowing column must include additional columns in any unique index or key to assure uniqueness. A narrowing column requires query and indexing that do not presume uniqueness of hash or bucket columns along with DRI that can correctly disambiguate a collision.
Whether and how a collision resolver deals with duplicates at the VARBINARY(20) bucket size or the INT (4 bytes) bucket size is not meaningfully different. The beauty of using BINARY_CHECKSUM as a truncation method is that instead of using a fixed subset of bytes from the value as the original value, a subset of each byte is used. With the NIST substring method, the hacker can still perpetrate a dictionary attack on the hash, but is left with the task of mapping the part of the original value not reached by the lambda. That could take a while. With Raul’s random method, a dictionary attack using his MAC is not possible. The only relationship they have is that the MAC was created because the value already existed.
Using CHECKSUM instead of BINARY_CHECKSUM will reduce the number of buckets as will removing the sign (ABS()).
Interestingly, RSA defines MAC as synonymous with checksum: http://www.rsa.com/rsalabs/node.asp?id=2177
Until convinced otherwise, I plan to make use of checksum truncation for bucketing of lookups in my encrypted data where I can.