Verifying Little Bits of Big Data

Most data is small. Filesystems store executables, text files, configuration files, and a few large files. Databases are enormous tables of small rows. Version control systems are chains of directory trees made up of small bits of code. Because of the volume of data, efficiently managing this becomes difficult. In this post, we’re going to look at how to verify that data hasn’t been corrupted or tampered with.

To start, we will assume that most data has not been corrupted. So we want to have some way of identifying which pieces of data have been corrupted. One method is store hashes (cryptographically-secure ones of course) of every piece of data. For something like a database, this might seem to be enough since the only structure that exists is within a row. So let’s look at how we might go about verifying our data. Say we have a frozen snapshot of a database, a hash of every row in the database, and that both follow the same order.

We can now verify the contents of this snapshot by scanning over each row and checking its hash against our list of hashes. If the hash we compute for a row differs from the one in our list, then we know exactly which rows have been corrupted and can discard or restore these from an earlier backup.

But wait, this doesn’t actually verify whether database itself is complete! We could have lost both rows and hashes. To fix this, we can compute a hash of all the hashes and use that to verify that the database snapshot contains exactly the rows that were there at the time of the snapshot and that they are in the correct order. In fact, we could avoid storing the individual row hashes and only store the database hash. When verifying the snapshot, if we find that the original and newly-computed hashes match, then we know that database is fully intact. This reduces the verification overhead to the size of just one hash! However, we now have no information to tell us which rows are valid and which have been corrupted. To track that information, we would still need the row-level hashes (alternatively, we could hash blocks of rows and we would know which blocks are valid).

This idea of recursively applying hashes is powerful and is called a Merkle tree (or hash tree). Merkle trees can be used to verify data in version control systems and filesystems. By having directory nodes store hashes of the hashes of all of their children, Merkle trees can be used to verify the integrity of a complete directory tree.

The most famous application Merkle trees is probably in Bitcoin. Bitcoin consists of an ever-growing chain of block headers (hence the term blockchain). Each block header contains a hash of the previous block, some metadata, and the root of a Merkle tree. This allows for anyone to efficiently verify whether or not a transaction took place. All they would need besides the actual transaction content are just O(log n) hashes since the contents of most subtrees will be irrelevant.

How many hashes we want to track determines the granularity at which we can verify corruption and also the speed with which we can verify any particular piece of data. Say we only store the higher levels of the tree for a full Merkle tree as in the example below.

In this example, if we want to verify the contents of just one data node, we have to process the contents of 4. If there is any corruption, we don’t know which of the 4 are actually corrupted. By contrast, if we store hashes for the full tree, then we would only need to process the contents of a single data node, and we could pinpoint which have been corrupted and which are valid. Generalizing this, for a full binary tree of height h, if we store hashes for a tree depth of d, then we have 2d+1-1 hashes to store and can identify failures at a granularity of 2h-d data nodes. Similarly, to verify any failures, we would have to process 2h-d data nodes.

Despite their effectiveness, Merkle trees are not that well know outside of their use in cryptocurrencies. At their core, the data structure is really quite simple: trees of hashes where each node is the hash of its children and leaf nodes are hashes of actual data. It easy to take almost any tree-based data structure and use Merkle trees to provide effective integrity protection for the entire tree. Next time you have to store large amounts of data, consider protecting it with a Merkle tree.