Marvelous Merkle Trees

This article was written by author Ian Forrest

In my previous blog post, I wrote about how one might go about building a Tamperproof Logging Implementation. A good-sized chunk of that post was about how one could use Merkle Trees to verify the individual log messages while also verifying the consistency of the entire set of logs. What I didn’t cover is why Merkle Trees were used in the first place.

First, a quick refresher on Merkle Trees. A Merkle Tree is implemented as a binary tree where each leaf node is a hash of data. Each intermediate node is a hash of its two immediate children, culminating in a unique root hash at the top of the tree.

The result is something that looks like this:

The properties of a Merkle Tree allow for the verification of any leaf node’s inclusion in the tree by providing what’s called a Merkle Proof, sometimes referred to as an inclusion proof. A Merkle Proof consists of a leaf node’s sibling and each intermediate hash needed to calculate the root hash. By calculating the root hash with Merkle Proof and comparing it to a published root hash from a trusted location, data can easily be validated for inclusion.

Merkle Tree Benefits

In order to understand why Merkle Trees are so prolific and why we, at Pangea, chose to use them for our Secure Audit Log, it’s essential to understand their benefits.

  • Small Storage Requirement
    Due to Merkle Trees storing only hashes of the underlying data, the storage requirements are relatively small for the benefits they provide.

  • Fast, lightweight data validation
    Compared to hash chaining or other data validation and consistency verification techniques, a Merkle proof is considerably smaller and faster to validate. Consider this, even as the number of leaf nodes in a Merkle Tree doubles, the number of hashes in a Merkle proof only grows by two.

  • Content Addressable
    Content Addressability means that items are addressable by their content, rather than an artificial ID (e.g., sequence, GUID). A root hash in a Merkle Tree is unique based on the leaf nodes contained therein, thus making it content addressable. Moreover, each node in the tree is a hash of its contents as well. Content-addressable objects are inherently tamperproof, because to change the contents would be to change the address.

  • *Consistency Verification
    *In a growing Merkle Tree, any previous root hash and thereby the data it represents can be proved to be included in a newly calculated root hash. Pangea’s Secure Audit Log refers to the proof required as a consistency proof, which is key to ensuring that records have not been inserted or deleted from the log.

Merkle Trees in Action

To further understand why the benefits of Merkle Trees matter, it’s helpful to know how and where they’re used in the real world.

Crypto (BitCoin)

Today, more people than ever are talking about Merkle Trees, largely due to its usage in cryptocurrency. The unique properties of Merkle Trees allow for fast transaction validation and minimize data replication required. Here’s a great article on Merkle Trees and their usage in the Simplified Payment Verification (SPV) nodes of Bitcoin. For an interesting twist on what you know about Merkle trees, read about Ethereums usage of Patricia Merkle Tries.

Data Replication (Cassandra DB)

As mentioned several times, the root hash of a Merkle Tree is unique based on its contents. Some databases like Cassandra use Merkle Trees to validate replicas. By hashing each row a root and adding those hashes to a Merkle Tree, a unique root hash is generated for each data replica. Identifying inconsistencies between replicas is as trivial as comparing the two root hashes to see if they match. Further, when an inconsistency is found, one can compare each intermediate node to identify which rows have been corrupted. This article is an excellent read for understanding in more detail how Cassandra uses Merkle trees for this process.

Distributed File Systems (BitTorrent)

Another example of a Merkle Tree implementation is in the BitTorrent Merkle Hash Extension. Typically torrent files contain a flat list of hashes representing each downloadable chunk of a content file. However, a problem arises when content files increase in size — either the size of the torrent file containing the list of hashes must increase, or the size of the chunks must increase. Neither of these are great options. The Merkle Hash Extension was created to solve this problem by allowing the root hash to be used for verification rather than a flat list of all hashes. This article explains the usage in detail.

Consistency Verification (Pangea Secure Audit Log)

I’d be remiss without pointing out that Pangea’s Secure Audit Log uses Merkle Trees extensively for the validation of data and the consistency of that data. Using Merkle proofs, users can validate each log entry as untampered and a member of the Audit Log. Further, the integrity of the entire log can be validated using consistency proofs, guaranteeing that no records have been inserted or deleted from the logs.

Wrapping Up

Merkle Trees are a fantastic contribution to Computer Science and have a great number of uses. At Pangea, we’re excited about our use of Merkle Trees in our Secure Audit Log service and would love to get your feedback on it. Sign up for our Pangea Secure Audit Log now.

0
Subscribe to my newsletter

Read articles from Pranav Shikarpur directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Pranav Shikarpur
Pranav Shikarpur