Skip to content
eawagner edited this page Jun 25, 2013 · 2 revisions

Mango Hash

Problem

Here's the problem- you've got distributed systems that are maintaining a dataset that needs to be consistent. That is, when asked, each system needs to be able to present the same dataset. To make the problem harder, the systems could be isolated geographically and may have low bandwidth by which to communicate. It's also possible that systems could go down without warning. BitTorrent uses Merkle Trees to determine whether or not a potential "server's" blocks in a dataset match the blocks that a client is seeking to request. Git uses Merkle Trees to represent commit histories that can be compared and merged without the need to look at each change in each file.

Solution

Mango Hash presents a data structure for data convergence in distributed systems called a Merkle Tree (http://en.wikipedia.org/wiki/Merkle_tree). The Merkle Tree allows us to summarize large datasets and further determine differences as an nlog(n) operation. The idea is that we forfeight some computing juice up front and, once the tree is created, we can easily determine the differences and, with some level of granularity, determine where those differences occur. A Merkle Tree consists of leaves, which are hashed summaries of a raw dataset, and hash nodes, which are hashed summaries of their children. The Merkle Tree structure in this codebase is a k-ary implementation, meaning that each hash node can have some arbitrary k number of children.

Usage

Generally, a Merkle Tree doesn't just exist on its own. That is to say, it's usually being compared against another Merkle Tree that was generated against a raw dataset on a foreign system. Merkle Trees are based on some raw data set- in this example, we will use pre-defined chunks of a file to build a merkle tree. In the Merkle Tree provided with Mango Hash, the Hash Nodes automatically has their children. In order to construct a Merkle Tree, we just need to fill in the leaves with the hashed summaries of our raw data.

Diving into some code, let's splits this README.md file into chunks and construct a merkle tree. We'll opt for the simple solution for purposes of an example. We simply need to iterate through our file and create a leaf out of each hashed chunk.

First thing we'll want to do is create an extension of a HashLeaf. The HashLeaf class was purposefully left abstract so that we can place a little more meaningful data about our buckets. In the case of read chunks from a file, it'd probably be useful to know what chunk in the file we are on.

public class FileChunkHashLeaf extends HashLeaf {
  
  protected int chunk;
  
  public FileChunkHashLeaf(String hash, int chunk) {
    super(hash);
    this.chunk = chunk;
  }
}
String filename = "/workspace/mango/hash/README.md";

BufferedInputStream bufferedInput = null;
byte[] buffer = new byte[32];   // let's chunk our tree into 32 bytes

List<HashLeaf> leaves = new ArrayList<HashLeaf>();

try {
            
  bufferedInput = new BufferedInputStream(new FileInputStream(filename));
            
  int bytesRead = 0;
  int chunk = 0;
  while ((bytesRead = bufferedInput.read(buffer)) != -1) {
    leaves.add(new FileChunkHashLeaf(new String(HashUtils.hashBytes(buffer)), chunk));
    chunk++;
  }
} catch(Exception e) {
  throw new RuntimeException(e);
} finally {
  bufferedInput.close();
}


MerkleTree merkleTree = new MerkleTree(leaves);

Assuming we have a second merkle tree that was generated by some foriegn system, we can find out which leaves (chunks) differed using the following line of code:

List<? extends HashLeaf> diffLeaves = remoteTree.diff(localTree);

If all we care about is equality, we can easily compare two trees to see if they are equal. Because the trees need to contain the same arity, number of leaves, as well as equal hashes, many times equality is a constant time operation.

boolean equals = localTree.equals(remoteTree);

Clone this wiki locally