5.92

8 3/24: Hashes and bloom filters

Warmup with sets

We know you’re sick of dictionaries, so we’re going to talk about a new kind of data structure that’s not a dictionary. We’re going to talk about sets.

Many of you are taking a discrete mathematics course, so you probably have an idea of how pervasive sets are in computer science. Many algorithms depend on having a set data structure around. They provide an interface like this:

// A Set<T> implements

//

// add : T -> Void

// Adds an element to the set

// Effect: updates the set

//

// contains : T -> Boolean

// Check if an element is in the set

The key thing about sets is they let you check if an element is in it. That matches up with how sets are defined in mathematics.

Sets are important enough that Java comes with a Set interface.

Before we move on, can you implement a Set with any of the data structures you have implemented so far using delegation?

Exercise 1. Optional: Design a data definition and class definition that implements the Set interface. You should delegate to a data structure you have already designed in a previous lab or assignment. (e.g., lists, dictionaries, etc.)

NB: If you feel that you know this material well enough already or if it is taking too much time, move on to the next non-optional exercise.

Bloom filters

Since in class we have been discussing hash tables and hash codes, we will look at a hash-based implementation of the Set interface. Bloom filters are an efficient way to represent sets that are probabilistic. That is, sometimes it will lie and tell you that an element is in the set when it is not. However, it will never lie that an element is not in the set.

The way that you implement a bloom filter is by using a bit vector. A bit vector is like a List where each position represents a single bit (true or false, 1 or 0, etc.) and you can either get, set, or flip bits in your vector. It implements this interface:

// A BitVector is a

//   new BitVector(Integer size)

// and implements

//

// get : Integer -> Boolean

// Produce the bit at the given position

//

// set : Integer -> Void

// Set the bit to true at the given position

// Effect: updates the vector

//

// flip : Integer -> Void

// Flip a bit at the position

// Effect: updates the vector

Exercise 2. Implement a bit vector.

As an aside, here is an application of bit vectors: a neat algorithm for sorting that uses bit vectors. Suppose you’re sorting several million distinct integers and you know the maximum size of any given integer you’re sorting. You may not want to use a typical sorting algorithm because it will allocate a bunch of intermediate data structures into your computer’s RAM.

Instead, you can build a single bit vector that’s as long as the maximum size. Then, when you see an integer just set the bit with the integer as the position. After processing all of your integers, just read off all of the positions setto true in your bit vectors in order, which gives you the sorted list. This is guaranteed to work if your integers are all distinct.

Exercise 3. Optional: Implement the bit vector sorting algorithm explained above.

Test it with a random list of a few million distinct integers.

Okay, back to bloom filters. Here is the data definition:

// A BloomFilter<T> is a

//   new BloomFilter<T>(Integer k, Integer m)

//

// and implements Set<T>

A bloom filter is built with a bit vector of size m. How a bit vector works is that when you add an element to it, you use k different hash codes to give you k positions to set in your bit vector.

When you do a lookup, you will compute the same k hash codes. If a bit at any position is not set, then you know that the element is definitely not in the set. However, if all bits are set, then you only know that it might be in the set.

The reason you don’t know if the element is actually in the set is because there’s a chance another element would have the same positions because of how you modulo the hash codes. If you increase both k and m it is much less likely that you will get these false positives.

One thing you might be wondering is how you obtain k different hash codes when you only have one hashCode method. We can use a trick with random numbers to do this:

import java.util.List;

import java.util.ArrayList;

import java.util.Random;

 

public class HashGenerator {

  // hashCode : Object Integer Integer -> List<Integer>

  // produces k hashCodes for the given object

  public static List<Integer> hashCodes(Object o, Integer k, Integer m) {

        Random rand = new Random(o.hashCode());

    ArrayList<Integer> keys = new ArrayList<Integer>();

 

    return hashCodesAccum(o, k, m, keys, rand);

  }

 

  // hashCodesAccum : Object Integer Integer List<Integer> -> List<Integer>

  // accumulator helper for the function above

  // Invariant: lst is the list of keys generated so far

  private static List<Integer> hashCodesAccum(Object o, Integer k, Integer m,

                  List<Integer> keys, Random rand) {

    Integer next = rand.nextInt(m);

 

    if (k == 0) {

      return keys;

    }

    else {

      keys.add(next);

      return hashCodesAccum(o, k - 1, m, keys, rand);

    }

  }

}

Using the HashGenerator.hashCodes method defined above, you can generate k bit vector positions based on the result of any object’s hashCode method.

Exercise 4. Using bit vectors and hash codes, implement a BloomFilter.

Exercise 5. Randomly test your BloomFilter by adding many (try 10s, then 100s, of elements) random elements and testing if they’re in the set. The elements can be Integers or anything else you can randomly generate.

You should notice that for small bloom filters and good values of k and m that your bloom filter will rarely lie. Does it keep working if you increase the number of elements?

We should note that the method we used to generate k hash codes above is not very good because it’s not guaranteed to give you well distributed hash codes. It turns out that there are much better hashing mechanisms such as MurmurHash.

We have provided an implementation of MurmurHash (based on a public domain implementation by Derek Young here) which you can get here.

It implements the same interface as the HashGenerator above but you’ll have to invoke MurmurHash2.hashCodes instead.

Exercise 6. Use the MurmurHash implementation in your bloom filter. Does it do any better than your original for false positives?