6.10

Lecture 26: Hashing and Equality

A fast data structure for finding data by a key; the full rules for equality in Java

26.1Introduction

In the past few lectures we’ve talked about ArrayList, which is a data structure that provides particularly efficient ways to access items at any position within a list. But what if we need to look up data not by position, but rather by some other type of key? What can we do? Let’s examine two examples of such data.

26.1.1Dictionaries

Suppose we need to represent a typical English dictionary, with words mapped to their meanings. We might start by defining a class to represent dictionary entries:
 // Represents one word in a dictionary, together with its definition class DictEntry { String word; String meaning; }
We then need to represent a collection of these entries. We have several choices of data structures to use to represent this collection:
• IList<DictEntry>, with entries in no particular order

• IList<DictEntry> that is sorted alphabetically by word

• Deque<DictEntry>, with entries in no particular order

• Deque<DictEntry> that is sorted alphabetically by word

• ArrayList<DictEntry>, with entries in no particular order

• ArrayList<DictEntry> that is sorted alphabetically by word

• Binary trees of DictEntry

• Binary search trees of DictEntry, ordered alphabetically by word

Let’s consider what operations we’ll typically perform on such a dictionary. The single most important operation on a dictionary of words is to lookup a word to find its definition, so ideally we want that operation to be as fast as possible.

Right away, we can see that the unsorted IList, Deque, ArrayList and binary tree options are woefully outmatched by their sorted versions: at the cost of maintaining the sort order, we get far better algorithms to access the items. (Of course, the overhead of maintaining the sorting order may not be trivial, so if we can avoid that, so much the better.)

Some slightly more careful thought shows that both ILists and Deques are less effective than ArrayLists, since we can (as we saw in Lecture 22) use binary search over indices to quickly narrow in on a word, rather than plod our way through the list from front to back.

26.1.2Wikis

A wiki is like a dictionary that anyone can edit. Here, instead of words and meanings, we would have
 // Represents a Wiki entry class WikiEntry { String url; String contents; }
(Of course, in practice we would use more sophisticated classes to represent URLs and contents, but strings will suffice here.) Here the performance tradeoffs are a bit less clear: we need the ability to quickly access an entry by its URL, and we need the ability to quickly edit an entry, but we also need the ability to create or delete entries quickly, too. Further, URLs don’t have an obvious ordering: technically we could sort them alphabetically, but that ordering is artificial. Users of a wiki don’t scan through the entire wiki looking for the article they want; they search for the term or just type in the URL directly.

With this in mind, our best contenders above (the sorted ArrayList and the binary search trees) are looking less ideal. Further, because we are now worried about creating and deleting entries, we have to worry about the cost of maintaining the sort order.

26.2Introducing Hash-tables

Our goal is to design a data structure that gives us fast access to items when looked up by a key, the ability to add, remove, and modify items, and preferably does not require that we impose a sort order on the keys at all. While we’re choosing our requirements, we might as well generalize, and say that this data structure should permit any kind of data as keys, and any kind of data as values. A tall order! But it turns out to be possible — let’s see how.

Of all the data structures we have seen so far, only one of them gives us perfectly fast access to its items: assuming we know what index to look for, an ArrayList will simply let us get that item instantly. (We’ll see in Lecture 27: Introduction to Big-O Analysis more precisely what “instantly” means.) If only we could somehow summarize the key, and compute an index from it, then we could represent our dictionaries and our wikis as

ArrayLists where the value for key $$k$$ can be found at index $$summarize(k)$$

where $$summarize$$ is some function that takes a key and produces a non-negative integer.

This isn’t a DictEntry or a WikiEntry but we did wish for an even-more-general structure of which dictionaries and wikis were just examples!

Let’s see an example. Suppose we wanted to record a mapping of professors’ names to their office numbers. We need to record:
• Ben Lerner ===> WVH314

• Leena Razzaq ===> WVH310B

• Olin Shivers ===> WVH318

• Matthias Felleisen ===> WVH308B

Suppose we picked the following as our $$summarize$$ function:

$$summarize(name) =$$ length in characters of the first name

Then we pick as our data an ArrayList<String> of length 10, and compute:
• “Ben Lerner” gets mapped to index 3

• “Leena Razzaq” gets mapped to index 5

• “Olin Shivers” gets mapped to index 4

• “Matthias Felleisen” gets mapped to index 8

so our ArrayList will look like this:
index:     0     1     2       3        4        5       6     7       8        9
data:   [     |     |     | WVH314 | WVH318 | WVH310B |     |     | WVH308B |        ]
This is amazingly efficient — this one trick lets us get all the operations we want!
• To find the room number for a given professor, we simply compute the length of their name, and go directly to that index in the ArrayList.

• To modify the room number for a given professor, we again compute the length of their name, and set the data at that index in the ArrayList.

• To delete the information for a given professor, we again compute the length of their name, and clear the data at that index from the ArrayList.

• Adding information for a new professor is no different than modifying information for that professor.

Assuming that this “summarize” function is easy to compute, this is far faster than walking through a list, or even a binary search tree.

Do Now!

What could possibly go wrong with the approach mentioned above?

26.3Introducing HashMaps

This data structure is so widely used that, like ArrayLists, Java provides an implementation for us. It’s called a HashMap<K, V>, and it’s a generic data type that is parameterized by the type K of keys and the type V of values. Accordingly, each of our examples above would be represented as HashMap<String, String>, where the first String represents words, urls or professor names, and the second String represents meanings, contents, or room assignments respectively. A HashMap maps keys to values. (Note: To use a HashMap, we need to include the line
 import java.util.HashMap;
at the top of each file that uses them, just like we needed to do to use ArrayLists.)

But how does Java decide what to use for the “summarize” function? In Java, every class implicitly extends a common base class called Object. (If we define a class that explicitly extends another class, then that base class eventually extends Object.) Because of this, every object type inherits a method called hashCode, which simply returns an integer. (This is also why every class has a toString method, and various other methods we haven’t needed to examine in this course.) Java even provides a default implementation of this method for us. A hash code is simply a summarization of a piece of data as a number, and the hashCode method is a hash function that computes this hash code for us. There is a very important caveat about hashCode, which will be discussed below.

And in fact, HashMaps don’t use ArrayLists directly, but rather use something similar that’s a bit more efficient yet.

Note that externally, we cannot see that a HashMap uses an ArrayList as its internal data, just like externally we could not see that our Deques used Sentinels and Nodes internally. All we know is the methods available on HashMap.

We can build and use our example room-assignments as follows:
 class ExampleHashMaps { void testHashMaps(Tester t) { HashMap rooms = new HashMap(); // Put all the data into the hashtable rooms.put("Ben Lerner", "WVH314"); rooms.put("Leena Razzaq", "WVH310B"); rooms.put("Olin Shivers", "WVH308"); rooms.put("Matthias Felleisen", "WVH308B"); // Get the data t.checkExpect(rooms.get("Ben Lerner"), "WVH314"); t.checkExpect(rooms.get("Olin Shivers"), "WVH310B"); // Check that some data is present t.checkExpect(rooms.containsKey("Leena Razzaq"), true); t.checkExpect(rooms.containsKey("Amal Ahmed"), false); // Data that isn't present will return null t.checkExpect(rooms.get("Amal Ahmed"), null); } }

Because adding and setting values for a key are essentially the same operation, this action was named put (to contrast it with get), rather than add. (It may help to understand the HashMap methods by noticing that add on ArrayLists is almost exactly like put on HashMaps, except the “key” type is required to be an int.)

Also note the additional method containsKey, which tells us whether a key is present in the HashMap or not. This is crucially important if you ever design a HashMap that stores the value null thanks to an old design decision, the get method will not throw an error if a key is not found, but instead will return null. Accordingly, the only way to distinguish if a returned value of null means the key wasn’t actually present is to use containsKey. But why would you deliberately store null in a HashMap?

26.4Hash collisions

Let’s go back to our example implementation of a hash table. Suppose we want to add another professor:
• Amal Ahmed ===> WVH328

According to our rules, we compute the length of her first name, and go to index 4 in the ArrayList...but it’s already occupied! This is called a collision, and it’s a rather critical problem. After all, we can’t assume that every person in our table will have a first name of a unique length!

There are several ways to solve this problem, and we will not discuss them in exhaustive detail here: we sketch out the space of possibilities just to illustrate how creative (and complex!) the solutions can get. The first way is simply to choose a better hash function that avoids more collisions. But collisions are inevitable: there are a lot of names, and only a few positions in our ArrayList, and eventually it will be full. The second way is simply to make the ArrayList bigger to begin with, so there is more space. But this may be wasteful, if most of those locations are empty.

The remaining ways to solve collisions require that we store both the key and the value in our data structure, as opposed to just the values (and using the keys only to compute indices). The third way to solve collisions is by probing, which says “Compute the hashCode of the key. If that index is already used by some other data, then look at the next index, and keep going until you find a free index.” In our example, “Amal Ahmed” hashes to index 4, which is full, and so is index 5. But index 6 is free, so Amal’s room number would be placed there. This approach works for a little while, but after ten items are inserted into our ArrayList, it is completely full, and everything will collide and nothing will find a place!

The fourth way is called chaining, and replaces our ArrayList of values with an ArrayList of ArrayLists of values. Each inner list, or bucket, can hold all the values it needs to. Of course, this eventually defeats the purpose of having a hash table in the first place: we have to scan that inner list.

The fifth way is called rehashing, and essentially uses that inner ArrayList as another hash table, using a different hash function so that the inner table doesn’t immediately fill with collisions. This requires additional implementation effort beyond the built-in support in Java, and is not a general-purpose solution.

Finally, when all else fails, we can resize the hashtable on demand. This is an expensive operation, but if done sporadically, can be the most efficient solution of them all.

26.5Hash codes and equality

There is a crucial rule that must be followed regarding hash codes. The built-in HashMap computes the hashCode of a key, and then examines the various keys with that hash code to see which one is equal, and returns the corresponding value. This begs the question: what happens when we want to define our own equality?

Until this point, we have defined our own equality methods as sameShape or sameList or sameTree, when there’s a perfectly good name, equals, waiting to be used! We could easily override the equals method, but there is a catch: we must then also override the hashCode method to match. Any two objects that are equal according to the equals method must also have equal hash codes. This is crucial, and seems backwards at times, so let’s be explicit:
• If we override equals such that objA.equals(objB) is true, then we must also override hashCode to ensure that objA.hashCode() == objB.hashCode().

• If we override equals such that objA.equals(objB) is false, then objA.hashCode() and objB.hashCode() may or may not be the same.

• If we override hashCode such that objA.hashCode() != objB.hashCode(), then we must also override equals to ensure that objA.equals(objB) is false.

• If we override hashCode such that objA.hashCode() == objB.hashCode(), then objA.equals(objB) may or may not be true.

26.5.1Defining custom hashCodes

Suppose we wanted to define a custom hash code for our Book and Author classes: we could do so as follows:
 class Book { Author author; String title; int year; public int hashCode() { return this.author.hashCode() * 10000 + this.year; } } class Author { Book book; String name; int yob; public int hashCode() { return this.name.hashCode() * 10000 + this.yob; } }
We pick the constant 10000 since years (at least for the foreseeable future!) are numbers less than that. Notice that the hash code for Authors does not include any information about the book field in its computation (because otherwise we’d have infinite recursion), but this does not violate our rules above.

26.5.2Defining custom equals methods for simple classes

The equals method is ultimately inherited from the Object class. Accordingly, the signature for equals must work for any object at all (unlike our sameBook or sameShape methods), so it looks like
 public boolean equals(Object other);
If we want to override the default behavior of equals, then we must implement exactly this signature. But because the type of other is merely Object, we don’t know anything about it, and so cannot make progress. We will need to cast from Object to the particular class we are in. This is the one and only valid use of instanceof and casting in this entire course! It is both sufficient to let us implement our equality method, and necessary because the types are otherwise uninformative. Here are properly-implemented custom equality methods for Book and Author:
 // In Book public boolean equals(Object other) { if (!(other instanceof Book)) { return false; } // this cast is safe, because we just checked instanceof Book that = (Book)other; return this.author.equals(that.author) && this.year == that.year && this.title.equals(that.title); }
 // In Author public boolean equals(Object other) { if (!(other instanceof Author)) { return false; } // this cast is safe, because we just checked instanceof Author that = (Author)other; return this.name.equals(that.name) && this.yob == that.yob; }

The fields that are used in computing hashCodes must be a subset of the fields used for computing equals or else we could violate the rule above relating hashcodes and equality.

Exercise

Design faulty hashCode and equals methods for Book, such that two Books could have different hashcodes and yet still be equal.

Note that if you implement an equals method with any signature other than the one above, Java will not complain! But, your program also will not run as you might expect. Remind yourself of the properties we want in an equality relation. Suppose we defined our Author class to have the following method, with a too-specific signature:
 // In Author public boolean equals(Author that) { return this.name.equals(that.name) && this.yob == that.yob; }
And suppose we had the following two example objects:
 Author author1 = new Author("Bill Nye", 1955); Object author2 = new Author("Bill Nye", 1955);
If we check whether author1.equals(author2), we would like the answer to be true. But since author2 is declared to be of type Object, Java will look for a method with signature boolean equals(Object other) and we didn’t define any such method! Instead, we get the default implementation of equals, from the Object class, and it will return false. This sort of subtle bug is hard to detect; our autograders will warn you if you get this wrong, but Eclipse will not. Be alert!

26.5.3Defining custom equals methods for itemizations

If we want to define customized equality testing for itemizations, we obviously can’t throw away the hard-won correct behavior of our sameness-testing methods from Lecture 12: Defining sameness for complex data, part 2. In particular, we know that using instanceof is not sufficient in the presence of inheritance. But we also are forced to use the signature for equals given above.

The solution in this case is a “hybrid” of the code above and the double-dispatch technique of Lecture 12. In fact, we do not even need to rewrite that code; we just add to it. Recall our definition of Circle, for example, and the AShape abstract base class:
 interface IShape { boolean sameShape(IShape that); boolean sameCircle(Circle that); boolean sameSquare(Square that); boolean sameRect(Rect that); } abstract class AShape implements IShape { public boolean sameCircle(Circle that) { return false; } public boolean sameSquare(Square that) { return false; } public boolean sameRect(Rect that) { return false; } } class Circle extends AShape { int radius; Circle(int radius) { this.radius = radius; } public boolean sameShape(IShape that) { return that.sameCircle(this); } public boolean sameCircle(Circle that) { return that.radius == this.radius; } }
We have here a suite of methods that are perfectly suited to determining when two IShapes are the same or not. We merely need to cause the equals methods for shapes to delegate into those methods.

Do Now!

Design a single implementation for equals that will suffice for Circle, Square and Rect.

All we need to do is override equals on the AShape base class, as follows:
 // In AShape public boolean equals(Object other) { if (!other instanceof IShape) { return false; } // this cast is safe, because we just checked instanceof IShape that = (IShape)other; return this.sameShape(that); }
Now any Circle, Square or Rect object will inherit its equals method from AShape, and that method first checks that the other object is at least an IShape (or else it’s definitely not equal to this object!), and then casts down to IShape and delegates to the sameShape method, which implements the double-dispatch technique for sameness testing that we worked out already.

Do Now!

Now that we’ve overridden equals, our shape classes have violated an important principle. What principle is it? Fix it!