Lecture 26: Hashing and Equality
A fast data structure for finding data by a key; the full rules for equality in Java
26.1 Introduction
In the past few lectures we’ve talked about ArrayList, which is a data structure that provides particularly efficient ways to access items at any position within a list. But what if we need to look up data not by position, but rather by some other type of key? What can we do? Let’s examine two examples of such data.
26.1.1 Dictionaries
// Represents one word in a dictionary, together with its definition class DictEntry { String word; String meaning; }
IList<DictEntry>, with entries in no particular order
IList<DictEntry> that is sorted alphabetically by word
Deque<DictEntry>, with entries in no particular order
Deque<DictEntry> that is sorted alphabetically by word
ArrayList<DictEntry>, with entries in no particular order
ArrayList<DictEntry> that is sorted alphabetically by word
Binary trees of DictEntry
Binary search trees of DictEntry, ordered alphabetically by word
Let’s consider what operations we’ll typically perform on such a dictionary. The single most important operation on a dictionary of words is to lookup a word to find its definition, so ideally we want that operation to be as fast as possible.
Right away, we can see that the unsorted IList, Deque, ArrayList and binary tree options are woefully outmatched by their sorted versions: at the cost of maintaining the sort order, we get far better algorithms to access the items. (Of course, the overhead of maintaining the sorting order may not be trivial, so if we can avoid that, so much the better.)
Some slightly more careful thought shows that both ILists and Deques are less effective than ArrayLists, since we can (as we saw in Lecture 22) use binary search over indices to quickly narrow in on a word, rather than plod our way through the list from front to back.
26.1.2 Wikis
// Represents a Wiki entry class WikiEntry { String url; String contents; }
With this in mind, our best contenders above (the sorted ArrayList and the binary search trees) are looking less ideal. Further, because we are now worried about creating and deleting entries, we have to worry about the cost of maintaining the sort order.
26.2 Introducing Hash-tables
Our goal is to design a data structure that gives us fast access to items when looked up by a key,
the ability to add, remove, and modify items, and preferably does not require that we impose a sort order
on the keys at all. While we’re choosing our requirements, we might as well generalize, and say that
this data structure should permit any kind of data as keys, and any kind of data as values.
A tall order! But it turns out to be possible —
ArrayLists where the value for key \(k\) can be found at index \(summarize(k)\)
This isn’t a DictEntry or a WikiEntry —
but we did wish for an even-more-general structure of which dictionaries and wikis were just examples!
Ben Lerner ===> WVH314
Leena Razzaq ===> WVH310B
Olin Shivers ===> WVH318
Matthias Felleisen ===> WVH308B
\(summarize(name) =\) length in characters of the first name
“Ben Lerner” gets mapped to index 3
“Leena Razzaq” gets mapped to index 5
“Olin Shivers” gets mapped to index 4
“Matthias Felleisen” gets mapped to index 8
index: 0 1 2 3 4 5 6 7 8 9 data: [ | | | WVH314 | WVH318 | WVH310B | | | WVH308B | ]
To find the room number for a given professor, we simply compute the length of their name, and go directly to that index in the ArrayList.
To modify the room number for a given professor, we again compute the length of their name, and set the data at that index in the ArrayList.
To delete the information for a given professor, we again compute the length of their name, and clear the data at that index from the ArrayList.
Adding information for a new professor is no different than modifying information for that professor.
Do Now!
What could possibly go wrong with the approach mentioned above?
26.3 Introducing HashMaps
import java.util.HashMap;
But how does Java decide what to use for the “summarize” function? In Java, every class implicitly extends a common base class called Object. (If we define a class that explicitly extends another class, then that base class eventually extends Object.) Because of this, every object type inherits a method called hashCode, which simply returns an integer. (This is also why every class has a toString method, and various other methods we haven’t needed to examine in this course.) Java even provides a default implementation of this method for us. A hash code is simply a summarization of a piece of data as a number, and the hashCode method is a hash function that computes this hash code for us. There is a very important caveat about hashCode, which will be discussed below.
And in fact, HashMaps don’t use ArrayLists directly, but rather use something similar that’s a bit more efficient yet.
class ExampleHashMaps { void testHashMaps(Tester t) { HashMap<String, String> rooms = new HashMap<String, String>(); // Put all the data into the hashtable rooms.put("Ben Lerner", "WVH314"); rooms.put("Leena Razzaq", "WVH310B"); rooms.put("Olin Shivers", "WVH308"); rooms.put("Matthias Felleisen", "WVH308B"); // Get the data t.checkExpect(rooms.get("Ben Lerner"), "WVH314"); t.checkExpect(rooms.get("Olin Shivers", "WVH310B"); // Check that some data is present t.checkExpect(rooms.containsKey("Leena Razzaq"), true); t.checkExpect(rooms.containsKey("Amal Ahmed"), false); // Data that isn't present will return null t.checkExpect(rooms.get("Amal Ahmed"), null); } }
Because adding and setting values for a key are essentially the same operation, this action was named put (to contrast it with get), rather than add. (It may help to understand the HashMap methods by noticing that add on ArrayLists is almost exactly like put on HashMaps, except the “key” type is required to be an int.)
Also note the additional method containsKey, which tells us whether a key is present in the HashMap or not. This is crucially important if you ever design a HashMap that stores the value null —
thanks to an old design decision, the get method will not throw an error if a key is not found, but instead will return null. Accordingly, the only way to distinguish if a returned value of null means the key wasn’t actually present is to use containsKey. But why would you deliberately store null in a HashMap?
26.4 Hash collisions
Amal Ahmed ===> WVH328
There are several ways to solve this problem, and we will not discuss them in exhaustive detail here: we sketch out the space of possibilities just to illustrate how creative (and complex!) the solutions can get. The first way is simply to choose a better hash function that avoids more collisions. But collisions are inevitable: there are a lot of names, and only a few positions in our ArrayList, and eventually it will be full. The second way is simply to make the ArrayList bigger to begin with, so there is more space. But this may be wasteful, if most of those locations are empty.
The remaining ways to solve collisions require that we store both the key and the value in our data structure, as opposed to just the values (and using the keys only to compute indices). The third way to solve collisions is by probing, which says “Compute the hashCode of the key. If that index is already used by some other data, then look at the next index, and keep going until you find a free index.” In our example, “Amal Ahmed” hashes to index 4, which is full, and so is index 5. But index 6 is free, so Amal’s room number would be placed there. This approach works for a little while, but after ten items are inserted into our ArrayList, it is completely full, and everything will collide and nothing will find a place!
The fourth way is called chaining, and replaces our ArrayList of values with an ArrayList of ArrayLists of values. Each inner list, or bucket, can hold all the values it needs to. Of course, this eventually defeats the purpose of having a hash table in the first place: we have to scan that inner list.
The fifth way is called rehashing, and essentially uses that inner ArrayList as another hash table, using a different hash function so that the inner table doesn’t immediately fill with collisions. This requires additional implementation effort beyond the built-in support in Java, and is not a general-purpose solution.
Finally, when all else fails, we can resize the hashtable on demand. This is an expensive operation, but if done sporadically, can be the most efficient solution of them all.
26.5 Hash codes and equality
There is a crucial rule that must be followed regarding hash codes. The built-in HashMap computes the hashCode of a key, and then examines the various keys with that hash code to see which one is equal, and returns the corresponding value. This begs the question: what happens when we want to define our own equality?
If we override equals such that objA.equals(objB) is true, then we must also override hashCode to ensure that objA.hashCode() == objB.hashCode().
If we override equals such that objA.equals(objB) is false, then objA.hashCode() and objB.hashCode() may or may not be the same.
If we override hashCode such that objA.hashCode() != objB.hashCode(), then we must also override equals to ensure that objA.equals(objB) is false.
If we override hashCode such that objA.hashCode() == objB.hashCode(), then objA.equals(objB) may or may not be true.
26.5.1 Defining custom hashCodes
class Book { Author author; String title; int year; public int hashCode() { return this.author.hashCode() * 10000 + this.year; } } class Author { Book book; String name; int yob; public int hashCode() { return this.name.hashCode() * 10000 + this.yob; } }
26.5.2 Defining custom equals methods for simple classes
public boolean equals(Object other);
// In Book public boolean equals(Object other) { if (!(other instanceof Book)) { return false; } // this cast is safe, because we just checked instanceof Book that = (Book)other; return this.author.equals(that.author) && this.year == that.year && this.title.equals(that.title); }
// In Author public boolean equals(Object other) { if (!(other instanceof Author)) { return false; } // this cast is safe, because we just checked instanceof Author that = (Author)other; return this.name.equals(that.name) && this.yob == that.yob; }
Exercise
Design faulty hashCode and equals methods for Book, such that two Books could have different hashcodes and yet still be equal.
// In Author public boolean equals(Author that) { return this.name.equals(that.name) && this.yob == that.yob; }
Author author1 = new Author("Bill Nye", 1955); Object author2 = new Author("Bill Nye", 1955);
26.5.3 Defining custom equals methods for itemizations
If we want to define customized equality testing for itemizations, we obviously can’t throw away the hard-won correct behavior of our sameness-testing methods from Lecture 12: Defining sameness for complex data, part 2. In particular, we know that using instanceof is not sufficient in the presence of inheritance. But we also are forced to use the signature for equals given above.
interface IShape { boolean sameShape(IShape that); boolean sameCircle(Circle that); boolean sameSquare(Square that); boolean sameRect(Rect that); } abstract class AShape implements IShape { public boolean sameCircle(Circle that) { return false; } public boolean sameSquare(Square that) { return false; } public boolean sameRect(Rect that) { return false; } } class Circle extends AShape { int radius; Circle(int radius) { this.radius = radius; } public boolean sameShape(IShape that) { return that.sameCircle(this); } public boolean sameCircle(Circle that) { return that.radius == this.radius; } }
Do Now!
Design a single implementation for equals that will suffice for Circle, Square and Rect.
// In AShape public boolean equals(Object other) { if (!other instanceof IShape) { return false; } // this cast is safe, because we just checked instanceof IShape that = (IShape)other; return this.sameShape(that); }
Do Now!
Now that we’ve overridden equals, our shape classes have violated an important principle. What principle is it? Fix it!