Lecture 7: Applications of Equality
Applications of Equality, Comparison and Hashing
Overview
Now that we have two general ways of comparing objects to each other (the Comparable and the Comparator interfaces in Java) we can use them in instances where we need to compare and order objects. Many data structures and algorithms need this capability, including some that exist in Java.
Searching through a list
Searching for an element in a list is a fundamental operation in many applications. This operation requires us to answer the following question: “is this object equal to what I want?” The equals method allows us to answer this question for any Java object.
Naive searching
The simplest way to search through a list is to iterate through it and check each object. For a List object, we can use its iterator as follows:
public static <T> boolean naiveSearch(List<T> list,T value) { Iterator<T> it = list.iterator(); boolean found = false; while ((it.hasNext()) && (!found)) { T obj = it.next(); if (obj.equals(value)) { found = true; } } return found; }
For a regular arrays, we can use a counting-for loop similarly:
public static <T> boolean naiveSearch(T[] array,T value) { int i=0; boolean found = false; while ((i<array.length) && (!found)) { if (array[i].equals(value)) { found = true; } } return found; }
Better searching: binary search
Consider a list of numbers that is already sorted.
Element: 2 4 7 9 10 11 13 Index : 0 1 2 3 4 5 6
Consider searching for the number 4 in this list. Let us compare it to the number in the middle of the list: 9. Obviously it is not the number we are looking for, but it is greater than it. Since the list is sorted this guarantees that attempting to look for 4 to the right of 9 is futile: it will never be there. In other words, if 4 exists in the list it has to be to the left of 9. Thus we can now look for 4 in the list {2 4 7}. The intuition here is that we can throw away an entire part of the list by making only one comparison. We can continue this process: the middle of the new list is 4, which is our number!
We will see later a better way of quantifying “faster”
What if we looked for 12, a number that isn’t in the list? We start from the entire list and compare 12 to the middle number 9. Since 12 is greater than 9, we look for the right-half {10 11 13}. We compare 12 to its middle number 11. Since 12 is greater than 11, we look for the right-half {13}. We compare 12 to 13, and they are not equal we conclude that 12 is not present in the list. We arrived at this conclusion by making only 3 comparisons. Using the naive algorithm we would have to compare 12 with all 7 numbers in the list. Thus, this searching seems to be “faster”. Note that the entire algorithm required only the ability to compare two elements in the list. We can generalize this algorithm to a list of objects of any type, so long as they are comparable to each other. This algorithm is called the binary search algorithm, and it works only if the list is already sorted.
We can express this operation recursively as follows:
Let middle = middle element of list
If middle = value return true
If middle > value recurse into left half of list (binarySearch(left-half-of-list,value))
If middle < value recurse into right half of list (binarySearch(right-half-of-list,value))
How can we express “left-half-of-list” and “right-half-of-list”? Furthermore how can we quickly get to the middle of the list? These two questions suggest that the ability to access any element of the list quickly is desirable (random access). Therefore we will use array-type lists (arrays or ArrayList). Then we can express any part of the list using two indices, left and right.
Finally we need a way to compare two objects: we can use a Comparator object for this purpose.
public static <T> boolean binarySearch(T[] array,T value,Comparator<T> comparator) { return binarySearchRec(array,0,array.length-1,value,comparator); } private static <T> boolean binarySearchRec(T[] array,int low,int high,T value,Comparator<T> comparator) { //low and high are inclusive, i.e. [low,high] if (low>high) { //crossed over, so element does not exist return false; } int med = (low+high)/2; int compareValue = comparator.compare(value,array[med]); if (compareValue==0) { //found it return true; } else if (compareValue<0) { //object in location med is greater than value return binarySearchRec(array, low, med - 1, value, comparator); } else { //object in location med is greater than value return binarySearchRec(array, med + 1, high, value, comparator); } }
Do Now!
Try to implement binary search non-recursively.
Sorting
Sorting a list is a fundamental operation by itself and in many applications (e.g. searching becomes faster due to binary search). The sorting operation requires ordering, and therefore the ability to compare objects.
Insertion sort
In Lecture 4 we wrote a method to sort a list of books by price. The algorithm we used was insertion sort: repeatedly insert a new element into a sorted list so that the resulting list is also sorted. We implemented this on our list recursively. How might we do this on array-type lists?
Consider an array arr. Let us assume that the sub-array arr[0..k-1] is sorted already. We now attempt to insert a[k] into this array, so that the sub-arrayarr[0..k] is sorted. How would we insert an element into a sorted array? We will have to find its correct place in the array, and then make space for it by shifting all the elements to its right by one space to the right.
temp = arr[k]; int j=k-1; while ((j>=0) && (comparator.compare(arr[j],temp)>0)) { arr[j+1] = arr[j]; //shift arr[j] one place to the right j = j-1; } arr[j+1] = temp; //put temp in its correct place
If we iteratively follow this procedure from k=1 (insert arr[1] into array arr[0]) to k=n-1 (insert arr[n-1] into sorted array arr[0..n-2]) then we will end up with a completely sorted array.
public static <T> void insertionSort(T[] arr,Comparator<T> comparator) { for (int k=1;k<arr.length;k++) { //insert arr[k] into arr[0..k-1] which is already sorted T temp = arr[k]; int j=k-1; while ((j>=0) && (comparator.compare(arr[j],temp)>0)) { arr[j+1] = arr[j]; //shift arr[j] one place to the right j = j-1; } arr[j+1] = temp; //put temp in its correct place } }
Selection sort
Consider an unsorted list:
8 3 1 9 4 2 -3
After we sort this list, what should be the last number in the list? It should be the maximum number in the list (and hence would appear last). Let us swap 9 with -3 in the list to obtain the list below
8 3 1 -3 4 2 9
We now know that one element (9) is at its correct place in sorted order. If we continue this procedure (find the maximum, swap with the last element) with the sub-list (8 3 1 -3 4 2) we will swap 8 with 2. Thus we move closer to the sorted list one-step at a time. We select the maximum number in the list, swap it with the last element and repeat the procedure with the list excluding this element. This algorithm is called selection sort and is another way to sort a list of things.
We need to find the maximum element in a list. This involves picking an element and comparing it with every other element. If we have a comparator for this purpose we can generalize this algorithm to lists with arbitrary objects that can be compared to each other.
public static <T> void selectionSort(T[] arr,Comparator<T> comparator) { for (int k = arr.length - 1; k >= 1; k = k - 1) { //find the maximum in the array arr[0..k] and swap it with arr[k] //To swap two elements we need to know their indices //that is why we keep track of where the maximum is, instead of the //maximum itself //we assume that the maximum is the first element, i.e. arr[0] int maxindex = 0; for (int j = 1; j <= k; j++) { if (comparator.compare(arr[j], arr[maxindex]) > 0) { //element at j is greater than element at maxindex, so j is the //new maximum maxindex = j; } } //swap arr[maxindex] and arr[k] swap(arr, maxindex, k); } } private static <T> void swap(T[] arr,int i,int j) { //swap elements at indices i and j T temp = arr[i]; arr[i] = arr[j]; arr[j] = temp; }
Merge sort
Consider the following problem: you are given two lists that are already sorted. You must combine them into a list that is also sorted. Although we can do this by simply appending one list to another and using one of the above algorithms to sort the resulting list, we can do better (because the two lists to be merged are already sorted).
We start from the first elements of the two lists and an empty result list.
v v [ 4 6 9 12 ] [ 2 5 8 9 ] Result: []
We compare the element we are at in the first list to that in the second list. We select the smaller of the two elements, put it in the result and advance the position for that list.
v v [ 4 6 9 12 ] [ 2 5 8 9 ] Result: [ 2 ]
We do the same again: compare two elements, put the smaller one in the result and advance the position. If they are equal we break the tie arbitrarily.
v v [ 4 6 9 12 ] [ 2 5 8 9 ] Result: [ 2 4 ]
v v [ 4 6 9 12 ] [ 2 5 8 9 ] Result: [ 2 4 5 ]
v v [ 4 6 9 12 ] [ 2 5 8 9 ] Result: [ 2 4 5 6 ]
v v [ 4 6 9 12 ] [ 2 5 8 9 ] Result: [ 2 4 5 6 8 ]
v [ 4 6 9 12 ] [ 2 5 8 9 ] Result: [ 2 4 5 6 8 9]
At this point we have exhausted one of the lists. We know that the remaining elements in the other list are all greater than the last element in the result list (why?). So we simply copy them over.
[ 4 6 9 12 ] [ 2 5 8 9 ] Result: [ 2 4 5 6 8 9 9 12 ]
How is this relevant to sorting a list?
Given an arbitrary list, we can sort it by following these steps:
Divide the list into two.
Recursively sort the left half and the right half.
Merge the two sorted halves.
Step 3 is the above merging algorithm. But there are some details to implementing this:
The sorting algorithm must sort the list in-place, i.e. we expect it to mutate the list rather than return the sorted output as a separate list.
How to “divide” this list into two? If we are using array-type lists, we can use indices like we did in binary search.
The “two sorted halves” are two parts of the same array-type list. The merge algorithm must be implemented to take the two lists this way.
The merge algorithm is expected to store its results in the original list itself, but the above merging algorithm used a separate list. So we need a temporary list that stores the results of the merge, and then copy it back to the original list. This will ensure that the sorting is in-place.
This is known as the merge sort algorithm. A variation of this algorithm is used in Java’s Arrays.sort(Object[] a) method.
We first look at the implementation of the merge method:
private static <T> void merge(T[] arr, List<T> temp, int low, int med, int high, Comparator<T> comparator) { //merge the sorted arrays arr[low..med] and arr[med+1..high] using //the temporary array temp and store the result back in arr[low..high] int i, j, k; i = low; j = med + 1; k = low; while ((i <= med) && (j <= high)) { if (comparator.compare(arr[i], arr[j]) < 0) { //arr[i] is lesser than arr[j], move arr[i] to temp and advance i temp.set(k, arr[i]); i = i + 1; } else { //arr[j] is less than or equal to arr[i], move arr[j] to temp and // advance j temp.set(k,arr[j]); j = j + 1; } //advance k regardless k = k + 1; } //the above loop exits when one of the subarrays is exhausted, so we copy // over the remaining part of the other array since it is already sorted while (i <= med) { temp.set(k,arr[i]); i = i + 1; k = k + 1; } while (j <= high) { temp.set(k,arr[j]); j = j + 1; k = k + 1; } //copy over temp[low..high] to arr[low..high] for (i=low;i<=high;i++) { arr[i] = temp.get(i); } }
This is called by the mergesort algorithm as follows:
public static <T> void mergesort(T[] arr, Comparator<T> comparator) { List<T> temp = new ArrayList<T>(); for (T obj:arr) { temp.add(obj); } mergesortRec(arr, temp, 0, arr.length - 1, comparator); } private static <T> void mergesortRec(T[] arr, List<T> temp, int low, int high, Comparator<T> comparator) { //sort the array arr[low..high] using the temp array as a temporary array if (low < high) { int med = (low + high) / 2; //recursively sort arr[low..med] mergesortRec(arr, temp, low, med, comparator); //recursively sort arr[med+1..high] mergesortRec(arr, temp, med + 1, high, comparator); //merge arr[low..med] and arr[med+1..high] and store the result in arr[low..high] merge(arr, temp, low, med, high, comparator); } }
Ordered sets and maps
The notion of ordering doesn’t just apply to storing items in a list. There are other, non-linear ways of storing items.
One such application is the notion of ordered sets. Sets in computer science are similar to sets in discrete math. A mathematical set consists of several items. By definition mathematical sets do not contain duplicates. Sets offer operations like addition, removal, membership (is a specific item in the set?), etc. This notion of sets can be implemented in programming and used for many applications.
An ordered set is a set that stores its elements using some notion of ordering. This ordering does not necessarily manifest itself as a position or index (like lists) but still, elements are stored relative to each other in some way. This ordering leads to benefits in implementing the above set operations. Any ordered set naturally requires a way to order elements relative to each other.
In Java the Set interface encapsulates the operations on a set, and the SortedSet interface is an ordered set. The TreeSet class implements the SortedSet interface using a balanced, binary search tree. It can work with both Comparable and Comparator objects to decide ordering.
A map can be conceptually thought of as a table with two columns: keys and values. Contrary to lists and sets, one adds a pair (tuple) to a map. The value is generallly the data that one wishes to store, whereas the key is something that can be used to uniquely identify the value. Similar to how set cannot store duplicates, maps cannot store multiple entries with the same key (there is no such restriction for values). A map is usually able to efficiently search for an entry given its key. A map is an example of an associative data structure (given a map and a key, provide the associated value).
Examples of things that can be stored in maps are employee records (key: social security number, value: data about employee), pizza accounts (key: phone number, value: buying history, contact information, etc.), etc. When thinking about using a map to store data, think about one central question: can something be used to uniquely identify that data? If so, a map may be a good candidate.
Operations on a map are encapsulated in the Map interface. An ordered map is similar to the idea of an ordered set: it imposes/uses some ordering on the keys of the map. These operations are encapsulated in the SortedMap interface. The TreeMap class implements the SortedMap interface by using a balanced binary search tree arranged by keys.
Hash tables
Introduction
Hash tables are another way to store data so that it can be searched efficiently. The idea of a hash table is simple but powerful.
Consider using a traditional array of objects. We would use it as follows:
SomeClass []objects = new SomeClass[10]; objects[0] = //some object objects[1] = //some object SomeClass so = objects[i]; //get the ith object in the array
One can think of an array as associating an object with a position (index). But the index in this case, a simple number in the range [0,array.length-1] may not be related to the object at all. For data that can be represented by an integer number but in another range (e.g. a social security number), one could use some arithmetic to transform this integer to the appropriate range, such that unique integers produced unique array indices. How can we generalize this concept further to apply to data that is non-numeric, or even composite (i.e. objects)? Instead of simple arithmetic we can envision a function that will take this object and compute an integer number that is a valid array index. Thus the concept can be:
SomeClass []objects = new SomeClass[10]; SomeClass obj = //some object to be inserted into this array int index = func(obj); //convert from obj to index objects[index] = obj; ... // Look for an object anotherobj int whereCanItBe = func(anotherobj); //If objects[whereCanItBe] is anotherobj it is present, else absent
Thus the idea is to write a function to convert any object somehow to an integer number that is a valid index into an array. When an object is inserted we use this function to find the position in which it must be stored. Similarly to search an object we can use the same function to find the position where it should be. Ideally if it is not at this position we can conclude it is not present in the array at all. The idea of converting an object into an integer is known as hashing, this magic converting function is called a hash function, its result is the object’s hash value and the array is a hash table.
A hash table promises to be outstanding: one can now search by checking one array index, provided the hash function is good and fast. This promise breaks down if unique objects do not generate unique hash values. This uniqueness is further limited by the condition that the hash value should be a valid array index. Most importantly, the hash function must be written separately for each type of object.
Hash tables in Java
Java provides two implementations of hash tables: HashSet that implements Set and HashMap that implements Map. Unlike the binary search tree the hash table does not promise any ordering of its contents (the hash function can distribute them randomly within the table). Thus these are examples of unordered sets and maps.
All Java classes, courtesy the Object class, also contain a method called public int hashCode(). This is the hash function that both of the above implementations will use. In order to enable objects of a class to be stored in hashsets and hashmaps, one must override this method in that class. Java provides several helper functions to facilitate creating a good hash function.
In order to write this method for a class, we first identify what parts of its state (i.e. which instance variables) can uniquely identify an object. Then we write a hash function that uses all those parts in the hope that it will generate a unique hash value. For example, the Person class has the following structure:
public class Person{ private String firstName; private String lastName; private int yearOfBirth; ... }
Assuming that our application will never see two people with the same first and last names and the same year of birth, a combination of these three attributes uniquely identifies a Person object. Thus we need a hash function that will take two String objects and an integer number and return a good hash value.
public class Person{ private String firstName; private String lastName; private int yearOfBirth; ... @Override public int hashCode() { return Objects.hash(firstName,lastName,yearOfBirth); } }
The Objects.hash method is a Swiss-army knife of hash functions. It takes an arbitrary number of arguments of different types and generates a hash value. It does so by combining the hash values of each argument (using their own hashCode methods). Thus the above example uses the hashCode() methods of the String and Integer class to compute the final hash value. Both these classes already implement the hashCode method.
Do Now!
Read more about it here, and see where it leads you.
While it is possible to create your own hash function from scratch, these built-in methods make the process much simpler. It is highly recommended to use them.
Equality and hashing
Recall how the hash table finds whether an object is present: it goes the position denoted by the hash value and it checks whether the object there is the same as what it is looking for. Conceptually if two objects are deemed to be the same they ought to behave the same in a hash table.
In Java this sameness checking is standardized by the....public boolean equals(Object other). Thus the following rule is paramount to correct usage of HashSet and HashMap objects:
"If two objects are equal to each other according to your equals method then they must produce the same hash values using their respective hashCode() methods"
In other words, if we override equals for a class, we must override hashCode so that if objects deemed to be equal by the first generate the same number in the second.