Welcome to Lesson 14!

Learning Objectives
By the end of this lesson, you should know...
  • How do you select a good hash algorithm?
  • How do you override the hashCode() method for a class?
  • What is the efficiency of the Hash Table operations?
  • How does the size of the Hash Table affect its efficiency?
  • When is it necessary to resize a hash table and how would you write a simple resizing method?

1. Improving Our Hash Algorithm
  • The previous hash algorithm, which mapped a key to an index according to the length of the key, was not a good hashing algorithm.
  • What was wrong with this algorithm?
    • Too many collisions - we were unable to get an even distribution of values across the table
    • Clustering of values around buckets 2-10
    • As a related issue, some indices were unusable
    • index 0 can never hold a value as there are no names of length 0
  • Both of these problems contribute to a loss of efficiency.
  • Perhaps we can find a better hash algorithm...
  • Let's try the following:
  • Sum the Unicode values of each character in the key
  • Scale the result by applying % table size.
method int hash(String key, int TABLE_SIZE)
let sum = 0;
for(i = 0 ... key.length())
sum += (int)key.charAt(i) //summing the Unicode values for each char

index = sum % TABLE_SIZE //dividing the summed Unicode values by table_size
//and storing remainder as the index
return index


2. Overriding hashCode() for a Class
  • In CIS 36B, we looked at how to override methods inherited from the Object class. Specifically, we focused on:
    • toString()
    • equals()
  • hashCode() is another method that all classes inherit from Object.
    • It is useful when working with hash tables and hash maps
    • In fact, this method is called automatically by Java's built-in Collection classes HashMap, HashSet and Hashtable
  • For each class you write, it is good practice to override the hashCode() method for that specific class, especially if you are already overriding equals.
  • Therefore, when we write Lab 7, we will override the hashCode() method for any class interacting with the Hash Table.
  • According to the Oracle documentation about this method,

    The general contract of hashCode is:

    • Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
    • If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
    • It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
  • Our new hash algorithm will call the hashCode() method for each class, and then take the modulus of the returned integer value, dividing it by table size:
private int hash(T t) {
    int code = t.hashCode();
    return code % Table.size();
}
  • Then, for each class interacting with the hash table class, we will need to ensure that it overrides hashCode(). For example, given a Book class, we might override hashCode like this:
public class Book {
@Override public int hashCode() {
String key = title + author; //define key for this class
int sum = 0;
for (int i = 0; i < key.length(); i++) {
sum += (int) key.charAt(i);
}
return sum;
}
}
  • Or, given a class Butterfly, we might override hashCode as follows:
public class Butterfly {
@Override public int hashCode() {
String key = genus + species; //define key for this class
int sum = 0;
for (int i = 0; i < key.length(); i++) {
sum += (int) key.charAt(i);
}
return sum;
}
}
  • Or, given a class Student, we might override hashCode as follows:
public class Student {
@Override public int hashCode() {
String key = "" + studentID; //define key for this class
int sum = 0;
for (int i = 0; i < key.length(); i++) {
sum += (int) key.charAt(i);
}
return sum;
}
}
  • Note that we can use the same algorithm, which sums Unicode values, but simply place it inside the hashCode() method for each class we write.

3. Guidelines for Choosing a Good Hash Algorithm
  • But, how do we know that our new algorithm is a good one?
  • Here are some guidelines for writing good hashing methods:
  1. Use all of the information provided by a key.
    • Our algorithm uses all characters in the name to create the key.
  2. Make sure the elements to insert can be spread evenly across the entire table.
    • By applying % operator we make sure our summed Unicode values will be better distributed throughout the table.
  3. Verify that your algorithm can map similar keys to different buckets.
    • With the new algorithm, will values like "Mark" and "Matt" be placed in the same bucket? 
    • Applying the hashing algorithm, with table size of 37 
    • Mark: mapped to index 25 
    • Matt: mapped to index 3
  4. Make sure it uses only simple operations (such as the basic arithmetic operators) to minimize run-time. 
    • When you write a hash table, you will call your hashing method often (insert, search, delete). 
    • Make certain your hashing method is not computationally intensive. 
  • While the hashing method is an important aspect in helping to increase the effectiveness of a hash table, other aspects also need to be considered.

4. Big-O Run-Time of a Hash Table
  • For hash table operations such as search, insert and remove we get the following Big-O run-times:
  • Best case: O(1)
    • Element is alone at collision free index.
  • Worst case: O(n)
    • Worst case is a linear search, which is O(n).
    • Holds for both linear probing and chaining.

Worst Case for Separate Chaining

          image source

  • Average case: O(1)? Yes!
  • Average case run-time of hash table operations is a question. Assuming a perfect hash method, these operations are always O(1).
  • However, without a perfect hash method, average run-time will be dependent upon how big the table is and how many values it is currently storing (i.e. on how full the table is).
  • As more and more values are inserted into the table, the likelihood of collision increases, resulting in longer chains.
  • The speed of hash table operations is therefore dependent on something called the load factor, which is defined to be n/k, where k is the the table size (number of buckets) and n is the number of keys.

load factor = n / k
where,
n = the number of keys (elements)
k = the table size (number of buckets)
  • The larger the value for the load factor, the slower the hash table operations.
  • Therefore, average case for a hash table is sometimes considered to be O(n/k).
  • You might be thinking: But... O(n/k) = O(n), as we do not include lower order terms and coefficients in our Big-O run-time.
  • So, is a hash table about as efficient as a linked list then?
  • In practice, O(n/k) is still a practical improvement over O(n).
  • Moreover, if your table size is large enough, average run-time approaches constant time.
  • Rule of thumb: let your table size be twice to three times the number of anticipated keys.
  • If I set k = 2n (table size is twice the number of keys)
  • O(n/2n) -> O(1/2) -> O(1)


5. Guidelines for Selecting a Table Size

  • In order to effectively minimize collisions (thereby maintaining efficiency), a good rule-of-thumb is to make your table size be twice the number of entries you are planning to insert.
  • However, more often than not, the number of entries to be inserted will not be known in advance.
  • Therefore, it may be necessary to re-size your hash table:
  • Remember the load factor: n / k (number_of_keys divided by table_size)
  • As the load factor increases, collisions will become more common, resulting in decreased efficiency.
  • How can we decrease the load factor? We increase the table size!
  • When should we re-size the table?
  • Resizing also requires computation time, so only resize when necessary to maintain lookup efficiency.
    • Linear probing: Array indices are used up quickly. Re-size when load factor >= 1/2
    • Separate Chaining: Opinions vary: between 2/3 and 3/4.
  • Simplest re-sizing method:
    • Create new hash table with an array sized twice as large as the current hash table.
    • Linearly traverse the old hash table.
    • Copy each element and insert into new hash table.
    • Elements will be rehashed when inserted into new hash table.
    • Delete old hash table.


Wrap Up
  • Answer the review questions on Canvas for this Lesson
  • Nice 7 minute summary of Hash Tables: Harvard CS 50 Video

Upcoming Assignments

  • Lab due Monday
  • Study for your Quiz
~Enjoy Your Day!~