By the end of this lesson, you should know...
- How do you select a good hash algorithm?
- How do you override the hashCode() method for a class?
- What is the efficiency of the Hash Table operations?
- How does the size of the Hash Table affect its efficiency?
- When is it necessary to resize a hash table and how would you write a simple resizing method?
1. Improving Our Hash Algorithm
- The previous hash algorithm, which mapped a key to an index according to the length of the key, was not a good hashing algorithm.
- What was wrong with this algorithm?
- Too many collisions - we were unable to get an even distribution of values across the table
- Clustering of values around buckets 2-10
- As a related issue, some indices were unusable
- index 0 can never hold a value as there are no names of length 0
- Both of these problems contribute to a loss of efficiency.
- Perhaps we can find a better hash algorithm...
- Let's try the following:
- Sum the Unicode values of each character in the key
- Scale the result by applying % table size.
method int hash(String key, int TABLE_SIZE)
let sum = 0; for(i = 0 ... key.length())
sum += (int)key.charAt(i) //summing the Unicode values for each char
index = sum % TABLE_SIZE //dividing the summed Unicode values by table_size //and storing remainder as the index return index
2. Overriding hashCode() for a Class
- Our new hash algorithm will call the hashCode() method for each class, and then take the modulus of the returned integer value, dividing it by table size:
private int hash(T t) { int code = t.hashCode(); return code % Table.size();
}
- Then, for each class interacting with the hash table class, we will need to ensure that it overrides hashCode(). For example, given a Book class, we might override hashCode like this:
public class Book {
@Override public int hashCode() {
String key = title + author; //define key for this class int sum = 0;
for (int i = 0; i < key.length(); i++) {
sum += (int) key.charAt(i);
} } - Or, given a class Butterfly, we might override hashCode as follows:
public class Butterfly {
@Override public int hashCode() {
String key = genus + species; //define key for this class int sum = 0;
for (int i = 0; i < key.length(); i++) {
sum += (int) key.charAt(i);
} } - Or, given a class Student, we might override hashCode as follows:
public class Student {
@Override public int hashCode() {
String key = "" + studentID; //define key for this class int sum = 0;
for (int i = 0; i < key.length(); i++) {
sum += (int) key.charAt(i);
} } - Note that we can use the same algorithm, which sums Unicode values, but simply place it inside the hashCode() method for each class we write.
3. Guidelines for Choosing a Good Hash Algorithm
- But, how do we know that our new algorithm is a good one?
- Here are some guidelines for writing good hashing methods:
- Use all of the information provided by a key.
- Our algorithm uses all characters in the name to create the key.
- Make sure the elements to insert can be spread evenly across the entire table.
- By applying % operator we make sure our summed Unicode values will be better distributed throughout the table.
- Verify that your algorithm can map similar keys to different buckets.
- With the new algorithm, will values like "Mark" and "Matt" be placed in the same bucket?
- Applying the hashing algorithm, with table size of 37
- Mark: mapped to index 25
- Matt: mapped to index 3
- Make sure it uses only simple operations (such as the basic arithmetic operators) to minimize run-time.
- When you write a hash table, you will call your hashing method often (insert, search, delete).
- Make certain your hashing method is not computationally intensive.
- While the hashing method is an important aspect in helping to increase the effectiveness of a hash table, other aspects also need to be considered.
4. Big-O Run-Time of a Hash Table
- For hash table operations such as search, insert and remove we get the following Big-O run-times:
- Best case: O(1)
- Element is alone at collision free index.
- Worst case: O(n)
- Worst case is a linear search, which is O(n).
- Holds for both linear probing and chaining.
Worst Case for Separate Chaining
- Average case: O(1)? Yes!
- Average case run-time of hash table operations is a question. Assuming a perfect hash method, these operations are always O(1).
- However, without a perfect hash method, average run-time will be dependent upon how big the table is and how many values it is currently storing (i.e. on how full the table is).
- As more and more values are inserted into the table, the likelihood of collision increases, resulting in longer chains.
- The speed of hash table operations is therefore dependent on something called the load factor, which is defined to be n/k, where k is the the table size (number of buckets) and n is the number of keys.
load factor = n / k
where,
n = the number of keys (elements)
k = the table size (number of buckets)
- The larger the value for the load factor, the slower the hash table operations.
- Therefore, average case for a hash table is sometimes considered to be O(n/k).
- You might be thinking: But... O(n/k) = O(n), as we do not include lower order terms and coefficients in our Big-O run-time.
- So, is a hash table about as efficient as a linked list then?
- In practice, O(n/k) is still a practical improvement over O(n).
- Moreover, if your table size is large enough, average run-time approaches constant time.
- Rule of thumb: let your table size be twice to three times the number of anticipated keys.
- If I set k = 2n (table size is twice the number of keys)
- O(n/2n) -> O(1/2) -> O(1)
5. Guidelines for Selecting a Table Size
- In order to effectively minimize collisions (thereby maintaining efficiency), a good rule-of-thumb is to make your table size be twice the number of entries you are planning to insert.
- However, more often than not, the number of entries to be inserted will not be known in advance.
- Therefore, it may be necessary to re-size your hash table:
- Remember the load factor: n / k (number_of_keys divided by table_size)
- As the load factor increases, collisions will become more common, resulting in decreased efficiency.
- How can we decrease the load factor? We increase the table size!
- When should we re-size the table?
- Resizing also requires computation time, so only resize when necessary to maintain lookup efficiency.
- Linear probing: Array indices are used up quickly. Re-size when load factor >= 1/2
- Separate Chaining: Opinions vary: between 2/3 and 3/4.
- Simplest re-sizing method:
- Create new hash table with an array sized twice as large as the current hash table.
- Linearly traverse the old hash table.
- Copy each element and insert into new hash table.
- Elements will be rehashed when inserted into new hash table.
- Delete old hash table.
Wrap Up - Answer the review questions on Canvas for this Lesson
- Nice 7 minute summary of Hash Tables: Harvard CS 50 Video
Upcoming Assignments- Lab due Monday
- Study for your Quiz
~Enjoy Your Day!~
|