Learning Objectives
By the end of this class, you should know...
  • What is a Hash Table?
  • What are the advantages of using a Hash Table over other ADTs we have studied?
    • Arrays?
    • Linked Lists (and other linear structures)?
    • Binary Search Trees?
  • What is a hashing algorithm?
  • How do you write a good hashing algorithm?
  • What are some solutions to the problem of collisions?
  • What is the efficiency of common hashing operations?
  • How does the size of the Hash Table affect its efficiency?
  • How do you search for an element in a Hash Table?


Announcements
  • Quiz 4 on Thursday - BSTs and Hash Tables
  • Lab 7 posted - Hash Tables and BSTs
  • Feedback from a former student about internship search

Review Activity
With a partner, complete the following:
  • Insert the following values (in this order) into an empty BST: 7, 3, 2, 9, 18, 5, 14, 8, 4, -1
  • Redraw the tree after calling remove on node 7 using the algorithm given in class.
  • Now, trace pre, post and inorder print on the BST.
  • Write the pre, post and inorder print functions for a BST

Hash Table Pre-Assessment: Remember This?
  • ASCII Review: Write a function which prints out the ASCII value of each character in a string:
void printASCII(string str)
{
    //write function body here
}



The Hash Table ADT
  • A Hash Table is a structure where values are mapped or assigned to an array index according to the rules of an algorithm.
    •  This algorithm is called the hashing function.
    •  The hashing function takes an input (called a key) and then returns the index of an array.
    •  This array index is where the value is to be stored
  • As we are going to see, hash table operations, such as search, insert, and delete can be performed in just O(1) time.

Group Activity: Hash Table Bingo!
  • To help you understand a Hash Table, let's play Hash Table Bingo.
  • The class as a whole is going to play the role of the underlying array structure in a Hash Table.
  • Each student in the class will be assigned an index based on his or her name on the class roster.
  • Using a hash function, I will enter the name of a piece of candy as a key.
  • The value to store at the array index
  • The hash function will return to me the index at which I want to store the candy.
  • If your index is returned by the hash function, you will get a piece of candy.
  • At the end of the exercise, I can hash the names of the pieces of candy in order to find out who received a piece and who didn't.

A Simple Array Only Example
  • Suppose that I wish to store the names of all the students in my class.
  • We could start with an array of strings.
const int CLASS_SIZE = 40; //# students enrolled in the class
string names[CLASS_SIZE];

  • Now, our first thought might be to store each student's name alphabetically in the array.

names[0] = "Anthony";

names[1] = "Brandon";

names[2] = "Cameron";

...

names[39] = "Zih-Ying";

  • However, later, if a student drops the class, I would need to:
    • Use binary search to search the array for the student's name: O(log n) time.
    • Delete the name and update the array by copying all the values after that student's name down by one: O(n) time
    • Is there a way that I can do better?

Let's Use a Hash Table Instead
  • The underlying structure of a hash table is an array.
  • However when we insert elements into our hash table array, we utilize a hashing algorithm
  • The hashing algorithm is used to determine at which array index to place an element in our table. 
  • Each index is also known as a bucket.
    • As we will see later, more than one element can be associated with an index. 
  • Let's look at a simple example of a hash table:
  • Using the array above as the basis for a hash table provides a more efficient alternative when searching or deleting a name from the class roster.
const int CLASS_SIZE = 40; //# students enrolled in the class
string names[CLASS_SIZE];
  • Let's assume that I have a simple hashing algorithm to determine at which array index to store each name on my roster. 
    • When searching for a student's name, I would call the hashing algorithm, passing the name as input.
    • The hashing algorithm would return the index with which the name is associated.
    • Lookup then involves searching for the name at a single array index, which is O(1).
  • Assume my hashing algorithm is implemented as the following function:
int hash(string student_name)
{
    int index = student_name.length();
    return index;
}
...
//applying my hashing algorithm to the first name in my CIS 22C roster
int index = hash("Anthony"); //Should return 7
names[index] = "Anthony"; //storing the value at index 7
  • In the above example, the value "Anthony" is called the key.
  • The key is the value that is passed into the hashing algorithm to determine the placement of the name within the table.
  • Now, when I add students to the roster, they will be stored at the index provided by the hashing algorithm:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    ...
    names[39] = "";
  • Later, when I search for a name in my table, I can simply call the hashing algorithm:
int index = hash("Anthony");//hashing algorithm provides index to search for the name
if (names[index] == "Anthony") //check if the value at index is equal to the name I am searching for
    names[index] = ""; //removing the name of the student who dropped
else
    cout << "Anthony isn't in the table." << endl;
  • As we can see from this simple example, with a hash table, insertion, search and deletion can be performed in just O(1).
  • We will discuss the Big-O run-time of hash table operations later in the lesson.

Collisions



  • As you probably noticed, there is a problem with the above example.
  • How many people in this class have a first name that is 3 letters long? 4 letters? 5 letters? 6 letters? 7 letters? 8 letters?
  • This problem is known as a collision.
    • A collision occurs when more than one value is mapped to the same index in a hash table.
    • In my above example, most names would be mapped to the indices 2 - 10.
    • Trivia: Average name length is 5.9 letters.
  • Collisions can be mitigated with better hashing algorithms.
  • However, regardless of how good a hashing algorithm is, collisions will occur.
  • perfect hash function (no collisions) can only be achieved if all values are known ahead of time; very unlikely.
  • There are various approaches to resolving collisions. We will discuss two principle approaches below.

Linear Probing
  • One simple approach to handling collisions within a hash table is called linear probing.
  • In linear probing, if an element is hashed to an index that is already storing another value, the new element is inserted at the next open index.
  • For example, imagine our names hash table contains the following values:
         names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    ...
    names[39] = "";
  • Now, further suppose that I want to insert the name "Brandon" into my names table.
    • Woops! There is a collision when I apply my hashing algorithm.
    • Index 7 is already occupied by "Anthony"
    • I can use linear probing to resolve this collision.
  • Using linear probing, I insert the name "Brandon" in the next available index after the hashed index of 7.
    • Where would I place "Brandon" in the table?
    • names[8] = "Brandon";
  • Now, my table looks like the following:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    names[8] = "Brandon";
    ...
    names[39] = "";
  • Next, suppose I want to insert the name "Cameron". Where would his name go as index 7 already contains the name "Anthony" and index 8 already contains the name "Brandon"?
  • In this case, I would need to place "Cameron" two indices away from his hashed index of 7, at index 9:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    names[8] = "Brandon";
    names[9] = "Cameron";
    ...
    names[39] = "";
  • Now, suppose that I want to search for the name "Cameron."
    • Applying my hash function, I would look for his name at index 7.
    • When, I do not locate him there, I know that I need to keep searching farther down the table.
  • As you can see, the more collisions, the worse the efficiency of my hash table when performing the insert and search operations.
  • Given sufficient collisions in my table, search approaches a Big-O run-time of O(n), no improvement over a simple array.
  • Deleting also becomes problematic.
    • Need to fill empty space from deletion.
    • Which elements were properly indexed? Which were placed due to linear probing?
  • Is there a better approach than linear probing?

Separate Chaining
  • Another approach to collision resolution is called separate chaining.
  • Rather than limiting ourselves to storing only a single value at each index of the table, with separate chaining we store a series of 0 or more linked elements at each index.
  • In essence, we make our hash table an array of linked lists, as depicted below:
  • Now, each time that a name is inserted, and there is a collision, we simply push the new name onto the back of our list at that particular index.
  • Therefore, we could re-envision our previous names hash table like so:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma" --> "Sean";
    names[5] = "Howon" --> "Qilin";
    names[6] = "Maryna";
    names[7] = "Anthony" --> "Brandon" --> "Cameron";
    ...
    names[39] = "";
  • Now when I want to search for the name "Cameron," if his name is in the table, it must be in the list at index 7.
  • To check, simply perform a linear search through the list at names[7].
  • Later in the lesson, we will look at how to implement a version of a hash table that uses separate chaining.


Big-O Run-Time of a Hash Table 
  • For hash table operations such as search, insert and delete we get the following Big-O run-times:
  • Best case: O(1)
    • Element is alone at collision free index.
  • Worst case: O(n)
    • Worst case is a linear search, which is O(n).
    • Holds for both linear probing and chaining.
  • Average case: O(1)? Yes!
  • Average case run-time of hash table operations is a question. Assuming a perfect hash function, these operations are always O(1). 
  • However, without a perfect hash function, average run-time will be dependent upon how big the table is and how many values it is currently storing (i.e. on how full the table is).
  • As more and more values are inserted into the table, the likelihood of collision increases resulting in longer chains.
  • The speed of hash table operations is therefore dependent on something called the load factor, which is defined to be n/k, where k is the the table size (number of buckets) and n is the number of keys.
load factor = n / k
  • The larger the value for the load factor, the slower the hash table operations.
  • Therefore, average case for a hash table is sometimes considered to be O(n/k). 
  • You might be thinking: But... O(n/k) = O(n), as we do not include lower order terms and coefficients in our Big-O run-time. 
    • So, is a hash table about as efficient as a linked list then?
    • In practice, O(n/k) is still a practical improvement over O(n).
    • Moreover, if your table size is large enough, average run-time approaches constant time:
      • If I set k = 2n (table size is twice the number of keys)
      • O(n/2n) -> O(1/2) -> O(1) 

Improving Our Hashing Algorithm
  • The above hashing algorithm, which mapped a key to an index according to the length of the key, was not a good hashing algorithm.
  • What was wrong with this algorithm?
    • Too many collisions - we were unable to get an even distribution of values across the table
      • Clustering of values around buckets 2-10
    • As a related issue, some indices were unusable
      • index 0 can never hold a value as there are no names of length 0
    • Both of these problems contribute to a loss of efficiency.
  • Perhaps we can find a better hashing algorithm...
  • Let's try the following:
    • Sum the ASCII values of each character in the key
    • Scale the result by applying % table size.
int hash(string key, const int TABLE_SIZE)
{
    int index, sum = 0;
    for(int i = 0; i < key.length(); i++)
        sum += (int) key[i]; //summing the ASCII values for each character in the string
    index = sum % TABLE_SIZE; //dividing the summed ASCII values by 35 && storing remainder as my index
    return index;
}
.
Guidelines for Choosing a Good Hash Function
  • But, how do we know that our new algorithm is a good one?
  • Here are some guidelines for writing good hashing functions:
    1. Use all of the information provided by a key.
      • Our algorithm uses all characters in the name to create the key.
    2. Make sure the elements to insert can be spread across the entire table.
      • By applying % operator we make sure our summed ASCII values will be better distributed throughout the table.
    3. Verify that your algorithm can map similar keys to different buckets.
      • Will values like "Mark" and "Matt" be placed in the same bucket?
      • Applying the hashing algorithm, with table size of 37
      • Mark: mapped to index 25
      • Matt: mapped to index 36
    4. Make sure it uses only simple operations (such as the basic arithmetic operators) to minimize run-time.
      • When you write a hash table, you will call your hashing function often (insert, search, delete).
      • Make certain your hashing function is not computationally intensive.
  • While the hashing function is an important aspect in helping to increase the effectiveness of a hash table, other aspects also need to be considered.

Guidelines for Selecting a Table Size

  • In order to effectively minimize collisions (thereby maintaining efficiency), a good rule-of-thumb is to make your table size be twice the number of entries you are planning to insert.
  • However, more often than not, the number of entries to be inserted will not be known in advance.
  • Therefore, it may be necessary to re-size your hash table:
    • Remember the load factor: n /k (number_of_keys  divided by table_size)
    • As the load factor increases, collisions will become more common, resulting in decreased efficiency.
    • How can we decrease the load factor? We increase the table size!
    • When should we re-size the table?
      • Linear probing: Array indices are used up quickly. Re-size when load factor >= 1/2
      • Separate Chaining: Opinions vary: between 2/3 and 3/4.
    • Simplest re-sizing method:
      • Create new hash table with an array sized twice as large as the current hash table.
      • Linearly traverse the old hash table.
        • Copy each element and insert into new hash table.
        • Elements will be rehashed when inserted into new hash table.
      • Delete old hash table.


Wrap Up
  • With your partner, answer the questions from today's learning objectives
  • Nice 7 minute summary of Hash Tables: Harvard CS 50 Video

Upcoming Assignments

  • Lab 7 due Tuesday
  • Quiz 4 on Thursday