Welcome to Lesson 13!

Learning Objectives
By the end of this class, you should know...
  • What is a Hash Table?
  • What is a hashing algorithm?
  • What are some solutions to the problem of collisions?
  • What is the difference between linear probing and separate chaining?


1. Hash Table Pre-Assessment: Remember This?
  • Unicode Review: Write a method which prints out the Unicode value of each character in a String:
public static void printUnicode(String str)
{
    //write method body here
}





2. The Hash Table ADT

A hash table is an array that uses a hash function to determine placement of elements (keys)
  • A Hash Table is a structure where values are mapped to an array index according to the rules of an algorithm.
    •  This algorithm is called the hashing algorithm.
    •  The hashing algorithm takes an input (called a key) and then returns the index of an array.
    •  This array index (also called a bucket) is the location in the array where the value can be stored.
    •  The key can be the data itself or some portion of the data, as in the example below where the hash table is storing employee objects.
  • As we are going to see, hash table operations, such as search, insert, and delete can be performed in just O(1) time!

2.1. A Hash Table Example

  • Suppose that I wish to store the names of all the students in my class.
  • We could start with an array of Strings.
final int CLASS_SIZE = 42; //# students enrolled in the class
String[CLASS_SIZE] names = new String[];
  • The underlying structure of a hash table is an array.
  • However to insert elements into our hash table array, we utilize a hashing algorithm.
  • In general, a hashing algorithm is used to produce a value, usually a number, that is consistent for a given input.
  • For a hash table, the hashing algorithm is used to determine at which array index (or bucket) to place a value in the table.
    • We will also need to use the hashing algorithm to search for a value (at which index was the value placed in the table) and to remove a value.
  • Let's assume a simple hashing algorithm to determine at which array index to store each name on the roster. 
    • When searching for a student by name, I would call the hashing algorithm, passing the name as input.
    • The hashing algorithm would return the index with which the name is associated.


  • Assume the hashing algorithm is implemented as the following method:
private int hash(String student_name) //student_name is the key
{
    int index = student_name.length();
    return index;
}
...
//applying the hash algorithm to the first name in my CIS 22C roster
int index = hash("Anthony"); //Should return 7
names[index] = "Anthony"; //storing the value at index 7
  • In the above example, the value "Anthony" is the key.
  • The key is the value that is used by the hash method to determine the index of the name within the table.
  • Now, when I add students to the roster, they will be stored at the index provided by the hash method:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    ...
    names[41] = "";
  • Later, when I search for a name in my table, I can simply call the hash method:
hash("Anthony") => index at which Anthony is stored if he is in the table

2.2. Collisions



  • As you probably noticed, there is a problem with the above example.
  • How many people in this class have a first name that is 3 letters long? 4 letters? 5 letters? 6 letters? 7 letters? 8 letters?
  • This problem is known as a collision.
    • A collision occurs when more than one value is mapped to the same index in a hash table.
    • In my above example, most names would be mapped to the indices 2 - 10.
    • Trivia: Average name length is 5.9 letters.
  • Collisions can be mitigated with better hashing algorithms.
  • However, regardless of how good a hashing algorithm is, collisions will occur.
  • perfect hash method (no collisions) can only be achieved if all values are known ahead of time; very unlikely.
  • There are various approaches to resolving collisions. We will discuss two principle approaches below.

2.3. Linear Probing
  • One simple approach to handling collisions within a hash table is called linear probing.
  • In linear probing, if an element is hashed to an index that is already storing another value, the new element is inserted at the next open index.
  • For example, imagine our names hash table contains the following values:
         names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    names[8] = "";
    names[9] = "";
    ...
    names[41] = "";
  • Now, further suppose that I want to insert the name "Brandon" into my names table.
    • Woops! There is a collision when I apply my hash algorithm.
    • Index 7 is already occupied by "Anthony"
    • I can use linear probing to resolve this collision.
  • Using linear probing, I insert the name "Brandon" in the next available index after the hashed index of 7.
    • Where would I place "Brandon" in the table?
    • names[8] = "Brandon";
  • Now, my table looks like the following:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    names[8] = "Brandon"; //placed at next available bucket
    names[9] = "";
    ...
    names[41] = "";
  • Next, suppose I want to insert the name "Cameron". Where would his name go as index 7 already contains the name "Anthony" and index 8 already contains the name "Brandon"?
  • In this case, I would need to place "Cameron" two indices away from his hashed index of 7, at index 9:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    names[8] = "Brandon";
    names[9] = "Cameron"; //placed at next available bucket
    ...
    names[41] = "";
  • Now, suppose that I want to search for the name "Cameron."
    • Applying my hash method, I would look for his name at index 7.
    • When I do not locate him there, I know that I need to keep searching farther down the table.
  • As you can see, the more collisions, the worse the efficiency of my hash table when performing the insert and search operations.
  • Given sufficient collisions in my table, search approaches a Big-O run-time of O(n), no improvement over a simple array.
  • Is there another approach to linear probing?

2.4. Separate Chaining
  • Another approach to collision resolution is called separate chaining.
  • Rather than limiting ourselves to storing only a single value at each index of the table, with separate chaining we store a series of 0 or more linked elements at each index.
  • In essence, we make our hash table an array of linked lists, as depicted below:
  • Now, if a name is inserted, and there is a collision, we simply push the new name onto the back of our list at that particular index.
  • Therefore, we could re-envision our previous names hash table like so:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma" --> "Sean";
    names[5] = "Howon" --> "Qilin";
    names[6] = "Maryna";
    names[7] = "Anthony" --> "Brandon" --> "Cameron";
    ...
    names[41] = "";
  • Now when I want to search for the name "Cameron," if his name is in the table, it must be in the list at index 7.
  • To check, simply perform a linear search through the list at names[7].
  • For lab 7, we will look at how to implement a version of a hash table that uses separate chaining.


Wrap Up
  • Answer the review questions on Canvas for this lesson
  • Nice 7 minute summary of Hash Tables: Harvard CS 50 Video
  • Review of Unicode Video

Upcoming Assignments

  • Lab 7 due Monday
  • Study for your Quiz

~Enjoy Your Day!~