Learning Objectives
By the end of this class, you should know...
• What is a Hash Table?
• What are the advantages of using a Hash Table over other ADTs we have studied?
• Arrays?
• Linked Lists (and other linear structures)?
• Binary Search Trees?
• What is a hashing algorithm?
• How do you write a good hashing algorithm?
• What are some solutions to the problem of collisions?
• What is the difference between linear probing and separate chaining?
• What is the efficiency of the Hash Table operations?
• How does the size of the Hash Table affect its efficiency?
• When is it necessary to resize a hash table and how would you write a simple resizing function?

Announcements
• Quiz after the break
• Lab 7 posted - Hash Tables and BSTs

Hash Table Pre-Assessment: Remember This?
• ASCII Review: Write a function which prints out the ASCII value of each character in a string:
void printASCII(string str)
{
//write function body here
}

• A Hash Table is a structure where values are mapped to an array index according to the rules of an algorithm.
•  This algorithm is called the hashing algorithm.
•  The hashing algorithm takes an input (called a key) and then returns the index of an array.
•  This array index (also called a bucket) is the location in the array where the value can be stored.
•  The key can be the data itself or some portion of the data, as in the example below where the hash table is storing employee objects.
• As we are going to see, hash table operations, such as search, insert, and delete can be performed in just O(1) time!

Group Activity: Hash Table Bingo!
• To help you get a feeling for the way a hash table is organized, let's play Hash Table Bingo.
• The class as a whole is going to play the role of the underlying array structure in a Hash Table.
• Each student in the class will be assigned an index.
• Using a hash function, I will enter the name of a piece of candy as a key.
• The hash function will return the index at which to store the candy.
• If your index number is returned by the hash function, you will get a piece of candy.
• At the end of the exercise, I could hash the names of the pieces of candy in order to find out who received a piece and who didn't.

Another Hash Table Example

• Suppose that I wish to store the names of all the students in my class.
const int CLASS_SIZE = 42; //# students enrolled in the class
string names[CLASS_SIZE];
• The underlying structure of a hash table is an array.
• However to insert elements into our hash table array, we utilize a hashing algorithm.
• In general, a hashing algorithm is used to produce a value, usually a number, that is consistent for a given input.
• For a hash table, the hashing algorithm is used to determine at which array index (or bucket) to place a value in the table.
• We will also need to use the hashing algorithm to search for a value (at which index was the value placed in the table) and to remove a value.
• Let's assume a simple hashing algorithm to determine at which array index to store each name on the roster.
• When searching for a student by name, I would call the hashing algorithm, passing the name as input.
• The hashing algorithm would return the index with which the name is associated.

• Assume the hashing algorithm is implemented as the following function:
int hash(string student_name)
{
int index = student_name.length();
return index;
}
...
//applying the hash algorithm to the first name in my CIS 22C roster
int index = hash("Anthony"); //Should return 7
names[index] = "Anthony"; //storing the value at index 7
• In the above example, the value "Anthony" is the key.
• The key is the value that is used by the hash function to determine the index of the name within the table.
• Now, when I add students to the roster, they will be stored at the index provided by the hash function:
names[0] = "";
names[1] = "";
names[2] = "";
names[3] = "Kun";
names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
...
names[41] = "";
• Later, when I search for a name in my table, I can simply call the hash function:
hash("Anthony") => index at which Anthony is stored if he is in the table

Collisions

• As you probably noticed, there is a problem with the above example.
• How many people in this class have a first name that is 3 letters long? 4 letters? 5 letters? 6 letters? 7 letters? 8 letters?
• This problem is known as a collision.
• A collision occurs when more than one value is mapped to the same index in a hash table.
• In my above example, most names would be mapped to the indices 2 - 10.
• Trivia: Average name length is 5.9 letters.
• Collisions can be mitigated with better hashing algorithms.
• However, regardless of how good a hashing algorithm is, collisions will occur.
• perfect hash function (no collisions) can only be achieved if all values are known ahead of time; very unlikely.
• There are various approaches to resolving collisions. We will discuss two principle approaches below.

Linear Probing
• One simple approach to handling collisions within a hash table is called linear probing.
• In linear probing, if an element is hashed to an index that is already storing another value, the new element is inserted at the next open index.
• For example, imagine our names hash table contains the following values:
names[0] = "";
names[1] = "";
names[2] = "";
names[3] = "Kun";
names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "";
names[9] = "";
...
names[41] = "";
• Now, further suppose that I want to insert the name "Brandon" into my names table.
• Woops! There is a collision when I apply my hash algorithm.
• Index 7 is already occupied by "Anthony"
• I can use linear probing to resolve this collision.
• Using linear probing, I insert the name "Brandon" in the next available index after the hashed index of 7.
• Where would I place "Brandon" in the table?
• names[8] = "Brandon";
• Now, my table looks like the following:
names[0] = "";
names[1] = "";
names[2] = "";
names[3] = "Kun";
names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "Brandon"; //placed at next available bucket
names[9] = "";
...
names[41] = "";
• Next, suppose I want to insert the name "Cameron". Where would his name go as index 7 already contains the name "Anthony" and index 8 already contains the name "Brandon"?
• In this case, I would need to place "Cameron" two indices away from his hashed index of 7, at index 9:
names[0] = "";
names[1] = "";
names[2] = "";
names[3] = "Kun";
names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "Brandon";
names[9] = "Cameron"; //placed at next available bucket
...
names[41] = "";
• Now, suppose that I want to search for the name "Cameron."
• Applying my hash function, I would look for his name at index 7.
• When I do not locate him there, I know that I need to keep searching farther down the table.
• As you can see, the more collisions, the worse the efficiency of my hash table when performing the insert and search operations.
• Given sufficient collisions in my table, search approaches a Big-O run-time of O(n), no improvement over a simple array.
• Is there another approach to linear probing?

Separate Chaining
• Another approach to collision resolution is called separate chaining.
• Rather than limiting ourselves to storing only a single value at each index of the table, with separate chaining we store a series of 0 or more linked elements at each index.
• In essence, we make our hash table an array of linked lists, as depicted below:
• Now, if a name is inserted, and there is a collision, we simply push the new name onto the back of our list at that particular index.
• Therefore, we could re-envision our previous names hash table like so:
names[0] = "";
names[1] = "";
names[2] = "";
names[3] = "Kun";
names[4] = "Doma" --> "Sean";
names[5] = "Howon" --> "Qilin";
names[6] = "Maryna";
names[7] = "Anthony" --> "Brandon" --> "Cameron";
...
names[41] = "";
• Now when I want to search for the name "Cameron," if his name is in the table, it must be in the list at index 7.
• To check, simply perform a linear search through the list at names[7].
• For lab 7, we will look at how to implement a version of a hash table that uses separate chaining.

Big-O Run-Time of a Hash Table
• For hash table operations such as search, insert and remove we get the following Big-O run-times:
• Best case: O(1)
• Element is alone at collision free index.
• Worst case: O(n)
• Worst case is a linear search, which is O(n).
• Holds for both linear probing and chaining.

Worst Case for Separate Chaining

• Average case: O(1)? Yes!
• Average case run-time of hash table operations is a question. Assuming a perfect hash function, these operations are always O(1).
• However, without a perfect hash function, average run-time will be dependent upon how big the table is and how many values it is currently storing (i.e. on how full the table is).
• As more and more values are inserted into the table, the likelihood of collision increases, resulting in longer chains.
• The speed of hash table operations is therefore dependent on something called the load factor, which is defined to be n/k, where k is the the table size (number of buckets) and n is the number of keys.
load factor = n / k
where,
n = the number of keys (elements)
k = the table size (number of buckets)
• The larger the value for the load factor, the slower the hash table operations.
• Therefore, average case for a hash table is sometimes considered to be O(n/k).
• You might be thinking: But... O(n/k) = O(n), as we do not include lower order terms and coefficients in our Big-O run-time.
• So, is a hash table about as efficient as a linked list then?
• In practice, O(n/k) is still a practical improvement over O(n).
• Moreover, if your table size is large enough, average run-time approaches constant time.
• Rule of thumb: let your table size be twice to three times the number of anticipated keys.
• If I set k = 2n (table size is twice the number of keys)
• O(n/2n) -> O(1/2) -> O(1)

Improving Our Hash Algorithm
• The above hash algorithm, which mapped a key to an index according to the length of the key, was not a good hashing algorithm.
• What was wrong with this algorithm?
• Too many collisions - we were unable to get an even distribution of values across the table
• Clustering of values around buckets 2-10
• As a related issue, some indices were unusable
• index 0 can never hold a value as there are no names of length 0
• Both of these problems contribute to a loss of efficiency.
• Perhaps we can find a better hash algorithm...
• Let's try the following:
• Sum the ASCII values of each character in the key
• Scale the result by applying % table size.
int hash(string key, const int TABLE_SIZE)
{
int index, sum = 0;
for(int i = 0; i < key.length(); i++)
sum += (int) key[i]; //summing the ASCII values for each character in the string
index = sum % TABLE_SIZE; //dividing the summed ASCII values by TABLE_SIZE
//and storing remainder as my index
return index;
}

Guidelines for Choosing a Good Hash Function
• But, how do we know that our new algorithm is a good one?
• Here are some guidelines for writing good hashing functions:
1. Use all of the information provided by a key.
• Our algorithm uses all characters in the name to create the key.
2. Make sure the elements to insert can be spread evenly across the entire table.
• By applying % operator we make sure our summed ASCII values will be better distributed throughout the table.
3. Verify that your algorithm can map similar keys to different buckets.
• Will values like "Mark" and "Matt" be placed in the same bucket?
• Applying the hashing algorithm, with table size of 37
• Mark: mapped to index 25
• Matt: mapped to index 36
4. Make sure it uses only simple operations (such as the basic arithmetic operators) to minimize run-time.
• When you write a hash table, you will call your hashing function often (insert, search, delete).
• Make certain your hashing function is not computationally intensive.
• While the hashing function is an important aspect in helping to increase the effectiveness of a hash table, other aspects also need to be considered.

Guidelines for Selecting a Table Size

• In order to effectively minimize collisions (thereby maintaining efficiency), a good rule-of-thumb is to make your table size be twice the number of entries you are planning to insert.
• However, more often than not, the number of entries to be inserted will not be known in advance.
• Therefore, it may be necessary to re-size your hash table:
• Remember the load factor: n / k (number_of_keys divided by table_size)
• As the load factor increases, collisions will become more common, resulting in decreased efficiency.
• How can we decrease the load factor? We increase the table size!
• When should we re-size the table?
• Resizing also requires computation time, so only resize when necessary to maintain lookup efficiency.
• Linear probing: Array indices are used up quickly. Re-size when load factor >= 1/2
• Separate Chaining: Opinions vary: between 2/3 and 3/4.
• Simplest re-sizing method:
• Create new hash table with an array sized twice as large as the current hash table.
• Linearly traverse the old hash table.
• Copy each element and insert into new hash table.
• Elements will be rehashed when inserted into new hash table.
• Delete old hash table.