Learning ObjectivesBy the end of this class, you should know...
 What is a Hash Table?
 What are the advantages of using a Hash Table over other ADTs we have studied?
 Arrays?
 Linked Lists (and other linear structures)?
 Binary Search Trees?
 What is a hashing algorithm?
 How do you write a good hashing algorithm?
 What are some solutions to the problem of collisions?
 What is the difference between linear probing and separate chaining?
 What is the efficiency of the Hash Table operations?
 How does the size of the Hash Table affect its efficiency?
 When is it necessary to resize a hash table and how would you write a simple resizing function?
Announcements
 Quiz after the break
 Lab 7 posted  Hash Tables and BSTs
Hash Table PreAssessment: Remember This?
 ASCII Review: Write a function which prints out the ASCII value of each character in a string:
void printASCII(string str) {
//write function body here }
The Hash Table ADT
 A Hash Table is a structure where values are mapped to an array index according to the rules of an algorithm.
 This algorithm is called the hashing algorithm.
 The hashing algorithm takes an input (called a key) and then returns the index of an array.
 This array index (also called a bucket) is the location in the array where the value can be stored.
 The key can be the data itself or some portion of the data, as in the example below where the hash table is storing employee objects.
 As we are going to see, hash table operations, such as search, insert, and delete can be performed in just O(1) time!
Group Activity: Hash Table Bingo!  To help you get a feeling for the way a hash table is organized, let's play Hash Table Bingo.
 The class as a whole is going to play the role of the underlying array structure in a Hash Table.
 Each student in the class will be assigned an index.
 Using a hash function, I will enter the name of a piece of candy as a key.
 The hash function will return the index at which to store the candy.
 If your index number is returned by the hash function, you will get a piece of candy.
 At
the end of the exercise, I could hash the names of the pieces of candy in
order to find out who received a piece and who didn't.
Another Hash Table Example
 Suppose that I wish to store the names of all the students in my class.
 We could start with an array of strings.
const int CLASS_SIZE = 42; //# students enrolled in the class string names[CLASS_SIZE];  The underlying structure of a hash table is an array.
 However to insert elements into our hash table array, we utilize a hashing algorithm.
 In general, a hashing algorithm is used to produce a value, usually a number, that is consistent for a given input.
 For a hash table, the hashing algorithm is used to determine at which array index (or bucket) to place a value in the table.
 We will also need to use the hashing algorithm to search for a value (at which index was the value placed in the table) and to remove a value.
 Let's assume a simple hashing algorithm to determine at which array index to store each name on the roster.
 When searching for a student by name, I would call the hashing algorithm, passing the name as input.
 The hashing algorithm would return the index with which the name is associated.
 Assume the hashing algorithm is implemented as the following function:
int hash(string student_name) {
int index = student_name.length();
return index;
//applying the hash algorithm to the first name in my CIS 22C roster
int index = hash("Anthony"); //Should return 7
names[index] = "Anthony"; //storing the value at index 7
 In the above example, the value "Anthony" is the key.
 The key is the value that is used by the hash function to determine the index of the name within the table.
 Now, when I add students to the roster, they will be stored at the index provided by the hash function:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
... names[41] = "";  Later, when I search for a name in my table, I can simply call the hash function:
hash("Anthony") => index at which Anthony is stored if he is in the table
 As you probably noticed, there is a problem with the above example.
 How many people in this class have a first name that is 3 letters long? 4 letters? 5 letters? 6 letters? 7 letters? 8 letters?
 This problem is known as a collision.
 A collision occurs when more than one value is mapped to the same index in a hash table.
 In my above example, most names would be mapped to the indices 2  10.
 Trivia: Average name length is 5.9 letters.
 Collisions can be mitigated with better hashing algorithms.
 However, regardless of how good a hashing algorithm is, collisions will occur.
 A perfect hash function (no collisions) can only be achieved if all values are known ahead of time; very unlikely.
 There are various approaches to resolving collisions. We will discuss two principle approaches below.
Linear Probing  One simple approach to handling collisions within a hash table is called linear probing.
 In
linear probing, if an element is hashed to an index that is already
storing another value, the new element is inserted at the next open
index.
 For example, imagine our names hash table contains the following values:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "";
names[9] = "";
... names[41] = "";  Now, further suppose that I want to insert the name "Brandon" into my names table.
 Woops! There is a collision when I apply my hash algorithm.
 Index 7 is already occupied by "Anthony"
 I can use linear probing to resolve this collision.
 Using linear probing, I insert the name "Brandon" in the next available index after the hashed index of 7.
 Where would I place "Brandon" in the table?
 names[8] = "Brandon";
 Now, my table looks like the following:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "Brandon"; //placed at next available bucket
names[9] = "";
... names[41] = "";  Next,
suppose I want to insert the name "Cameron". Where would his name go as
index 7 already contains the name "Anthony" and index 8 already
contains the name "Brandon"?
 In this case, I would need to place "Cameron" two indices away from his hashed index of 7, at index 9:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "Brandon";
names[9] = "Cameron"; //placed at next available bucket
... names[41] = "";  Now, suppose that I want to search for the name "Cameron."
 Applying my hash function, I would look for his name at index 7.
 When I do not locate him there, I know that I need to keep searching farther down the table.
 As you can see, the more collisions, the worse the efficiency of my hash table when performing the insert and search operations.
 Given sufficient collisions in my table, search approaches a BigO runtime of O(n), no improvement over a simple array.
 Is there another approach to linear probing?
Separate Chaining  Another approach to collision resolution is called separate chaining.
 Rather
than limiting ourselves to storing only a single value at each index of
the table, with separate chaining we store a series of 0 or more linked
elements at each index.
 In essence, we make our hash table an array of linked lists, as depicted below:
 Now, if a name is inserted, and there is a collision, we simply
push the new name onto the back of our list at that particular index.
 Therefore, we could reenvision our previous names hash table like so:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma" > "Sean";
names[5] = "Howon" > "Qilin";
names[6] = "Maryna";
names[7] = "Anthony" > "Brandon" > "Cameron";
... names[41] = "";  Now when I want to search for the name "Cameron," if his name is in the table, it must be in the list at index 7.
 To check, simply perform a linear search through the list at names[7].
 For lab 7, we will look at how to implement a version of a hash table that uses separate chaining.
BigO RunTime of a Hash Table  For hash table operations such as search, insert and remove we get the following BigO runtimes:
 Best case: O(1)
 Element is alone at collision free index.
 Worst case: O(n)
 Worst case is a linear search, which is O(n).
 Holds for both linear probing and chaining.
Worst Case for Separate Chaining
 Average case: O(1)? Yes!
 Average
case runtime of hash table operations is a question. Assuming a
perfect hash function, these operations are always O(1).
 However,
without a perfect hash function, average runtime will be dependent
upon how big the table is and how many values it is currently storing
(i.e. on how full the table is).
 As more and more values are inserted into the table, the likelihood of collision increases, resulting in longer chains.
 The speed of hash table operations is therefore dependent on something called the load factor, which is defined to be n/k, where k is the the table size (number of buckets) and n is the number of keys.
load factor = n / k where,
n = the number of keys (elements) k = the table size (number of buckets)
 The larger the value for the load factor, the slower the hash table operations.
 Therefore, average case for a hash table is sometimes considered to be O(n/k).
 You might be thinking: But... O(n/k) = O(n), as we do not include lower order terms and coefficients in our BigO runtime.
 So, is a hash table about as efficient as a linked list then?
 In practice, O(n/k) is still a practical improvement over O(n).
 Moreover, if your table size is large enough, average runtime approaches constant time.
 Rule of thumb: let your table size be twice to three times the number of anticipated keys.
 If I set k = 2n (table size is twice the number of keys)
 O(n/2n) > O(1/2) > O(1)
Improving Our Hash Algorithm  The
above hash algorithm, which mapped a key to an index according to
the length of the key, was not a good hashing algorithm.
 What was wrong with this algorithm?
 Too many collisions  we were unable to get an even distribution of values across the table
 Clustering of values around buckets 210
 As a related issue, some indices were unusable
 index 0 can never hold a value as there are no names of length 0
 Both of these problems contribute to a loss of efficiency.
 Perhaps we can find a better hash algorithm...
 Let's try the following:
 Sum the ASCII values of each character in the key
 Scale the result by applying % table size.
int hash(string key, const int TABLE_SIZE) { int index, sum = 0; for(int i = 0; i < key.length(); i++)
sum += (int) key[i]; //summing the ASCII values for each character in the string
index = sum % TABLE_SIZE; //dividing the summed ASCII values by TABLE_SIZE
//and storing remainder as my index
return index;
}
Guidelines for Choosing a Good Hash Function But, how do we know that our new algorithm is a good one?
 Here are some guidelines for writing good hashing functions:
 Use all of the information provided by a key.
 Our algorithm uses all characters in the name to create the key.
 Make sure the elements to insert can be spread evenly across the entire table.
 By applying % operator we make sure our summed ASCII values will be better distributed throughout the table.
 Verify that your algorithm can map similar keys to different buckets.
 Will values like "Mark" and "Matt" be placed in the same bucket?
 Applying the hashing algorithm, with table size of 37
 Mark: mapped to index 25
 Matt: mapped to index 36
 Make sure it uses only simple operations (such as the basic arithmetic operators) to minimize runtime.
 When you write a hash table, you will call your hashing function often (insert, search, delete).
 Make certain your hashing function is not computationally intensive.
 While
the hashing function is an important aspect in helping to increase the
effectiveness of a hash table, other aspects also need to be considered.
Guidelines for Selecting a Table Size  In
order to effectively minimize collisions (thereby maintaining
efficiency), a good ruleofthumb is to make your table size be twice
the number of entries you are planning to insert.
 However, more often than not, the number of entries to be inserted will not be known in advance.
 Therefore, it may be necessary to resize your hash table:
 Remember the load factor: n / k (number_of_keys divided by table_size)
 As the load factor increases, collisions will become more common, resulting in decreased efficiency.
 How can we decrease the load factor? We increase the table size!
 When should we resize the table?
 Resizing also requires computation time, so only resize when necessary to maintain lookup efficiency.
 Linear probing: Array indices are used up quickly. Resize when load factor >= 1/2
 Separate Chaining: Opinions vary: between 2/3 and 3/4.
 Simplest resizing method:
 Create new hash table with an array sized twice as large as the current hash table.
 Linearly traverse the old hash table.
 Copy each element and insert into new hash table.
 Elements will be rehashed when inserted into new hash table.
 Delete old hash table.
Wrap Up
 With your partner, answer the questions from today's learning objectives
 Nice 7 minute summary of Hash Tables: Harvard CS 50 Video
