Learning ObjectivesBy the end of this class, you should know...
 What is a Hash Table?
 What are the advantages of using a Hash Table over other ADTs we have studied?
 Arrays?
 Linked Lists (and other linear structures)?
 Binary Search Trees?
 What is a hashing algorithm?
 How do you write a good hashing algorithm?
 What are some solutions to the problem of collisions?
 What is the efficiency of common hashing operations?
 How does the size of the Hash Table affect its efficiency?
 How do you search for an element in a Hash Table?
Announcements
 Quiz 4 on Thursday  BSTs and Hash Tables
 Lab 7 posted  Hash Tables and BSTs
 Feedback from a former student about internship search
Review Activity With a partner, complete the following:
 Insert the following values (in this order) into an empty BST: 7, 3, 2, 9, 18, 5, 14, 8, 4, 1
 Redraw the tree after calling remove on node 7 using the algorithm given in class.
 Now, trace pre, post and inorder print on the BST.
 Write the pre, post and inorder print functions for a BST
Hash Table PreAssessment: Remember This?
 ASCII Review: Write a function which prints out the ASCII value of each character in a string:
void printASCII(string str) {
//write function body here }
The Hash Table ADT A Hash Table is a structure where values are mapped or assigned to an array index according to the rules of an algorithm.
 This algorithm is called the hashing function.
 The hashing function takes an input (called a key) and then returns the index of an array.
 This array index is where the value is to be stored
 As we are going to see, hash table operations, such as search, insert, and delete can be performed in just O(1) time.
Group Activity: Hash Table Bingo!  To help you understand a Hash Table, let's play Hash Table Bingo.
 The class as a whole is going to play the role of the underlying array structure in a Hash Table.
 Each student in the class will be assigned an index based on his or her name on the class roster.
 Using a hash function, I will enter the name of a piece of candy as a key.
 The value to store at the array index
 The hash function will return to me the index at which I want to store the candy.
 If your index is returned by the hash function, you will get a piece of candy.
 At
the end of the exercise, I can hash the names of the pieces of candy in
order to find out who received a piece and who didn't.
A Simple Array Only Example  Suppose that I wish to store the names of all the students in my class.
 We could start with an array of strings.
const int CLASS_SIZE = 40; //# students enrolled in the class string names[CLASS_SIZE];
 Now, our first thought might be to store each student's name alphabetically in the array.
 However, later, if a student drops the class, I would need to:
 Use binary search to search the array for the student's name: O(log n) time.
 Delete the name and update the array by copying all the values after that student's name down by one: O(n) time
 Is there a way that I can do better?
Let's Use a Hash Table Instead  The underlying structure of a hash table is an array.
 However when we insert elements into our hash table array, we utilize a hashing algorithm
 The hashing algorithm is used to determine at which array index to place an element in our table.
 Each index is also known as a bucket.
 As we will see later, more than one element can be associated with an index.
 Let's look at a simple example of a hash table:
 Using
the array above as the basis for a hash table provides a more efficient
alternative when searching or deleting a name from the class roster.
const int CLASS_SIZE = 40; //# students enrolled in the class string names[CLASS_SIZE];
 Let's assume that I have a simple hashing algorithm to determine at which array index to store each name on my roster.
 When searching for a student's name, I would call the hashing algorithm, passing the name as input.
 The hashing algorithm would return the index with which the name is associated.
 Lookup then involves searching for the name at a single array index, which is O(1).
 Assume my hashing algorithm is implemented as the following function:
int hash(string student_name) {
int index = student_name.length();
return index;
//applying my hashing algorithm to the first name in my CIS 22C roster
int index = hash("Anthony"); //Should return 7
names[index] = "Anthony"; //storing the value at index 7
 In the above example, the value "Anthony" is called the key.
 The key is the value that is passed into the hashing algorithm to determine the placement of the name within the table.
 Now, when I add students to the roster, they will be stored at the index provided by the hashing algorithm:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
... names[39] = "";  Later, when I search for a name in my table, I can simply call the hashing algorithm:
int index = hash("Anthony");//hashing algorithm provides index to search for the name
if (names[index] == "Anthony") //check if the value at index is equal to the name I am searching for
names[index] = ""; //removing the name of the student who dropped
else cout << "Anthony isn't in the table." << endl;
 As we can see from this simple example, with a hash table, insertion, search and deletion can be performed in just O(1).
 We will discuss the BigO runtime of hash table operations later in the lesson.
 As you probably noticed, there is a problem with the above example.
 How many people in this class have a first name that is 3 letters long? 4 letters? 5 letters? 6 letters? 7 letters? 8 letters?
 This problem is known as a collision.
 A collision occurs when more than one value is mapped to the same index in a hash table.
 In my above example, most names would be mapped to the indices 2  10.
 Trivia: Average name length is 5.9 letters.
 Collisions can be mitigated with better hashing algorithms.
 However, regardless of how good a hashing algorithm is, collisions will occur.
 A perfect hash function (no collisions) can only be achieved if all values are known ahead of time; very unlikely.
 There are various approaches to resolving collisions. We will discuss two principle approaches below.
Linear Probing  One simple approach to handling collisions within a hash table is called linear probing.
 In
linear probing, if an element is hashed to an index that is already
storing another value, the new element is inserted at the next open
index.
 For example, imagine our names hash table contains the following values:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
... names[39] = "";  Now, further suppose that I want to insert the name "Brandon" into my names table.
 Woops! There is a collision when I apply my hashing algorithm.
 Index 7 is already occupied by "Anthony"
 I can use linear probing to resolve this collision.
 Using linear probing, I insert the name "Brandon" in the next available index after the hashed index of 7.
 Where would I place "Brandon" in the table?
 names[8] = "Brandon";
 Now, my table looks like the following:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "Brandon";
... names[39] = "";  Next,
suppose I want to insert the name "Cameron". Where would his name go as
index 7 already contains the name "Anthony" and index 8 already
contains the name "Brandon"?
 In this case, I would need to place "Cameron" two indices away from his hashed index of 7, at index 9:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "Brandon";
names[9] = "Cameron";
... names[39] = "";  Now, suppose that I want to search for the name "Cameron."
 Applying my hash function, I would look for his name at index 7.
 When, I do not locate him there, I know that I need to keep searching farther down the table.
 As you can see, the more collisions, the worse the efficiency of my hash table when performing the insert and search operations.
 Given sufficient collisions in my table, search approaches a BigO runtime of O(n), no improvement over a simple array.
 Deleting also becomes problematic.
 Need to fill empty space from deletion.
 Which elements were properly indexed? Which were placed due to linear probing?
 Is there a better approach than linear probing?
Separate Chaining  Another approach to collision resolution is called separate chaining.
 Rather
than limiting ourselves to storing only a single value at each index of
the table, with separate chaining we store a series of 0 or more linked
elements at each index.
 In essence, we make our hash table an array of linked lists, as depicted below:
 Now,
each time that a name is inserted, and there is a collision, we simply
push the new name onto the back of our list at that particular index.
 Therefore, we could reenvision our previous names hash table like so:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma" > "Sean";
names[5] = "Howon" > "Qilin";
names[6] = "Maryna";
names[7] = "Anthony" > "Brandon" > "Cameron";
... names[39] = "";  Now when I want to search for the name "Cameron," if his name is in the table, it must be in the list at index 7.
 To check, simply perform a linear search through the list at names[7].
 Later in the lesson, we will look at how to implement a version of a hash table that uses separate chaining.
BigO RunTime of a Hash Table  For hash table operations such as search, insert and delete we get the following BigO runtimes:
 Best case: O(1)
 Element is alone at collision free index.
 Worst case: O(n)
 Worst case is a linear search, which is O(n).
 Holds for both linear probing and chaining.
 Average case: O(1)? Yes!
 Average
case runtime of hash table operations is a question. Assuming a
perfect hash function, these operations are always O(1).
 However,
without a perfect hash function, average runtime will be dependent
upon how big the table is and how many values it is currently storing
(i.e. on how full the table is).
 As more and more values are inserted into the table, the likelihood of collision increases resulting in longer chains.
 The speed of hash table operations is therefore dependent on something called the load factor, which is defined to be n/k, where k is the the table size (number of buckets) and n is the number of keys.
load factor = n / k
 The larger the value for the load factor, the slower the hash table operations.
 Therefore, average case for a hash table is sometimes considered to be O(n/k).
 You might be thinking: But... O(n/k) = O(n), as we do not include lower order terms and coefficients in our BigO runtime.
 So, is a hash table about as efficient as a linked list then?
 In practice, O(n/k) is still a practical improvement over O(n).
 Moreover, if your table size is large enough, average runtime approaches constant time:
 If I set k = 2n (table size is twice the number of keys)
 O(n/2n) > O(1/2) > O(1)
Improving Our Hashing Algorithm  The
above hashing algorithm, which mapped a key to an index according to
the length of the key, was not a good hashing algorithm.
 What was wrong with this algorithm?
 Too many collisions  we were unable to get an even distribution of values across the table
 Clustering of values around buckets 210
 As a related issue, some indices were unusable
 index 0 can never hold a value as there are no names of length 0
 Both of these problems contribute to a loss of efficiency.
 Perhaps we can find a better hashing algorithm...
 Let's try the following:
 Sum the ASCII values of each character in the key
 Scale the result by applying % table size.
int hash(string key, const int TABLE_SIZE) { int index, sum = 0; for(int i = 0; i < key.length(); i++)
sum += (int) key[i]; //summing the ASCII values for each character in the string
index = sum % TABLE_SIZE; //dividing the summed ASCII values by 35 && storing remainder as my index
return index;
} . Guidelines for Choosing a Good Hash Function But, how do we know that our new algorithm is a good one?
 Here are some guidelines for writing good hashing functions:
 Use all of the information provided by a key.
 Our algorithm uses all characters in the name to create the key.
 Make sure the elements to insert can be spread across the entire table.
 By applying % operator we make sure our summed ASCII values will be better distributed throughout the table.
 Verify that your algorithm can map similar keys to different buckets.
 Will values like "Mark" and "Matt" be placed in the same bucket?
 Applying the hashing algorithm, with table size of 37
 Mark: mapped to index 25
 Matt: mapped to index 36
 Make sure it uses only simple operations (such as the basic arithmetic operators) to minimize runtime.
 When you write a hash table, you will call your hashing function often (insert, search, delete).
 Make certain your hashing function is not computationally intensive.
 While
the hashing function is an important aspect in helping to increase the
effectiveness of a hash table, other aspects also need to be considered.
Guidelines for Selecting a Table Size  In
order to effectively minimize collisions (thereby maintaining
efficiency), a good ruleofthumb is to make your table size be twice
the number of entries you are planning to insert.
 However, more often than not, the number of entries to be inserted will not be known in advance.
 Therefore, it may be necessary to resize your hash table:
 Remember the load factor: n /k (number_of_keys divided by table_size)
 As the load factor increases, collisions will become more common, resulting in decreased efficiency.
 How can we decrease the load factor? We increase the table size!
 When should we resize the table?
 Linear probing: Array indices are used up quickly. Resize when load factor >= 1/2
 Separate Chaining: Opinions vary: between 2/3 and 3/4.
 Simplest resizing method:
 Create new hash table with an array sized twice as large as the current hash table.
 Linearly traverse the old hash table.
 Copy each element and insert into new hash table.
 Elements will be rehashed when inserted into new hash table.
 Delete old hash table.
Wrap Up
 With your partner, answer the questions from today's learning objectives
 Nice 7 minute summary of Hash Tables: Harvard CS 50 Video
Upcoming Assignments Lab 7 due Tuesday
 Quiz 4 on Thursday
