Welcome to Lesson 13!
Learning Objectives
 How do you write BST search and remove?
 What is a Hash Table?
 What are the advantages of using a Hash Table over other ADTs we have studied?
 Arrays
 Linked Lists (and other linear structures)
 Binary Search Trees
 What is a hashing algorithm?
 How do you write a good hashing algorithm?
 What are some solutions to the problem of collisions?
 What is the efficiency of common hashing operations?
 How does the size of the Hash Table affect its efficiency?
 How do you search for an element in a Hash Table?
Announcements  Quiz 5 on Thursday
 Lab 6 due Thursday
 Today you will learn about the second of two data structures you will need for your course project:
 BST
 Hash Tables
BSTs Continued
Searching a BST Searching a BST is an almost identical process to adding a new node to the tree.
 In fact, we can use the same image as for insertion!
image source
 Let's trace an example, using the same example as for insertion.
 The Pseudocode is almost identical to insert as well.
 However, search returns a boolean instead of being a void function
Pseudocode for findHelper(root, value) 1. Base Case: the value is equal to the data stored at the root 1a. return true 2. Check if the value being sought is smaller than the data stored at the root. 2a. If this is true 2aa. If root left is NULL 2aaa. return false 2b. If the root left is not NULL 2bb. recursively call the findHelper function passing it root left and value 3. Otherwise 3a. Check if the root right is NULL 3aa. If this is true, return false 3b. If the root right is not NULL 3bb. recursively call the findHelper function passing it root right and value
Writing the BST Search Function find
 Add the following function prototypes to your BinarySearchTree.h file.
private: bool findHelper(NodePtr root, bstitem value);
public: bool find(bstitem value);
 Then, below the class definition, but inside of BinarySearchTree.h, add the following wrapper function definition for contains:
template <class bstitem> bool BinarySearchTree<bstitem>::find(bstitem value) { //add code to check for the precondition here
return findHelper(root, value);
}  It
is your job to write the helper function findHelper(root, value) and
also write the code to check for the precondition in find.
 Test your function inside of BinarySearchTreeTest.cpp to make sure it is working properly.
Finding the Minimum Value in a BST
 The purpose of the minimum function is to find the minimum value in the tree.
 Where is the minimum value always located in a BinarySearchTree?
 If you are uncertain, look at an example of BST to see where the minimum is located.
image source
 To find the minimum, all we need to do is to scroll through the nodes in the tree to the proper location.
Pseudocode for minimumHelper(root) //Note: Nonrecursive
 Repeat while root's left node is not Null
 Return the data at the root
Activity: Writing Minimum
 Add the following function prototypes to your BinarySearchTree.h file.
private: bstitem minimumHelper(NodePtr root);
public: bstitem minimum();
 Then, below the class definition, but inside of BinarySearchTree.h, add the following wrapper function definition:
template <class bstitem> bstitem BinarySearchTree<bstitem>::minimum() { //add code to handle the preconditionreturn minimumHelper(root);
}  It is your job to write the helper function minimumHelper(root).
 Test your function inside of BinarySearchTreeTest.cpp to make sure it is working properly.
BST Remove  Remove is perhaps the most challenging BST algorithm to implement.
 Like search, insert, and the tree traversals it is also recursive.
 Remove works by first searching for a particular value in a BST and then freeing the memory for this Node.
 However, complication comes from the fact that we could encounter several different cases:
 Case 1: The node we wish to remove if a leaf node < easy case
 Case 2: The node we wish to remove has only one child < medium case
 Case 3: The node we wish to remove has two children < hard case
 When deleting nodes in each of these cases, you must be mindful to preserve the BST property.
 Let's look at each of these possibilities in the diagram below:
 Therefore, when writing remove, we need to handle all three cases.
 The first 3 steps in the pseudocode below locate the value to remove inside the tree
 Step 4 handles the three cases discussed above.
Pseudocode for removeHelper(root, value) If root is null
 return root
 Otherwise, if value < the data at the root
 set root left equal to the recursive call of remove on root left
 Otherwise, if value > the data at the root
 set root right equal to the recursive call of remove on root right
 Otherwise,
 If root is a leaf node
 delete root
 Otherwise if root has a left child but no right child
 set the left child to be the root
 call delete on the original root
 Otherwise if root has a right child but no left child
 set the right child to be the root
 call delete on the original root
 Otherwise
 Search for the minimum value in the right subtree of the root
 Set the data of the root to be the data of the minimum value of the root right
 Set
root right equal to the recursive call of remove, passing it root right
and the minimum data of root right (i.e. delete the duplicate value in
the right subtree)
 return the root
Activity: Writing BST Remove
 Add the following function prototypes to your BinarySearchTree.h file.
private: NodePtr removeHelper(NodePtr root, bstitem value);
public: void remove(bstitem value);
 Then, below the class definition, but inside of BinarySearchTree.h, add the following wrapper function definition for remove:
template <class bstitem> void BST<bstitem>::remove(bstitem value) { //add code to handle the preconditions root = remove(root, value);
}  The next step is to implement the removeHelper(root, value) function.
 Note that you can use the minimumHelper(root) function that you wrote above as an additional helper function.
 Note
2: When you implement the removeHelper function, you will need to alter
it's signature outside of the class definition to be:
typename BinarySearchTree<bstitem>::NodePtr BinarySearchTree<bstitem>::removeHelper(NodePtr root, bstitem value) { //implement function here
}  Otherwise, your compiler will not recognize what a NodePtr is outside of the BinarySearchTree class definition.
 Test your function inside of BinarySearchTreeTest.cpp to make sure it is working properly.
Hash Table PreAssessment: Remember This?
 ASCII Review: Write a function which prints out the ASCII value of each character in a string:
void printASCII(string str) {
//write function body here }
The Hash Table ADT A Hash Table is a structure where values are mapped or assigned to an array index according to the rules of an algorithm.
 This algorithm is called the hashing function.
 The hashing function takes an input (called a key) and then returns the index of an array.
 This array index is where the value is to be stored
 As we are going to see, hash table operations, such as search, insert, and delete can be performed in just O(1) time.
Group Activity: Hash Table Bingo!  To help you understand a Hash Table, let's play Hash Table Bingo.
 The class as a whole is going to play the role of the underlying array structure in a Hash Table.
 Each student in the class will be assigned an index based on his or her name on the class roster.
 Using a hash function, I will enter the name of a piece of candy as a key.
 The value to store at the array index
 The hash function will return to me the index at which I want to store the candy.
 If your index is returned by the hash function, you will get a piece of candy.
 At
the end of the exercise, I can hash the names of the pieces of candy in
order to find out who received a piece and who didn't.
A Simple Array Only Example  Suppose that I wish to store the names of all the students in my class.
 We could start with an array of strings.
const int CLASS_SIZE = 35; //# students enrolled in the class string names[CLASS_SIZE];
 Now, our first thought might be to store each student's name alphabetically in the array.
 However, later, if a student drops the class, I would need to:
 Use binary search to search the array for the student's name: O(log n) time.
 Delete the name and update the array by copying all the values after that student's name down by one: O(n) time
 Is there a way that I can do better?
Let's Use a Hash Table Instead  The underlying structure of a hash table is an array.
 However when we insert elements into our hash table array, we utilize a hashing algorithm
 The hashing algorithm is used to determine at which array index to place an element in our table.
 Each index is also known as a bucket.
 As we will see later, more than one element can be associated with an index.
 Let's look at a simple example of a hash table:
 Using
the array above as the basis for a hash table provides a more efficient
alternative when searching or deleting a name from the class roster.
const int CLASS_SIZE = 35; //# students enrolled in the class string names[CLASS_SIZE];
 Let's assume that I have a simple hashing algorithm to determine at which array index to store each name on my roster.
 When searching for a student's name, I would call the hashing algorithm, passing the name as input.
 The hashing algorithm would return the index with which the name is associated.
 Lookup then involves searching for the name at a single array index, which is O(1).
 Assume my hashing algorithm is implemented as the following function:
int hash(string student_name) {
int index = student_name.length();
return index;
//applying my hashing algorithm to the first name in my CIS 22C roster
int index = hash("Anthony"); //Should return 7
names[index] = "Anthony"; //storing the value at index 7
 In the above example, the value "Anthony" is called the key.
 The key is the value that is passed into the hashing algorithm to determine the placement of the name within the table.
 Now, when I add students to the roster, they will be stored at the index provided by the hashing algorithm:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
... names[34] = "";  Later, when I search for a name in my table, I can simply call the hashing algorithm:
int index = hash("Anthony");//hashing algorithm provides index to search for the name
if (names[index] == "Anthony") //check if the value at index is equal to the name I am searching for
names[index] = ""; //removing the name of the student who dropped
else cout << "Anthony isn't in the table." << endl;
 As we can see from this simple example, with a hash table, insertion, search and deletion can be performed in just O(1).
 We will discuss the BigO runtime of hash table operations later in the lesson.
 As you probably noticed, there is a problem with the above example.
 How many people in this class have a first name that is 3 letters long? 4 letters? 5 letters? 6 letters? 7 letters? 8 letters?
 This problem is known as a collision.
 A collision occurs when more than one value is mapped to the same index in a hash table.
 In my above example, most names would be mapped to the indices 2  10.
 Trivia: Average name length is 5.9 letters.
 Collisions can be mitigated with better hashing algorithms.
 However, regardless of how good a hashing algorithm is, collisions will occur.
 A perfect hash function (no collisions) can only be achieved if all values are known ahead of time; very unlikely.
 There are various approaches to resolving collisions. We will discuss two principle approaches below.
Linear Probing  One simple approach to handling collisions within a hash table is called linear probing.
 In
linear probing, if an element is hashed to an index that is already
storing another value, the new element is inserted at the next open
index.
 For example, imagine our names hash table contains the following values:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
... names[34] = "";  Now, further suppose that I want to insert the name "Brandon" into my names table.
 Woops! There is a collision when I apply my hashing algorithm.
 Index 7 is already occupied by "Anthony"
 I can use linear probing to resolve this collision.
 Using linear probing, I insert the name "Brandon" in the next available index after the hashed index of 7.
 Where would I place "Brandon" in the table?
 names[8] = "Brandon";
 Now, my table looks like the following:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "Brandon";
... names[34] = "";  Next,
suppose I want to insert the name "Cameron". Where would his name go as
index 7 already contains the name "Anthony" and index 8 already
contains the name "Brandon"?
 In this case, I would need to place "Cameron" two indices away from his hashed index of 7, at index 9:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma";
names[5] = "Howon";
names[6] = "Maryna";
names[7] = "Anthony";
names[8] = "Brandon";
names[9] = "Cameron";
... names[34] = "";  Now, suppose that I want to search for the name "Cameron."
 Applying my hash function, I would look for his name at index 7.
 When, I do not locate him there, I know that I need to keep searching farther down the table.
 As you can see, the more collisions, the worse the efficiency of my hash table when performing the insert and search operations.
 Given sufficient collisions in my table, search approaches a BigO runtime of O(n), no improvement over a simple array.
 Deleting also becomes problematic.
 Need to fill empty space from deletion.
 Which elements were properly indexed? Which were placed due to linear probing?
 Is there a better approach than linear probing?
Separate Chaining  Another approach to collision resolution is called separate chaining.
 Rather
than limiting ourselves to storing only a single value at each index of
the table, with separate chaining we store a series of 0 or more linked
elements at each index.
 In essence, we make our hash table an array of linked lists, as depicted below:
 Now,
each time that a name is inserted, and there is a collision, we simply
push the new name onto the back of our list at that particular index.
 Therefore, we could reenvision our previous names hash table like so:
names[0] = ""; names[1] = "";
names[2] = "";
names[3] = "Kun"; names[4] = "Doma" > "Sean";
names[5] = "Howon" > "Qilin";
names[6] = "Maryna";
names[7] = "Anthony" > "Brandon" > "Cameron";
... names[34] = "";  Now when I want to search for the name "Cameron," if his name is in the table, it must be in the list at index 7.
 To check, simply perform a linear search through the list at names[7].
 Later in the lesson, we will look at how to implement a version of a hash table that uses separate chaining.
BigO RunTime of a Hash Table  For hash table operations such as search, insert and delete we get the following BigO runtimes:
 Best case: O(1)
 Element is alone at collision free index.
 Worst case: O(n)
 Worst case is a linear search, which is O(n).
 Holds for both linear probing and chaining.
 Average case: O(1)?
 Average
case runtime of hash table operations is a question. Assuming a
perfect hash function, these operations are always O(1).
 However,
without a perfect hash function, average runtime will be dependent
upon how big the table is and how many values it is currently storing
(i.e. on how full the table is).
 As more and more values are inserted into the table, the likelihood of collision increases resulting in longer chains.
 The speed of hash table operations is therefore dependent on something called the load factor, which is defined to be n/k, where k is the the table size (number of buckets) and n is the number of keys.
load factor = n / k
 The larger the value for the load factor, the slower the hash table operations.
 Therefore, average case for a hash table is sometimes considered to be O(n/k).
 You might be thinking: But... O(n/k) = O(n), as we do not include lower order terms and coefficients in our BigO runtime.
 So, is a hash table about as efficient as a linked list then?
 In practice, O(n/k) is still a practical improvement over O(n).
 Moreover, if your table size is large enough, average runtime approaches constant time:
 If I set k = 2n (table size is twice the number of keys)
 O(n/2n) > O(1/2) > O(1)
Improving Our Hashing Algorithm  The
above hashing algorithm, which mapped a key to an index according to
the length of the key, was not a good hashing algorithm.
 What was wrong with this algorithm?
 Too many collisions  we were unable to get an even distribution of values across the table
 Clustering of values around buckets 210
 As a related issue, some indices were unusable
 index 0 can never hold a value as there are no names of length 0
 Both of these problems contribute to a loss of efficiency.
 Perhaps we can find a better hashing algorithm...
 Let's try the following:
 Sum the ASCII values of each character in the key
 Scale the result by applying % table size.
int hash(string key, const int TABLE_SIZE) { int index, sum = 0; for(int i = 0; i < key.length(); i++)
sum += (int) key[i]; //summing the ASCII values for each character in the string
index = sum % TABLE_SIZE; //dividing the summed ASCII values by 35 && storing remainder as my index
return index;
} . Guidelines for Choosing a Good Hash Function But, how do we know that our new algorithm is a good one?
 Here are some guidelines for writing good hashing functions:
 Use all of the information provided by a key.
 Our algorithm uses all characters in the name to create the key.
 Make sure the elements to insert can be spread across the entire table.
 By applying % operator we make sure our summed ASCII values will be better distributed throughout the table.
 Verify that your algorithm can map similar keys to different buckets.
 Will values like "Mark" and "Matt" be placed in the same bucket?
 Applying the hashing algorithm, with table size of 37
 Mark: mapped to index 25
 Matt: mapped to index 36
 Make sure it uses only simple operations (such as the basic arithmetic operators) to minimize runtime.
 When you write a hash table, you will call your hashing function often (insert, search, delete).
 Make certain your hashing function is not computationally intensive.
 While
the hashing function is an important aspect in helping to increase the
effectiveness of a hash table, other aspects also need to be considered.
Guidelines for Selecting a Table Size  In
order to effectively minimize collisions (thereby maintaining
efficiency), a good ruleofthumb is to make your table size be twice
the number of entries you are planning to insert.
 However, more often than not, the number of entries to be inserted will not be known in advance.
 Therefore, it may be necessary to resize your hash table:
 Remember the load factor: n /k (number_of_keys divided by table_size)
 As the load factor increases, collisions will become more common, resulting in decreased efficiency.
 How can we decrease the load factor? We increase the table size!
 When should we resize the table?
 Linear probing: Array indices are used up quickly. Resize when load factor >= 1/2
 Separate Chaining: Opinions vary: between 2/3 and 3/4.
 Simplest resizing method:
 Create new hash table with an array sized twice as large as the current hash table.
 Linearly traverse the old hash table.
 Copy each element and insert into new hash table.
 Elements will be rehashed when inserted into new hash table.
 Delete old hash table.
Wrap Up  With your partner, answer the questions from today's learning objectives
 Lab 6 due Thursday
 Quiz 5 on Thursday
~See You Thursday!~
