Welcome to Lesson 13!

Learning Objectives
  • How do you write BST search and remove?
  • What is a Hash Table?
  • What are the advantages of using a Hash Table over other ADTs we have studied?
    • Arrays
    • Linked Lists (and other linear structures)
    • Binary Search Trees
  • What is a hashing algorithm?
    • How do you write a good hashing algorithm?
  • What are some solutions to the problem of collisions?
  • What is the efficiency of common hashing operations?
    • How does the size of the Hash Table affect its efficiency?
  • How do you search for an element in a Hash Table?

Announcements
  • Quiz 5 on Thursday
    • Quiz 4 returned
  • Lab 6 due Thursday
  • Today you will learn about the second of two data structures you will need for your course project:
  1. BST
  2. Hash Tables

BSTs Continued

Searching a BST
  • Searching a BST is an almost identical process to adding a new node to the tree.
  • In fact, we can use the same image as for insertion!

image

image source

  • Let's trace an example, using the same example as for insertion.
  • The Pseudocode is almost identical to insert as well.
  • However, search returns a boolean instead of being a void function

Pseudocode for findHelper(root, value)

1. Base Case: the value is equal to the data stored at the root

1a. return true

2. Check if the value being sought is smaller than the data stored at the root.

2a. If this is true

2aa. If root left is NULL

2aaa. return false

2b. If the root left is not NULL

2bb. recursively call the findHelper function passing it root left and value

3. Otherwise

3a. Check if the root right is NULL

3aa. If this is true, return false

3b. If the root right is not NULL

3bb. recursively call the findHelper function passing it root right and value

4. return false


Writing the BST Search Function find
  • Add the following function prototypes to your BinarySearchTree.h file.
private:
    bool findHelper(NodePtr root, bstitem value);

public:
    bool find(bstitem value);
  • Then, below the class definition, but inside of BinarySearchTree.h, add the following wrapper function definition for contains:
template <class bstitem>
bool BinarySearchTree<bstitem>::find(bstitem value)
{
//add code to check for the precondition here
if (value == root->data)
        return true;
else
return findHelper(root, value);

}
  • It is your job to write the helper function findHelper(root, value) and also write the code to check for the precondition in find.
  • Test your function inside of BinarySearchTreeTest.cpp to make sure it is working properly.

Finding the Minimum Value in a BST
  • The purpose of the minimum function is to find the minimum value in the tree.
  • Where is the minimum value always located in a BinarySearchTree?
    • If you are uncertain, look at an example of BST to see where the minimum is located.

A binary search tree of size 9 and depth 3, with 8 at the root. The leaves are not drawn.

image source

  • To find the minimum, all we need to do is to scroll through the nodes in the tree to the proper location.
Pseudocode for minimumHelper(root) //Note: Non-recursive
  • Repeat while root's left node is not Null
    • set root to be root left
  • Return the data at the root

Activity: Writing Minimum
  • Add the following function prototypes to your BinarySearchTree.h file.
private:
    bstitem minimumHelper(NodePtr root);

public:
    bstitem minimum();
  • Then, below the class definition, but inside of BinarySearchTree.h, add the following wrapper function definition:
template <class bstitem>
bstitem BinarySearchTree<bstitem>::minimum()
{
          //add code to handle the precondition
return minimumHelper(root);
}
  • It is your job to write the helper function minimumHelper(root).
  • Test your function inside of BinarySearchTreeTest.cpp to make sure it is working properly.


BST Remove
  • Remove is perhaps the most challenging BST algorithm to implement.
  • Like search, insert, and the tree traversals it is also recursive.
  • Remove works by first searching for a particular value in a BST and then freeing the memory for this Node.
  • However, complication comes from the fact that we could encounter several different cases:
    • Case 1: The node we wish to remove if a leaf node <-- easy case
    • Case 2: The node we wish to remove has only one child <-- medium case
    • Case 3: The node we wish to remove has two children <-- hard case
  • When deleting nodes in each of these cases, you must be mindful to preserve the BST property.
  • Let's look at each of these possibilities in the diagram below:
Deletion rules


  • Therefore, when writing remove, we need to handle all three cases.
  • The first 3 steps in the pseudocode below locate the value to remove inside the tree
  • Step 4 handles the three cases discussed above.
Pseudocode for removeHelper(root, value)
  1. If root is null
    1. return root
  2. Otherwise, if value < the data at the root
    1. set root left equal to the recursive call of remove on root left
  3. Otherwise, if value > the data at the root
    1. set root right equal to the recursive call of remove on root right
  4. Otherwise,
    1. If root is a leaf node
      1. delete root
    2. Otherwise if root has a left child but no right child
      1. set the left child to be the root
      2. call delete on the original root
    3. Otherwise if root has a right child but no left child
      1. set the right child to be the root
      2. call delete on the original root
    4. Otherwise
      1. Search for the minimum value in the right subtree of the root
      2. Set the data of the root to be the data of the minimum value of the root right
      3. Set root right equal to the recursive call of remove, passing it root right and the minimum data of root right (i.e. delete the duplicate value in the right subtree)
  5. return the root

Activity: Writing BST Remove
  • Add the following function prototypes to your BinarySearchTree.h file.
private:
    NodePtr removeHelper(NodePtr root, bstitem value);
  
public:
    void remove(bstitem value);
  • Then, below the class definition, but inside of BinarySearchTree.h, add the following wrapper function definition for remove:
template <class bstitem>
void BST<bstitem>::remove(bstitem value)
{
//add code to handle the preconditions
root = remove(root, value);
}
  • The next step is to implement the removeHelper(root, value) function.
  • Note that you can use the minimumHelper(root) function that you wrote above as an additional helper function.
  • Note 2: When you implement the removeHelper function, you will need to alter it's signature outside of the class definition to be:
typename BinarySearchTree<bstitem>::NodePtr BinarySearchTree<bstitem>::removeHelper(NodePtr root, bstitem value)
{
    //implement function here
}
  • Otherwise, your compiler will not recognize what a NodePtr is outside of the BinarySearchTree class definition.
  • Test your function inside of BinarySearchTreeTest.cpp to make sure it is working properly.


Hash Table Pre-Assessment: Remember This?
  • ASCII Review: Write a function which prints out the ASCII value of each character in a string:
void printASCII(string str)
{
    //write function body here
}



The Hash Table ADT
  • A Hash Table is a structure where values are mapped or assigned to an array index according to the rules of an algorithm.
    •  This algorithm is called the hashing function.
    •  The hashing function takes an input (called a key) and then returns the index of an array.
    •  This array index is where the value is to be stored
  • As we are going to see, hash table operations, such as search, insert, and delete can be performed in just O(1) time.

Group Activity: Hash Table Bingo!
  • To help you understand a Hash Table, let's play Hash Table Bingo.
  • The class as a whole is going to play the role of the underlying array structure in a Hash Table.
  • Each student in the class will be assigned an index based on his or her name on the class roster.
  • Using a hash function, I will enter the name of a piece of candy as a key.
  • The value to store at the array index
  • The hash function will return to me the index at which I want to store the candy.
  • If your index is returned by the hash function, you will get a piece of candy.
  • At the end of the exercise, I can hash the names of the pieces of candy in order to find out who received a piece and who didn't.

A Simple Array Only Example
  • Suppose that I wish to store the names of all the students in my class.
  • We could start with an array of strings.
const int CLASS_SIZE = 35; //# students enrolled in the class
string names[CLASS_SIZE];

  • Now, our first thought might be to store each student's name alphabetically in the array.

names[0] = "Anthony";

names[1] = "Brandon";

names[2] = "Cameron";

...

names[36] = "Zih-Ying";

  • However, later, if a student drops the class, I would need to:
    • Use binary search to search the array for the student's name: O(log n) time.
    • Delete the name and update the array by copying all the values after that student's name down by one: O(n) time
    • Is there a way that I can do better?
Let's Use a Hash Table Instead
  • The underlying structure of a hash table is an array.
  • However when we insert elements into our hash table array, we utilize a hashing algorithm
  • The hashing algorithm is used to determine at which array index to place an element in our table. 
  • Each index is also known as a bucket.
    • As we will see later, more than one element can be associated with an index. 
  • Let's look at a simple example of a hash table:
  • Using the array above as the basis for a hash table provides a more efficient alternative when searching or deleting a name from the class roster.
const int CLASS_SIZE = 35; //# students enrolled in the class
string names[CLASS_SIZE];
  • Let's assume that I have a simple hashing algorithm to determine at which array index to store each name on my roster. 
    • When searching for a student's name, I would call the hashing algorithm, passing the name as input.
    • The hashing algorithm would return the index with which the name is associated.
    • Lookup then involves searching for the name at a single array index, which is O(1).
  • Assume my hashing algorithm is implemented as the following function:
int hash(string student_name)
{
    int index = student_name.length();
    return index;
}
...
//applying my hashing algorithm to the first name in my CIS 22C roster
int index = hash("Anthony"); //Should return 7
names[index] = "Anthony"; //storing the value at index 7
  • In the above example, the value "Anthony" is called the key.
  • The key is the value that is passed into the hashing algorithm to determine the placement of the name within the table.
  • Now, when I add students to the roster, they will be stored at the index provided by the hashing algorithm:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    ...
    names[34] = "";
  • Later, when I search for a name in my table, I can simply call the hashing algorithm:
int index = hash("Anthony");//hashing algorithm provides index to search for the name
if (names[index] == "Anthony") //check if the value at index is equal to the name I am searching for
    names[index] = ""; //removing the name of the student who dropped
else
    cout << "Anthony isn't in the table." << endl;
  • As we can see from this simple example, with a hash table, insertion, search and deletion can be performed in just O(1).
  • We will discuss the Big-O run-time of hash table operations later in the lesson.

Collisions



  • As you probably noticed, there is a problem with the above example.
  • How many people in this class have a first name that is 3 letters long? 4 letters? 5 letters? 6 letters? 7 letters? 8 letters?
  • This problem is known as a collision.
    • A collision occurs when more than one value is mapped to the same index in a hash table.
    • In my above example, most names would be mapped to the indices 2 - 10.
    • Trivia: Average name length is 5.9 letters.
  • Collisions can be mitigated with better hashing algorithms.
  • However, regardless of how good a hashing algorithm is, collisions will occur.
  • perfect hash function (no collisions) can only be achieved if all values are known ahead of time; very unlikely.
  • There are various approaches to resolving collisions. We will discuss two principle approaches below.

Linear Probing
  • One simple approach to handling collisions within a hash table is called linear probing.
  • In linear probing, if an element is hashed to an index that is already storing another value, the new element is inserted at the next open index.
  • For example, imagine our names hash table contains the following values:
         names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    ...
    names[34] = "";
  • Now, further suppose that I want to insert the name "Brandon" into my names table.
    • Woops! There is a collision when I apply my hashing algorithm.
    • Index 7 is already occupied by "Anthony"
    • I can use linear probing to resolve this collision.
  • Using linear probing, I insert the name "Brandon" in the next available index after the hashed index of 7.
    • Where would I place "Brandon" in the table?
    • names[8] = "Brandon";
  • Now, my table looks like the following:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    names[8] = "Brandon";
    ...
    names[34] = "";
  • Next, suppose I want to insert the name "Cameron". Where would his name go as index 7 already contains the name "Anthony" and index 8 already contains the name "Brandon"?
  • In this case, I would need to place "Cameron" two indices away from his hashed index of 7, at index 9:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma";
    names[5] = "Howon";
    names[6] = "Maryna";
    names[7] = "Anthony";
    names[8] = "Brandon";
    names[9] = "Cameron";
    ...
    names[34] = "";
  • Now, suppose that I want to search for the name "Cameron."
    • Applying my hash function, I would look for his name at index 7.
    • When, I do not locate him there, I know that I need to keep searching farther down the table.
  • As you can see, the more collisions, the worse the efficiency of my hash table when performing the insert and search operations.
  • Given sufficient collisions in my table, search approaches a Big-O run-time of O(n), no improvement over a simple array.
  • Deleting also becomes problematic.
    • Need to fill empty space from deletion.
    • Which elements were properly indexed? Which were placed due to linear probing?
  • Is there a better approach than linear probing?

Separate Chaining
  • Another approach to collision resolution is called separate chaining.
  • Rather than limiting ourselves to storing only a single value at each index of the table, with separate chaining we store a series of 0 or more linked elements at each index.
  • In essence, we make our hash table an array of linked lists, as depicted below:
  • Now, each time that a name is inserted, and there is a collision, we simply push the new name onto the back of our list at that particular index.
  • Therefore, we could re-envision our previous names hash table like so:
    names[0] = "";
    names[1] = "";
    names[2] = "";
    names[3] = "Kun";
    names[4] = "Doma" --> "Sean";
    names[5] = "Howon" --> "Qilin";
    names[6] = "Maryna";
    names[7] = "Anthony" --> "Brandon" --> "Cameron";
    ...
    names[34] = "";
  • Now when I want to search for the name "Cameron," if his name is in the table, it must be in the list at index 7.
  • To check, simply perform a linear search through the list at names[7].
  • Later in the lesson, we will look at how to implement a version of a hash table that uses separate chaining.


Big-O Run-Time of a Hash Table 
  • For hash table operations such as search, insert and delete we get the following Big-O run-times:
  • Best case: O(1)
    • Element is alone at collision free index.
  • Worst case: O(n)
    • Worst case is a linear search, which is O(n).
    • Holds for both linear probing and chaining.
  • Average case: O(1)?
  • Average case run-time of hash table operations is a question. Assuming a perfect hash function, these operations are always O(1). 
  • However, without a perfect hash function, average run-time will be dependent upon how big the table is and how many values it is currently storing (i.e. on how full the table is).
  • As more and more values are inserted into the table, the likelihood of collision increases resulting in longer chains.
  • The speed of hash table operations is therefore dependent on something called the load factor, which is defined to be n/k, where k is the the table size (number of buckets) and n is the number of keys.
load factor = n / k
  • The larger the value for the load factor, the slower the hash table operations.
  • Therefore, average case for a hash table is sometimes considered to be O(n/k). 
  • You might be thinking: But... O(n/k) = O(n), as we do not include lower order terms and coefficients in our Big-O run-time. 
    • So, is a hash table about as efficient as a linked list then?
    • In practice, O(n/k) is still a practical improvement over O(n).
    • Moreover, if your table size is large enough, average run-time approaches constant time:
      • If I set k = 2n (table size is twice the number of keys)
      • O(n/2n) -> O(1/2) -> O(1) 

Improving Our Hashing Algorithm
  • The above hashing algorithm, which mapped a key to an index according to the length of the key, was not a good hashing algorithm.
  • What was wrong with this algorithm?
    • Too many collisions - we were unable to get an even distribution of values across the table
      • Clustering of values around buckets 2-10
    • As a related issue, some indices were unusable
      • index 0 can never hold a value as there are no names of length 0
    • Both of these problems contribute to a loss of efficiency.
  • Perhaps we can find a better hashing algorithm...
  • Let's try the following:
    • Sum the ASCII values of each character in the key
    • Scale the result by applying % table size.
int hash(string key, const int TABLE_SIZE)
{
    int index, sum = 0;
    for(int i = 0; i < key.length(); i++)
        sum += (int) key[i]; //summing the ASCII values for each character in the string
    index = sum % TABLE_SIZE; //dividing the summed ASCII values by 35 && storing remainder as my index
    return index;
}
.
Guidelines for Choosing a Good Hash Function
  • But, how do we know that our new algorithm is a good one?
  • Here are some guidelines for writing good hashing functions:
    1. Use all of the information provided by a key.
      • Our algorithm uses all characters in the name to create the key.
    2. Make sure the elements to insert can be spread across the entire table.
      • By applying % operator we make sure our summed ASCII values will be better distributed throughout the table.
    3. Verify that your algorithm can map similar keys to different buckets.
      • Will values like "Mark" and "Matt" be placed in the same bucket?
      • Applying the hashing algorithm, with table size of 37
      • Mark: mapped to index 25
      • Matt: mapped to index 36
    4. Make sure it uses only simple operations (such as the basic arithmetic operators) to minimize run-time.
      • When you write a hash table, you will call your hashing function often (insert, search, delete).
      • Make certain your hashing function is not computationally intensive.
  • While the hashing function is an important aspect in helping to increase the effectiveness of a hash table, other aspects also need to be considered.

Guidelines for Selecting a Table Size

  • In order to effectively minimize collisions (thereby maintaining efficiency), a good rule-of-thumb is to make your table size be twice the number of entries you are planning to insert.
  • However, more often than not, the number of entries to be inserted will not be known in advance.
  • Therefore, it may be necessary to re-size your hash table:
    • Remember the load factor: n /k (number_of_keys  divided by table_size)
    • As the load factor increases, collisions will become more common, resulting in decreased efficiency.
    • How can we decrease the load factor? We increase the table size!
    • When should we re-size the table?
      • Linear probing: Array indices are used up quickly. Re-size when load factor >= 1/2
      • Separate Chaining: Opinions vary: between 2/3 and 3/4.
    • Simplest re-sizing method:
      • Create new hash table with an array sized twice as large as the current hash table.
      • Linearly traverse the old hash table.
        • Copy each element and insert into new hash table.
        • Elements will be rehashed when inserted into new hash table.
      • Delete old hash table.
Wrap Up
  • With your partner, answer the questions from today's learning objectives


Upcoming Assignments
  • Lab 6 due Thursday
  • Quiz 5 on Thursday

~See You Thursday!~