Welcome to Lesson 14!Learning ObjectivesBy the end of today's lessons, you should know- What is the Big-O runtime of the hash table operations (insert, delete, search)?
- What is the load factor?
- What are some guidelines for selecting a good hashing algorithm?
Announcements- Quiz 5 after the break
- Collect Lab 6
- Return Lab 5
- Lab 7 Assigned - due one week from today
- Midterm postponed by one class day
- New date: Tuesday, November 22
- No online office hours tomorrow due to Veteran's Day Holiday
- However, I will check my email a couple of times tomorrow to answer questions
- Discuss Project Report 3 due in one week
Review ActivityWith a partner, answer the following questions...- Trace BST Insert on the following values: 10, 12, 15, 4, 7, 13, 9, 11
- What would be the output when calling pre, post and in order print on the above BSTs?
- What is the Big-O runtime of the BST operations (search, insert, delete)? How do these compare to the Big-O runtime for a linked list?
- How is data organized in a hash table?
- What is a hashing algorithm?
- What is a collision?
- What are two strategies that used to handle collisions?
- What is the difference between them?
Hash Tables ContinuedSeparate Chaining- Another approach to collision resolution is called
**separate chaining**. - Rather than limiting ourselves to storing only a single value at each index of the table, with separate chaining we store a series of 0 or more linked elements at each index.
- In essence, we make our hash table an array of linked lists, as depicted below:
- Now, each time that a name is inserted, and there is a collision, we simply push the new name onto the back of our list at that particular index.
- Therefore, we could re-envision our previous names hash table like so:
names[1] = ""; names[2] = ""; names[3] = "Kun"; names[4] = "Doma" --> "Sean"; names[5] = "Howon" --> "Qilin"; names[6] = "Maryna"; names[7] = "Anthony" --> "Brandon" --> "Cameron"; ... names[34] = ""; - Now when I want to search for the name "Cameron," if his name is in the table, it must be in the list at index 7.
- To check, simply perform a linear search through the list at names[7].
- Later in the lesson, we will look at how to implement a version of a hash table that uses separate chaining.
Big-O Run-Time of a Hash Table - For hash table operations such as search, insert and delete we get the following Big-O run-times:
**Best case**: O(1)- Element is alone at collision free index.
**Worst case**: O(n)- Worst case is a linear search, which is O(n).
- Holds for both linear probing and chaining.
**Average case**: O(1)?- Average case run-time of hash table operations is a question. Assuming a perfect hash function, these operations are always O(1).
- However, without a perfect hash function, average run-time will be dependent upon how big the table is and how many values it is currently storing (i.e. on how full the table is).
- As more and more values are inserted into the table, the likelihood of collision increases resulting in longer chains.
- The speed of hash table operations is therefore dependent on something called the
**load factor**, which is defined to be**n/k**, where k is the the table size (number of buckets) and n is the number of keys.
- The larger the value for the load factor, the slower the hash table operations.
- Therefore, average case for a hash table is sometimes considered to be O(n/k).
- You might be thinking: But... O(n/k) = O(n), as we do not include lower order terms and coefficients in our Big-O run-time.
- So, is a hash table about as efficient as a linked list then?
- In practice, O(n/k) is still a practical improvement over O(n).
- Moreover, if your table size is large enough, average run-time approaches constant time:
- If I set k = 2n (table size is twice the number of keys)
- O(n/2n) -> O(1/2) -> O(1)
Improving Our Hashing Algorithm- The above hashing algorithm, which mapped a key to an index according to the length of the key, was not a good hashing algorithm.
- What was wrong with this algorithm?
- Too many collisions - we were unable to get an even distribution of values across the table
**Clustering**of values around buckets 2-10- As a related issue, some indices were unusable
- index 0 can never hold a value as there are no names of length 0
- Both of these problems contribute to a loss of efficiency.
- Perhaps we can find a better hashing algorithm...
- Let's try the following:
- Sum the ASCII values of each character in the key
- Scale the result by applying % table size.
int hash(string key, const int TABLE_SIZE) { int index, sum = 0; for(int i = 0; i < key.length(); i++) sum += (int) key[i]; //summing the ASCII values for each character in the string index = sum % TABLE_SIZE; //dividing the summed ASCII values by 35 && storing remainder as my index return index; } Guidelines for Choosing a Good Hash Function- But, how do we know that our new algorithm is a good one?
- Here are some guidelines for writing good hashing functions:
- Use all of the information provided by a key.
- Our algorithm uses all characters in the name to create the key.
- Make sure the elements to insert can be spread across the entire table.
- By applying % operator we make sure our summed ASCII values will be better distributed throughout the table.
- Verify that your algorithm can map similar keys to different buckets.
- Will values like "Mark" and "Matt" be placed in the same bucket?
- Applying the hashing algorithm, with table size of 37
- Mark: mapped to index 25
- Matt: mapped to index 36
- Make sure it uses only simple operations (such as the basic arithmetic operators) to minimize run-time.
- When you write a hash table, you will call your hashing function often (insert, search, delete).
- Make certain your hashing function is not computationally intensive.
- While the hashing function is an important aspect in helping to increase the effectiveness of a hash table, other aspects also need to be considered.
- In order to effectively minimize collisions (thereby maintaining efficiency), a good rule-of-thumb is to make your table size be twice the number of entries you are planning to insert.
- However, more often than not, the number of entries to be inserted will not be known in advance.
- Therefore, it may be necessary to re-size your hash table:
- Remember the load factor: n /k (number_of_keys divided by table_size)
- As the load factor increases, collisions will become more common, resulting in decreased efficiency.
- How can we decrease the load factor? We increase the table size!
- When should we re-size the table?
- Linear probing: Array indices are used up quickly. Re-size when load factor >= 1/2
- Separate Chaining: Opinions vary: between 2/3 and 3/4.
- Simplest re-sizing method:
- Create new hash table with an array sized twice as large as the current hash table.
- Linearly traverse the old hash table.
- Copy each element and insert into new hash table.
- Elements will be rehashed when inserted into new hash table.
- Delete old hash table.
Work on Lab 7Wrap Up - With your partner, answer the questions from today's learning objectives
Upcoming Assignments- Lab 7 due Thursday
~Have a Great Weekend!~ |