Hash table
- Hash function (return non-negative integers -- hash code) + array (store data types placed in the array location[hash code] in the structure and make that data type a pointer to the linked list)
- Almost a constant search/insert time: O(1)
- A hash table is essentially an array?of?linked lists, drawing from top to bottom. Each linked list in the array has elements of a certain category.
- For example, store names into an array with 26 positions, one for each letter of the alphabet. It takes only one step to find the alphabetic location according to ASCII (A-[0]; H-[7]). If there are names starting with the same letter, the collision can be solved by stitching them together along the linked list from left to right. No need to grow the size of array or move any value.
- Hash function: take some input and deterministically maps it to the location it should go in (input letter output number). In our example, the hash function just returns an index corresponding to the first letter of the name, such as?0?for “Albus” and?25?for “Zacharias”.
unsigned int hash( char *str )
{
? ? ? ? int sum = 0;
? ? ? ? for (int j = 0; str[j] != '\0'; j++)
? ? ? ? {
? ? ? ? ? ? ? ? sum += str[j];
? ? ? ? }
? ? ? ? return sum % HASH_MAX;? ? ? ? ? ? ? ? ? ????????? // return the non-negative hash code
}
- For the worst case, all the names might start with the same letter, so we might end up with the equivalent of a single linked list again ---------------------------------- COLLISION (when some pieces of data run through the hash function and yield the same hash code)
? ? ? ? Solutions for collisions:
? ? ? ? Linear Probing: place the data in the next consecutive element in the array until we find a vacancy
? ? ? ? ? ? ? ? Clustering: more likely to grow the cluster
? ? ? ? Chaining: each element in the array is a pointer to the head of a linked list, which eliminates clustering. Theoretically it may not benefit searching, but the sorting speed can be accelerated in the real world.?
- Optimization when meet too many collisions: we might look at the first two letters, and allocate enough buckets for 26*26 possible hashed values (AA-ZZ), or even the first three letters?(AAA-ZZZ), requiring 26*26*26 buckets:
? ? While using more space in memory, we’re more likely to only need one step to look for a value, reducing our running time for search.
- Example: sorting cards: first putting them in piles by suit, (spades, diamonds, hearts, and clubs), then sort each pile. So it will takes only 13 steps rather than 13*4 steps.
- The worst case running time for a hash table is O(n)?asymptotically, since, as?n?gets very large, each bucket will have on the order of?n?values, even if we have hundreds or thousands of buckets. In practice, though, our running time will be faster since we’re dividing our values into multiple buckets.
Trie?
- Short for “retrieval”. A trie is a tree with arrays as nodes: (H - the root of the tree, unique key & simple boolean value)
- Each array stores 26 letters. Each node from top to bottom is an array of pointers to other nodes.?
- For each word, the first letter will point to an array, where the next valid letter will point to another array, and so on, until we reach a boolean value indicating the end of a valid word, marked in green above. If our word isn’t in the trie, then one of the arrays won’t have a pointer or terminating character for our word.
- The searching time depends on the length of the word we’re looking for only. This might be a fixed maximum and not depend on the number of other words in the data structure, so we can have?O(1) for searching and insertion.
- The cost for this, though, is that we need lots of memory to store pointers and boolean values as indicators of valid words, even though lots of them won’t be used (even the null pointer takes 8 bits, which indicates very inefficient memory usage, rewinding from the contiguous property of arrays).
// Example
typedef struct _trie
{
? ? ? ? char university[20];? ? ? ? ? ? ? ? ? ? ? ? ? ? ?// spaces for 20 characters
? ? ? ? struct _trie* paths[10];? ? ????????????????? ? // an array of ten pointers to other nodes of the same type
}
trie;
Abstract data structures
There are even higher-level constructs,?abstract data structures, where we use our building blocks of arrays, linked lists, hash tables, and tries to?implement?a solution to some problem (implement with some other data structures).
- queue
For example, one abstract data structure is a?queue, like a line of people waiting, where the first value we put in are the first values that are removed, or first-in-first-out (FIFO).?
To add a value we?enqueue?it, and to remove a value we?dequeue?it. This data structure is abstract because it’s an idea that we can implement in different ways: with an array that we resize as we add and remove items, or with a linked list where we append values to the end.
Problems for queue implementing using array: cannot dynamically change the array (one in one out everyone one step forward).
- stack
An “opposite” data structure would be a?stack, where items most recently added are removed first: last-in-first-out (LIFO). At a clothing store, we might take, or?pop, the top sweater from a stack, and new sweaters would be added, or?pushed, to the top as well.
- dictionary
Another example of an abstract data structure is a?dictionary, where we can map keys to values (associate keys with values), such as words to their definitions. We can implement one with a hash table or an array, taking into account the tradeoff between time and space. (think about initial labels for names at the food pickup section)
For the perverse corner case, we meet the two same keys, or too many keys starting from the same letter (remember arrays with finite sizes so we have to cheat on near spaces if we fill spaces for one key).?
Summary
1. Array
? ? ? ? bad insertion, bad deletion, great lookup, relatively easy to sort (like bubble and merge), relatively small size-wise, stuck with fixed-size, no flexibility
2. Linked list
? ? ? ? great insertion, great deletion, bad lookup (rely on linear search), relatively difficult to sort, relatively small size-wise (larger than arrays)
3. hash table
? ? ? ? easy insertion (hash code then insert by linear probing or chaining), easy deletion (if chaining once we found the element), better lookup than linked lists on average, not ideal for sorting, can run the gamut of size
4. trie
? ? ? ? complex insertion, easy deletion (just free the node), fast lookup (depend on the length of data itself), already sorted, heavily space-exhausted