本次Java作業(yè)是實(shí)現(xiàn)一個完整的哈希表
COM S 228, Fall 2020Programming Project 5
Generating a Perfect Hash Table Implementation
Problem Overview
Graphs are the most general of the common data structures; consider that a scalar object is a degenerate list,which is a degenerate tree, which is itself a degenerate graph, or in other words, every data structure we’ve discussed this semester can be represented as a graph. Due to their extreme generality, it’s rare that you’ll use a generic “graph” data structure. The limited set of things you might do with a list make two generic list implementations—an array based- and a linked-structure—good enough for most implementations, and it’s only if you are doing, e.g., low-level system programming that you might have a good reason to write a custom, one-off list. But there is so much variation in graphs that it is perhaps impossible to write a good and useful generic graph implementation.
Hash tables provide a means to get constant-time look-up performance on a collection. Usually, the data comes to us at runtime, often in an on-line fashion. In this case, we take what foreknowledge we can to inform best practices in hash table construction, choosing a hash function, etc., to build a data structure with as close to optimal performance as possible; however, if the keys in the table are known beforehand, it’s always possible to construct a perfect hash table, a table in which every key maps to a unique slot.There are many algorithms to construct such a table and the associated hash function; almost all of them are implemented using graphs. For this project, you will be implementing a perfect hash table construction algorithm1?
?This algorithm generates a pair of tables of random numbers. Keys are hashed with the values in each these tables, giving two integers (one for each table). These integers are then taken as node indices in a graph, with an edge between them. So long as the resultant graph—after hashing all of the keys—contains no cycles, the algorithm is ready to generate the hash table as a function of the tables (the T tables) used in the graph generation and a function g() on those tables.
Algorithm Example
NOTE: This algorithm looks pretty intimidating at first glance. With some careful reading and workingthrough of examples, you should find it’s actually fairly straightforward. That said, you don’t necessarily have to understand it at all! If you find yourself getting anxious about all this “scary-looking stuff”, jump down to the Requirements sections, read that, then come back here with a more relaxed mindset. It is very common for professional programmers to be required to implement solutions to problems that they don’t actually understand, so if you find yourself doing that, here or elsewhere, please know that it is normal. This is not to advocate for ignorance, however–it’s always better to understand something than not to understand it–only to recognize that sometimes the cost-benefit of understanding is too high! The example below was generated by hand, and uses an optimization that would be very difficult to implement in program control. We note that our original keys (the number names from “one” to “ten”) 1The algorithm is presented in: Zbigniew J. Czech, George Havas, and Bohdan S. Majewski, “An optimal algorithm for generating minimal perfect has functions”, Information Processing Letters, 43(5):257-264, October 1992. are unambiguously differentiable by the first two letters. Thus, we can simplify the presentation by using only the first two letters of the keys. In your Java program, you will always use the entirety of every word; however, the method that hashes the words is already implemented for you, so you don’t need to concern yourself with it unless you are endeavoring to fully understand the implementation.Our keys and the “sub-keys” that we’re using in this example follow:
Key Minimum Unique Sub-key
one on
two tw
three th
four fo
five fi
six si
seven se
eight ei
nine ni
ten te
A modulus must be chosen, giving a maximum number in our tables, and a maximum number of nodes in the resultant graph. The paper authors choose a value that is one more than twice the number of keys, so we do too. This choice has implications on the runtime of the algorithm (smaller moduli increase the probability of cycles in the resultant graph, forcing a restart of the algorithm). If the modulus is too small, the algorithm will enter an infinite loop.
Next, we populate our tables, T1 and T2, with random numbers less than the modulus. Each of the two tables will have the same number of rows as the length l of the sub-key. The i
th row, 0 <= i <= l ? 1,defines a separate mapping from letters at the i
th position (increasing from left to right) within the sub-key to random numbers.
T1
e f h i n o s t w
8 15 10 2 10 19 13 0 13
3 5 0 5 5 6 19 5 6
T2
e f h i n o s t w
20 2 4 19 11 15 3 8 11
16 4 2 6 19 10 18 4 18
These tables define a function that we will use to transform our keys into nodes in a graph. For each key,
we apply the function twice, once each for T1 and T2. Index the tables by the letter position and the letter
value, add the numbers found in the tables, and return a result modulo the modulus2
. We get two values per
key, one for T1, and one for T2. These values are taken as node IDs in our graph, and an edge runs between
them corresponding with the index of the key in the input vector, so “one” corresponds with index zero,
“two” with index one, etc.
2
If you look carefully at the hash function defined in the project source template, you will notice that we always use tables
of size 4 × 64. We’re doing things slightly differently there, to reduce code complexity. Instead of indexing with letter position,
we index with letter position mod 4, and instead of letter value, we use letter value mod 64. The example in this document uses
a simpler (and better) approach that is easy to do when hand tuning, but difficult to automate. The fundamental structure of the
algorithm is unchanged in either case.
Applying our table to the first key, “one”, and recall that we are using only the first two letters in this
example (and that we always use the entirety of every word in our program code), we have T1(0, o) ==
19 and T1(1, n) == 5. 19 + 5 ≡ 3 mod 21, so one end of our “one” edge is node 3. The other end is
given by T2(0, o) == 15 and T2(1, n) == 19, and 15 + 19 ≡ 13 mod 21. So the “one” edge runs from
node 3 to node 13.
The following table gives the calculation for the entire input set:
key T1(0,letter) T1(1,letter) sum mod 21 T2(0,letter) T2(1,letter) sum mod 21
one 19 5 3 15 19 13
two 0 6 6 8 18 5
three 0 0 0 8 2 10
four 15 6 0 2 10 12
five 15 5 20 2 6 8
six 13 5 18 3 6 9
seven 13 3 16 3 16 19
eight 8 5 13 20 6 5
nine 10 5 15 11 6 17
ten 0 3 3 8 16 3
The T1 sum column is also called u(key) and the T2 sum is v(key). We’ll be referring to them by these
names later in the algorithm.
Now our graph is built. The next step is to check if the graph is suitable to build our hash table. A
graph is suitable if it contains no cycles. Cycle detection is usually done with a depth first search, but in this
example, we’ll inspect by eye; in program control we would call a method for this check, one that you will
be implementing!
Look at the final row in the above table and note that the “ten” edge is a self loop; it both begins and
ends on node 3. Therefore, we need to discard this graph and the associated tables. We generate a new set
of random tables:
T1
e f h i n o s t w
1 13 16 3 3 7 10 1 17
19 11 17 6 20 3 0 17 14
T2
e f h i n o s t w
7 5 6 13 20 8 19 10 16
8 3 0 1 13 0 17 14 4
and calculate a new graph:
key T1(0,letter) T1(1,letter) sum mod 21 T2(0,letter) T2(1,letter) sum mod 21
one 7 20 6 8 13 0
two 1 14 15 10 4 14
three 1 17 18 10 0 10
four 13 3 16 5 0 5
five 13 6 19 5 1 6
six 10 6 16 19 1 20
seven 10 19 8 19 8 6
eight 1 6 7 7 1 8
nine 3 6 9 20 1 0
ten 1 19 20 10 8 18
Applying our cycle-detecting eyeballs, we see that this graph has no loops:
9
0
6
19
8
7
15
14
5
16
20
18
10
four (3)
nine (8)
one (0)
five (4)
seven (6)
eight (7)
two (1)
three (2)
six (5)
ten (9)
The final step is to fill the array that defines the g() function (as named in the paper; “g” does not stand
for “graph”). g() maps edges in our graph to indices in our original input array. We do this by traversing
the connected components of the graph. We always start with the lowest-numbered, unvisited vertex. When
traversing through a node where you have a choice of which direction to travel next, the choice does not
matter.
We start with node zero and assign g(0) = 0. Then we traverse the neighbors of node 0; since neighbor
order doesn’t matter, let’s do node 9 next. g(9) = 8 ? g(0) mod 10 ≡ 8, where the 8 in the subtraction
comes from the edge value and 10 is the number of keys in the input vector. Going back to node 1 and
heading the other direction, we have g(6) = 0 ? g(0) mod 10 ≡ 0; again, the 0 in the subtraction is the
edge value. Continuing to node 8, g(8) = 6 ? g(6) mod 10 ≡ 6. Node 7, g(7) = 7 ? g(8) mod 10 ≡ 1.
And node 19, g(19) = 4 ? g(6) mod 10 ≡ 4.
This completes the first connected subgraph. Let’s move to the next subgraph. The lowest-numbered,
unvisited node is node 5. We assign g(5) = 0, and proceed as above. Node 16, g(16) = 3 ? g(5)
mod 10 ≡ 3. Node 20, g(20) = 5 ? g(16) mod 10 ≡ 2. Node 18, g(18) = 9 ? g(20) mod 10 ≡ 7. And
node 10 completes this subgraph, g(10) = 2 ? g(18) mod 10 ≡ 5.
There is one more subgraph, containing nodes 14 and 15. Start with the smaller, g(14) = 0, and
g(15) = 1 ? g(14) mod 10 ≡ 1.
And here is g:
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
g 0 0 0 0 0 0 0 1 6 8 5 0 0 0 0 1 3 0 7 4 2
Positions in this array that were not assigned in the last step do not matter. We choose to fill them in with
zeros.
As stated above, u(key) is the first “sum mod 21” in the final graph calculation table above, and v(key)
is the second such column. Our hash function is given by:
HASH(key) = (g (u (key)) + g (v (key))) mod 10
thus:
HASH(“seven”) = g(8) + g(6) mod 10
= 6 + 0 mod 10
= 6
which is the seventh element, since arrays are zero indexed; and
HASH(“three”) = g(18) + g(10) mod 10
= 7 + 5 mod 10
= 12 mod 10
= 2