有没有高手帮忙翻译下,关于哈希表的
III. DESCRIPTION OF THE SIGNATURE ARRAY HASH TABLEHash tables with linked-list chains can be used to store
string signature values (where signatures are the hashes of the
strings). Hash tables support easy addition and removal of
values. However, hash tables require a significant amount of
memory due to pointers that are stored with each value in the
hash table chains. To eliminate the memory needed for
pointers, we have the idea to store hash chains in blocks in a
signature array and use another array to contain offsets to the
blocks. The actual values stored are signatures of the strings.
Thus, a false positive will occur if an inString hashes to the
same offset location (i.e., chain) and same signature value as a
non-matching stored string.
Figure 2 shows the two key algorithms for our Signature
Array Hash Table (SAHT). The function mapStrings() maps
m strings into the SAHT and uses a hash function to generate
hash values, hash1 and hash2, for each string. On exit, two
arrays are defined and comprise the SAHT:
• ptrArray consisting of m <sigOffset, blockLen> pairs
• sigArray consisting of m stringSig values
where sigOffset is the offset into sigArray for a chain with
index hash1 modulo m and blockLen is the length of the chain
stored in this block. A temporary array, tempArray, is defined
within mapStrings(). For a 32-bit machine, we can define
hash1 and hash2 to be 32 bits, sigOffset to be 24 bits, chainLen
to be 8 bits, and stringSig to be 16 or 32 bits. The values of
sigOffset and chainLen are derived from hash1 and hash2 in
the function. A 24-bit sigOffset value limits m to be less than
224 = 16,777,216 (i.e., no more than 16,777,216 strings can be
mapped into the SAHT with the variables defined in size as
they are here). The arrays ptrArray and sigArray are used in
testString() to test for membership of an input string.
The probability of false positive of the SAHT can be
computed as a balls-and-bins problem where the number of
balls and bins are equal. For m strings and b bits in stringSig,
[ ] Σ=
−
⎟ ⎟
⎠
⎞
⎜ ⎜
⎝
⎛
⎟ ⎟
⎠
⎞
⎜ ⎜
⎝
⎛
⎟⎠
⎞
⎜⎝
− ⎛ − ⎟⎠
⎞
⎜⎝
⎛ − ⎟⎠
⎞
⎜⎝
⎛
⎟ ⎟⎠
⎞
⎜ ⎜⎝
⎛
=
m
i
i
b
i m i
i m m
m
1 2
Pr false positive 1 1 1 1 1 1 (2)
where the first binomial expression is the probability of a
i = 1, 2,…, m strings mapping to a given block in sigArray and
the second expression is the probability of the string having a
unique stringSig in the chain at the given block and assumes
that there may already be duplicate stringSig values in a chain.
Equation (2) cannot be simplified into a closed form. However,
if we assume that each chain contains only non-duplicated
values, then the second expression is i 2b . The modified
equation (2) then simplifies to,
[ ] 2b
Pr false positive = 1 . (3)
In practice, we find that equations (2) and (3) compute to
numerical values that are very close to the same.
We solve for the mean number of look-ups for the SAHT
for the case of a miss and hit. Let U be a random variable for
the number of look-ups. Then the expected value,
[ ] 1 1 1 1 1 1 2
1
= + = ⎟
⎟
⎠
⎞
⎜ ⎜
⎝
⎛
⎟⎠
⎞
⎜⎝
⎛ − ⎟⎠
⎞
⎜⎝
⎛
⎟ ⎟⎠
⎞
⎜ ⎜⎝
⎛
+ = Σ=
m −
i
i m i
miss i m m
m
E U i (4)
where the initial “1” is for the look-up in the ptrTable and the
summation is the mean chain length. For Ehit [U] we condition
on the probability of a block (or bin) being non-empty. Thus,
( ) 1 Pr[chain length 0]
[ ] 1 1
− =
Ehit U = + (5)
where
m e
m Pr[chain length 0] 1 1 ≈ 1 ⎟⎠
⎞
⎜⎝
= = ⎛ − (6)
and thus, Ehit [U] = 2.582 for large m.
In the SAHT, a mapped string can be removed by setting its
mapped sigArray value to zero (this assumes that the hash
function always returns a non-zero value). A new string can be
added at a somewhat greater cost; sigArray needs to be shifteddown
to create a space for a newly mapped-in stringSig. The
algorithms removeString() and addString() are not shown due
to space restrictions.
VI. RELATED WORK
A counting Bloom filter was first proposed in [2] as a
means of supporting element removal. A counting Bloom filter
uses multiple bits (typically 4) for each location. Thus, a 4-bit
counting Bloom filter requires 4x the amount of memory. To
reduce the amount of time spent in hashing in a Bloom filter,
the use of a Random Number Generator (RNG) has been
explored [4]. In this approach, one hash is made and then this
hash value is used to seed the RNG to generate the k −1
remaining “fake” hash values. This approach will result in
higher false positive rates when the initial hash values are the
same for different strings.
Our work is not the first to propose the use of an array to
store hash chains to eliminate the need for pointers. In [5] the
authors used a sparse array to save space by allocating space
only for locations in use. Space is allocated as signatures are
inserted. Signatures are pointed to by a second array used as a
reference. However, it is not clear which data structure they
used to implement it. Our work uses a dense array instead,
which allocates space for the expected size of the set. We do
not waste space since we know the set size from the beginning.