概覽
最近在看redis源碼,發(fā)現(xiàn)redis采用了幾種不同的算法來計算Hash Code;因此打算借此整理下JDK中的實現(xiàn),加深理解;
Redis
Thomas Wang's 32 bit Mix Function
關(guān)于該算法的具體內(nèi)容运悲,可以參考這篇文章,算法源碼如下:
public int hash32shift(int key)
{
key = ~key + (key << 15); // key = (key << 15) - key - 1;
key = key ^ (key >>> 12);
key = key + (key << 2);
key = key ^ (key >>> 4);
key = key * 2057; // key = (key + (key << 3)) + (key << 11);
key = key ^ (key >>> 16);
return key;
}
可以看到它是計算int類型的hash code,返回結(jié)果也是int類型;另外它還提供了64bit的算法涩咖,有興趣的可以自行查閱:
根據(jù)Thomas Wang所描述,由于上述代碼可以利用CPU的native指令经柴,在HP 9000 workstations機器上只需要11個時鐘周期,速度很快墩朦;
比較費解這邊的常量值是怎么確定的坯认,無奈算法基礎(chǔ)比較差,如果有了解的同學(xué)氓涣,歡迎答疑牛哺;
Austin Appleby's MurmurHash2
MurmurHash是一種非加密型哈希函數(shù),由Austin Appleby在2008年發(fā)明劳吠,并且有多個變種;該算法的作者2011去了google,開發(fā)出來了新的算法CityHash引润;
unsigned int dictGenHashFunction(const void *key, int len) {
/* 'm' and 'r' are mixing constants generated offline.
They're not really 'magic', they just happen to work well. */
uint32_t seed = dict_hash_function_seed;
const uint32_t m = 0x5bd1e995;
const int r = 24;
/* Initialize the hash to a 'random' value */
uint32_t h = seed ^ len;
/* Mix 4 bytes at a time into the hash */
const unsigned char *data = (const unsigned char *)key;
while(len >= 4) {
uint32_t k = *(uint32_t*)data;
k *= m;
k ^= k >> r;
k *= m;
h *= m;
h ^= k;
data += 4;
len -= 4;
}
/* Handle the last few bytes of the input array */
switch(len) {
case 3: h ^= data[2] << 16;
case 2: h ^= data[1] << 8;
case 1: h ^= data[0]; h *= m;
};
/* Do a few final mixes of the hash to ensure the last few
* bytes are well-incorporated. */
h ^= h >> 13;
h *= m;
h ^= h >> 15;
return (unsigned int)h;
}
Murmur可以計算字符串的hash code,基本思想就是把key分成n組,每組4個字符痒玩,把這4個字符看成是一個uint_32淳附,進行n次運算,得到一個h蠢古,然會在對h進行處理奴曙,得到一個相對離散的hash code;
DJB Hash
unsigned int DJBHash(char *str)
{
unsigned int hash = 5381;
while (*str){
hash = ((hash << 5) + hash) + (*str++); /* times 33 */
}
hash &= ~(1 << 31); /* strip the highest bit */
return hash;
}
Redis對其進行了部分調(diào)整,不區(qū)分大小寫:
unsigned int dictGenCaseHashFunction(const unsigned char *buf, int len) {
unsigned int hash = (unsigned int)dict_hash_function_seed;
while (len--)
hash = ((hash << 5) + hash) + (tolower(*buf++)); /* hash * 33 + c */
return hash;
}
JDK
Austin Appleby's MurmurHash3
從JDK7開始草讶,JDK引入了一種新的計算Hash code的算法洽糟,可以在如下的集合對象中使用:
- HashMap
- Hashtable
- HashSet
- LinkedHashMap
- LinkedHashSet
- WeakHashMap
- ConcurrentHashMap
final int hash(Object k) {
int h = hashSeed;
if (0 != h && k instanceof String) {
return sun.misc.Hashing.stringHash32((String) k);
}
h ^= k.hashCode();
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
可以看到只有key為字符類型,而且hashSeed不為0時才采用新的哈希算法堕战;
sun.misc.Hashing.stringHash32實際上調(diào)用的是String.hash32方法:
int hash32() {
int h = hash32;
//h==0表示未計算過hash code
//可以看到hash code只會計算一次
if (0 == h) {
//HASHING_SEED是根據(jù)時間戳等計算出來的隨機數(shù)
h = sun.misc.Hashing.murmur3_32(HASHING_SEED, value, 0, value.length);
//確保結(jié)果非0坤溃,避免重新計算
h = (0 != h) ? h : 1;
hash32 = h;
}
return h;
}
可以看到最終調(diào)用的是MurmurHash3算法,那么什么情況下hashSeed不為0呢嘱丢?
final boolean initHashSeedAsNeeded(int capacity) {
boolean currentAltHashing = hashSeed != 0;
boolean useAltHashing = sun.misc.VM.isBooted() &&
(capacity >= Holder.ALTERNATIVE_HASHING_THRESHOLD);
boolean switching = currentAltHashing ^ useAltHashing;
if (switching) {
hashSeed = useAltHashing
? sun.misc.Hashing.randomHashSeed(this)
: 0;
}
return switching;
}
可以看到是否采用新算法和Holder.ALTERNATIVE_HASHING_THRESHOLD有關(guān)薪介;實際上JDK定義了一個新的屬性jdk.map.althashing.threshold(默認值-1),只有當(dāng)系統(tǒng)容量大于該屬性值時越驻,才會采用新的算法汁政;因此可以將通過-Djdk.map.althashing.threshold=0設(shè)置該屬性為0道偷,啟用新的哈希函數(shù);
String.hashCode
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
Object.hashCode
之前曾經(jīng)在Java對象結(jié)構(gòu) 中介紹了Java的MarkWord,其中記錄了對象的hashCode;如果對象計算過hashCode烂完,則會存儲到對象頭中;如果是第一次計算則是調(diào)用
static inline intptr_t get_next_hash(Thread * Self, oop obj) {
intptr_t value = 0 ;
if (hashCode == 0) {
// This form uses an unguarded global Park-Miller RNG,
// so it's possible for two threads to race and generate the same RNG.
// On MP system we'll have lots of RW access to a global, so the
// mechanism induces lots of coherency traffic.
value = os::random() ;
} else
if (hashCode == 1) {
// This variation has the property of being stable (idempotent)
// between STW operations. This can be useful in some of the 1-0
// synchronization schemes.
intptr_t addrBits = intptr_t(obj) >> 3 ;
value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ;
} else
if (hashCode == 2) {
value = 1 ; // for sensitivity testing
} else
if (hashCode == 3) {
value = ++GVars.hcSequence ;
} else
if (hashCode == 4) {
value = intptr_t(obj) ;
} else {
// Marsaglia's xor-shift scheme with thread-specific state
// This is probably the best overall implementation -- we'll
// likely make this the default in future releases.
unsigned t = Self->_hashStateX ;
t ^= (t << 11) ;
Self->_hashStateX = Self->_hashStateY ;
Self->_hashStateY = Self->_hashStateZ ;
Self->_hashStateZ = Self->_hashStateW ;
unsigned v = Self->_hashStateW ;
v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ;
Self->_hashStateW = v ;
value = v ;
}
value &= markOopDesc::hash_mask;
if (value == 0) value = 0xBAD ;
assert (value != markOopDesc::no_hash, "invariant") ;
TEVENT (hashCode: GENERATE) ;
return value;
}
可以看到诵棵,這邊提供了多種不同的實現(xiàn)抠蚣,具體采用哪種,取決于hashCode的值履澳,在Linux機器上,hashCode默認為0:
因此是通過os::random計算hashCode,可以參考Java對象結(jié)構(gòu) 嘶窄;
其它
Character、Integer距贷、Long柄冲、Double的hashCode方法返回包裝的基本類型;