上一節(jié)對(duì)global_State
做了子模塊劃分摔癣,回顧一下global_State
整理后的代碼:
typedef struct global_State {
// Version
const lua_Number *version; /* pointer to version number */
// Hash
unsigned int seed; /* randomized seed for hashes */
// Global Registry
TValue l_registry;
// Global String table
stringtable strt; /* hash table for strings */
Mbuffer buff; /* temporary buffer for string concatenation */
// Global Meta table
TString *tmname[TM_N]; /* array with tag-method names */
struct Table *mt[LUA_NUMTAGS]; /* metatables for basic types */
// Global Thread list
struct lua_State *mainthread;
struct lua_State *twups; /* list of threads with open upvalues */
// Memory Allocator
lua_Alloc frealloc; /* function to reallocate memory */
void *ud; /* auxiliary data to 'frealloc' */
// GC Info
GCInfo *gcinfo;
// Error Recover
lua_CFunction panic; /* to be called in unprotected errors */
}
加深一下印象晃听,也就是說global_State
內(nèi)部包含了下面這些重要部分:
- Version:pointer to version number
- Hash:randomized seed for hashes
- Global Registry
- Global String table
- Global Meta table
- Global Thread list
- Memory Allocator
- GC Info
- Error Recover
其中差凹,Version指向了版本指針,通常我們發(fā)布程序版本號(hào)都是外置的颗味,Lua把版本號(hào)內(nèi)置在全局狀態(tài)里舔亭,使得可以在運(yùn)行時(shí)動(dòng)態(tài)判斷自己的版本號(hào)脖律。
而谢肾,Hash部分的seed則是用在散列表里面計(jì)算哈希時(shí)的種子,在Lua的代碼里可以找到這段代碼:
//lstring.c
unsigned int luaS_hash (const char *str, size_t l, unsigned int seed) {
unsigned int h = seed ^ cast_uint(l);
size_t step = (l >> LUAI_HASHLIMIT) + 1;
for (; l >= step; l -= step)
h ^= ((h<<5) + (h>>2) + cast_byte(str[l - 1]));
return h;
}
第1眼看上去luaS_hash
的代碼沒什么道理可言小泉,其實(shí)一步步展開還是能理解的芦疏。首先,哈希函數(shù)有兩類:
- 加密用的哈希函數(shù)微姊,例如SHA-256, SHA-512, MD5, RIPEMD-160等
- 非加密用的哈希函數(shù)
luaS_hash
即屬于非加密用的哈希函數(shù)酸茴,在散列表里的主要作用是:
- 把輸入字符串隨機(jī)映射到散列表的索引范圍[0, len-1]內(nèi)
- 哈希應(yīng)該在[0,len-1]內(nèi)盡量符合均勻分布(Unifrom),也就是合法的輸入被映射到[0,len-1]中任何一個(gè)地址的概率是相等的兢交,這就使得輸入經(jīng)過哈希后可以得到一個(gè)“隨機(jī)的地址”薪捍,以減少碰撞。
可見配喳,它本質(zhì)上是一種偽隨機(jī)數(shù)生成器酪穿,而我們知道偽隨機(jī)數(shù)生成器最古典的就是線性同余方法(wiki:Linear congruential generator(LCG)),它的算法如下:
LCG算出來的偽隨機(jī)數(shù)序列在模M之后就是一個(gè)周期整數(shù)序列界逛,周期的大小決定了碰撞的概率昆稿。LCG的周期最大是M,但大部分情況都會(huì)少于M息拜,如果要讓LCG達(dá)到最大周期,應(yīng)該要符合以下Hull–Dobell Theorem條件:
- B净响,M互質(zhì)
- M的所有質(zhì)因子都能整除A-1
- 若M是4的倍數(shù)少欺,A-1也是。
- A馋贤,B赞别,都比M小
- A,B都是正整數(shù)
LCG本身只是偽隨機(jī)數(shù)生成器配乓,需要滿足均勻分布以減少碰撞才能在散列表里面使用仿滔。因此一個(gè)非密碼學(xué)使用的哈希函數(shù)是否足夠好(good hash function),就要看它是否足夠均勻分布犹芹。
先看一個(gè)常見的Hash函數(shù)的實(shí)現(xiàn)崎页,叫做DJBX33A (Daniel J. Bernstein, Times 33 with Addition):
unsigned int DJBHash(char* str, unsigned int len){
unsigned int hash = 5381;
unsigned int i = 0;
for(i = 0; i < len; str++, i++) {
hash = ((hash << 5) + hash) + (*str);
}
return hash;
}
在for循環(huán)里做的hash = ((hash << 5) + hash) + (*str);
實(shí)際上就是hash = hash*33+str[i]
,就是LCG遞歸算法的一個(gè)中間步驟腰埂。對(duì)比一下公式飒焦,可以看到:
- =左邊的hash
- =右邊的hash
- A=33
- B=str[i],str是ASCII字符串屿笼,str[i]是8 bits牺荠,從而B是[0,256)范圍內(nèi)的整數(shù)翁巍。
- M=(或者 ,取決于int的位數(shù))
可以看到A休雌,B灶壶,M滿足如下的幾個(gè)性質(zhì):
- 如果B是奇數(shù),B和M互質(zhì)杈曲,至少一半的概率
- M的所有質(zhì)因子是2驰凛,它可以整除A-1=32
- M( 或者 )是4的倍數(shù),A-1=32也是4的倍數(shù)
- A鱼蝉,B洒嗤,都比M小
- A,B是正整數(shù)(除0外)魁亦。
因此DJBX33A這個(gè)哈希函數(shù)具有很好的最大周期性渔隶,從而盡可能減少碰撞。但這只解釋了一半原因洁奈,另一半原因是:
- A選擇33這個(gè)接近的數(shù)字间唉,可以充分利用計(jì)算機(jī)里面用shift-and-add的方式計(jì)算乘法,即:h^33 = h^32+h = (h<<5)+h
- 假設(shè)h是32位利术,則二進(jìn)制表達(dá)式為A1A2...A32呈野,其中Ai取0或者1,那么h<<5的二進(jìn)制表達(dá)式是A6A7...A3200000印叁,那么(h<<5)+h實(shí)際上把h的bit做了一次位移后再逐位和h自己相加被冒,得到的h^33的每一個(gè)bit就盡可能均勻地混合了原來h的每一個(gè)bit信息。
- 從這個(gè)角度來說轮蜕,如果h<<n里面的n太大昨悼,則會(huì)每次只混合了比較少的h原來bit的信息;如果n太小跃洛,則h<<n的每個(gè)bit離h的每個(gè)bit很近率触,這會(huì)需要更多的迭代才能讓h的32個(gè)bit混合均勻。而5和32互質(zhì)汇竭,也有利于多次迭代中讓h的每個(gè)bit有更大概率與輸出h的每個(gè)bit都有機(jī)會(huì)做混合葱蝗。
- 考慮B=str[i],str是ASCII字符串细燎,str[i]是8 bits两曼。ASCII的前4個(gè)bit都含有0x3。則:
- 迭代一次:h1=h0<<n+h0+str[i]
- 迭代二次:h2=h1<<n+h1+str[i+1] = h1'+str[i+1]
- 可以看到如果n是8找颓,h1'的低8位就是str[i]合愈,那么h1'和str[i+1]的[4,8]位之間就會(huì)有相同的值做混合,不利于增加信息。n取4和2也是一樣的問題佛析。而n取5則可以讓str[i]的低4位有更多機(jī)會(huì)和str[i+1]的高4位做混合益老。
上面這段解釋來自參考資料[2]的評(píng)論,而參考資料[3]的評(píng)論下則有一段對(duì)DJB里的這幾個(gè)魔數(shù)的來源做了另一番解釋:直接暴力計(jì)算對(duì)比寸莫,you can you up捺萌!
/*
* DJBX33A (Daniel J. Bernstein, Times 33 with Addition)
*
* This is Daniel J. Bernstein's popular `times 33' hash function as
* posted by him years ago on comp.lang.c. It basically uses a function
* like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best
* known hash functions for strings. Because it is both computed very
* fast and distributes very well.
*
* The magic of number 33, i.e. why it works better than many other
* constants, prime or not, has never been adequately explained by
* anyone. So I try an explanation: if one experimentally tests all
* multipliers between 1 and 256 (as RSE did now) one detects that even
* numbers are not useable at all. The remaining 128 odd numbers
* (except for the number 1) work more or less all equally well. They
* all distribute in an acceptable way and this way fill a hash table
* with an average percent of approx. 86%.
*
* If one compares the Chi^2 values of the variants, the number 33 not
* even has the best value. But the number 33 and a few other equally
* good numbers like 17, 31, 63, 127 and 129 have nevertheless a great
* advantage to the remaining numbers in the large set of possible
* multipliers: their multiply operation can be replaced by a faster
* operation based on just one shift plus either a single addition
* or subtraction operation. And because a hash function has to both
* distribute good _and_ has to be very fast to compute, those few
* numbers should be preferred and seems to be the reason why Daniel J.
* Bernstein also preferred it.
*
*
* -- Ralf S. Engelschall <rse@engelschall.com>
*/
除了DJBX33A哈希函數(shù)外,下面兩個(gè)著名的哈希函數(shù)也是類似的做法:
- wiki:Bob Jenkin's hash function
- wiki:Fowler–Noll–Vo hash function
- 參考資料[4]上列出了一共16種不同的非加密用的哈希函數(shù)
回到luaS_hash
膘茎,與DJBX33A哈希函數(shù)是類似的桃纯,其中:
- 通過隨機(jī)種子
unsigned int h = seed ^ cast_uint(l);
來防御哈希碰撞攻擊 - 通過
LUAI_HASHLIMIT
控制step
,通過控制step可以控制哈希計(jì)算的速度 -
(h<<5)
與(h>>2)
做混合
可以看到seed正是使用global_State里的seed披坏,下面的代碼驗(yàn)證了這點(diǎn):
static TString *internshrstr (lua_State *L, const char *str, size_t l) {
TString *ts;
global_State *g = G(L);
stringtable *tb = &g->strt;
unsigned int h = luaS_hash(str, l, g->seed);
TString **list = &tb->hash[lmod(h, tb->size)];
...
}
而global_State->seed的初始化如下:
LUA_API lua_State *lua_newstate (lua_Alloc f, void *ud) {
...
g->seed = luai_makeseed(L);
...
}
進(jìn)一步态坦,我們看下luai_makeseed(L);
的代碼:
static unsigned int luai_makeseed (lua_State *L) {
char buff[3 * sizeof(size_t)];
unsigned int h = cast_uint(time(NULL));
int p = 0;
addbuff(buff, p, L); /* heap variable */
addbuff(buff, p, &h); /* local variable */
addbuff(buff, p, &lua_newstate); /* public function */
lua_assert(p == sizeof(buff));
return luaS_hash(buff, p, h);
}
這就是隨機(jī)種子初始化的地方,可以看到最后一句調(diào)用了luaS_hash來遞歸的生成種子本身棒拂,而buff
,p
,h
就是初始值伞梯,其中h
就是“生成隨機(jī)種子的種子”,分拆下:
- 字符串
buff
包含了heap variable
,local variable
,address of lua_newstate
三種信息 - p就是buff的長(zhǎng)度
- 而種子h使用時(shí)間函數(shù)
time
來初始化帚屉,實(shí)際上用戶可以在 luaconf.h 中配置 luai_makeseed 定義自己的隨機(jī)方法谜诫。
待續(xù)
至此,我們分解清楚了global_State->seed的作用攻旦,以及l(fā)uaS_hash函數(shù)背后的原理喻旷。下一次,我們會(huì)繼續(xù)探索global_State-> l_registry牢屋。對(duì)技術(shù)背后原理的好奇且预,是前進(jìn)的最大動(dòng)力。
[1] https://en.wikipedia.org/wiki/Linear_congruential_generator
[2] https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm
[3] https://stackoverflow.com/questions/10696223/reason-for-5381-number-in-djb-hash-function/13809282#13809282
[4] https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cryptographic_hash_functions