redis字典實現(xiàn)

一众辨、原理

1.1 數(shù)組+鏈表的底層數(shù)據(jù)結構

字典是用hash實現(xiàn)的煤蚌，通過數(shù)組+鏈表解決hash沖突耕挨。通過hash函數(shù)計算出key的index后，在對應的鏈表上進行數(shù)據(jù)的增刪查尉桩。

字典的hash實現(xiàn)

1.2 漸進式rehash

rehash含義：當dict中的數(shù)據(jù)越來越多時俗孝，數(shù)組中的鏈表會越來越長，dict的增刪改效率越來越低魄健，因此需要對數(shù)組進行擴容赋铝，這就是rehash。
漸進式rehash的含義：當數(shù)組擴容時沽瘦，需要進行一次數(shù)據(jù)遷移革骨。在觸發(fā)擴容時农尖，并不執(zhí)行數(shù)據(jù)遷移，而是在后續(xù)的每次增刪改等操作中良哲，遷移一部分數(shù)據(jù)盛卡。
為什么漸進式rehash：當dict很大時，擴容涉及的數(shù)據(jù)非常的多筑凫，如果一次遷移完會導致服務器卡頓滑沧。

二、數(shù)據(jù)結構定義

2.1 dict

dictType是hash相關的處理函數(shù)巍实，C的多態(tài)實現(xiàn)滓技，在運行時確定需要調用的函數(shù)。
privdata棚潦，在dictType中的處理函數(shù)里會使用該數(shù)據(jù)令漂。
ht[2]。實際的數(shù)據(jù)存儲丸边，非rehash情況下叠必，使用ht[0]存儲數(shù)據(jù)，rehash過程中妹窖，ht[0]存老的數(shù)據(jù)纬朝，ht[1]存新的數(shù)據(jù)。
rehashidx骄呼。rehash的進度玄组，-1表示不在rehash，其它值表示當前rehash的進度谒麦，數(shù)據(jù)遷移是遍歷ht[0]把數(shù)據(jù)遷移到ht[1]中俄讹，rehashidx記錄了當前遍歷ht[0]的下標。
iterators表示指向該dict的安全迭代器的數(shù)量绕德。

typedef struct dict {
    dictType *type;
    void *privdata;
    dictht ht[2];
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    unsigned long iterators; /* number of iterators currently running */
} dict;

2.2 dictType

dict綁定的處理函數(shù)患膛，主要有hash函數(shù)，key比較函數(shù)耻蛇，key-value的拷貝和析構函數(shù)踪蹬。

typedef struct dictType {
    uint64_t (*hashFunction)(const void *key);
    void *(*keyDup)(void *privdata, const void *key);
    void *(*valDup)(void *privdata, const void *obj);
    int (*keyCompare)(void *privdata, const void *key1, const void *key2);
    void (*keyDestructor)(void *privdata, void *key);
    void (*valDestructor)(void *privdata, void *obj);
} dictType;

2.3 dictht

實際存儲key-value的結構。

typedef struct dictht {
    dictEntry **table; // 存儲數(shù)據(jù)的數(shù)組
    unsigned long size; // 數(shù)組大小
    unsigned long sizemask; // size為2^n  sizemask則為size - 1
    unsigned long used; // 哈希表中k-v的數(shù)量
} dictht;

2.4 dictEntry

字典的元存儲數(shù)據(jù)結構臣咖，k-v類型跃捣，包含存儲的鍵值對和數(shù)組中對應的鏈表的下一個元素。

typedef struct dictEntry {
    void *key; // key
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v; // val
    struct dictEntry *next; // 指向下一個元素
} dictEntry;

2.5 dictIterator

dict的迭代器夺蛇，迭代的過程可以從上面的dict結構大致可以看出疚漆。2層嵌套循環(huán)，外層遍歷數(shù)組，內(nèi)層遍歷數(shù)組元素對應的鏈表娶聘。

typedef struct dictIterator {
    dict *d;  // 迭代器對應的字典
    long index; // 遍歷的hash表對應的數(shù)組下標
    int table, safe; // table對應dict->ht[2]的下標 如果是正在rehash這2個表中都會有數(shù)據(jù)
// safe=1表示該迭代器可以在迭代中間進行增刪改等操作 反之則不行
    dictEntry *entry, *nextEntry; // entry表示當前迭代器指向的元素 nextEntry表示
    /* unsafe iterator fingerprint for misuse detection. */
    long long fingerprint; // 對于unsafe的迭代器 遍歷前根據(jù)dict的值計算一個fingerprint 
    // 當?shù)瓿珊?比較當前的fingerprint和迭代前的值是否相同
    // 不同則說明在迭代過程中有增刪操作 迭代器使用不合法
} dictIterator;

三闻镶、疑問和答案

3.1 key-value的存儲格式

key為void*類型，只有起始地址丸升。類型和長度由上層業(yè)務控制铆农。可能類型有是int狡耻，long墩剖，double，sds等類型夷狰。
value可以是int64, uint64, double等內(nèi)置類型岭皂，也可以是void*。需要業(yè)務去解析內(nèi)存的類型和長度孵淘。

3.2 擴容判斷算法

static unsigned int dict_force_resize_ratio = 5;
#define DICT_HT_INITIAL_SIZE     4
static int _dictExpandIfNeeded(dict *d)
{
    if (dictIsRehashing(d)) return DICT_OK; // 如果已在rehash中 不再觸發(fā)

    if (d->ht[0].size == 0) return dictExpand(d, DICT_HT_INITIAL_SIZE); // 初始化大小為4

    if (d->ht[0].used >= d->ht[0].size &&
        (dict_can_resize ||
         d->ht[0].used/d->ht[0].size > dict_force_resize_ratio))
    {
        return dictExpand(d, d->ht[0].used*2); // 擴容時數(shù)組容量翻倍
    }
    return DICT_OK;
}

1.正常情況蒲障，hash表的的數(shù)據(jù)量大于數(shù)組長度時擴容歹篓。
2.特殊情況瘫证，hash表的數(shù)據(jù)量大于數(shù)組長度的5倍時，強制擴容庄撮。

3.3 擴容的容量計算

hash表的初始大小為4背捌。
每次擴容時，大小翻倍洞斯。

3.4 何時數(shù)據(jù)遷移

涉及的相關代碼較多毡庆，這里不貼源碼了，數(shù)據(jù)遷移調用函數(shù)_dictRehashStep烙如。通過在源碼中搜索該函數(shù)的調用可以發(fā)現(xiàn)么抗，在對dict進行增刪查改時，會觸發(fā)該函數(shù)亚铁。

3.5 如何數(shù)據(jù)遷移

數(shù)據(jù)遷移比較簡單蝇刀，每次最多找10個空的鏈表或者處理一個有數(shù)據(jù)的鏈表。

static void _dictRehashStep(dict *d) {
    if (d->iterators == 0) dictRehash(d,1);
}

int dictRehash(dict *d, int n) {
    int empty_visits = n*10; /* Max number of empty buckets to visit. */
    if (!dictIsRehashing(d)) return 0;

    while(n-- && d->ht[0].used != 0) {
        dictEntry *de, *nextde;

        assert(d->ht[0].size > (unsigned long)d->rehashidx);
        while(d->ht[0].table[d->rehashidx] == NULL) { // 遞增rehashidx找非空的鏈表 但是最多找  n*10個 找不到則返回
            d->rehashidx++;
            if (--empty_visits == 0) return 1;
        }
        de = d->ht[0].table[d->rehashidx];
        while(de) { // 找到數(shù)組中非空的鏈表后 遍歷鏈表將所有數(shù)據(jù)都進行遷移
            uint64_t h;

            nextde = de->next;
            /* Get the index in the new hash table */
            h = dictHashKey(d, de->key) & d->ht[1].sizemask;
            de->next = d->ht[1].table[h];
            d->ht[1].table[h] = de;
            d->ht[0].used--;
            d->ht[1].used++;
            de = nextde;
        }
        d->ht[0].table[d->rehashidx] = NULL;
        d->rehashidx++;
    }

    // 如果已經(jīng)遷移完 設置標志位并且啟用dict->ht[0] 棄用dict->ht[1]
    if (d->ht[0].used == 0) {
        zfree(d->ht[0].table);
        d->ht[0] = d->ht[1];
        _dictReset(&d->ht[1]);
        d->rehashidx = -1;
        return 0;
    }

    /* More to rehash... */
    return 1;
}

3.6 dict的釋放

釋放的實現(xiàn)也比較簡單徘溢，不看源碼也可以猜到無非是遍歷整個dict釋放所有元素吞琐，最后釋放dict自身。相關函數(shù)dictRelease然爆，有興趣可以看源碼站粟。

3.7 dict落地的序列化

這個在序列化模塊找答案，查看dict.h/c中的所有代碼后曾雕，未找到序列化相關的函數(shù)奴烙。

3.8 rehash過程中再次觸發(fā)擴容

有可能rehash過程中再次觸發(fā)擴容。
上面源碼中可以看到，觸發(fā)二次擴容會直接返回缸沃。

3.9 迭代器遍歷實現(xiàn)

上面已經(jīng)提到了恰起，迭代器分為安全和非安全2種。區(qū)別在于迭代過程中是否能進行增刪改趾牧。
非安全迭代會在開始前根據(jù)dict的數(shù)據(jù)生成一個fingerprint检盼，在結束后再次生成fingerprint，比較2個值是否相同即可判斷中間是否有增刪改翘单。
rehash過程中吨枉，2個哈希表都需要遍歷。

dictEntry *dictNext(dictIterator *iter)
{
    while (1) { // 只有遍歷完或者找到下一個非空的元素才退出循環(huán)
        if (iter->entry == NULL) {
            dictht *ht = &iter->d->ht[iter->table];
            if (iter->index == -1 && iter->table == 0) { // 迭代開始的特殊處理
                if (iter->safe)
                    iter->d->iterators++; // 表示dict當前有效的迭代器+1
                else
                    iter->fingerprint = dictFingerprint(iter->d); // 非安全模式的迭代器
            }
            iter->index++;
            if (iter->index >= (long) ht->size) { // 如果當前的hash表已遍歷完
                if (dictIsRehashing(iter->d) && iter->table == 0) { // 判斷是否在rehash中 并且當前是ht[0] 是則遍歷ht[1]
                    iter->table++;
                    iter->index = 0;
                    ht = &iter->d->ht[1];
                } else {
                    break; // 遍歷完成
                }
            }
            iter->entry = ht->table[iter->index];
        } else {
            iter->entry = iter->nextEntry;
        }
        if (iter->entry) { // 找到非空的元素
            /* We need to save the 'next' here, the iterator user
             * may delete the entry we are returning. */
            // 上面注釋解釋了為什么要保存nextEntry的原因 當entry被刪除時 nextEntry可以記錄遍歷的進度
            iter->nextEntry = iter->entry->next; 
            return iter->entry;
        }
    }
    return NULL;
}

3.10 rehash中的迭代器遍歷

從上面源碼可知哄芜，rehash過程的遍歷和普通的遍歷的差別僅在于遍歷完dict->ht[0]后需要遍歷dict->ht[1]貌亭。

3.11 迭代過程中元素的增刪

通過上面的源碼可以對安全模式下迭代器遍歷時有字典有數(shù)據(jù)的增刪的情況討論。

增加且增加到nextEntry之前认臊，新增的數(shù)據(jù)無法被遍歷到圃庭。
增加且增加到nextEntry之后，新增的數(shù)據(jù)在后續(xù)過程中會被遍歷到失晴。
刪除且刪除nextEntry之前的數(shù)據(jù)包括entry剧腻，不會對遍歷產(chǎn)生影響。
刪除nextEntry涂屁。會導致野指針的訪問书在，后續(xù)迭代訪問nextEntry時訪問的是野指針。
刪除nextEntry之后的數(shù)據(jù)拆又。不會對遍歷產(chǎn)生影響儒旬。

總結

漸進式rehash是個非常常見的情況，go中的map使用的也是漸進式rehash帖族。對于做服務器的而言栈源，應該把這個作為常識，一個操作的復雜度超過一定閾值時竖般，必須考慮分多次操作甚垦，不然有可能會引起服務器的卡頓。
非安全迭代器的校驗捻激。非安全迭代器在迭代前后會各生成一個fingerprint制轰，前后對比就能判斷中間是否有數(shù)據(jù)修改。這個想法很好胞谭，但是實現(xiàn)的方式有待商榷垃杖，源碼如下：

long long dictFingerprint(dict *d) {
    long long integers[6], hash = 0;
    int j;

    integers[0] = (long) d->ht[0].table;
    integers[1] = d->ht[0].size;
    integers[2] = d->ht[0].used;
    integers[3] = (long) d->ht[1].table;
    integers[4] = d->ht[1].size;
    integers[5] = d->ht[1].used;

    for (j = 0; j < 6; j++) {
        hash += integers[j];
        /* For the hashing step we use Tomas Wang's 64 bit integer hash. */
        hash = (~hash) + (hash << 21); // hash = (hash << 21) - hash - 1;
        hash = hash ^ (hash >> 24);
        hash = (hash + (hash << 3)) + (hash << 8); // hash * 265
        hash = hash ^ (hash >> 14);
        hash = (hash + (hash << 2)) + (hash << 4); // hash * 21
        hash = hash ^ (hash >> 28);
        hash = hash + (hash << 31);
    }
    return hash;
}

從上面的源碼可以看到，無論它使用的hash算法多好丈屹，只要size和used前后一致调俘，算出來的結果就是一樣的伶棒，也就意味著只要增加和刪除的操作是成對的，不管操作的是否是同一個元素彩库，最后dict被修改了也無法通過fingerprint看出來肤无。