為搜索引擎設(shè)計一個 key-value 儲存 原文鏈接
1.描述使用場景和約束
使用場景:
- 用戶請求可以命中緩存或者找不到
假設(shè)和約束:
- 流量不均衡,存在熱點數(shù)據(jù)
- 查找速度盡量快
- 設(shè)計緩存淘汰策略
- 1億用戶量
- 平均每月100億次查詢
容量估算:
數(shù)據(jù)結(jié)構(gòu)key
是query
,value
是results
:
query
50字節(jié)title
20字節(jié)snippet
200字節(jié)
共計270字節(jié)如果每次查詢都不重復(fù)欲侮,每月有2.7TB的數(shù)據(jù)
每秒4000次讀請求
2.創(chuàng)建系統(tǒng)設(shè)計圖
3.設(shè)計關(guān)鍵組件
使用場景:用戶請求已在緩存中
場景緩存可以使用Memcache或者Redis,減少倒掛索引服務(wù)和文檔服務(wù)的讀壓力,在緩存淘汰策略上形帮,可以使用LRU(least recently used)。
- 查詢請求執(zhí)行以下操作:
- 解析請求
- 分詞周叮,詞拼寫糾錯
- 去cache查詢是否存在符合條件的結(jié)果
- 如果存在辩撑,更新LRU中的cache位置
- 返回cache里的結(jié)果
- 如果不存在:
- 調(diào)用倒掛索引服務(wù)去獲取符合條件的結(jié)果
- 去文檔服務(wù)中查詢結(jié)果,獲取標題和摘要
- 將結(jié)果更新到LRU中
- 如果存在辩撑,更新LRU中的cache位置
LRU的實現(xiàn)可以借助hash表和雙向鏈表來實現(xiàn):
class Node(object):
def __init__(self, query, results):
self.query = query
self.results = results
class LinkedList(object):
def __init__(self):
self.head = None
self.tail = None
def move_to_front(self, node):
...
def append_to_front(self, node):
...
def remove_from_tail(self):
...
class Cache(object):
def __init__(self, MAX_SIZE):
self.MAX_SIZE = MAX_SIZE
self.size = 0
self.lookup = {} # key: query, value: node
self.linked_list = LinkedList()
def get(self, query)
"""Get the stored query result from the cache.
Accessing a node updates its position to the front of the LRU list.
"""
node = self.lookup[query]
if node is None:
return None
self.linked_list.move_to_front(node)
return node.results
def set(self, results, query):
"""Set the result for the given query key in the cache.
When updating an entry, updates its position to the front of the LRU list.
If the entry is new and the cache is at capacity, removes the oldest entry
before the new entry is added.
"""
node = self.lookup[query]
if node is not None:
# Key exists in cache, update the value
node.results = results
self.linked_list.move_to_front(node)
else:
# Key does not exist in cache
if self.size == self.MAX_SIZE:
# Remove the oldest entry from the linked list and lookup
self.lookup.pop(self.linked_list.tail.query, None)
self.linked_list.remove_from_tail()
else:
self.size += 1
# Add the new key and value
new_node = Node(query, results)
self.linked_list.append_to_front(new_node)
self.lookup[query] = new_node
查詢服務(wù):
class QueryApi(object):
def __init__(self, memory_cache, reverse_index_service):
self.memory_cache = memory_cache
self.reverse_index_service = reverse_index_service
def parse_query(self, query):
"""Remove markup, break text into terms, deal with typos,
normalize capitalization, convert to use boolean operations.
"""
...
def process_query(self, query):
query = self.parse_query(query)
results = self.memory_cache.get(query)
if results is None:
results = self.reverse_index_service.process_search(query)
self.memory_cache.set(query, results)
return results
緩存需要在下面情況下更新:
- 頁面內(nèi)容發(fā)生變化時
- 頁面有新增或者刪除時
- 頁面排序變化時
4.完善設(shè)計
關(guān)于分布式緩存仿耽,可以參考Redis Cluster