faiss 使用

添加faiss到python路徑

vim ~/anaconda2/lib/python2.7/site-packages/my.pth
添加路徑 /home/bi_tag/faiss/faiss
參照官網(wǎng): https://github.com/facebookresearch/faiss/wiki/Getting-started
import numpy as np
import faiss                   # make faiss available

# 構(gòu)造數(shù)據(jù)
import time
d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
  • IndexFlatL2
# 為向量集構(gòu)建IndexFlatL2索引抗果,它是最簡單的索引類型,只執(zhí)行強力L2距離搜索
%time index = faiss.IndexFlatL2(d)   # build the index
print(index.is_trained)
index.add(xb)                  # add vectors to the index
print(index.ntotal)
k = 4   # we want to see 4 nearest neighbors
%time D, I = index.search(xb[:5], k) # sanity check
print(I)     # 向量索引位置
print(D)     # 相似度矩陣
%time D, I = index.search(xq, k)     # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 80.1 μs
True
100000
CPU times: user 197 ms, sys: 7 ms, total: 204 ms
Wall time: 4.86 ms
[[  0 393 363  78]
 [  1 555 277 364]
 [  2 304 101  13]
 [  3 173  18 182]
 [  4 288 370 531]]
[[ 0.          7.17517376  7.20763016  7.25116253]
 [ 0.          6.32356453  6.6845808   6.79994583]
 [ 0.          5.79640865  6.39173603  7.28151226]
 [ 0.          7.27790546  7.527987    7.66284657]
 [ 0.          6.76380348  7.29512024  7.36881447]]
CPU times: user 26.7 s, sys: 3.66 s, total: 30.3 s
Wall time: 683 ms
[[ 381  207  210  477]
 [ 526  911  142   72]
 [ 838  527 1290  425]
 [ 196  184  164  359]
 [ 526  377  120  425]]
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]
  • IndexIVFFlat 加聚類
# https://github.com/facebookresearch/faiss/wiki/Faster-search
# 加速計算

'''
To speed up the search, it is possible to segment the dataset into pieces. 
We define Voronoi cells in the d-dimensional space, and each database vector falls in one of the cells. 
At search time, 
only the database vectors y contained in the cell the query x 
falls in and a few neighboring ones are compared against the query vector.
'''

# 通過使用IndexIVFFlat索引再菊,將數(shù)據(jù)集分割成多個斥滤,我們在d維空間中定義Voronoi cells赞咙,
# 只計算和x落在同一個cells中的y的距離
'''
This is done via the IndexIVFFlat index. This type of index requires a training stage, 
that can be performed on any collection of vectors 
that has the same distribution as the database vectors. 
In this case we just use the database vectors themselves.
# 需要預(yù)訓(xùn)練
# 搜索方法有兩個參數(shù):nlist(單元格數(shù)),nprobe(執(zhí)行搜索訪問的單元格數(shù))
'''
'''
There are two parameters to the search method: nlist, the number of cells, 
and nprobe, the number of cells (out of nlist) that are visited to perform a search. 
The search time roughly increases linearly with the number of probes plus some constant due to the quantization.
'''

The nprobe parameter is always a way of adjusting the tradeoff between speed and accuracy of the result. 
Setting nprobe = nlist gives the same result as the brute-force search (but slower).'''
# nprobe, 準(zhǔn)確度和時間的折中

nlist = 100 # 單元格數(shù)
k = 4
quantizer = faiss.IndexFlatL2(d)  # the other index  d是向量維度
%time index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
# here we specify METRIC_L2, by default it performs inner-product search
assert not index.is_trained
%time index.train(xb)
assert index.is_trained
%time index.add(xb)                  # add may be a bit slower as well
%timeit D, I = index.search(xq, k)     # actual search
print(I[-5:])                  # neighbors of the 5 last queries
index.nprobe = 10        # 執(zhí)行搜索訪問的單元格數(shù)(nlist以外)      # default nprobe is 1, try a few more
%time D, I = index.search(xq, k)
print(I[-5:]) # neighbors of the 5 last queries
index.search??
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 111 μs
CPU times: user 1.51 s, sys: 33 ms, total: 1.54 s
Wall time: 62.9 ms
CPU times: user 1.02 s, sys: 26 ms, total: 1.05 s
Wall time: 23.9 ms
100 loops, best of 3: 16.3 ms per loop
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]
CPU times: user 4.08 s, sys: 50 ms, total: 4.13 s
Wall time: 92.2 ms
[[ 9900 10500  9309  9831]
 [11055 10895 10812 11321]
 [11353 11103 10164  9787]
 [10571 10664 10632  9638]
 [ 9628  9554 10036  9582]]
# https://github.com/facebookresearch/faiss/wiki/Lower-memory-footprint
# 有損壓縮存儲
# 
'''
The indexes we have seen, IndexFlatL2 and IndexIVFFlat both store the full vectors. 
To scale up to very large datasets, 
Faiss offers variants that compress the stored vectors with a lossy compression based on product quantizers.

The vectors are still stored in Voronoi cells, 
but their size is reduced to a configurable number of bytes m (d must be a multiple of m)
'''
# 向量降維參數(shù) m诈豌,  向量維度d必須是m的倍數(shù)
'''
The compression is based on a Product Quantizer, 
that can be seen as an additional level of quantization, 
that is applied on sub-vectors of the vectors to encode.

In this case, since the vectors are not stored exactly, 
the distances that are returned by the search method are also approximations.
'''
# 結(jié)果的近似計算仆救, 向量壓縮映射基于Product Quantizer  

'''
They can be compared with the IVFFlat results above. For this case, 
most results are wrong, but they are in the correct area of the space, 
as shown by the IDs around 10000. The situation is better for real data because:

   uniform data is very difficult to index because 
there is no regularity that can be exploited to cluster or reduce dimensionality

   for natural data, the semantic nearest neighbor is often significantly closer than irrelevant results.
'''
# 在真實數(shù)據(jù)分布, 近似結(jié)果會更好

nlist = 100
m = 8 
k = 4
quantizer = faiss.IndexFlatL2(d)  # this remains the same
# 為了擴展到非常大的數(shù)據(jù)集矫渔,F(xiàn)aiss提供了基于產(chǎn)品量化器的有損壓縮來壓縮存儲的向量的變體彤蔽。壓縮的方法基于乘積量化。
# 損失了一定精度為代價蚌斩, 自身距離也不為0铆惑, 這是由于有損壓縮。
%time index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
faiss.IndexIVFPQ??
# 8 specifies that each sub-vector is encoded as 8 bits
%time index.train(xb)
%time index.add(xb)
%time D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
index.nprobe = 10              # make comparable with experiment above
%time D, I = index.search(xq, k)     # search
print(I[-5:])
# 簡化寫法送膳, 傳String參數(shù)
print "Simplifying index construction"
%time index = faiss.index_factory(d, "IVF100,PQ8")
%time index.train(xb)
%time index.add(xb)
index.nprobe = 10              # make comparable with experiment above
%time D, I = index.search(xq, k)     # search
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 144 μs
CPU times: user 1min 29s, sys: 2.12 s, total: 1min 31s
Wall time: 2.04 s
CPU times: user 6.87 s, sys: 113 ms, total: 6.98 s
Wall time: 152 ms
CPU times: user 11 ms, sys: 1e+03 μs, total: 12 ms
Wall time: 257 μs
[[   0  424  363  278]
 [   1  555 1063   24]
 [   2  304   46  346]
 [   3  773  182 1529]
 [   4  288  754  531]]
[[ 1.45568264  6.03136778  6.18729019  6.38852692]
 [ 1.4934082   5.74254704  6.19941282  6.21501732]
 [ 1.60279989  6.20174742  6.32792664  6.78541422]
 [ 1.69804895  6.2623148   6.26956797  6.56042767]
 [ 1.30235744  6.13624763  6.33899879  6.51442099]]
CPU times: user 3.82 s, sys: 69 ms, total: 3.89 s
Wall time: 84.8 ms
[[10664 10914  9922  9380]
 [10260  9014  9458 10310]
 [11291  9380 11103 10392]
 [10856 10284  9638 11276]
 [10304  9327 10152  9229]]
Simplifying index construction
CPU times: user 6 ms, sys: 1 ms, total: 7 ms
Wall time: 163 μs
CPU times: user 2min 27s, sys: 2.21 s, total: 2min 29s
Wall time: 8.72 s
CPU times: user 8.89 s, sys: 185 ms, total: 9.08 s
Wall time: 202 ms
CPU times: user 3.05 s, sys: 37 ms, total: 3.08 s
Wall time: 67.7 ms
  • GPU使用(未部署员魏, 服務(wù)器沒有GPU)
# https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs
# 使用GPU
# 單GPU
res = faiss.StandardGpuResources()  # use a single GPU
# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
gpu_index_flat.add(xb)         # add vectors to the index
print(gpu_index_flat.ntotal)

k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index_flat.search(xq, k)  # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])  
# 多GPU
ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)
cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_all_gpus(  # build the index
    cpu_index
)
gpu_index.add(xb)              # add vectors to the index
print(gpu_index.ntotal)
k = 4                          # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5])                   # neighbors of the 5 first queries
print(I[-5:])                  # neighbors of the 5 last queries
0
# Faiss building blocks: clustering, PCA, quantization
# k-means 聚類, PCA, PQ量化

# k-means 聚類
ncentroids = 1024
niter = 20
verbose = True
n = 20000
d = 32
x = np.random.rand(n, d).astype('float32')
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter, verbose)
kmeans.train(x)
# kmeans.centroids 聚類中心
D, I = kmeans.index.search(x, 1)
index = faiss.IndexFlatL2 (d)
index.add (x)
%time D, I = index.search (kmeans.centroids, 15)  # 最近的15個聚類中心
print(D[:5])                   # neighbors of the 5 first queries
print(I[:5])                  # neighbors of the 5 last queries
CPU times: user 544 ms, sys: 11 ms, total: 555 ms
Wall time: 12.1 ms
[[ 0.75993538  1.01302719  1.23967171  1.24554443  1.35987091  1.36220741
   1.3649559   1.37965965  1.3970623   1.40263176  1.43087006  1.47748756
   1.49857903  1.50987244  1.51762772]
 [ 0.64031982  1.11455727  1.19411278  1.2266655   1.27806664  1.3130722
   1.32496071  1.37686729  1.41514015  1.41737747  1.46724701  1.48617554
   1.49199867  1.49544144  1.51000404]
 [ 0.91652107  1.06410217  1.15173531  1.39237595  1.40605354  1.41710854
   1.42310143  1.44347572  1.44360542  1.45162773  1.49108124  1.49484253
   1.53658867  1.5404129   1.55116844]
 [ 0.80885315  1.33774185  1.34846497  1.3493042   1.36525154  1.36763763
   1.46035767  1.46951294  1.47649002  1.48202133  1.49103546  1.51737404
   1.5261879   1.52928543  1.55130959]
 [ 0.86759949  1.15416336  1.17353058  1.19447136  1.21015167  1.33818817
   1.37376595  1.3926239   1.49480057  1.52285957  1.53648758  1.54061317
   1.55315018  1.57767677  1.58125877]]
[[ 7088 11759    99 19526 16154 12055 17769 12503  9853 13708 13931  6740
   4466 17428 19830]
 [11262  8439 11477  8793 19599  6928  3824  7343 14503  9918 18099  4363
  18700  9995 17213]
 [16546 15334  8798  3045  5780  4047 19566 19456  1025 11855 11011  4220
   6267 14521 15692]
 [ 1301 19438  1677  9848  5933  1423 12830  2033  4208 19255 16002 12055
   2408 17972  2365]
 [18764 17835  7337  3042  3290  6483  9176  3692 17901 19445  3422  2857
  19815  4279  9130]]
  • PCA 向量降維
# PCA降維
# random training data 
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)
[ 116.68821716  116.36938477  107.59984589  107.00305939  105.75762939
  103.17525482  101.48827362  101.10425568   98.96426392   96.75302887]
  • PQ encoding / decoding 向量的有損壓縮與解壓
# PQ encoding / decoding
# 向量的有損壓縮與解壓
# The ProductQuantizer object can be used to encode or decode vectors to codes:
d = 32  # data dimension
cs = 4  # code size (bytes)
# train set 
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)
# encode 
codes = pq.compute_codes(x)
# decode
x2 = pq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
print avg_relative_error

0.0663308

  • IDMAP
# IndexIDMap 加ID映射 
index = faiss.IndexFlatL2(xb.shape[1]) 
ids = np.arange(xb.shape[0])
# index.add_with_ids(xb, ids)  # 報錯叠聋, 不支持 because IndexFlatL2 does not support add_with_ids
index2 = faiss.IndexIDMap(index)
index2.add_with_ids(xb, ids) # ok, works, the vectors are stored in the underlying index

https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

如何選擇使用哪種索引方式撕阎,

faiss提供了很多索引方式的api供選擇

所有的索引構(gòu)建都可以通過 faiss.index_factory(d, "......") 統(tǒng)一接口來實現(xiàn)

'''
10000 * 64
1, IndexFlatL2, Exact Search for L2碌补, brute-force虏束, "Flat"
L2距離 精確計算,無需train厦章,速度慢镇匀, CPU time需要26s, real需要687ms
2, IndexIVFFlat, Take another index to assign vectors to inverted lists "IVFx,Flat"
切割到cells, 估算袜啃,需要train, nprobe 越大越精確(最大nlist), 教程中取了1/10汗侵, 執(zhí)行時間也大概縮短為1/7
3, IndexPQ, Product quantizer (PQ) in flat mode "PQx"
壓縮,"PQx"
4, IndexIVFPQ群发, IVFADC (coarse quantizer+PQ on residuals) "IVF100,PQ8"
分桶加壓縮晰韵, 可以用index_factory來簡化構(gòu)造, "IVF100,PQ8" -> 100個桶熟妓, 壓縮到8bit
時間進一步降低雪猪, 更重要的是, 通過PQ的有損壓縮起愈,降低了內(nèi)存使用
'''

索引類型選擇

'''
1, HNSWx IndexHNSWFlat 方法 無需train, 不支持removing 向量樣本
Supported on GPU: no只恨, 準(zhǔn)確
2译仗, "...,Flat",
先聚類分桶,讀入訓(xùn)練過程會比較慢坤次, 再計算相似度古劲, 快, 無壓縮缰猴, 存儲消耗等同于原數(shù)據(jù)集大小产艾, 通過nprobe來折中速度與精度
"..." means a clustering of the dataset has to be performed beforehand (read below).
After clustering, "Flat" just organizes the vectors into buckets,
so it does not compress them, the storage size is the same as that of the original dataset.
The tradeoff between speed and accuracy is set via the nprobe parameter.
Supported on GPU: yes (but see below, the clustering method must be supported as well)
3, "How big is the dataset?"
(1), 小于1M vectors "...,IVFx,..." 支持GPU
(2), 1M ~ 10M: "...,IMI2x10,..." 不支持GPU
(3), 10M - 100M "...,IMI2x12,..."
(4), 100M - 1B "...,IMI2x14,..."
'''

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末滑绒,一起剝皮案震驚了整個濱河市闷堡,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌疑故,老刑警劉巖杠览,帶你破解...
    沈念sama閱讀 211,376評論 6 491
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異纵势,居然都是意外死亡踱阿,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,126評論 2 385
  • 文/潘曉璐 我一進店門钦铁,熙熙樓的掌柜王于貴愁眉苦臉地迎上來软舌,“玉大人,你說我怎么就攤上這事牛曹》鸬悖” “怎么了?”我有些...
    開封第一講書人閱讀 156,966評論 0 347
  • 文/不壞的土叔 我叫張陵黎比,是天一觀的道長超营。 經(jīng)常有香客問我,道長阅虫,這世上最難降的妖魔是什么演闭? 我笑而不...
    開封第一講書人閱讀 56,432評論 1 283
  • 正文 為了忘掉前任,我火速辦了婚禮颓帝,結(jié)果婚禮上米碰,老公的妹妹穿的比我還像新娘。我一直安慰自己躲履,他們只是感情好,可當(dāng)我...
    茶點故事閱讀 65,519評論 6 385
  • 文/花漫 我一把揭開白布聊闯。 她就那樣靜靜地躺著工猜,像睡著了一般。 火紅的嫁衣襯著肌膚如雪菱蔬。 梳的紋絲不亂的頭發(fā)上篷帅,一...
    開封第一講書人閱讀 49,792評論 1 290
  • 那天史侣,我揣著相機與錄音,去河邊找鬼魏身。 笑死惊橱,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的箭昵。 我是一名探鬼主播税朴,決...
    沈念sama閱讀 38,933評論 3 406
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼家制!你這毒婦竟也來了正林?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,701評論 0 266
  • 序言:老撾萬榮一對情侶失蹤颤殴,失蹤者是張志新(化名)和其女友劉穎觅廓,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體涵但,經(jīng)...
    沈念sama閱讀 44,143評論 1 303
  • 正文 獨居荒郊野嶺守林人離奇死亡杈绸,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 36,488評論 2 327
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了矮瘟。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片瞳脓。...
    茶點故事閱讀 38,626評論 1 340
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖芥永,靈堂內(nèi)的尸體忽然破棺而出篡殷,到底是詐尸還是另有隱情,我是刑警寧澤埋涧,帶...
    沈念sama閱讀 34,292評論 4 329
  • 正文 年R本政府宣布板辽,位于F島的核電站,受9級特大地震影響棘催,放射性物質(zhì)發(fā)生泄漏劲弦。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 39,896評論 3 313
  • 文/蒙蒙 一醇坝、第九天 我趴在偏房一處隱蔽的房頂上張望邑跪。 院中可真熱鬧,春花似錦呼猪、人聲如沸画畅。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,742評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽轴踱。三九已至,卻和暖如春谚赎,著一層夾襖步出監(jiān)牢的瞬間淫僻,已是汗流浹背诱篷。 一陣腳步聲響...
    開封第一講書人閱讀 31,977評論 1 265
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留雳灵,地道東北人棕所。 一個月前我還...
    沈念sama閱讀 46,324評論 2 360
  • 正文 我出身青樓,卻偏偏與公主長得像悯辙,于是被迫代替她去往敵國和親琳省。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 43,494評論 2 348

推薦閱讀更多精彩內(nèi)容