添加faiss到python路徑
vim ~/anaconda2/lib/python2.7/site-packages/my.pth
添加路徑 /home/bi_tag/faiss/faiss
參照官網(wǎng): https://github.com/facebookresearch/faiss/wiki/Getting-started
import numpy as np
import faiss # make faiss available
# 構(gòu)造數(shù)據(jù)
import time
d = 64 # dimension
nb = 100000 # database size
nq = 10000 # nb of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
- IndexFlatL2
# 為向量集構(gòu)建IndexFlatL2索引抗果,它是最簡單的索引類型,只執(zhí)行強力L2距離搜索
%time index = faiss.IndexFlatL2(d) # build the index
print(index.is_trained)
index.add(xb) # add vectors to the index
print(index.ntotal)
k = 4 # we want to see 4 nearest neighbors
%time D, I = index.search(xb[:5], k) # sanity check
print(I) # 向量索引位置
print(D) # 相似度矩陣
%time D, I = index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 80.1 μs
True
100000
CPU times: user 197 ms, sys: 7 ms, total: 204 ms
Wall time: 4.86 ms
[[ 0 393 363 78]
[ 1 555 277 364]
[ 2 304 101 13]
[ 3 173 18 182]
[ 4 288 370 531]]
[[ 0. 7.17517376 7.20763016 7.25116253]
[ 0. 6.32356453 6.6845808 6.79994583]
[ 0. 5.79640865 6.39173603 7.28151226]
[ 0. 7.27790546 7.527987 7.66284657]
[ 0. 6.76380348 7.29512024 7.36881447]]
CPU times: user 26.7 s, sys: 3.66 s, total: 30.3 s
Wall time: 683 ms
[[ 381 207 210 477]
[ 526 911 142 72]
[ 838 527 1290 425]
[ 196 184 164 359]
[ 526 377 120 425]]
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
- IndexIVFFlat 加聚類
# https://github.com/facebookresearch/faiss/wiki/Faster-search
# 加速計算
'''
To speed up the search, it is possible to segment the dataset into pieces.
We define Voronoi cells in the d-dimensional space, and each database vector falls in one of the cells.
At search time,
only the database vectors y contained in the cell the query x
falls in and a few neighboring ones are compared against the query vector.
'''
# 通過使用IndexIVFFlat索引再菊,將數(shù)據(jù)集分割成多個斥滤,我們在d維空間中定義Voronoi cells赞咙,
# 只計算和x落在同一個cells中的y的距離
'''
This is done via the IndexIVFFlat index. This type of index requires a training stage,
that can be performed on any collection of vectors
that has the same distribution as the database vectors.
In this case we just use the database vectors themselves.
# 需要預(yù)訓(xùn)練
# 搜索方法有兩個參數(shù):nlist(單元格數(shù)),nprobe(執(zhí)行搜索訪問的單元格數(shù))
'''
'''
There are two parameters to the search method: nlist, the number of cells,
and nprobe, the number of cells (out of nlist) that are visited to perform a search.
The search time roughly increases linearly with the number of probes plus some constant due to the quantization.
'''
The nprobe parameter is always a way of adjusting the tradeoff between speed and accuracy of the result.
Setting nprobe = nlist gives the same result as the brute-force search (but slower).'''
# nprobe, 準(zhǔn)確度和時間的折中
nlist = 100 # 單元格數(shù)
k = 4
quantizer = faiss.IndexFlatL2(d) # the other index d是向量維度
%time index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
# here we specify METRIC_L2, by default it performs inner-product search
assert not index.is_trained
%time index.train(xb)
assert index.is_trained
%time index.add(xb) # add may be a bit slower as well
%timeit D, I = index.search(xq, k) # actual search
print(I[-5:]) # neighbors of the 5 last queries
index.nprobe = 10 # 執(zhí)行搜索訪問的單元格數(shù)(nlist以外) # default nprobe is 1, try a few more
%time D, I = index.search(xq, k)
print(I[-5:]) # neighbors of the 5 last queries
index.search??
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 111 μs
CPU times: user 1.51 s, sys: 33 ms, total: 1.54 s
Wall time: 62.9 ms
CPU times: user 1.02 s, sys: 26 ms, total: 1.05 s
Wall time: 23.9 ms
100 loops, best of 3: 16.3 ms per loop
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
CPU times: user 4.08 s, sys: 50 ms, total: 4.13 s
Wall time: 92.2 ms
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
- IndexIVFPQ 加聚類酱酬, 加壓縮, 壓縮算法:https://hal.inria.fr/file/index/docid/514462/filename/paper_hal.pdf
# https://github.com/facebookresearch/faiss/wiki/Lower-memory-footprint
# 有損壓縮存儲
#
'''
The indexes we have seen, IndexFlatL2 and IndexIVFFlat both store the full vectors.
To scale up to very large datasets,
Faiss offers variants that compress the stored vectors with a lossy compression based on product quantizers.
The vectors are still stored in Voronoi cells,
but their size is reduced to a configurable number of bytes m (d must be a multiple of m)
'''
# 向量降維參數(shù) m诈豌, 向量維度d必須是m的倍數(shù)
'''
The compression is based on a Product Quantizer,
that can be seen as an additional level of quantization,
that is applied on sub-vectors of the vectors to encode.
In this case, since the vectors are not stored exactly,
the distances that are returned by the search method are also approximations.
'''
# 結(jié)果的近似計算仆救, 向量壓縮映射基于Product Quantizer
'''
They can be compared with the IVFFlat results above. For this case,
most results are wrong, but they are in the correct area of the space,
as shown by the IDs around 10000. The situation is better for real data because:
uniform data is very difficult to index because
there is no regularity that can be exploited to cluster or reduce dimensionality
for natural data, the semantic nearest neighbor is often significantly closer than irrelevant results.
'''
# 在真實數(shù)據(jù)分布, 近似結(jié)果會更好
nlist = 100
m = 8
k = 4
quantizer = faiss.IndexFlatL2(d) # this remains the same
# 為了擴展到非常大的數(shù)據(jù)集矫渔,F(xiàn)aiss提供了基于產(chǎn)品量化器的有損壓縮來壓縮存儲的向量的變體彤蔽。壓縮的方法基于乘積量化。
# 損失了一定精度為代價蚌斩, 自身距離也不為0铆惑, 這是由于有損壓縮。
%time index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
faiss.IndexIVFPQ??
# 8 specifies that each sub-vector is encoded as 8 bits
%time index.train(xb)
%time index.add(xb)
%time D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
index.nprobe = 10 # make comparable with experiment above
%time D, I = index.search(xq, k) # search
print(I[-5:])
# 簡化寫法送膳, 傳String參數(shù)
print "Simplifying index construction"
%time index = faiss.index_factory(d, "IVF100,PQ8")
%time index.train(xb)
%time index.add(xb)
index.nprobe = 10 # make comparable with experiment above
%time D, I = index.search(xq, k) # search
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 144 μs
CPU times: user 1min 29s, sys: 2.12 s, total: 1min 31s
Wall time: 2.04 s
CPU times: user 6.87 s, sys: 113 ms, total: 6.98 s
Wall time: 152 ms
CPU times: user 11 ms, sys: 1e+03 μs, total: 12 ms
Wall time: 257 μs
[[ 0 424 363 278]
[ 1 555 1063 24]
[ 2 304 46 346]
[ 3 773 182 1529]
[ 4 288 754 531]]
[[ 1.45568264 6.03136778 6.18729019 6.38852692]
[ 1.4934082 5.74254704 6.19941282 6.21501732]
[ 1.60279989 6.20174742 6.32792664 6.78541422]
[ 1.69804895 6.2623148 6.26956797 6.56042767]
[ 1.30235744 6.13624763 6.33899879 6.51442099]]
CPU times: user 3.82 s, sys: 69 ms, total: 3.89 s
Wall time: 84.8 ms
[[10664 10914 9922 9380]
[10260 9014 9458 10310]
[11291 9380 11103 10392]
[10856 10284 9638 11276]
[10304 9327 10152 9229]]
Simplifying index construction
CPU times: user 6 ms, sys: 1 ms, total: 7 ms
Wall time: 163 μs
CPU times: user 2min 27s, sys: 2.21 s, total: 2min 29s
Wall time: 8.72 s
CPU times: user 8.89 s, sys: 185 ms, total: 9.08 s
Wall time: 202 ms
CPU times: user 3.05 s, sys: 37 ms, total: 3.08 s
Wall time: 67.7 ms
- GPU使用(未部署员魏, 服務(wù)器沒有GPU)
# https://github.com/facebookresearch/faiss/wiki/Running-on-GPUs
# 使用GPU
# 單GPU
res = faiss.StandardGpuResources() # use a single GPU
# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
gpu_index_flat.add(xb) # add vectors to the index
print(gpu_index_flat.ntotal)
k = 4 # we want to see 4 nearest neighbors
D, I = gpu_index_flat.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:])
# 多GPU
ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)
cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_all_gpus( # build the index
cpu_index
)
gpu_index.add(xb) # add vectors to the index
print(gpu_index.ntotal)
k = 4 # we want to see 4 nearest neighbors
D, I = gpu_index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
0
- k-means 聚類
https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization
# Faiss building blocks: clustering, PCA, quantization
# k-means 聚類, PCA, PQ量化
# k-means 聚類
ncentroids = 1024
niter = 20
verbose = True
n = 20000
d = 32
x = np.random.rand(n, d).astype('float32')
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter, verbose)
kmeans.train(x)
# kmeans.centroids 聚類中心
D, I = kmeans.index.search(x, 1)
index = faiss.IndexFlatL2 (d)
index.add (x)
%time D, I = index.search (kmeans.centroids, 15) # 最近的15個聚類中心
print(D[:5]) # neighbors of the 5 first queries
print(I[:5]) # neighbors of the 5 last queries
CPU times: user 544 ms, sys: 11 ms, total: 555 ms
Wall time: 12.1 ms
[[ 0.75993538 1.01302719 1.23967171 1.24554443 1.35987091 1.36220741
1.3649559 1.37965965 1.3970623 1.40263176 1.43087006 1.47748756
1.49857903 1.50987244 1.51762772]
[ 0.64031982 1.11455727 1.19411278 1.2266655 1.27806664 1.3130722
1.32496071 1.37686729 1.41514015 1.41737747 1.46724701 1.48617554
1.49199867 1.49544144 1.51000404]
[ 0.91652107 1.06410217 1.15173531 1.39237595 1.40605354 1.41710854
1.42310143 1.44347572 1.44360542 1.45162773 1.49108124 1.49484253
1.53658867 1.5404129 1.55116844]
[ 0.80885315 1.33774185 1.34846497 1.3493042 1.36525154 1.36763763
1.46035767 1.46951294 1.47649002 1.48202133 1.49103546 1.51737404
1.5261879 1.52928543 1.55130959]
[ 0.86759949 1.15416336 1.17353058 1.19447136 1.21015167 1.33818817
1.37376595 1.3926239 1.49480057 1.52285957 1.53648758 1.54061317
1.55315018 1.57767677 1.58125877]]
[[ 7088 11759 99 19526 16154 12055 17769 12503 9853 13708 13931 6740
4466 17428 19830]
[11262 8439 11477 8793 19599 6928 3824 7343 14503 9918 18099 4363
18700 9995 17213]
[16546 15334 8798 3045 5780 4047 19566 19456 1025 11855 11011 4220
6267 14521 15692]
[ 1301 19438 1677 9848 5933 1423 12830 2033 4208 19255 16002 12055
2408 17972 2365]
[18764 17835 7337 3042 3290 6483 9176 3692 17901 19445 3422 2857
19815 4279 9130]]
- PCA 向量降維
# PCA降維
# random training data
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)
[ 116.68821716 116.36938477 107.59984589 107.00305939 105.75762939
103.17525482 101.48827362 101.10425568 98.96426392 96.75302887]
- PQ encoding / decoding 向量的有損壓縮與解壓
# PQ encoding / decoding
# 向量的有損壓縮與解壓
# The ProductQuantizer object can be used to encode or decode vectors to codes:
d = 32 # data dimension
cs = 4 # code size (bytes)
# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)
# encode
codes = pq.compute_codes(x)
# decode
x2 = pq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
print avg_relative_error
0.0663308
- IDMAP
# IndexIDMap 加ID映射
index = faiss.IndexFlatL2(xb.shape[1])
ids = np.arange(xb.shape[0])
# index.add_with_ids(xb, ids) # 報錯叠聋, 不支持 because IndexFlatL2 does not support add_with_ids
index2 = faiss.IndexIDMap(index)
index2.add_with_ids(xb, ids) # ok, works, the vectors are stored in the underlying index
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
如何選擇使用哪種索引方式撕阎,
faiss提供了很多索引方式的api供選擇
所有的索引構(gòu)建都可以通過 faiss.index_factory(d, "......") 統(tǒng)一接口來實現(xiàn)
'''
10000 * 64
1, IndexFlatL2, Exact Search for L2碌补, brute-force虏束, "Flat"
L2距離 精確計算,無需train厦章,速度慢镇匀, CPU time需要26s, real需要687ms
2, IndexIVFFlat, Take another index to assign vectors to inverted lists "IVFx,Flat"
切割到cells, 估算袜啃,需要train, nprobe 越大越精確(最大nlist), 教程中取了1/10汗侵, 執(zhí)行時間也大概縮短為1/7
3, IndexPQ, Product quantizer (PQ) in flat mode "PQx"
壓縮,"PQx"
4, IndexIVFPQ群发, IVFADC (coarse quantizer+PQ on residuals) "IVF100,PQ8"
分桶加壓縮晰韵, 可以用index_factory來簡化構(gòu)造, "IVF100,PQ8" -> 100個桶熟妓, 壓縮到8bit
時間進一步降低雪猪, 更重要的是, 通過PQ的有損壓縮起愈,降低了內(nèi)存使用
'''
索引類型選擇
'''
1, HNSWx IndexHNSWFlat 方法 無需train, 不支持removing 向量樣本
Supported on GPU: no只恨, 準(zhǔn)確
2译仗, "...,Flat",
先聚類分桶,讀入訓(xùn)練過程會比較慢坤次, 再計算相似度古劲, 快, 無壓縮缰猴, 存儲消耗等同于原數(shù)據(jù)集大小产艾, 通過nprobe來折中速度與精度
"..." means a clustering of the dataset has to be performed beforehand (read below).
After clustering, "Flat" just organizes the vectors into buckets,
so it does not compress them, the storage size is the same as that of the original dataset.
The tradeoff between speed and accuracy is set via the nprobe parameter.
Supported on GPU: yes (but see below, the clustering method must be supported as well)
3, "How big is the dataset?"
(1), 小于1M vectors "...,IVFx,..." 支持GPU
(2), 1M ~ 10M: "...,IMI2x10,..." 不支持GPU
(3), 10M - 100M "...,IMI2x12,..."
(4), 100M - 1B "...,IMI2x14,..."
'''