PerformanceBenchmark of Locality Sensitity Hashing and KD-Tree Algorithm

1 The Purposes

Get familiar with the common ANN algorithms, such as KD-Tree and LSH
Learn the implementation of LSH and other related coding skills
Analysis the performance of KD-Tree and LSH under different dimensions

2 The Principles

2.1 ANN

Efficient and accurate methods for Approximate Near-NearestNeighbor (ANN) queries are important for large-scale datasetsapplications such as image feature matching, recommendation system,DNA sequencing and so on.
Popular ANN methods include randomized kd-tree(KD-Tree),hierarchical k-means(HKM) and locality-sensitive hashing(LSH).

2.1.1 KDT

KDT partition a vector space by recursively generatinghyperplanes and cut along coordinates where there is maximal variancein the data. KD-Tree is good for searching in low-dimensional spaceswhile its efficiency decreases as dimensionality grows over than 10generally.

Building a kd-tree: O(N*logN) for time and O(N) for space
Nearest neighbor search: O(logN)
M nearest neighbors: O(M*logN)

2.1.2 LSH

LSH algorithm map similar input items to the same hash code withhigher probability than dissimilar items. For a given key size K, wegenerate randomized hyperplanes and check if a point P is above orbelow the hyperplanes. Assigning 0/1 based on below/above checking.And then perform harmming distance search on K bit for approximatenearest neighbors.

Build a LSH table: O(N*K) for time and O(N) for space, K is the key size
Nearest neightbor search: O(1) for the best situation

2.2 Related Open Source Projects

FLANNis a library for performing fast approximate nearest neighborsearches in high dimensional spaces. It contains a collection ofalgorithms we found to work best for nearest neighbor search and asystem for automatically choosing the best algorithm and optimumparameters depending on the dataset.

LsHash is an open source project on Github and implement a fast pythonlocality sensitive hashing with persistance support.

Webenchmark our KD-Tree performance based on FLANN and pyflann project.The dynamic library libflann.so is compiled and linked bypyflann python bindings. Our Lsh algorithm is based on LsHash andsome modifications are made.

2.3 Datasets

2.3.1 SIFT10K

SIFT10K dataset contains 128 dimensionalities of SIFT feature. Itcomprises up to 10 thousand of data vectors and one hundred of queryvectors. For each query vector, the ground truth file is also givenwith top k=100 nearest neighbors.

2.3.3 SIFT1M

SIFT1M dataset contains 128 dimensionalities of SIFT feature. Itcomprises up to 1 million of data vectors and 10 thousand of queryvectors. For each query vector, the ground truth file is also givenwith top k=100 nearest neighbors.

2.3.3 GIST1M

GIST dataset contains 960 dimensionalities of GIST feature. Itcomprises up to 1 million of data vectors and one thousand of queryvectors. For each query vector, the ground truth file is also givenwith top k=100 nearest neighbors.

2.4 Measure

We choose 3 main factors to benchmark performance of KD-Tree andLSH. Precesion can be seen as a measure of exactness or quality andtime/memory cost are also taken into account.
For each expriment, we will plot the two (x,y) lines underdifferent different algorithms, where y is the search time gain overlinear search time and x is the corresponding precision. Theprecision is the average precision of all query points.
Regarding expriment figure may like Fig.1 :

Fig. 1: respecting expriment figure on comparision of kd-tree and lsh algorithms. IT’S NOT ACTUAL EXPRIMENTAL DATA!

3 The Procedures

This is the machine configuration:

Ubuntu 16.04, 64 bit, with Intel Core i7-4970K 4.0GHz x8 and 16GB RAM
Main programming language: Python2.7 & C11++
FLANN 1.8.4 and pyflann bindings
LSH implementation via python based on open source LsHash

3.1 Data Dimensionality

The Startdard dataset SIFT1M contains 128 dimensionality of SIFTfeature and GIFT1M comprises 960 dimensionality of GIST feature. Wecompare the each performance of KD-Tree and LSH on dimension 128, 960respectively.

For KD-Tree, the parameters include Kd-tree number K and how manyleafs to visit when searching for neighbors, namely Checks. For LSH,the parameters include table number L and key size.

For each dimension, we will calculate the distribution of(precision, search time gain over linear) and plot it using pythonmatplotlib.
Fig.2 and Fig.3 are the respecting expriement result.

Fig. 2: (precision, search time gain over linear) distribution on SIFT dataset

Fig. 3: (precision, search time gain over linear) distribution on GIST dataset

Besides, we will calculate the memory cost on both 128 and 960dimentionality.

3.2 Dataset Size

We select dataset size of 10K, 1M from standard dataset SIFT10Kand SIFT1M. We discard the GIST1M dataset as the GIST featuredimension varies. For better comparision, SIFT1B need to be takeninto account. However, it’s too large to run on my computer atpresent.
In this expriment, we will vary the tree number K, search leafsChecks for KD-Tree algorithm and table number L, key Size for LSHalgorithm to get the disbritution of (search time gain over linear,precision) under different dataset size.
Fig.2 and Fig.3 are the respecting expriement result.

Fig. 5: (precision, search time gain over linear) distribution on SIFT1M dataset

Fig. 4: (precision, search time gain over linear) distribution on SIFT10K dataset

Besides, we will calculate the memory cost on both 10K and 1Mdataset.

3.3 Varying k

The ground truth dataset comprises top k=100 features for eachquery. To compare specific k performance of KD-Tree and LSHalgorithms, we select k=1, 3, 30, 100. We choose SIFT1M as thedataset and discard SIFT10K and GIST1M to keep the other dimensionsame other than different k.
In this expriment, we will vary the tree number K, search leafsChecks for KD-Tree algorithm and table number L, key Size for LSHalgorithm to get the disbritution of (search time gain over linear,precision) under different dataset size.

Fig. 6: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=1

Fig. 7: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=3

Fig. 9: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=100

Fig. 8: (precision, search time gain over linear) distribution on SIFT1M dataset with query k=30

4 The Results

To be continued...

5 The Bibliography

[1]MariusMuja and David G. Lowe: Nearest Neighbor Algorithms for High Dimensional Data. Pattern Analysis and Machine Intelligence (PAMI), Vol. 36, 2014.

[2]A. Gionis, P. Indyk, and R. Motwani. Similarity search in highdimensions via hashing. In Proc. Int’l Conf. on Very Large DataBases (VLDB), 1999. 1, 2, 4

[3]Qin Lv, William Josephson, Zhe Wang, Moses Charikar, Kai Li.Multi-Probe LSH: Efficient Indexing for High-Dimensional SimilaritySearch. Proceedings of the 33rd International Conference on VeryLarge Data Bases (VLDB). Vienna, Austria. September 2007

[4]Marius Muja, David Lowe. FLANN - Fast Library for Approximate NearestNeighbors User Manual. January 24, 2013

[5]C. Silpa-Anan and R. Hartley. Optimised KD-trees for fast imagedescriptor matching. In Proc. Computer Vision and Pattern Recognition(CVPR), 2008. 1, 2

[6]IoannisZ. Emiris and DimitriNicolopoulos. Randomizedkd-trees for Approximate Nearest Neighbor Search.

November28, 2013

[7]JingdongWang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing forSimilarity Search: A Survey.

arXiv:1408.2927v1[cs.DS] 13 Aug 2014

[8]MalcolmSlaney, Yury Lifshits and Junfeng He. Optimal Parameters for Locality-Sensitive Hashing.Proceedings of the IEEE, Vol. 100, No. 9, September 2012

[9]Nearest neighbor search. Wikipedia,https://en.wikipedia.org/wiki/Nearest_neighbor_search. Accessed 12 Oct 2016

[10]FLANN-Fast Library for Approximate Nearest Neighbors. University ofBritsh, http://www.cs.ubc.ca/research/flann/.Access 14 Oct 2016

[11]ALGLIB User Guide. ALGLIB Project,http://www.alglib.net/other/nearestneighbors.php.Accessed 14 Oct 2016

[12]Jason. LSH 算法框架分析. Web, http://blog.jasonding.top/2018/01/01/MLStick/ http://www.cnblogs.com/hxsyl/p/4626092.html.Access 11 Oct 2016

[13]壞人. LSH位置敏感哈希算法. CSDN, http://blog.csdn.net/u013378306/article/details/52473176,8 Sep 2016.

最后編輯于：2017.12.04 06:18:40

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市冰更，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌敌厘，老刑警劉巖狰腌，帶你破解...
沈念sama閱讀 216,997評論 6贊 502
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件芍碧，死亡現(xiàn)場離奇詭異膀跌，居然都是意外死亡琼腔，警方通過查閱死者的電腦和手機泌豆，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,603評論 3贊 392
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門捅伤，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人庇楞，你說我怎么就攤上這事》裎常” “怎么了吕晌？”我有些...
開封第一講書人閱讀 163,359評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長临燃。經(jīng)常有香客問我睛驳，道長，這世上最難降的妖魔是什么谬俄？我笑而不...
開封第一講書人閱讀 58,309評論 1贊 292
?港島之戀（遺憾婚禮）
正文為了忘掉前任柏靶，我火速辦了婚禮，結果婚禮上溃论，老公的妹妹穿的比我還像新娘屎蜓。我一直安慰自己，他們只是感情好钥勋，可當我...
茶點故事閱讀 67,346評論 6贊 390
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布炬转。她就那樣靜靜地躺著辆苔，像睡著了一般。火紅的嫁衣襯著肌膚如雪扼劈。梳的紋絲不亂的頭發(fā)上驻啤，一...
開封第一講書人閱讀 51,258評論 1贊 300
城市分裂傳說
那天，我揣著相機與錄音荐吵，去河邊找鬼骑冗。笑死，一個胖子當著我的面吹牛先煎，可吹牛的內容都是我干的贼涩。我是一名探鬼主播，決...
沈念sama閱讀 40,122評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼薯蝎，長吁一口氣：“原來是場噩夢啊……” “哼遥倦！你這毒婦竟也來了？” 一聲冷哼從身側響起占锯，我...
開封第一講書人閱讀 38,970評論 0贊 275
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤袒哥，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后消略，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體堡称，經(jīng)...
沈念sama閱讀 45,403評論 1贊 313
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 37,596評論 3贊 334
?白月光啟示錄
正文我和宋清朗相戀三年疑俭，在試婚紗的時候發(fā)現(xiàn)自己被綠了粮呢。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 39,769評論 1贊 348
活死人
序言：一個原本活蹦亂跳的男人離奇死亡钞艇，死狀恐怖啄寡，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情哩照，我是刑警寧澤挺物，帶...
沈念sama閱讀 35,464評論 5贊 344
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站飘弧，受9級特大地震影響识藤，放射性物質發(fā)生泄漏。R本人自食惡果不足惜次伶，卻給世界環(huán)境...
茶點故事閱讀 41,075評論 3贊 327
男人毒藥：我在死后第九天來索命
文/蒙蒙一痴昧、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧冠王，春花似錦赶撰、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,705評論 0贊 22
一樁弒父案豪娜，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽餐胀。三九已至，卻和暖如春瘤载，著一層夾襖步出監(jiān)牢的瞬間否灾，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 32,848評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工鸣奔，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留墨技，地道東北人。一個月前我還...
沈念sama閱讀 47,831評論 2贊 370
代替公主和親
正文我出身青樓溃蔫，卻偏偏與公主長得像健提，于是被迫代替她去往敵國和親琳猫。傳聞我的和親對象是個殘疾皇子伟叛，可洞房花燭夜當晚...
茶點故事閱讀 44,678評論 2贊 354