參考:R document
Seurat識別細胞類群的原理FindNeighbors和FindClusters
Jaccard系數(shù)_百度百科
Annoy解析
https://blog.csdn.net/qq_37696858/article/details/88143156
(Shared) Nearest-neighbor graph construction
- Description
Computes the k.param nearest neighbors for a given dataset. Can also optionally (via compute.SNN), construct a shared nearest neighbor graph by calculating the neighborhood overlap (Jaccard index) between every cell and its k.param nearest neighbors.
首先計算每個細胞的KNN,也就是計算每個細胞之間的相互距離,依據(jù)細胞之間鄰居的overlap來構(gòu)建snn graph顾瞻。
計算給定數(shù)據(jù)集的k.param最近鄰漾峡。也可以選擇(通過compute.SNN)定欧,通過計算每個細胞最近鄰之間的鄰域重疊(Jaccard索引)和其鄰近的k.param來構(gòu)造SNN信殊。 - Usage
FindNeighbors(object, ...)
Default S3 method:
FindNeighbors(
object,
query = NULL,
distance.matrix = FALSE,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
l2.norm = FALSE,
cache.index = FALSE,
index = NULL,
...
)
S3 method for class 'Assay'
FindNeighbors(
object,
features = NULL,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
l2.norm = FALSE,
cache.index = FALSE,
...
)
S3 method for class 'dist'
FindNeighbors(
object,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
l2.norm = FALSE,
cache.index = FALSE,
...
)
S3 method for class 'Seurat'
FindNeighbors(
object,
reduction = "pca",
dims = 1:10,
assay = NULL,
features = NULL,
k.param = 20,
return.neighbor = FALSE,
compute.SNN = !return.neighbor,
prune.SNN = 1/15,
nn.method = "annoy",
n.trees = 50,
annoy.metric = "euclidean",
nn.eps = 0,
verbose = TRUE,
force.recalc = FALSE,
do.plot = FALSE,
graph.name = NULL,
l2.norm = FALSE,
cache.index = FALSE,
...
)
- Arguments
object
An object
...
Arguments passed to other methods
query
Matrix of data to query against object. If missing, defaults to object.
詢問
根據(jù)對象查詢的數(shù)據(jù)矩陣挽绩。如果缺少荧库,默認為對象堰塌。
distance.matrix
Boolean value of whether the provided matrix is a distance matrix; note, for objects of class dist, this parameter will be set automatically
距離矩陣
提供的矩陣是否為距離矩陣的布爾值;注意分衫,對于dist類的對象场刑,此參數(shù)設(shè)置為 automatically
k.param
Defines k for the k-nearest neighbor algorithm
這個參數(shù)用于設(shè)置 KNN 算法中最近鄰的個數(shù),默認20
return.neighbor
Return result as Neighbor object. Not used with distance matrix input.
返回的鄰居
將結(jié)果作為鄰居對象返回。不適用于距離矩陣輸入牵现。
compute.SNN
also compute the shared nearest neighbor graph
計算共享鄰居的數(shù)量铐懊,一般不設(shè)置。
prune.SNN
Sets the cutoff for acceptable Jaccard index when computing the neighborhood overlap for the SNN construction. Any edges with values less than or equal to this will be set to 0 and removed from the SNN graph. Essentially sets the stringency of pruning (0 — no pruning, 1 — prune everything).
計算SNN構(gòu)造的領(lǐng)域重疊時瞎疼,設(shè)置可接受的Jaccard指數(shù)的截止值科乎。任何小于或者等于此值的任何邊將被設(shè)置為0并從SNN圖中刪除。本質(zhì)上設(shè)置修剪的嚴格性(0-不修剪贼急,1-修剪所有內(nèi)容)茅茂。
nn.method
Method for nearest neighbor finding. Options include: rann, annoy
這個參數(shù)提供了如何判斷鄰居的方法,提供的可選是rann, annoy
n.trees
More trees gives higher precision when using annoy approximate nearest neighbor search
樹的數(shù)量
當(dāng)使用annoy的近似最近鄰搜索時太抓,樹越多空闲,給出的精度越高。
annoy.metric
Distance metric for annoy. Options include: euclidean, cosine, manhattan, and hamming
annoy的距離走敌,選項包括:歐幾里德碴倾、余弦、曼哈頓和漢明
nn.eps
Error bound when performing nearest neighbor seach using RANN; default of 0.0 implies exact nearest neighbor search
使用RANN執(zhí)行最近鄰搜索時的錯誤界限掉丽;默認值0.0表示精確最近鄰搜索
verbose
Whether or not to print output to the console
是否將輸出打印到控制臺
force.recalc
Force recalculation of (S)NN.
SNN強制重新計算跌榔,一般不設(shè)置
l2.norm
Take L2Norm of the data
L2正則化
cache.index
Include cached index in returned Neighbor object (only relevant if return.neighbor = TRUE)
在返回的鄰居對象中包含緩存的索引(僅當(dāng)return.neighbor = TRUE時相關(guān))
index
Precomputed index. Useful if querying new data against existing index to avoid recomputing.
索引
預(yù)計算索引。如果根據(jù)現(xiàn)有索引查詢新數(shù)據(jù)以避免重新計算捶障,則非常有用矫户。
features
Features to use as input for building the (S)NN; used only when dims is NULL
特征
用作構(gòu)建SNN神經(jīng)網(wǎng)絡(luò)輸入的特征;僅當(dāng)dims為空時使用
reduction
Reduction to use as input for building the (S)NN
輸入的降維方法残邀,用于構(gòu)建SNN神經(jīng)網(wǎng)絡(luò)
dims
Dimensions of reduction to use as input
輸入的降維的維度
assay
Assay to use in construction of (S)NN; used only when dims is NULL
用于構(gòu)造SNN神經(jīng)網(wǎng)絡(luò)的分析;僅當(dāng)dims為空時使用
do.plot
Plot SNN graph on tSNE coordinates
在tSNE坐標上繪制SNN圖
graph.name
Optional naming parameter for stored (S)NN graph (or Neighbor object, if return.neighbor = TRUE). Default is assay.name_(s)nn. To store both the neighbor graph and the shared nearest neighbor (SNN) graph, you must supply a vector containing two names to the graph.name parameter. The first element in the vector will be used to store the nearest neighbor (NN) graph, and the second element used to store the SNN graph. If only one name is supplied, only the NN graph is stored.
可選命名存儲的神經(jīng)網(wǎng)絡(luò)圖的參數(shù)(或鄰居對象柑蛇,如果返回的鄰居=真)芥挣。默認為assay.name_(s)nn。要存儲鄰居圖和共享最近鄰居(SNN)圖耻台,必須向graph.name參數(shù)提供一個包含兩個名稱的向量空免。向量中的第一個元素將用于存儲最近鄰圖,第二個元素用于存儲SNN圖盆耽。如果只提供一個名稱蹋砚,則只存儲神經(jīng)網(wǎng)絡(luò)圖。
- Value
This function can either return a Neighbor object with the KNN information or a list of Graph objects with the KNN and SNN depending on the settings of return.neighbor and compute.SNN. When running on a Seurat object, this returns the Seurat object with the Graphs or Neighbor objects stored in their respective slots. Names of the Graph or Neighbor object can be found with Graphs or Neighbors.
值
根據(jù)return.neighbor和compute.SNN的設(shè)置摄杂,該函數(shù)可以返回帶有KNN信息的neighbor對象坝咐,也可以返回帶有KNN和SNN的Graph對象列表。當(dāng)在Seurat 對象上運行時析恢,這將返回帶有存儲在各自slots中的Graph或Neighbor對象的Seurat對象墨坚。圖形或鄰居對象的名稱可以在圖形或鄰居中找到。 - Examples
data("pbmc_small")
pbmc_small
Compute an SNN on the gene expression level
pbmc_small <- FindNeighbors(pbmc_small, features = VariableFeatures(object = pbmc_small))
More commonly, we build the SNN on a dimensionally reduced form of the data
such as the first 10 principle components.
pbmc_small <- FindNeighbors(pbmc_small, reduction = "pca", dims = 1:10)
1.KNN是什么映挂?
所謂K最近鄰泽篮,就是K個最近的鄰居的意思盗尸,說的是每個樣本都可以用它最接近的K個鄰近值來代表。近鄰算法就是將數(shù)據(jù)集合中每一個記錄進行分類的方法帽撑。
2.jaccard index是什么泼各?
雅卡爾指數(shù) (Jaccard index)或者稱為交并比、雅卡爾相似系數(shù)
可以用于比較樣本集的相似性與多樣性亏拉。其定義為兩個集合交集大小與并集大小之間的比例:
定義
給定兩個集合A,B扣蜻,Jaccard 系數(shù)定義為A與B交集的大小與A與B并集的大小的比值,定義如下:
當(dāng)A和B是空集時专筷,定義jaccard index為1弱贼。
雅卡爾距離(Jaccard distance)
則用于量度樣本集之間的不相似度,其定義為1減去雅卡爾系數(shù)磷蛹。Jaccard 距離越大吮旅,樣本相似度越低。公式定義如下:其中對參差(symmetric difference)
性質(zhì)
-
相似性
非對稱二元屬性的相似性
在數(shù)據(jù)挖掘領(lǐng)域味咳,常常需要比較兩個具有布爾值屬性的對象之間的距離庇勃,Jaccard距離就是常用的一種方法。給定兩個比較對象A槽驶,B责嚷。A, B 均有n個二元屬性,即每個屬性取值為{0,1}掂铐。定義如下4個統(tǒng)計量:
如圖1數(shù)示:
顯然有
Jaccard 系數(shù):
Jaccard距離:
對于非對稱二元屬性而言(比如說對于患癌癥和不患癌癥的屬性而言罕拂,不患癌癥是0,患癌癥是1全陨,那么0的數(shù)量遠遠大于1爆班,但是我們卻更關(guān)注1的數(shù)量):
M11 - A和B中都是1;
M01 - A中是0辱姨,B中是1柿菩;
M10 - A中是1,B中是0雨涛;
M00 - 兩者都是0.
3.如何鄰居判定枢舶?annoy、rann
-
Annoy是高維空間求近似最近鄰的一個開源模塊替久。
Annoy構(gòu)建一棵二叉樹凉泄,查詢時間為O(logn)。
Annoy通過隨機挑選兩個點蚯根,并使用垂直于這個點的等距離超平面將集合劃分為兩部分旧困。
如圖所示,圖中灰色線是連接兩個點,超平面是加粗的黑線吼具。按照這個方法在每個子集上迭代進行劃分僚纷。
依此類推,直到每個集合最多剩余k個點拗盒,下圖是一個k = 10 的情況怖竭。
相應(yīng)的完整二叉樹結(jié)構(gòu):
隨機投影森林。
一個思想依據(jù)是:在原空間中相鄰的點陡蝇,在樹結(jié)構(gòu)上也表現(xiàn)出相互靠近的特點痊臭,也就是說,如果兩個點在空間上相互靠近登夫,那么他們很可能被樹結(jié)構(gòu)劃分到一起广匙。
如果要在空間中查找臨近點,我們可以在這個二叉樹中搜索恼策。上圖中每個節(jié)點用超平面來定義鸦致,所以我們可以計算出該節(jié)點往哪個方向遍歷,搜索時間 log n
技巧1:使用優(yōu)先隊列
如果一個劃分的兩邊“靠得足夠近”(量化方式在后面介紹)涣楷,我們就兩邊都遍歷分唾。這樣就不只是遍歷一個節(jié)點的一邊,我們將遍歷更多的點
我們可以設(shè)置一個閾值狮斗,用來表示是否愿意搜索劃分“錯”的一遍绽乔。如果設(shè)置為0,我們將總是遍歷“對”的一片碳褒。但是如果設(shè)置成0.5折砸,就按照上面的搜索路徑。
這個技巧實際上是利用優(yōu)先級隊列沙峻,依據(jù)兩邊的最大距離睦授。好處是我們能夠設(shè)置比0大的閾值,逐漸增加搜索范圍专酗。
技巧2:構(gòu)建一個森林
我們能夠用一個優(yōu)先級隊列,同時搜索所有的樹盗扇。這樣有另外一個好處祷肯,搜索會聚焦到那些與已知點靠得最近的那些樹——能夠把距離最遠的空間劃分出去
每棵樹都包含所有的點,所以當(dāng)我們搜索多棵樹的時候疗隶,將找到多棵樹上的多個點佑笋。如果我們把所有的搜索結(jié)果的葉子節(jié)點都合在一起,那么得到的最近鄰就非常符合要求斑鼻。依照上述方法蒋纬,我們找到一個近鄰的集合,接下來就是計算所有的距離和對這些點進行排序,找到最近的k個點蜀备。
很明顯关摇,我們會丟掉一些最近的點,這也是為什么叫近似最近鄰的原因碾阁。
Annoy在實際使用的時候输虱,提供了一種機制可以調(diào)整(搜索k),你能夠根據(jù)它來權(quán)衡性能(時間)和準確度(質(zhì)量)脂凶。
tips:
1.距離計算宪睹,采用歸一化的歐氏距離:vectors = sqrt(2-2*cos(u, v)) 即余弦距離
2.向量維度較小(<100),即使維度到達1000變現(xiàn)也不錯
3.內(nèi)存占用小
4.索引創(chuàng)建與查找分離(特別是一旦樹已經(jīng)創(chuàng)建蚕钦,就不能添加更多項)
5.有兩個參數(shù)可以用來調(diào)節(jié)Annoy 樹的數(shù)量n_trees和搜索期間檢查的節(jié)點數(shù)量search_k
n_trees在構(gòu)建時提供亭病,并影響構(gòu)建時間和索引大小。 較大的值將給出更準確的結(jié)果嘶居,但更大的索引罪帖。
search_k在運行時提供,并影響搜索性能食听。 較大的值將給出更準確的結(jié)果胸蛛,但將需要更長的時間返回。
如果不提供search_k樱报,它將默認為n * n_trees葬项,其中n是近似最近鄰的數(shù)目。 否則迹蛤,search_k和n_tree大致是獨立的民珍,即如果search_k保持不變,n_tree的值不會影響搜索時間盗飒,反之亦然嚷量。 基本上,建議在可用負載量的情況下盡可能大地設(shè)置n_trees逆趣,并且考慮到查詢的時間限制蝶溶,建議將search_k設(shè)置為盡可能大。
from annoy import AnnoyIndex
import random
f = 40 #維度
t = AnnoyIndex(f) # Length of item vector that will be indexed
for i in xrange(1000):
v = [random.gauss(0, 1) for z in xrange(f)]
t.add_item(i, v) #添加向量
t.build(10) # 10 trees
t.save('test.ann')
u = AnnoyIndex(f)
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors of the first(0) vec
seurat4的annoy的源代碼:
# Run annoy
#
# @param data Data to build the index with
# @param query A set of data to be queried against data
# @param metric Distance metric; can be one of "euclidean", "cosine", "manhattan",
# "hamming"
# @param n.trees More trees gives higher precision when querying
# @param k Number of neighbors
# @param search.k During the query it will inspect up to search_k nodes which
# gives you a run-time tradeoff between better accuracy and speed.
# @param include.distance Include the corresponding distances
# @param index optional index object, will be recomputed if not provided
#
AnnoyNN <- function(data,
query = data,
metric = "euclidean",
n.trees = 50,
k,
search.k = -1,
include.distance = TRUE,
index = NULL
) {
idx <- index %||% AnnoyBuildIndex(
data = data,
metric = metric,
n.trees = n.trees)
nn <- AnnoySearch(
index = idx,
query = query,
k = k,
search.k = search.k,
include.distance = include.distance)
nn$idx <- idx
nn$alg.info <- list(metric = metric, ndim = ncol(x = data))
return(nn)
}
# Build the annoy index
#
# @param data Data to build the index with
# @param metric Distance metric; can be one of "euclidean", "cosine", "manhattan",
# "hamming"
# @param n.trees More trees gives higher precision when querying
#
#' @importFrom RcppAnnoy AnnoyEuclidean AnnoyAngular AnnoyManhattan AnnoyHamming
#
AnnoyBuildIndex <- function(data, metric = "euclidean", n.trees = 50) {
f <- ncol(x = data)
a <- switch(
EXPR = metric,
"euclidean" = new(Class = RcppAnnoy::AnnoyEuclidean, f),
"cosine" = new(Class = RcppAnnoy::AnnoyAngular, f),
"manhattan" = new(Class = RcppAnnoy::AnnoyManhattan, f),
"hamming" = new(Class = RcppAnnoy::AnnoyHamming, f),
stop ("Invalid metric")
)
for (ii in seq(nrow(x = data))) {
a$addItem(ii - 1, data[ii, ])
}
a$build(n.trees)
return(a)
}
# Search an Annoy approximate nearest neighbor index
#
# @param Annoy index, built with AnnoyBuildIndex
# @param query A set of data to be queried against the index
# @param k Number of neighbors
# @param search.k During the query it will inspect up to search_k nodes which
# gives you a run-time tradeoff between better accuracy and speed.
# @param include.distance Include the corresponding distances in the result
#
# @return A list with 'nn.idx' (for each element in 'query', the index of the
# nearest k elements in the index) and 'nn.dists' (the distances of the nearest
# k elements)
#
#' @importFrom future plan
#' @importFrom future.apply future_lapply
#
AnnoySearch <- function(index, query, k, search.k = -1, include.distance = TRUE) {
n <- nrow(x = query)
idx <- matrix(nrow = n, ncol = k)
dist <- matrix(nrow = n, ncol = k)
convert <- methods::is(index, "Rcpp_AnnoyAngular")
if (!inherits(x = plan(), what = "multicore")) {
oplan <- plan(strategy = "sequential")
on.exit(plan(oplan), add = TRUE)
}
res <- future_lapply(X = 1:n, FUN = function(x) {
res <- index$getNNsByVectorList(query[x, ], k, search.k, include.distance)
# Convert from Angular to Cosine distance
if (convert) {
res$dist <- 0.5 * (res$dist * res$dist)
}
list(res$item + 1, res$distance)
})
for (i in 1:n) {
idx[i, ] <- res[[i]][[1]]
if (include.distance) {
dist[i, ] <- res[[i]][[2]]
}
}
return(list(nn.idx = idx, nn.dists = dist))
}
- rann包
KD樹法—大規(guī)模點集
這個包只有一個函數(shù)nn2宣渗。功能十分強大抖所,支持很高的維數(shù)(文檔里說20維以上可能不太好了,但還是可以試試)痕囱,返回值包括了k級最鄰近點的序號和k級最鄰近點距離兩個內(nèi)容田轧,非常實用。由于實際上調(diào)用的是c++庫鞍恢,所以速度非成嫡常快每窖,我用了30w+點的點集,瞬間完成弦悉。
這包說明詳見RANN package - RDocumentation
Finds the k nearest neighbours for every point in a given dataset in O(N log N) time using Arya and Mount's ANN library (v1.1.3). There is support for approximate as well as exact searches, fixed radius searches and bd as well as kd trees.
使用Arya和Mount的ANN庫(v1.1.3)在O(N log N)時間內(nèi)為給定數(shù)據(jù)集中的每個點查找k個最近的鄰居窒典。支持近似和精確搜索,固定半徑搜索警绩,bd和kd樹崇败。
This package implements nearest neighbors for the Euclidean (L2) metric. For the Manhattan (L1) metric, install the RANN1 package.
這個包實現(xiàn)了歐幾里德(L2)度量的最近鄰。對于曼哈頓(L1)指標肩祥,安裝RANN1軟件包后室。
library(RANN)
p_co=st_coordinates(point_set)
#獲取點坐標。由于這個包只返回歐式距離混狠,
不是專門針對地理空間開發(fā)的岸霹,
所以在計算前還是先把點坐標轉(zhuǎn)換成投影坐標系為宜。
d=nn2(p_co,p_co,k=2)
#k=2即為設(shè)定K級最鄰近點将饺,
因為這個包只支持計算兩個點集之間的最鄰近點關(guān)系贡避,所以只能計算包含自身的2級最鄰近點。
該函數(shù)還有其他參數(shù)可以設(shè)定搜尋方法予弧、查尋樹的類型刮吧、搜索半徑等。
d_indice=d[[1]][,2]
#返回值包括兩個list元素掖蛤,
第一個元素就是最鄰近點序號矩陣杀捻。
第二列就是我們要找的最鄰近點序號。
'RANN'提供一個快速的最近鄰搜索算法蚓庭,能夠?qū)⑺惴ǖ臅r間復(fù)雜度縮短為O(n log n)致讥。
L2正則化:
L1正則化就是在loss function后邊所加正則項為L1范數(shù),加上L1范數(shù)容易得到稀疏解(0比較多)器赞。L2正則化就是loss function后邊所加正則項為L2范數(shù)的平方垢袱,加上L2正則相比于L1正則來說,得到的解比較平滑(不是稀疏)港柜,但是同樣能夠保證解中接近于0(但不是等于0请契,所以相對平滑)的維度比較多,降低模型的復(fù)雜度夏醉。