目前Gelly擁有越來越多的圖算法侨颈,可以方便地進(jìn)行大型圖分析县恕。使用也非常簡(jiǎn)單峰搪,只需要在需要分析的圖上執(zhí)行run()
方法即可辟灰,以標(biāo)簽傳播進(jìn)行社區(qū)發(fā)現(xiàn)為例个榕,代碼如下:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
Graph<Long, Long, NullValue> graph = ...
// run Label Propagation for 30 iterations to detect communities on the input graph
DataSet<Vertex<Long, Long>> verticesWithCommunity = graph.run(new LabelPropagation<Long>(30));
// print the result
verticesWithCommunity.print();
Community Detection
Overview
在圖論中,社區(qū)指的是內(nèi)部連接緊密但與其他組連接稀疏的節(jié)點(diǎn)組合芥喇。Gelly算法庫(kù)中提供的社區(qū)發(fā)現(xiàn)算法是論文 Towards real-time community detection in large networks中提到的社區(qū)發(fā)現(xiàn)算法的實(shí)現(xiàn)西采。
Details
該算法使用scatter-gather iterations
模型來實(shí)現(xiàn)。一開始继控,為每個(gè)頂點(diǎn)分配一個(gè)Tuple2
用于保存該頂點(diǎn)原始標(biāo)簽值label和一個(gè)等于1.0的分值score械馆。在每次迭代中胖眷,頂點(diǎn)向其鄰接頂點(diǎn)發(fā)送一個(gè)包含當(dāng)前頂點(diǎn)label和score的消息。當(dāng)一個(gè)頂點(diǎn)從其鄰接頂點(diǎn)接收到消息時(shí)霹崎,選擇具有最大score的label作為當(dāng)前節(jié)點(diǎn)的新label珊搀,并根據(jù)邊上的值以及用戶定義的度衰減系數(shù)和當(dāng)前superstep號(hào)對(duì)當(dāng)前score進(jìn)行重新打分。當(dāng)頂點(diǎn)不再更新其label值或達(dá)到最大迭代次數(shù)時(shí)尾菇,算法收斂境析。
Usage
該算法以任意頂點(diǎn)ID類型,Long型頂點(diǎn)值和Double型邊值的圖數(shù)據(jù)作為輸入错沽,并返回一個(gè)類型和輸入圖類型相同的新圖簿晓,其中返回的頂點(diǎn)值對(duì)應(yīng)于社區(qū)的標(biāo)簽。即如果兩個(gè)頂點(diǎn)具有相同的頂點(diǎn)值千埃,則它們屬于同一個(gè)社區(qū)憔儿。該構(gòu)造函數(shù)具有如下兩個(gè)參數(shù):
-
maxIterations
:最大迭代次數(shù)。 -
delta
:衰減系數(shù)放可,默認(rèn)0.5谒臼。
Label Propagation
Overview
這是一個(gè)著名的標(biāo)簽傳播算法的實(shí)現(xiàn),具體可以參考這篇論文Near linear time algorithm to detect community structures in large-scale networks耀里。該算法通過在鄰接頂點(diǎn)間互相傳播標(biāo)簽來實(shí)現(xiàn)社區(qū)發(fā)現(xiàn)蜈缤。與上一個(gè)社區(qū)發(fā)現(xiàn)算法不同,該算法實(shí)現(xiàn)不需要考慮每個(gè)社區(qū)的標(biāo)簽label的score值冯挎。
Details
該算法同樣使用scatter-gather iterations
模型來實(shí)現(xiàn)底哥,標(biāo)簽的類型應(yīng)該是Comparable,并使用輸入圖的頂點(diǎn)value值初始化(注意房官,初始化的時(shí)候可以將圖中每個(gè)頂點(diǎn)的ID值作為初始標(biāo)簽值賦值給頂點(diǎn)的value)趾徽。該算法迭代地通過傳播標(biāo)簽來細(xì)化發(fā)現(xiàn)的社區(qū),在每次迭代過程中翰守,每個(gè)頂點(diǎn)將其鄰接頂點(diǎn)中標(biāo)簽出現(xiàn)次數(shù)最多的標(biāo)簽作為當(dāng)前節(jié)點(diǎn)的新標(biāo)簽孵奶,如果出現(xiàn)次數(shù)最多的標(biāo)簽有多個(gè),將標(biāo)簽值最大的值作為當(dāng)前節(jié)點(diǎn)的新標(biāo)簽(注意:該算法理論上是允許隨機(jī)選則一個(gè)最大的標(biāo)簽的蜡峰,但是Gelly中的實(shí)現(xiàn)是取標(biāo)簽最大值的了袁,這也是為什么標(biāo)簽在初始化的時(shí)候必須是Comparable類型)。當(dāng)頂點(diǎn)不再更新其label值或達(dá)到最大迭代次數(shù)時(shí)湿颅,算法收斂载绿。由于LPA算法是不穩(wěn)定的,不同的初始化方式可能導(dǎo)致不同的社區(qū)發(fā)現(xiàn)結(jié)果油航。
Usage
該算法要求輸入的圖必須滿足兩個(gè)條件:即頂點(diǎn)的類型(ID)和頂點(diǎn)值value都是Comparable類型的崭庸,邊值value類型不考慮。該算法返回一個(gè)節(jié)點(diǎn)組成的數(shù)據(jù)集DataSet
,其中頂點(diǎn)的value值表示節(jié)點(diǎn)所屬的對(duì)應(yīng)社區(qū)冀自。該構(gòu)造函數(shù)需要傳入一個(gè)參數(shù):
-
maxIterations
:最大迭代次數(shù)揉稚。
Connected Components
Overview
該算法是弱連通分支算法的實(shí)現(xiàn),在收斂時(shí)熬粗,如果兩個(gè)頂點(diǎn)之間存在一條連通的路徑(不考慮方向)搀玖,那么這兩個(gè)點(diǎn)就屬于同一個(gè)連通分支。
Details
該算法也是使用scatter-gather iterations
模型來實(shí)現(xiàn)的驻呐。初始化的時(shí)候灌诅,使用唯一的comparable類型的頂點(diǎn)value表示連通分支的ID。每次迭代的時(shí)候含末,頂點(diǎn)向鄰接頂點(diǎn)傳播當(dāng)前的頂點(diǎn)value值猜拾,每個(gè)頂點(diǎn)在接收到的所有ID中選擇一個(gè)最小的ID與當(dāng)前頂點(diǎn)的value值比較,如果當(dāng)前頂點(diǎn)的value值小于選出最小ID值佣盒,則將當(dāng)前頂點(diǎn)的value值修改為這個(gè)選出的最小值挎袜。當(dāng)頂點(diǎn)不再更新其value值或達(dá)到最大迭代次數(shù)時(shí),算法收斂肥惭。
Usage
該算法生成一個(gè)頂點(diǎn)的數(shù)據(jù)集合DataSet
盯仪,其中頂點(diǎn)的value值對(duì)用于指定的連通分支,該構(gòu)造函數(shù)需要傳入一個(gè)參數(shù):
-
maxIterations
:最大迭代次數(shù)蜜葱。
GSA Connected Components
Overview
與前一個(gè)算法一樣全景,該算法也是弱連通分支算法的實(shí)現(xiàn),在收斂時(shí)牵囤,如果兩個(gè)頂點(diǎn)之間存在一條連通的路徑(不考慮方向)爸黄,那么這兩個(gè)點(diǎn)就屬于同一個(gè)連通分支。
Details
與前一個(gè)算法不同揭鳞,該算法使用gather-sum-apply iterations
模型來實(shí)現(xiàn)的炕贵。初始化的時(shí)候,使用唯一的comparable類型的頂點(diǎn)value表示連通分支的ID汹桦。在gather階段鲁驶,每個(gè)頂點(diǎn)收集其所有鄰接頂點(diǎn)的value值鉴裹;在sum階段舞骆,從上一階段收集的所有頂點(diǎn)value值中選出最小值;在apply階段径荔,如果該最小值小于當(dāng)前頂點(diǎn)的value值督禽,則將當(dāng)前頂點(diǎn)的value值修改為該最小值。當(dāng)頂點(diǎn)不再更新其value值或達(dá)到最大迭代次數(shù)時(shí)总处,算法收斂狈惫。
Usage
參數(shù):
-
maxIterations
:最大迭代次數(shù)。
Single Source Shortest Paths
Overview
該方法是針對(duì)加權(quán)圖中的單源點(diǎn)最短路徑算法的實(shí)現(xiàn)。給定一個(gè)源頂點(diǎn)胧谈,該算法會(huì)計(jì)算出該頂點(diǎn)與圖中其他所有頂點(diǎn)的最短路徑忆肾。
Details
該算法是使用scatter-gather iterations
模型來實(shí)現(xiàn)的。在每次迭代中菱肖,每個(gè)頂點(diǎn)向其鄰接頂點(diǎn)發(fā)送一條頂點(diǎn)到源頂點(diǎn)的距離和邊權(quán)重之和的消息Sum(distance, weight)
作為鄰接頂點(diǎn)的候選距離客冈,在接收到鄰接頂點(diǎn)的候選距離時(shí),如果候選距離中的最小值小于當(dāng)前頂點(diǎn)的距離值稳强,則更新當(dāng)前頂點(diǎn)的距離值场仲,否則不更新。如果在此次迭代中退疫,該頂點(diǎn)距離值沒有更新渠缕,則在下次迭代中,該頂點(diǎn)不再向鄰接頂點(diǎn)發(fā)送消息褒繁。當(dāng)頂點(diǎn)不再更新其value值或達(dá)到最大迭代次數(shù)時(shí)亦鳞,算法收斂。
Usage
該方法要求輸入的邊的值是Double類型棒坏,其他類型不考慮蚜迅。輸出的結(jié)果是個(gè)頂點(diǎn)的數(shù)據(jù)集合,其中頂點(diǎn)的value值表示當(dāng)前頂點(diǎn)到源點(diǎn)的最短距離俊抵。
參數(shù):
-
srcVertexId
:源點(diǎn)ID -
maxIterations
:最大迭代次數(shù)谁不。
Triangle Enumerator
Overview
該方法用于枚舉輸入的圖中出現(xiàn)的所有的唯一的三角形。一個(gè)三角形由三個(gè)頂點(diǎn)和三個(gè)頂點(diǎn)間互相連接的三個(gè)邊組成徽诲。該方法的實(shí)現(xiàn)不考慮邊的方向刹帕。
Details
該算法的實(shí)現(xiàn)思路也很簡(jiǎn)單,首先對(duì)所有的共享同一個(gè)頂點(diǎn)的邊進(jìn)行分組谎替,并構(gòu)建一個(gè)三元組偷溺,即三個(gè)頂點(diǎn)被兩條邊連接在一起。然后過濾掉所有不存在第三條邊的三元組钱贯,剩下的就是該圖中的所有的三角形枚舉結(jié)果挫掏。對(duì)于一組共享同一個(gè)頂點(diǎn)的n條邊,一共可以組成((n*(n-1))/2)個(gè)三元組秩命。因此尉共,優(yōu)化算法的一種方法是在輸出度較小的頂點(diǎn)上對(duì)邊進(jìn)行分組,以減少三角形的數(shù)量弃锐。該實(shí)現(xiàn)在基礎(chǔ)算法的基礎(chǔ)上進(jìn)行了擴(kuò)展袄友,通過計(jì)算邊緣頂點(diǎn)的輸出度,并對(duì)邊的小度頂點(diǎn)進(jìn)行邊的分組霹菊。
Usage
該算法以有向圖作為輸入剧蚣,并輸出一個(gè)Tuple3
的DataSet
,圖中的頂點(diǎn)ID是Comparable類型的。每個(gè)三元組對(duì)應(yīng)一個(gè)三角形鸠按,其中每個(gè)域代表一個(gè)頂點(diǎn)的ID礼搁。
Local Clustering Coefficient
Overview
局部聚類系數(shù)度量每個(gè)頂點(diǎn)鄰接頂點(diǎn)之間的連通性。分值范圍從0.0(鄰接頂點(diǎn)之間沒有邊)到1.0(鄰接頂點(diǎn)之間互相關(guān)聯(lián))目尖。
Details
在圖中叹坦,如果一個(gè)頂點(diǎn)的兩個(gè)相鄰頂點(diǎn)之間存在一條邊,則可以構(gòu)成一個(gè)三角形卑雁。統(tǒng)計(jì)一個(gè)頂點(diǎn)的鄰接頂點(diǎn)之間的邊數(shù)等價(jià)于統(tǒng)計(jì)一個(gè)頂點(diǎn)組成的三角形個(gè)數(shù)募书。聚類系數(shù)為一個(gè)頂點(diǎn)的鄰接定點(diǎn)間的邊數(shù)處以鄰接頂點(diǎn)間潛在的邊數(shù)〔舛祝可以看下面的例子:
頂點(diǎn)A的鄰接頂點(diǎn)有B莹捡,C,D扣甲,因此鄰接頂點(diǎn)之間的邊數(shù)為2(紅色的)篮赢,而A的鄰接頂點(diǎn)之間的潛在邊數(shù)為3條(所有可能的邊數(shù),包括虛線)琉挖。故頂點(diǎn)A的聚類系數(shù)為2/3=0.6667启泣,另外我們也可以知道頂點(diǎn)B的聚類系數(shù)為0.3333,頂點(diǎn)C的聚類系數(shù)為0.6667示辈,頂點(diǎn)D的聚類系數(shù)為0.333寥茫。
Usage
該方法即支持有向圖也支持無向圖,該方法的返回值是一個(gè)類型為UnaryResult
的DataSet
矾麻,其中每個(gè)UnaryResult
中包含當(dāng)前頂點(diǎn)的ID纱耻,頂點(diǎn)的度數(shù)以及頂點(diǎn),以及包含當(dāng)前頂點(diǎn)的三角形個(gè)數(shù)险耀。然后UnaryResult
提供了計(jì)算Local Clustering Efficiency的方法弄喘,直接調(diào)用即可。要注意的是圖中頂點(diǎn)ID必須是Comparable和Copyable類型甩牺。該方法具有以下參數(shù):
-
setIncludeZeroDegreeVertices
:是否包含度為零的頂點(diǎn)的結(jié)果蘑志。 -
setParallelism
:設(shè)置并發(fā)數(shù)。
Average Clustering Coefficient
Overview
平均聚類系數(shù)度量一個(gè)圖的平均連通性贬派。系數(shù)范圍從0.0(所有鄰接頂點(diǎn)之間都沒有邊)到1.0(完全圖)急但。
Details
平均聚類系數(shù)是具有至少兩個(gè)鄰接頂點(diǎn)的所有頂點(diǎn)的局部聚類系數(shù)得分的平均值。
Usage
該方法對(duì)有向圖和無相圖均適用赠群,返回結(jié)果是一個(gè)AnalyticResult
類型的對(duì)象羊始,里面包含當(dāng)前圖的平均聚類系數(shù)和頂點(diǎn)數(shù)旱幼。要注意的是圖中頂點(diǎn)ID必須是Comparable和Copyable類型查描。該方法的參數(shù):
-
setParallelism
:設(shè)置并發(fā)數(shù)。
Global Clustering Coefficient
Overview
全局聚類系數(shù)度量一個(gè)圖的連通性。系數(shù)范圍從0.0(所有鄰接頂點(diǎn)之間都沒有邊)到1.0(完全圖)冬三。
Details
全局聚類系數(shù)指的是每個(gè)節(jié)點(diǎn)的鄰接頂點(diǎn)中互相連接的個(gè)數(shù)在整個(gè)圖中的占比匀油。通常情況下,一個(gè)頂點(diǎn)關(guān)聯(lián)的相互連接的頂點(diǎn)對(duì)越多勾笆,該值越大敌蚜。
Usage
該方法對(duì)有向圖和無相圖均適用,返回結(jié)果是一個(gè)AnalyticResult
類型的對(duì)象窝爪。里面包含該圖中的三元組數(shù)目和三角形數(shù)弛车,并提供了方法計(jì)算每個(gè)頂點(diǎn)的全局聚類系數(shù)。要注意的是圖中頂點(diǎn)ID必須是Comparable和Copyable類型蒲每。該方法的參數(shù)
-
setParallelism
:設(shè)置并發(fā)數(shù)纷跛。
Triadic Census
Overview
在一個(gè)圖中,任意三個(gè)頂點(diǎn)都可以組成一個(gè)三元組邀杏,這三個(gè)頂點(diǎn)之間可能相連也可能不相連贫奠。Triadic Census用來計(jì)算圖中任意類型的三元組出現(xiàn)的次數(shù)。
Details
This analytic counts the four undirected triad types (formed with 0, 1, 2, or 3 connecting edges) or 16 directed triad types by counting the triangles from Triangle Listing and running Vertex Metrics to obtain the number of triplets and edges. Triangle counts are then deducted from triplet counts, and triangle and triplet counts are removed from edge counts.
Usage
Directed and undirected variants are provided. The algorithms take a simple graph as input and output a DataSet of TertiaryResult containing the three triangle vertices and, for the directed algorithm, a bitmask marking each of the six potential edges connecting the three vertices. The graph ID type must be Comparable and Copyable.
- setParallelism: override the parallelism of operators processing small amounts of data
- setSortTriangleVertices: normalize the triangle listing such that for each result (K0, K1, K2) the vertex IDs are sorted K0 < K1 < K2
Hyperlink-Induced Topic Search
Overview
Hyperlink-Induced Topic Search computes two interdependent scores for every vertex in a directed graph. Good hubs are those which point to many good authorities and good authorities are those pointed to by many good hubs
Details
Every vertex is assigned the same initial hub and authority scores. The algorithm then iteratively updates the scores until termination. During each iteration new hub scores are computed from the authority scores, then new authority scores are computed from the new hub scores. The scores are then normalized and optionally tested for convergence. HITS is similar to PageRank but vertex scores are emitted in full to each neighbor whereas in PageRank the vertex score is first divided by the number of neighbors.
Usage
The algorithm takes a simple directed graph as input and outputs a DataSet of UnaryResult containing the vertex ID, hub score, and authority score. Termination is configured by the number of iterations and/or a convergence threshold on the iteration sum of the change in scores over all vertices.
- setIncludeZeroDegreeVertices: whether to include zero-degree vertices in the iterative computation
- setParallelism: override the operator parallelism
PageRank
Overview
PageRank is an algorithm that was first used to rank web search engine results. Today, the algorithm and many variations are used in various graph application domains. The idea of PageRank is that important or relevant vertices tend to link to other important vertices.
Details
The algorithm operates in iterations, where pages distribute their scores to their neighbors (pages they have links to) and subsequently update their scores based on the sum of values they receive. In order to consider the importance of a link from one page to another, scores are divided by the total number of out-links of the source page. Thus, a page with 10 links will distribute 1/10 of its score to each neighbor, while a page with 100 links will distribute 1/100 of its score to each neighboring page.
Usage
The algorithm takes a directed graph as input and outputs a DataSet where each Result contains the vertex ID and PageRank score. Termination is configured with a maximum number of iterations and/or a convergence threshold on the sum of the change in score for each vertex between iterations.
- setParallelism: override the operator parallelism
Adamic-Adar
Overview
Adamic-Adar measures the similarity between pairs of vertices as the sum of the inverse logarithm of degree over shared neighbors. Scores are non-negative and unbounded. A vertex with higher degree has greater overall influence but is less influential to each pair of neighbors
Details
The algorithm first annotates each vertex with the inverse of the logarithm of the vertex degree then joins this score onto edges by source vertex. Grouping on the source vertex, each pair of neighbors is emitted with the vertex score. Grouping on vertex pairs, the Adamic-Adar score is summed.
Usage
參數(shù):
The algorithm takes a simple undirected graph as input and outputs a DataSet of BinaryResult containing two vertex IDs and the Adamic-Adar similarity score. The graph ID type must be Copyable.
- setMinimumRatio: filter out Adamic-Adar scores less than the given ratio times the average score
- setMinimumScore: filter out Adamic-Adar scores less than the given minimum
setParallelism: override the parallelism of operators processing small amounts of data
Jaccard Index
Overview
The Jaccard Index measures the similarity between vertex neighborhoods and is computed as the number of shared neighbors divided by the number of distinct neighbors. Scores range from 0.0 (no shared neighbors) to 1.0 (all neighbors are shared).
Details
Counting shared neighbors for pairs of vertices is equivalent to counting connecting paths of length two. The number of distinct neighbors is computed by storing the sum of degrees of the vertex pair and subtracting the count of shared neighbors, which are double-counted in the sum of degrees.
The algorithm first annotates each edge with the target vertex’s degree. Grouping on the source vertex, each pair of neighbors is emitted with the degree sum. Grouping on vertex pairs, the shared neighbors are counted.
Usage
參數(shù):
The algorithm takes a simple undirected graph as input and outputs a DataSet of tuples containing two vertex IDs, the number of shared neighbors, and the number of distinct neighbors. The result class provides a method to compute the Jaccard Index score. The graph ID type must be Copyable.
- setMaximumScore: filter out Jaccard Index scores greater than or equal to the given maximum fraction
- setMinimumScore: filter out Jaccard Index scores less than the given minimum fraction
- setParallelism: override the parallelism of operators processing small amounts of data
Vertex Metrics
Overview
Gelly的圖分析針對(duì)有向圖和無向圖望蜡,支持下列統(tǒng)計(jì)信息:
- number of vertices:頂點(diǎn)數(shù)
- number of edges:頂點(diǎn)的邊數(shù)
- average degree:頂點(diǎn)的平均度數(shù)
- number of triplets:頂點(diǎn)關(guān)聯(lián)的三元組個(gè)數(shù)
- maximum degree:節(jié)點(diǎn)的最大度數(shù)
- maximum number of triplets:節(jié)點(diǎn)關(guān)聯(lián)的最大三元組個(gè)數(shù)
針對(duì)無向圖唤崭,還有以下幾個(gè)統(tǒng)計(jì)信息: - number of unidirectional edges:?jiǎn)蜗蜻厒€(gè)數(shù)
- number of bidirectional edges:雙向邊個(gè)數(shù)
- maximum out degree:最大出度
- maximum in degree:最大入度
Details
這些頂點(diǎn)的統(tǒng)計(jì)信息分別由degree.annotate.directed.VertexDegrees
和 degree.annotate.undirected.VertexDegree
。
Usage
該方法對(duì)有向圖和無相圖均適用脖律,返回結(jié)果是一個(gè)AnalyticResult
類型的對(duì)象谢肾。里面包含圖中獲取節(jié)點(diǎn)統(tǒng)計(jì)信息的方法,并提供了方法計(jì)算每個(gè)頂點(diǎn)的全局聚類系數(shù)小泉。要注意的是圖中頂點(diǎn)ID必須是Comparable和Copyable類型勒叠。該方法的參數(shù):
-
setIncludeZeroDegreeVertices
:是否包含度為0的頂點(diǎn)結(jié)果。 -
setParallelism
:設(shè)置并發(fā)數(shù)膏孟。 -
setReduceOnTargetId
:僅支持無向圖眯分,用來計(jì)算不考慮方向的度數(shù)。
Edge Metrics
Overview
- number of triangle triplets:邊上的三角形三元組個(gè)數(shù)
- number of rectangle triplets:邊上矩形三元組個(gè)數(shù)
- maximum number of triangle triplets:邊上最大的三角形三元組個(gè)數(shù)
- maximum number of rectangle triplets:邊上最大的矩形三元組個(gè)數(shù)
Details
這些邊的統(tǒng)計(jì)信息分別由degree.annotate.directed.EdgeDegreesPair
和 degree.annotate.undirected.EdgeDegreePair
柒桑。
Usage
參數(shù):
-
setParallelism
:設(shè)置并發(fā)數(shù)弊决。 -
setReduceOnTargetId
:僅支持無向圖,用來計(jì)算不考慮方向的度數(shù)魁淳。