Using pyspark KMeans for Real World Clustering Problems

Today I use spark to deal with my prepared article embedding dataset. After solving some problems, I wrote down the process(this article is still incomplete now). Next step I will introduce some visualizations in this article(using matplotlib).

My data stored in a text file is formatted as follows:

{# of record, [v_d1, v_d2, ... , v_dn]}

di for the i-th dimmension.

KMeans algorithm

First we import some used modules:

from pyspark.mllib.clustering import KMeans, KMeansModel
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import SparseVector
from numpy import array

Then we preprocess the source data:

data=sc.textFile("/Users/pluto/data/notes_vector.csv")

def parseline(line):
    temp = re.sub(r'\[|\]','', line).split(',')
    return SparseVector(len(temp)-1, [(idx, float(value)) for idx, value in enumerate(temp[1:])])

parsedData = data.map(lambda line: parseline(line))

Our data preprocess is done. Use train method to fit the data.

clusters = KMeans.train(parsedData, 100, maxIterations=10, runs=10, initializationMode="random")

explanation of parameters:

  • data
  • training points stored as RDD[Array[Double}
  • k
  • number of clusters
  • maxIterations
  • max number of iterations
  • runs
  • number of parallel runs, defaults to 1. The best model is returned.
  • initializationMode
  • initialization model, either "random" or "k-means||" (default).

To evaluate the performance of the model, we need a error or loss function. The idea is to minimise the total euclidean distance between each data point and the mean centre point assigned to itself.

We define an error function as follows:

def error(point):
    center = clusters.centers[clusters.predict(point)]
    denseCenter = DenseVector(numpy.ndarray.tolist(center))
    return sqrt(sum([x**2 for x in (DenseVector(point.toArray()) - denseCenter)]))

Calculate total errors:

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)

Here the toArray is method for SparseVector. We make them consistent for vector subtraction.

Gaussian Mixture

A Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. The MLlib implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples. The implementation has the following parameters:

  • k is the number of desired clusters.
  • convergenceTol is the maximum change in log-likelihood at which we consider convergence achieved.
  • maxIterations is the maximum number of iterations to perform without reaching convergence.
  • initialModel is an optional starting point from which to start the EM algorithm. If this parameter is omitted, a random starting point will be constructed from the data.
gmm = GaussianMixture.train(parsedData, 100)
for i in range(100):
    print ("weight = ", gmm.weights[i], "mu = ", gmm.gaussians[i].mu,
        "sigma = ", gmm.gaussians[i].sigma.toArray())

Some traps

  1. cues
I figured it out. My indices parameters for the sparse vector are messed up. It is a good learning
for me:
When use the Vectors.sparse(int size, int[] indices, double[] values) to generate a vector,
size is the size of the whole vector, not just the size of the elements with value. The indices
array will need to be in ascending order. In many cases, it probably easier to use other two
forms of Vectors.sparse functions if the indices and value positions are not naturally sorted.

-Yao

Subject: KMeans - java.lang.IllegalArgumentException: requirement failed

I am trying to train a KMeans model with sparse vector with Spark 1.0.1.
When I run the training I got the following exception:
java.lang.IllegalArgumentException: requirement failed
                at scala.Predef$.require(Predef.scala:221)
                at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
                at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
                at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
                at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
                at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
                at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
                at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
                at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
                at org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
                at org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
                at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
                at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
                at scala.collection.immutable.Range.foreach(Range.scala:141)
                at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
                at scala.collection.AbstractTraversable.map(Traversable.scala:105)
                at org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
                at org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)

What does this means? How do I troubleshoot this problem?
Thanks.

-Yao
  1. cues

References

spark mllib

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末更哄,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子彤蔽,更是在濱河造成了極大的恐慌荐操,老刑警劉巖移迫,帶你破解...
    沈念sama閱讀 219,270評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件泼疑,死亡現(xiàn)場(chǎng)離奇詭異判哥,居然都是意外死亡忱叭,警方通過(guò)查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,489評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門(mén)轿曙,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)弄捕,“玉大人,你說(shuō)我怎么就攤上這事导帝∈匚剑” “怎么了?”我有些...
    開(kāi)封第一講書(shū)人閱讀 165,630評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵您单,是天一觀的道長(zhǎng)斋荞。 經(jīng)常有香客問(wèn)我,道長(zhǎng)睹限,這世上最難降的妖魔是什么譬猫? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,906評(píng)論 1 295
  • 正文 為了忘掉前任,我火速辦了婚禮羡疗,結(jié)果婚禮上染服,老公的妹妹穿的比我還像新娘。我一直安慰自己叨恨,他們只是感情好柳刮,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,928評(píng)論 6 392
  • 文/花漫 我一把揭開(kāi)白布。 她就那樣靜靜地躺著,像睡著了一般秉颗。 火紅的嫁衣襯著肌膚如雪痢毒。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 51,718評(píng)論 1 305
  • 那天蚕甥,我揣著相機(jī)與錄音哪替,去河邊找鬼。 笑死菇怀,一個(gè)胖子當(dāng)著我的面吹牛凭舶,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播爱沟,決...
    沈念sama閱讀 40,442評(píng)論 3 420
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼帅霜,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了呼伸?” 一聲冷哼從身側(cè)響起身冀,我...
    開(kāi)封第一講書(shū)人閱讀 39,345評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎括享,沒(méi)想到半個(gè)月后搂根,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,802評(píng)論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡奶浦,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,984評(píng)論 3 337
  • 正文 我和宋清朗相戀三年兄墅,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片澳叉。...
    茶點(diǎn)故事閱讀 40,117評(píng)論 1 351
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖沐悦,靈堂內(nèi)的尸體忽然破棺而出成洗,到底是詐尸還是另有隱情,我是刑警寧澤藏否,帶...
    沈念sama閱讀 35,810評(píng)論 5 346
  • 正文 年R本政府宣布瓶殃,位于F島的核電站,受9級(jí)特大地震影響副签,放射性物質(zhì)發(fā)生泄漏遥椿。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,462評(píng)論 3 331
  • 文/蒙蒙 一淆储、第九天 我趴在偏房一處隱蔽的房頂上張望冠场。 院中可真熱鬧,春花似錦本砰、人聲如沸碴裙。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 32,011評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)舔株。三九已至莺琳,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間载慈,已是汗流浹背惭等。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 33,139評(píng)論 1 272
  • 我被黑心中介騙來(lái)泰國(guó)打工, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留办铡,地道東北人辞做。 一個(gè)月前我還...
    沈念sama閱讀 48,377評(píng)論 3 373
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像料扰,于是被迫代替她去往敵國(guó)和親凭豪。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,060評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容

  • 我很笨 但只對(duì)你喔 我不是一定要你回來(lái)
    伍月的晴空閱讀 262評(píng)論 2 2
  • 我們都向往從一而終的愛(ài)情晒杈,只是不是每個(gè)人都會(huì)有那樣的好運(yùn)氣的嫂伞,大部分都會(huì)愛(ài)錯(cuò)一些人。走一些錯(cuò)路…
    小唐同學(xué)哦閱讀 199評(píng)論 0 0