Consumer Persona based on Clustering Algorithms

Hello everyone, it's been a while not updating my papers, cause I have changed my jobs from automated trading to data mining. But no matter what, the challenges and oppotunities are always in the first place. The first project I completed is about labeling behavior attributes of medicine consumers.

????The original data are the receipts, which are unconstructed. So we need to extract useful data to establish constructed database. Then we have to build our features based on the understanding of business. For instance, the original constructed data fieild contains like orderNo(訂單號), card ID(會員號), buy date, sex, age, pay money(訂單金額), count(購買數(shù)量), product name, per customer transaction(客單價), per customer consumption(客品數(shù)), company of product, ATC(ATC四級分類), approval(國藥準(zhǔn)字號) and etc. Then we extract features to denote higher level labels, like loyalty, consuming capacity, health consciousness, activity(活躍度), season consumption distribution, medicine preference(for disease prediction of customers), user value(for precision market) and so on. And the features of each higher level labels should not be relevant, we can use seaborn to draw heatmap for observation.

????After validating the rationality of those clusters, then we can put the models into practical use.?

The process we using fundamental data to build something denoted higher level labels(attributes), is called feature engineering, which is most critical and involves business privacy. So this paper just for recording my working experience. Well, let's begin.

????Firstly, data procssing, I think this step is the most time-wasting, and second in importance only to feature engineering. Cause you will confront many issues like default, redundancy, error logic data, error type data, outliers detection, data transformation(like logarithmic processing) and etc. And different dealing methods will be used under different business senario. For example, there are some ditributions below, extremely unbalanced, with long-tail.

????Then we need to make it from non-normal distribution to normal or close to normal distribution, which has better statistical attributes and fulfills the basic hypothesis of some clustering algorithms like k-means. So I did log-transformation, then looks like?left-skewed normal distribution. Also you can do sqrt-transform, reciprocal-transform or inverse trigonometric transform.

As for?outlier detection, you can use non machine-learning method like quartile detection, n-fold-sigma criterion. If you have clear definitions of outiers from business perspective, I think it would be better and time-saving to use non train-needed method. As for train-needed methods, I preferredIsolation Forest(abbr. IFrorest).?When using this method, you could just use "pure" data(without outliers) as training set, which would not affect the result in test set. This process is calledNovelty Detection, it is often used when?our outliers are very little(like spam). Sklearn has embeded IForest, we can use it directly. The result do not look so good. No matter what kind of training set(pure or impure)?I build, all performed bad. There are nearly 320 thousand data, with 280 thousand inliers and 40 thousand outliers after IForest detection. The pics below informs that black points denote inliers and white points denote outliers.? Therefore, most outliers detection is based on understanding of pratical senarios.

????After preparing data set, then we can do some training for fun. K-means, Mini-batch-Kmeans, Birch, MeanShift,?GaussianMixture(GMM), whatever you like, just try. However I do not suggest you just use one of that for pratical use. I imitate Adaboost rationale, use muti-clustering methods as weak clustering(this method will not be performed here). Or we can use MeanShfit which do not need to designate specific number of clusters, to combine with those methods need to set up the number of clusters. For example, we could use MeanShfit firstly, to find out cluster points without assginment of cluster count(here is 10).?

Then utilize those cluster points as input for K-means or GaussainMixture, and designate the cluster number we need.

Eventually, the clustering result looks like below:

????We still can not confirm whether the result is good or not. For instance, this label contains 3 features, and we need to build up 3 clusters denoting "High", "Median", "Low" respectively. And according to those features, we create 3 clusters, but still we do not ensure which represents "High", "Median" or "Low". Here, I give 3 features weights from business considering, and calculating 3 clusters final "score". If the "scores" are apperantly different from each others, then I can ensure the whole features and methods we use are pratical.?

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末滤祖,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子瓶籽,更是在濱河造成了極大的恐慌匠童,老刑警劉巖,帶你破解...
    沈念sama閱讀 219,270評論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件棘劣,死亡現(xiàn)場離奇詭異,居然都是意外死亡楞遏,警方通過查閱死者的電腦和手機(jī)茬暇,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,489評論 3 395
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來寡喝,“玉大人糙俗,你說我怎么就攤上這事≡蓿” “怎么了巧骚?”我有些...
    開封第一講書人閱讀 165,630評論 0 356
  • 文/不壞的土叔 我叫張陵,是天一觀的道長格二。 經(jīng)常有香客問我劈彪,道長,這世上最難降的妖魔是什么顶猜? 我笑而不...
    開封第一講書人閱讀 58,906評論 1 295
  • 正文 為了忘掉前任沧奴,我火速辦了婚禮,結(jié)果婚禮上长窄,老公的妹妹穿的比我還像新娘滔吠。我一直安慰自己,他們只是感情好挠日,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,928評論 6 392
  • 文/花漫 我一把揭開白布疮绷。 她就那樣靜靜地躺著,像睡著了一般嚣潜。 火紅的嫁衣襯著肌膚如雪冬骚。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,718評論 1 305
  • 那天懂算,我揣著相機(jī)與錄音唉韭,去河邊找鬼。 笑死犯犁,一個胖子當(dāng)著我的面吹牛属愤,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播酸役,決...
    沈念sama閱讀 40,442評論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼住诸,長吁一口氣:“原來是場噩夢啊……” “哼驾胆!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起贱呐,我...
    開封第一講書人閱讀 39,345評論 0 276
  • 序言:老撾萬榮一對情侶失蹤丧诺,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后奄薇,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體驳阎,經(jīng)...
    沈念sama閱讀 45,802評論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,984評論 3 337
  • 正文 我和宋清朗相戀三年馁蒂,在試婚紗的時候發(fā)現(xiàn)自己被綠了呵晚。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 40,117評論 1 351
  • 序言:一個原本活蹦亂跳的男人離奇死亡沫屡,死狀恐怖饵隙,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情沮脖,我是刑警寧澤金矛,帶...
    沈念sama閱讀 35,810評論 5 346
  • 正文 年R本政府宣布,位于F島的核電站勺届,受9級特大地震影響驶俊,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜免姿,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,462評論 3 331
  • 文/蒙蒙 一废睦、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧养泡,春花似錦嗜湃、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,011評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至肩榕,卻和暖如春刚陡,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背株汉。 一陣腳步聲響...
    開封第一講書人閱讀 33,139評論 1 272
  • 我被黑心中介騙來泰國打工筐乳, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人乔妈。 一個月前我還...
    沈念sama閱讀 48,377評論 3 373
  • 正文 我出身青樓蝙云,卻偏偏與公主長得像,于是被迫代替她去往敵國和親路召。 傳聞我的和親對象是個殘疾皇子勃刨,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,060評論 2 355

推薦閱讀更多精彩內(nèi)容