Monica Rogati 是領(lǐng)英的數(shù)據(jù)科學(xué)家蜗侈,她給了我們?cè)谕诰驍?shù)據(jù)時(shí)十個(gè)應(yīng)該避免的常見錯(cuò)誤核蘸。
- 假設(shè)數(shù)據(jù)是干凈的鳍怨。數(shù)據(jù)清洗通常占了工作中大部分時(shí)間选酗,而且簡(jiǎn)單的清洗動(dòng)作也常常揭示出重要的模式。比如問道“是這個(gè)方法導(dǎo)致數(shù)據(jù)中的30%都為NULL嗎饶套?90210這個(gè)郵編對(duì)應(yīng)的客戶真的有那么多嗎漩蟆?”在拿到數(shù)據(jù)時(shí)就進(jìn)行核對(duì),以確保其有效和有用妓蛮。
- 數(shù)據(jù)處理不規(guī)范怠李。假設(shè)你正在制作一個(gè)熱門結(jié)婚圣地的列表。你可以計(jì)算飛去某地參加婚禮的人數(shù),但如果不考慮所有去那個(gè)地方旅客的總?cè)藬?shù)的話捺癞,你的列表僅僅代表了一個(gè)航空業(yè)發(fā)達(dá)的城市列表夷蚊。
- 剔除異常值。假設(shè)有21個(gè)人每天使用你的產(chǎn)品一千次翘簇,這些人可能是你的超級(jí)粉絲,當(dāng)然也可能僅僅是爬你網(wǎng)站的爬蟲程序儿倒。但不管他們是誰版保,不應(yīng)該隨便的剔除他們。
- 包含異常值夫否。從某個(gè)角度來說這21個(gè)人每天用1000次你的產(chǎn)品很有趣彻犁,因?yàn)樗麄兡軒Ыo你意想不到的東西。但處理這些人沒有合適的通用模型凰慈,所以在某些功能上需要剔除他們汞幢,否則“推薦功能”可能給你所有的忠實(shí)粉絲都推了千篇一律的東西。
- 忽視時(shí)間周期性微谓∩瘢看了數(shù)據(jù)后驚嘆實(shí)習(xí)生是今年增長(zhǎng)最快的職位,定睛一看才發(fā)現(xiàn)是7月豺型。在尋找規(guī)律時(shí)仲智,如果忽視了時(shí)刻、工作日姻氨、月份會(huì)導(dǎo)致錯(cuò)誤的決策钓辆。
- 匯報(bào)增長(zhǎng)情況時(shí)忽視規(guī)模。情境非常重要肴焊,否則剛剛開始時(shí)前联,你爸爸注冊(cè)了一次,增長(zhǎng)率就翻了一倍娶眷。
- 數(shù)據(jù)輸出似嗤,如果你不知道該看什么,那dashboard基本沒什么用届宠。
- 狼來了双谆。你設(shè)置了很多報(bào)警好在出問題時(shí)第一時(shí)間修復(fù),但當(dāng)你的閾值設(shè)的很敏感時(shí)席揽,這些警報(bào)就像“狼來了”一樣顽馋,你慢慢就開始無視它們。
- 不要采集這里的數(shù)據(jù)綜合癥幌羞。將你的數(shù)據(jù)和其他來源的數(shù)據(jù)混合寸谜,可能會(huì)產(chǎn)生有價(jià)值的東西∈翳耄“你最好的客戶來的地方都非常喜歡日料嗎熊痴?”他爸。這些會(huì)給你很多很好的下一步行動(dòng)的想法,甚至?xí)绊懩愕脑鲩L(zhǎng)策略果善。
- 聚焦噪聲數(shù)據(jù)诊笤。即使什么都沒有,我們?nèi)祟愐材芙o他找出模式來巾陕。擺脫虛榮指標(biāo)的數(shù)據(jù)讨跟,退后一步關(guān)注更遠(yuǎn)大的目標(biāo)。
How to Think Like a Data Scientist
Monica Rogati, a data scientist at LinkedIn, gave us the following 10 common pitfalls that entrepreneurs should avoid as they dig into the data their startups capture.
- Assuming the data is clean. Cleaning the data you capture is often most of the work, and the simple act of cleaning it up can often reveal important patterns. “Is an instrumentation bug causing 30% of your numbers to be null?” asks Monica. “Do you really have that many users in the 90210 zip code?” Check your data at
the door to be sure it’s valid and useful. - Not normalizing. Let’s say you’re making a list of popular wedding destinations. You could count the number of people flying in for a wedding, but unless you consider the total number of air travellers coming to that city as well, you’ll just get a list of cities with busy airports.
- Excluding outliers. Those 21 people using your product more than a thousand times a day are either your biggest fans, or bots crawling your site for content. Whichever they are, ignoring them would be a mistake.
- Including outliers. While those 21 people using your product a thousand times a day are interesting from a qualitative perspective, because they can show you things you didn’t expect, they’re not good for building a general model. “You probably want to exclude them when building data products,” cautions Monica. “Otherwise, the ‘you may also like’ feature on your site will have the same items everywhere—the ones your hardcore fans wanted.”
- Ignoring seasonality. “Whoa, is ‘intern’ the fastest-growing job of the year? Oh, wait, it’s June.” Failure to consider time of day, day of week, and monthly changes when looking at patterns leads to bad decision making.
- Ignoring size when reporting growth. Context is critical. Or, as Monica puts it, “When you’ve just started, technically, your dad signing up does count as doubling your user base.”
- Data vomit. A dashboard isn’t much use if you don’t know where to look.
- Metrics that cry wolf. You want to be responsive, so you set up alerts to let you know when something is awry in order to fix it quickly. But if your thresholds are too sensitive, they get “whiny”— and you’ll start to ignore them.
- The “Not Collected Here” syndrome. “Mashing up your data with data from other sources can lead to valuable insights,” says Monica. “Do your best customers come from zip codes with a high concentration of sushi restaurants?” This might give you a few great ideas about what experiments to run next—or even influence
your growth strategy. - Focusing on noise. “We’re hardwired (and then programmed) to see patterns where there are none,” Monica warns. “It helps to set aside the vanity metrics, step back, and look at the bigger picture.“
節(jié)選自Alistair Croll,Benjamin Yoskovitz,《Lean Analytics》