以下是coursera上的課程:
Machine learning with tensorflow on Google Cloud Platform
https://www.coursera.org/learn/google-machine-learning/home/welcome
第一周內(nèi)容的一些摘錄與個人思考
1,To be successful at ML, you need to think, not just about creating models, but also serving out ML predictions
如果想好好利用機(jī)器學(xué)習(xí),讓其在你的業(yè)務(wù)中發(fā)揮作用呀邢,你應(yīng)該多想想淘讥,如何實(shí)現(xiàn)prediction這一步阵面,而不僅是創(chuàng)建模型。
2,we should make sure that we could process batch data and stream data the same way.
這和#1說的是同一個問題腮敌,有很多公司使用ML改進(jìn)業(yè)務(wù)都是失敗的阱当,比如建好了模型,卻不知道如何把生產(chǎn)中的數(shù)據(jù)源源不斷地塞進(jìn)這個模型來進(jìn)行訓(xùn)練糜工,比如:batch data弊添,也就是常見的日志文件,圖片等捌木,stream data也就是收集上來的metrics和events油坝,你有考慮過如何把這些不同類型的實(shí)際數(shù)據(jù),轉(zhuǎn)換成你模型定義的數(shù)據(jù)刨裆,然后傳輸進(jìn)模型嗎澈圈?
從數(shù)據(jù)角度,轉(zhuǎn)換方式是否可靠帆啃?
從工程角度瞬女,打算選用什么消息引擎?
從代碼角度努潘,如何實(shí)現(xiàn)诽偷?
3,need to be good at data engineering
Machine Learning
Data Pipeline
Data Analytics
Data Collection
Scalability Reliability Engineering
如果要做好機(jī)器學(xué)習(xí)的實(shí)踐疯坤,需要對數(shù)據(jù)工程有一定的積累报慕。上面的5個詞匯是一個金字塔從頭到底的排序⊙沟。可以看到眠冈,最底層也是一個SRE,不過這里的S指的是擴(kuò)展菌瘫,而不是站點(diǎn)蜗顽。對于底層這個詞匯我的理解是,你能保證你的可伸縮的服務(wù)在發(fā)展過程中雨让,也能很好契合我們這次機(jī)器學(xué)習(xí)實(shí)踐嗎雇盖?(如提供訓(xùn)練數(shù)據(jù),之后再使用模型提供的預(yù)測功能)
然后是數(shù)據(jù)收集宫患,這里的難點(diǎn)應(yīng)主要是收集系統(tǒng)的構(gòu)建刊懈,能架得住負(fù)載、能隨服務(wù)擴(kuò)展娃闲、還得好實(shí)現(xiàn)
數(shù)據(jù)收集到了就是分析和處理虚汛,臟數(shù)據(jù)該過濾的過濾,過于原始的數(shù)據(jù)該加工的加工
之后怎么把這些數(shù)據(jù)傳入進(jìn)模型也是個問題(pipeline)皇帮,我們可以在網(wǎng)上找到很多tensorflow的例子卷哩,就拿最火爆的識別衣服鞋子照片來說,別人的模型默認(rèn)輸入的是28X28像素的照片属拾,可問題是将谊,你怎么把你從生產(chǎn)環(huán)境搞來的各種各樣的數(shù)據(jù)轉(zhuǎn)化成如何規(guī)整的數(shù)據(jù)然后pipeline給模型來訓(xùn)練呢冷溶?
4,what's difference between ML and AI?
AI(artificial intelligence) is discipline, make machine act like human
ML(machine learning) is toolset, like neurton network
AI contains ML(AI技術(shù)包含ML)
機(jī)器學(xué)習(xí)和人工智能的關(guān)系是什么尊浓?簡單說:AI是規(guī)則逞频,AI讓機(jī)器能以人的方式行動(比如人類做很多判斷時,除了理性還有感性的因素)
機(jī)器學(xué)習(xí)是工具集栋齿,比如神經(jīng)網(wǎng)絡(luò)苗胀。
人工智能包含機(jī)器學(xué)習(xí),也就是名詞“水果”和“蘋果”的關(guān)系
5瓦堵,the old neurton network just have one hidden layer for :
computer power
data
computational tricks
神經(jīng)網(wǎng)絡(luò)其實(shí)在三十多年前就提出了基协,并且當(dāng)時也實(shí)現(xiàn)了,但那時候的應(yīng)用很有限菇用,基本上也只有一層隱藏層(不像現(xiàn)在的模型有很多層澜驮,可能卷積或者壓縮層就有好幾層)
造成這個原因主要有:計算機(jī)算力太弱、數(shù)據(jù)量太少惋鸥、tricks(這里可以理解為一些輔助技術(shù)杂穷,比如現(xiàn)在做ML,很多卷積模型揩慕、過濾手段亭畜、壓縮手段都非常重組)
6扮休,every product in Google has a dozen of ML models
this is an example:
Predict product demand
Predict inventory
Predict restocking time
16年底谷歌內(nèi)部的ML模型大概有4000個迎卤,現(xiàn)在估計過萬了。但是要注意玷坠,一個產(chǎn)品會有多個機(jī)器學(xué)習(xí)模型
Google photos
Google translate: when your take photo to a signal that you don not recognize
Model1: Identify the Sign
Model2: OCR the Characters
Model3: Identify Language
Model4: Translate Language
Model5: Superimpose Text
Model6: Select Correct Font
比如你用google翻譯蜗搔,拍了個標(biāo)識,然后自動識別八堡,這里就至少涉及了6個模型:1樟凄,識別出標(biāo)識(停車、限速的牌子等)2兄渺,提取出字符 3缝龄,識別字符的語言 4,翻譯 5挂谍,把翻譯后的內(nèi)容覆蓋到原先的標(biāo)識上 6叔壤,選擇一個適合的字體
Smart Reply Inbox: a complicated ML application
sequence to sequence model
the output of previous model will be the input of next model
7, what kinds of problems can ML solve?
eric schmidt said ML is about replacing programming, but most of us think of predicting data.
機(jī)器學(xué)習(xí)到底要解決什么問題
注意?谛稹炼绘!
我們可以通過訓(xùn)練已有數(shù)據(jù)獲取一個模型,然后我們自然而然想到訓(xùn)練好之后妄田,來進(jìn)行預(yù)測俺亮。但這個理解還需要更進(jìn)一步驮捍,機(jī)器學(xué)習(xí)真正的目的是為了替代我們原有的業(yè)務(wù)模型,我們發(fā)現(xiàn)需求脚曾,然后我們編寫代碼东且,然后迭代
Machine learning scales better than hand-coded rules
機(jī)器學(xué)習(xí)的伸縮性能力好于手寫業(yè)務(wù)邏輯規(guī)則
like you search "park" in search engine:
這里舉了一個例子,比如n年前你在谷歌上搜索一個公園本讥,那么實(shí)際上背后就有人為編寫的一套邏輯苇倡,會查看用戶的地理位置或資料里面的位置,然后根據(jù)一套手寫的邏輯來返回搜索結(jié)果
hand-coded rules are really hard to maintain, ML scales better because it's automated
但這種手寫的規(guī)則真的太難維護(hù)了
Google RankBrain (a deep neural network for search ranking and improve performance significantly)
這里提到了RankBrain囤踩,也就是谷歌搜索引擎的機(jī)器學(xué)習(xí)系統(tǒng)旨椒,它會學(xué)習(xí)用戶的搜索,然后來判斷或猜測用戶到底想搜什么東西
So, we get conclusion: what kinds of problems can ML solve?
the answer is Anything for which you are writing rules for today
所以結(jié)論就是堵漱,機(jī)器學(xué)習(xí)到底是解決什么問題的呢综慎?
只要你的業(yè)務(wù)中有需要人手動去編寫規(guī)則,就可以用機(jī)器學(xué)習(xí)來替代
(這里讓我想起做網(wǎng)絡(luò)設(shè)備CMDB時勤庐,添加不同廠商不同類型的設(shè)備示惊,可能就需要編寫不同的巡檢規(guī)則、不同的指標(biāo)收集規(guī)則)
8愉镰,It'all about data
when you search "coffee near me"
example equals labelled data, label this above example is "Does the user like the result or does he not?"
9米罚,F(xiàn)raming an ML problem
將機(jī)器學(xué)習(xí)問題框架化
如果我們要實(shí)踐機(jī)器學(xué)習(xí),那么可以從三個層面來框架化問題
cast it as learning problem(what data is for training, what is for predicting?)
機(jī)器學(xué)習(xí)層面:要訓(xùn)練哪些數(shù)據(jù)丈探,要預(yù)測哪些信息录择,如何模型成 train_data 與 label(這里要結(jié)合tensorflow的樣例代碼,比如數(shù)字09自然可以用09來表述數(shù)據(jù)的label碗降,但真實(shí)問題隘竭,如何定義數(shù)據(jù)的label呢?)
cast it as software problem(API for service,who will use service?how it doing today?)
從軟件層面:最后要提供怎樣的API讼渊?誰會使用這項服務(wù)(使用者關(guān)心的是什么)动看,目前沒實(shí)踐機(jī)器學(xué)習(xí)之前是怎樣處理業(yè)務(wù)問題?(痛點(diǎn)是爪幻?)
cast it in framework of a data problem(key actions to collect,analyze,)
Some scenario
10菱皆,Infuse your apps with ML
一些成功的實(shí)踐經(jīng)驗(yàn)
AUCNET as an example
AUCNET是一個日本網(wǎng)站,通過你拍攝汽車的照片挨稿,然后給你分析汽車型號仇轻,并整合其它服務(wù)(比如該型號的所有配置,購買信息等)
11叶组,What is the pre-trained model?
GCP provide:
Vision API
Speech API
Jobs API
Translation API
Natural Language API
Video Intelligence API
這里相當(dāng)于給GCP拯田,谷歌云平臺打的廣告,意思是比如你要做一個識別翻譯服務(wù)甩十,沒必要實(shí)現(xiàn)全部模型船庇,比如谷歌云平臺已經(jīng)提供視覺識別的機(jī)器學(xué)習(xí)API
12吭产,The ML marketplace is moving towards increasing levels of ML abstraction
ML的市場發(fā)展方向,是提升機(jī)器學(xué)習(xí)的抽象能力(怎么理解這句話呢鸭轮,個人理解就好比從小學(xué)到大學(xué)甚至master phd所接觸到的數(shù)學(xué)一樣)
數(shù)學(xué)的核心是通過模型來解釋現(xiàn)實(shí)臣淤,而很明顯,y = kx+b這種方程能概括的現(xiàn)實(shí)問題遠(yuǎn)不如 傅里葉能 抽線的現(xiàn)實(shí)問題多
13窃爷,Build a data strategy around ML
14邑蒋,Simple ML and More Data > Fancy ML and Small Data
so spend your energy collecting more data, not only quantity but also varity
機(jī)器學(xué)習(xí)最重要的不是你有多漂亮一個模型,或者這個模型多么高端和精準(zhǔn)按厘,模型是一個迭代的過程医吊,而更重要的是大量的數(shù)據(jù),并且除了足夠的“量”逮京,還需要盡可能多的種類
15卿堂,how to successfully applied ML?
Collecting data is often the longest and hardest part of ML project and the most likey to fail
應(yīng)用ML的過程中,總耗時并且最容易導(dǎo)致失敗的環(huán)節(jié)就是收集數(shù)據(jù)
collecting data contains rating, rating means finding labels for the data
這里的收集(collecting)還包括rating懒棉,這里的rating的意思是為數(shù)據(jù)設(shè)置label(這里我理解的是草描,在現(xiàn)實(shí)的業(yè)務(wù)環(huán)境下,更多的生產(chǎn)數(shù)據(jù)是很難用yes或no來簡單的描述策严,我們或許會用還行穗慕、很好等詞來形容,但對于機(jī)器妻导,則不能適用這樣模糊的描述逛绵,特別是用于訓(xùn)練的數(shù)據(jù),則需要有明確用于區(qū)分的label)
ML is a journey towards automation and scale
請把我栗竖,實(shí)踐機(jī)器學(xué)習(xí)的目的是為了自動化與業(yè)務(wù)規(guī)氖畲啵化
when we talking ML, most engineers keep thinking training, but the true utility of ML comes during predictions
當(dāng)我們談?wù)摍C(jī)器學(xué)習(xí)渠啤,大多數(shù)工程師會一直想如何訓(xùn)練狐肢,但機(jī)器學(xué)習(xí)真正的用處是預(yù)測這個過程,請把握這一點(diǎn)沥曹,不要過分糾結(jié)于模型訓(xùn)練
your models have to work on streaming data
你所創(chuàng)建的模型一定得能工作于流數(shù)據(jù)(這句話的意思就是份名,當(dāng)我們學(xué)習(xí)tensorflow時,使用的數(shù)據(jù)集可能是預(yù)先準(zhǔn)備好的妓美,但如果投入實(shí)際生產(chǎn)僵腺,模型則需要能在從流數(shù)據(jù)中得到不斷修正,也能為流數(shù)據(jù)做預(yù)測)
sometimes fail cuz something called training-serving skew
to reduce this skew, you'd better take the same code that was used to process historical data during training and reuse it during predictions
我們需要保證訓(xùn)練和預(yù)測使用相同的環(huán)境壶栋、相同的代碼
your data pipeline have to process both batch and stream
你的數(shù)據(jù)管道需要能同時處理batch和stream data辰如,這句和上面的work on streaming data是一個意思。batch data好理解也好實(shí)現(xiàn)贵试,但是stream data就沒那么好處理(這里也好想明白琉兜,特別機(jī)器學(xué)習(xí)這種很需要大量數(shù)據(jù)的業(yè)務(wù)凯正,如果你搭建并使用過分布式消息引擎就明白stream data會帶來的麻煩)
During prediction, the key performance aspect is speed of response
在預(yù)測環(huán)節(jié),最重要的性能指標(biāo)就是響應(yīng)速度
the magic of ML comes with quantity, not complexity
ML的magic來自大量數(shù)據(jù)豌蟋,而不是這東西的復(fù)雜度(不是代碼寫得復(fù)雜廊散,b格高就好)
Unstructed data accounts for 90% of enterprise data(like email, video footage, texts, reports, catalog, events)
雖然我們學(xué)習(xí)ML,學(xué)習(xí)tensorflow時用的訓(xùn)練數(shù)據(jù)都是規(guī)整的梧疲,但實(shí)際業(yè)務(wù)中允睹,超過90%數(shù)據(jù)都是非結(jié)構(gòu)化的,比如郵件幌氮、視頻缭受、文本、報告等
pre-trained models make processing unstructed data easier
所以要學(xué)會使用別的公司该互、機(jī)構(gòu)提供的現(xiàn)成模型來做數(shù)據(jù)處理(一方面給GCP的ML API打廣告贯涎,另外一方面告誡希望實(shí)踐ML的工程師,不要強(qiáng)求自己去實(shí)現(xiàn)ML中的各個環(huán)節(jié))
business can benefit from ML?
1,Infuse your apps with ML, simplify user input adapt to user
2,fine-tune your business, streamline your business processes
3,Anticipate users' need creatively fulfill intent
How Google Does ML
Google suggests that we should pay more focus on collecting data and building infrastrucutre instead of optimizing ML algorithm
谷歌很講自己是機(jī)器學(xué)習(xí)應(yīng)用最成功的公司慢洋,甚至沒有說之一塘雳。如果你想實(shí)踐機(jī)器學(xué)習(xí)并幫助自己的業(yè)務(wù),那么請為收集數(shù)據(jù)和創(chuàng)建基礎(chǔ)設(shè)施(這里的基礎(chǔ)設(shè)施普筹,比如數(shù)據(jù)管道pipeline败明,比如應(yīng)對服務(wù)批量部署的基礎(chǔ)設(shè)施,記得前面說了ML的目標(biāo)之一就是scale)下足夠多的精力
Avoid these top 10 ML pitfalls
10個ML陷阱
1,ML requires just as much software infrastructure
successful ML practise needs lots of things around the algorithm like a whole software stack to serve
2,no data collected yet
there is no need to talk about ML without collecting great data or access to great data
3,assume the data is ready for use
4,keep human in loop
5,product launch for the wrong thing
6,ML optimizing for the wrong thing
7,is your ML improving things in the real world
8,using a pre-trained ML algorithm vs building your own
9,ML algorithm are trained more than once
10,trying to design your own perception or NLP algorithm
這里可以學(xué)習(xí)下PPT技巧太防,在上面這張圖中妻顶,除了1~10列舉了10個陷阱,還通過前面帶顏色的小方塊說明可能出現(xiàn)的階段
the good thing to hear: most of the values comes along the way.
as you march towards ML, you may not get there, and you will still greatly improve everything you're working on.when you get there, ML improves alomost everything it touches once you're ready.
前面講了很多機(jī)器學(xué)習(xí)失敗的原因蜒车,這里也需要給所有鼓鼓勁讳嘱,在實(shí)踐ML的過程中就會給你的業(yè)務(wù)帶來好處
if the process to build and use ML is hard for your company, it's likely hard for the other members of your industry.
你需要明白,如果覺得實(shí)踐過程很難酿愧,那么對同行也是一樣的沥潭。
但如果你能做出一點(diǎn)成果,得到的反饋卻是極好的嬉挡,客戶會容易感知到更好的服務(wù)钝鸽,并給與你更多積極、準(zhǔn)確的數(shù)據(jù)反饋庞钢,然后這個反饋又會促進(jìn)你去微調(diào)業(yè)務(wù)
ML and business processes
Look at 5 phases:
1, Individual contributor
2, Delegation
3, Digitization
4, Big data and Analytics
5, Analytics Machine Learning
1~3階段是傳統(tǒng)的業(yè)務(wù)模型拔恰,45是最近幾年火熱的大數(shù)據(jù)與機(jī)器學(xué)習(xí)
finally, great ML systems will need humans in the loop.
and you should think about ML as a way to expand the impact or to scale the impact of your people, not as a way of complete removing them.
the more people you have in your organization, the more voices you have to say, automation is impossible
這句話就是上面delegation這個階段過久的一個問題,你的組織中人員越多基括,自動化就越難實(shí)現(xiàn)
Learn how to identify the origins of bias in ML/ make models inclusive/ evaluate ML models with biases
ML and human bias
想象一張鞋子的圖片颜懊,不同人會有不同想象,這就是human bias
but just because something is based on data doesn't automatically make it neutral
因?yàn)槟P褪侨祟愑?xùn)練的,而即便對于相同的東西河爹,不同的人也有不同的傾向使鹅,所以human bias是需要關(guān)注的一個問題
a common way that we evaluate performance in ML is by using a confusion matrix.
我們評估ML模型性能的一個方式就是使用confusion matrix
statistical measurement and acceptable tradeoff
we should focus on the False Positive Rate(labels says something doesn't exist but Model predicts it)
我們更應(yīng)關(guān)心上圖中的False Positive Rate
Rate = False Negatives / False Negatives + True Positives
False positive rate (α) = type I error = 1 ? specificity = FP / (FP + TN) = 180 / (180 + 1820) = 9%
False negative rate (β) = type II error = 1 ? sensitivity = FN / (TP + FN) = 10 / (20 + 10) = 33%
True positive rate (TPR), Recall, Sensitivity, probability of detection = Σ True positive/Σ Condition positive
Accuracy (ACC) = Σ True positive + Σ True negative/Σ Total population
Precision = Σ True positive/Σ Predicted condition positive
https://en.wikipedia.org/wiki/Sensitivity_and_specificity
后面所講的內(nèi)容就是利用google的datalab在線進(jìn)行學(xué)習(xí)與測試
這個產(chǎn)品就是類似google docs的在線編輯器