原文鏈接:Big Data Uncovered: What Does A Data Scientist Really Do?
Big Data Uncovered: What Does A Data Scientist Really Do?
大數(shù)據(jù)揭秘:數(shù)據(jù)科學(xué)家到底是干什么的姓赤?
The world of Big Data and data science can often seem complex or even arcane from the outside looking in. In business, a lot of people by now probably understand the basics of what Big Data analysis involves – collecting the ever growing amount of data we are generating, and using it to come up with meaningful insights. But what does this actually involve on a day to day level for the professionals who get their hands dirty with the nuts and bolts?
在外界看來誓斥,大數(shù)據(jù)領(lǐng)域和數(shù)據(jù)科學(xué)常被認(rèn)為是高深復(fù)雜甚至神秘的。商業(yè)領(lǐng)域中许帐,很多人現(xiàn)在可能已經(jīng)了解了大數(shù)據(jù)分析所包含的基本概念:對我們生成的不斷增長的海量數(shù)據(jù)加以收集劳坑,發(fā)掘其中具有重要意義的信息。但從事具體研究的專業(yè)人士每天究竟都在做些什么呢舞吭?
To have a look under the hood of a job that some describe as the 'Sexiest Job Of The 21st Century' I spoke to leading data scientist Dr Steve Hanks to get an overview of what the work of a data scientist actually involves, and what sort of person is likely to be successful in the field.
有人將數(shù)據(jù)科學(xué)家稱為“21 世紀(jì)最性感的工作”泡垃,為了揭開其神秘面紗析珊,我與權(quán)威的數(shù)據(jù)科學(xué)家 Steve Hanks 博士進行了交談,大體了解了數(shù)據(jù)科學(xué)家的工作包括哪些方面蔑穴,以及哪種人更加適合這個領(lǐng)域忠寻。
Dr Hanks gained a PhD in computer science at Yale University, has spent 15 years as a professor of computer science and has worked at companies including Amazon, Yahoo and Microsoft. Today he is chief data scientist at Whitepages.com where he is responsible for overseeing the Contact Graph – a database containing contact information for over 200 million people. The database is searched around two billion times every month and is the company's primary business asset.
Hanks 在耶魯大學(xué)獲得計算機科學(xué)博士學(xué)位,在 15 年時間內(nèi)作為計算機科學(xué)專家先后供職于亞馬遜存和、雅虎和微軟等多家公司奕剃。目前他是 Whitepages.com 的首席數(shù)據(jù)科學(xué)家,負(fù)責(zé) Contact Graph 的監(jiān)管工作捐腿。Contact Graph 是一個數(shù)據(jù)庫纵朋,存儲了超過 2 億人的聯(lián)系信息。這個數(shù)據(jù)庫每月被搜索約 20 億次茄袖,是該公司主要的業(yè)務(wù)資產(chǎn)操软。
This database has driven Whitepage's business since it was launched in 1997 and more recently it has diversified into app development. Caller ID, its replacement mobile user interface, queries the main Whitepages database to give more complete information on who is calling, and to help cut nuisance and spam calls. It also generates another revenue stream by providing its data to other companies to use in fraud prevention.
自 1997 年推出以來,這個數(shù)據(jù)庫一直是 Whitepages 的業(yè)務(wù)驅(qū)動力宪祥。最近這家公司又開發(fā)出一款手機應(yīng)用 Caller ID聂薪,它可替代手機用戶界面,通過查詢 Whitepages 的主數(shù)據(jù)庫蝗羊,提供更加完善的來電顯示信息藏澳,還可以屏蔽騷擾電話和廣告電話。此外耀找,這一數(shù)據(jù)庫還擴展出一條新的盈利途徑翔悠,即為其他公司提供數(shù)據(jù)以用于預(yù)防詐騙。
Key Capabilities of a data scientist
數(shù)據(jù)科學(xué)家的關(guān)鍵能力
The term "data scientist" can cover many roles across many industries and organizations from academia to finance or Government. Hanks leads a team of 12 to 15 members responsible for all of the analytics at Whitepages, and their skillsets and duties vary. However, he tells me, there are three key capabilities which every data scientist has to understand.
“數(shù)據(jù)科學(xué)家”這一術(shù)語可以代表學(xué)術(shù)野芒、金融蓄愁、政府等多種領(lǐng)域和組織中的多種角色。Hanks 所帶領(lǐng)的團隊有 12 至 15 名成員复罐,他們共同負(fù)責(zé) Whitepages 的所有數(shù)據(jù)分析工作涝登,而各成員的技能和職責(zé)則各不相同。不過他告訴我效诅,有三種能力是每個數(shù)據(jù)科學(xué)家必須具備的胀滚。
You have to understand that data has meaning
你必須清楚數(shù)據(jù)是有意義的
Hanks makes the point that we often overlook the fact that data means something and that it is important to understand that meaning. We have to look beyond the numbers and understand what they stand for if we are to gain any valid insights from it. Hanks points out "It doesn't have anything to do with algorithms or engineering or anything like that. Understanding data is really an art, and it's really important."
Hanks 認(rèn)為,我們經(jīng)常忽視一個事實乱投,即任何數(shù)據(jù)都是有意義的咽笼,關(guān)鍵在于理解這些意義。如果想要從數(shù)據(jù)中提煉出任何有效的信息戚炫,我們必須將目光超越數(shù)據(jù)本身剑刑,探尋其所表示的東西。Hanks 指出,這與算法施掏、工程學(xué)或類似的技術(shù)無關(guān)钮惠,理解數(shù)據(jù)實際上是一種藝術(shù),并且非常重要七芭。
You have to understand the problem that you need to solve, and how the data relates to that
你必須清楚自己需要解決的問題以及數(shù)據(jù)與這些問題的關(guān)系
Here is where you open your tool-kit to find the right analytics approaches and algorithms to work with your data. Hank talks about machine learning – which is very popular right now, but makes the point that there are hundreds of techniques to use data to solve problems – operations research, decision theory, game theory, control theory – which have all been around for a very long time. Hank says "Once you understand the data and you understand the problem you're trying to solve, that's when you can match the algorithm and get a meaningful solution."
這表示你需要從所掌握的技能中找出合適的分析方法和算法來搞定你的數(shù)據(jù)素挽。Hanks 談到了當(dāng)前非常流行的機器學(xué)習(xí),他指出使用數(shù)據(jù)解決問題的方法有幾百種之多狸驳,如運籌學(xué)预明、決策論、博弈論耙箍、控制論等撰糠,且這些方法均已出現(xiàn)了很長時間。Hanks 認(rèn)為辩昆,一旦你理解了數(shù)據(jù)阅酪,理解了試圖去解決的問題,便能夠找到最合適的算法并提供理想的解決方案汁针。
You have to understand the engineering
你必須了解工程學(xué)
The third capability is about understanding and delivering the infrastructure required to perform any analysis. In Hank's words "It doesn't do any good to solve the problem if you don't have the infrastructure in place to deliver the solution effectively, accurately and at the right time and place."
第三種能力即能夠?qū)?shù)據(jù)分析工作所需的基礎(chǔ)知識有足夠了解并運用自如遮斥。用 Hanks 的話來說,如果不具有相應(yīng)的基礎(chǔ)知識扇丛,以便能夠適時適地提供準(zhǔn)確有效的解決方案,對解決問題是毫無幫助的尉辑。
Being a good data scientist is really about paying attention to all three of those capabilities. You have to pay attention to the data and what it means, understand the problems and know about matching algorithms to those problems, and you have to understand the engineering to come up with solutions.
對于想成為一名優(yōu)秀數(shù)據(jù)科學(xué)家的人帆精,以上三種能力是必不可少的。你需要關(guān)注數(shù)據(jù)及其意義隧魄,理解問題并知曉解決問題的理想算法卓练,還需要了解工程學(xué),這將更有助于你解決問題购啄。
At the same time it doesn't mean there's no room for specialization. Hanks makes the point that it is virtually impossible to be an expert in all three of those areas, not to mention all the sub-divisions of each of them. It is okay to specialize in one of these areas as long as you have an appreciation of all of them. Hanks tells me: "Even if you're primarily an algorithm person or primarily an engineer. If you don't understand the problem you're solving and what your data is, you're going to make bad decisions."
然而這并不表示沒有專攻某一種能力的可能襟企。Hanks 認(rèn)為,實際上不可能存在精通全部三個領(lǐng)域的專家狮含,更何況這些領(lǐng)域各自又具有若干分支顽悼。而在已對這些領(lǐng)域建立了解的基礎(chǔ)之上,完全可以專門研究其中一個領(lǐng)域几迄。但 Hanks 告訴我蔚龙,即使你以算法研究為主或以工程師作為第一角色,如果沒有理解所解決的問題或是沒搞清楚數(shù)據(jù)的意義映胁,同樣沒辦法勝任數(shù)據(jù)科學(xué)家木羹。
Key qualities of a data scientist
數(shù)據(jù)科學(xué)家的關(guān)鍵品質(zhì)
In terms of personal qualities, a curiosity about data is essential, as well as communications skills, says Hanks. "People on my team spend a lot of time talking to customers to figure out what problems they need to solve, or talking to data vendors to find out what they can provide. So you become a middle man and communication is very important."
就個人品質(zhì)而言,對數(shù)據(jù)的好奇心是必不可少的解孙,溝通技巧也同樣重要坑填。Hanks 說“我的團隊成員會花很長時間與客戶進行溝通抛人,指出他們亟待解決的問題,還會與數(shù)據(jù)供應(yīng)商進行交流脐瑰,以便確定他們能夠提供哪些幫助妖枚。因此,你成了一個中間人蚪黑,可見溝通是非常重要的盅惜。”
Lots of different types of people go into data science, and Hanks explained to me that he was probably not a very typical example. However in my experience there is no such thing. The key capabilities Hanks mentioned cover a broad range of skills and people of different personality types and mind sets are attracted to the profession.
許許多多不同類型的人從事著數(shù)據(jù)科學(xué)行業(yè)忌穿,Hanks 對我解釋說他可能并不是個很典型的例子抒寂。而以我的經(jīng)驗來看,可不是這么回事兒掠剑。Hanks 提及的關(guān)鍵能力包含了范圍廣博的專業(yè)技術(shù)屈芜,而這個行業(yè)也不斷吸引著具有不同個性和想法的人們。
"I just really loved the interplay", Hanks says, "From the beginning I was just totally fascinated. My first exposure to data science was probably in operations research, and I just loved the idea that you could take big data sets and use them to learn things, and improve things, and I found out that you really could use them to make a difference, I've found that fascinating for over 30 years now."
“我真的非常喜歡這種互動朴译,”Hanks 說井佑,“一開始我完全被迷住了。我第一次接觸數(shù)據(jù)科學(xué)時眠寿,好像是關(guān)于運籌學(xué)的案例躬翁,你可以使用大數(shù)據(jù)集習(xí)得并改進事物,這種概念讓我深深著迷盯拱,并且我還發(fā)現(xiàn)盒发,你真的可以利用數(shù)據(jù)帶來一些不同,直到現(xiàn)在我已經(jīng)沉迷于此超過 30 年了狡逢∧ⅲ”
Even after all that time in the business though, problems still come up which have him scratching his head, and these serve as a great example of the sort of challenges data scientists find themselves struggling with on a day to day basis.
雖然在這一領(lǐng)域有著漫長而豐富的經(jīng)歷,還是可能遇到讓他抓狂的問題奢浑。當(dāng)談到數(shù)據(jù)科學(xué)家的挑戰(zhàn)時蛮艰,這些問題就是極好的例子,就是他們每日鉆研的目標(biāo)雀彼。
Just this morning I was working on something and one of the algorithms just wasn't doing what it was supposed to do – basically it was showing us a link between a particular person and a particular phone number which we just knew was incorrect. These problems can be very intermittent and very hard to diagnose.
就在今天早上壤蚜,我正忙于工作,發(fā)現(xiàn)某個算法沒有達(dá)到預(yù)期效果详羡,基本上這表示某人與某個電話號碼之間的已知鏈接是不正確的仍律。這些問題可能會斷斷續(xù)續(xù)地出現(xiàn),并且非常難以確定实柠。
"We have very specific algorithms that are supposed to do very specific things, and when they don't we just have to take them apart and find out why not, the problem is these days they are very complex and have a lot of working pieces! I can be completely mystified, like I am right now … but we will get there – we always do! That's really the sort of challenge we face day to day – systems which just don't behave the way they are supposed to according to our schematics."
“我們有非常具體的算法來處理非常具體的事情水泉,當(dāng)算法不奏效的時候,我們只能對它們進行仔細(xì)檢查并找出原因,問題是這些天用到的算法非常復(fù)雜草则,并且有大量參與計算的代碼段钢拧!我困惑極了,就像我現(xiàn)在這樣……但我們總是會搞定的炕横,我們一直如此源内!這就是我們每天面對的挑戰(zhàn)——沒有按照既定構(gòu)思運行的各種系統(tǒng)》莸睿”
In the time that Hanks has been working with data he has seen huge changes in the field, from working on structured databases on mainframes, to distributed Hadoop networks, to the cloud based, real time data processing world of today. So where does he see the future taking analytics and Big Data?
在 Hanks 從事數(shù)據(jù)科學(xué)的這些年中膜钓,他感受到了這個領(lǐng)域的巨大變化,從運行于主機上的結(jié)構(gòu)數(shù)據(jù)庫卿嘲,到分布式 Hadoop 網(wǎng)絡(luò)颂斜,再到今天基于云的實時數(shù)據(jù)處理,技術(shù)發(fā)展日新月異拾枣。那么隨著數(shù)據(jù)分析和大數(shù)據(jù)技術(shù)的發(fā)展沃疮,他對行業(yè)的未來又是如何看待的呢?
The Future of data science
數(shù)據(jù)科學(xué)的未來
Hanks sees a future of increased data streaming and real-time data processing, as opposed to huge batch processing of data. He believes that in this new world Hadoop MapReduce is less appropriate and in his work he is starting to use other systems like Scala and Akka.
Hanks 認(rèn)為增量式數(shù)據(jù)流和實時數(shù)據(jù)處理技術(shù)將大有未來梅肤,但并不看好海量數(shù)據(jù)批處理技術(shù)的前景司蔬。他相信,在這個嶄新的時代姨蝴,Hadoop MapReduce 將不再那么適用俊啼,他在工作中已開始使用 Scala 和 Akka 等其他系統(tǒng)。
One of the biggest challenges Hanks sees is the keeping up with the fast developments of new technologies and new algorithms. He believes that in order to be an effective data scientist you have to be holistic. He believes that it is relatively easy to become a specialist in MapReduce or a particular machine learning algorithm but the challenge is keeping up with the general speed of development in data science. "It's a field that is just stunningly big and complex, and has incredible breadth and depth", Hanks tells me, "You have to understand all of the pieces but the field is getting so vast – that's going to be the challenge facing data scientists going into the future."
Hanks 眼中最大的挑戰(zhàn)之一是要緊跟新技術(shù)和新算法快速發(fā)展的步伐左医。他認(rèn)為吨些,要成為一名出眾的數(shù)據(jù)科學(xué)家,必須要具有全局觀炒辉。他相信,成為 MapReduce 或某一機器學(xué)習(xí)算法領(lǐng)域的專家相對容易泉手,更大的挑戰(zhàn)在于緊跟數(shù)據(jù)科學(xué)的發(fā)展速度黔寇。“這是個非常龐大而復(fù)雜的領(lǐng)域斩萌,其范圍有著無法想象的廣度和深度缝裤,”Hanks 告訴我,“你必須要了解每個細(xì)節(jié)颊郎,但這個領(lǐng)域還在不停地飛速發(fā)展憋飞,這將是數(shù)據(jù)科學(xué)家未來所面臨的挑戰(zhàn)∧房裕”
譯者注:今年我的第一篇翻譯練筆榛做,希望讀者多多批評指正。手感逐漸恢復(fù)中...