數(shù)據(jù)科學(xué)家訪談——Chris Moody

Chris Moody

Data Scientist at Stitch Fix

Astrophysics to Data Science

Chris Moody started off his journey towards data science by peering off into distant galaxies, studying computational astrophysics at UC Santa Cruz as a graduate student.

As the data revolution hit the fields of science, however, Chris found himself having to learn how to use more sophisticated tools that could process more data. He dove into programming and contributing towards open-source astrophysics projects.

All this culminated in a data science fellowship at Insight Data Science. After completing his Fellowship, Chris joined Square’s Data Science team. After leaving Square, Chris is now a data scientist at Stitch Fix, a fashion startup.


Thank you very much for being with us, Chris. Can you tell us a little bit about your background?

I went to Caltech as an undergrad to study Physics. There, I had projects that were largely computational.

For example, a project I was involved in was looking at dark matter simulations. Basically, we don’t know that much about dark matter, but we can guess at things that it could possibly do. One of those things is that it could decay. If it decays, the dark matter particle gets a kick, and it goes off in a random direction at a random speed. Galaxies are sitting at the bottom of a gravity well; they’re like bread crumbs in a big bowl of dark matter. If the dark matter were spontaneously decaying and getting lots of extra energy, it could popcorn out, and totally change the profiles of galaxies in an essential way. This was a strongly computational project that taught me many skills.

After Caltech, I came to Santa Cruz for graduate studies, still working in computational astrophysics. While I was there, I was doing all sorts of things pertaining to galaxies. We would look through the Hubble Space Telescope at the youngest galaxies in the universe and notice that they were not at all like the galaxies today. Galaxies today are beautiful spiral structures. But when you look back at the youngest galaxies, they are lumpy and clumpy… they look like soup.

So, one of the questions was: Does that mesh well with our ideas of how our universe formed? We started to look at the simulations and realized that what we observed through the telescope is what we were seeing in our simulations. We were super surprised at these theoretical predictions coming true!

The next part followed the standard trajectory of a lot of businesses. We got one or two really positive examples of galaxies matching our predictions, and were very excited about the progress. But it was only one or two examples; we wanted to know if this was statistically significant, and so we started to scale up our data. We exploded from 100 gigabytes to hundreds of terabytes of data. This all started at the NASA Ames supercomputer.

It turns out that it’s really hard to answer simple questions when those questions don’t fit onto one computer. So we had to scale up a lot of our algorithms, and build our own infrastructure and framework. It was at that point that we started to get really interesting results. We started to see that this is generally true, and this attracted a lot of people to our project, scaling up our people power. So we’d get other new graduate student astronomers and explain, ‘This is how we work; this is how you can be efficient.’

I think the romantic, public idea of a scientist is that you jump into a cave and then five months later, you have a “Eureka” moment and you come out. Then it’s glorious. But that’s not really how it works. The reality is: you have lots of bugs, you make lots of errors, and you have to work as a team, which means you have to be able to work efficiently. You have to know how a pull request works. You have to know how commits work. You have to know how to document. You have to file bugs and report to issue trackers. You have to do all of these things.

At the end of all that, I realized that I most liked working with data. I liked working with algorithms. Actually, I absolutely loved working with algorithms.

I spent more time reading about how the algorithms worked and how they found all this truth, despite all the noise and red herrings in the data. I loved doing that and working with people on a project together. It was great. I thought galaxies were cool, don’t get me wrong, but I liked algorithms more.

It sounds like you spotted a project, saw that it was interesting and used your experience of working on it to explore your interests. How did your background in science inform your work as a data scientist?

Science is getting harder to do. It’s harder to do it individually and it has to happen as part of a team; a collaborative effort, so we can measure different things. Looking at papers from 50 years ago: having a paper with 50 authors on it was ridiculous, that just never happened. Half the papers out there were published with only one or two authors on them.

Now, that’s ridiculously absurd. I can’t remember the last time I read a paper with only one author on it.

It’s just because the instruments you have to use are larger. We end up having to use supercomputer resources or we have to use the Hubble Space Telescope to get somewhere. This means that the data and ideas are starting to grow much larger than one person can manage. In turn, it means that you have to learn how to work with other people. So that’s a paradigm shift of science, and also something that I think industry has been familiar with for a much longer period of time.

At the same time, a lot of my exposure to things like software engineering best practices, or even computer science, was completely self-taught. I didn’t take any formal classes in these fields.

That’s really interesting that it worked out so well, and also that that didn’t hinder you.

I think that’s actually pretty normal. Look at some start-ups. They’re really interested in finding someone who can actually do the work; someone who is trying to find and build a whole community and foster that growth. Take that person from the 90th percentile and just teach them the remaining 10% of the small skills needed. These startups are basically instilling habits; thinking about what you’re going to do and how it’s going to reflect on everyone else in the network, instead of being an isolated person.

Sometimes, that has to happen as a feedback reflex. You have to think of how you’re going to fit in with everything else. You have to think about how your code is going to be used by others. I was lucky in that I had a community leader in my project who was really interested in teaching everyone else how to work together, and I learned a lot from him.

Of your friends and peers from Caltech, many of whom have also gone on to do heavy computational physics research, have you found that a substantial portion of them are heading towards industry?

Yes, especially in astrophysics. I can’t tell you how many plots I have seen in the last year with the number of faculty jobs remaining constant, or maybe even slightly decreasing with time, compared with the sky-high number of post doctorates. That means that the likelihood of a post doctorate job opening is going down at a ridiculous rate. Even when I was in graduate school, the expected number of postdoctoral candidates went from two to three. If it kept going at that rate, by the time I’d finished my first post doctorate, the expected rate would be four postdocs to every one position.

Clearly, there’s a huge supply of post doctorates and not that many positions within academia.

How much did those academic job statistics influence your decision on what you wanted to do after graduate school? Did you feel you could get the same intellectual stimulation from problems in industry as you received in academia?

Yes, it was a hard decision, but you look at it and think, ‘How many times do I really want to roll the dice? How much do I really like this?’ That fear of not finding a job really destroys a lot of the romance of science. I feel like a lot of people start doing science because they have this romantic notion of becoming the best scientist, or contributing in a noble way. But the truth is that science is a shitty ride.

You can do a lot of the same things that science will let you do, but you don’t have to do these things in the world of academia. You can work on science in industry. When I made that realization, and understood I could do a lot of the science, and be involved in a lot of the cool stuff I’d tried to do in the first place, it made me realize that I could switch to a new job outside of academia. At the same time, I didn’t feel that I was giving up on what drove me initially. There are a lot of startups that are changing the world, so instead of trying to define clumps and galaxies, I could try to actually work with somebody, and try to change the world. I thought this was really cool and super exciting.

So then you joined Insight – a six week long Fellowship for PhDs looking to enter the field of data science. How much of what the Fellowship taught you would you say was new to you?

All of it. There’s a paradigm shift from science and industry. Everything in science is about a fully detailed presentation of an idea; exhaustively explicating all of the caveats. All of the communication is bordered on fully defined facts, or at least as much as possible. You look at the borders of your project, the borders of the results, and you know the downsides and you know the upsides, and that’s because you’re terrified that someone will find a deficit in your project, and then nail you for it.

But then the opposite is true in business. The biggest problem is that people have very limited bandwidth. It takes a lot of effort, and there are a lot of people demanding it. So the crux of everything in business is actually being able to move all of your results in as terse and precise a fashion as possible.

You don’t need to delineate all of the possibilities, you just need to say what is the major point, and you can go on from there. So, a lot of what Insight taught me was that you need to condense all of your results down as quickly as possible. You get someone’s attention and you go; that’s the hardest part. As scientists, we were taught to give an hour-long lecture on our project. We didn’t have to consider whether our audience was being entertained or not. If they’re not interested, you don’t care. They’re not your audience if they weren’t interested in the first place.

It’s the opposite idea during the Insight Data Science Fellowship. You have to go out and you have to make every single connection for yourself. You have to boil it down and make it completely convincing that everything you’re saying is relevant to them, and you have to do it in 5 seconds. Everything is an elevator pitch. Every YC Company has to give demos in 180 seconds. So Insight is all about building a demo in those 6 weeks, and then pitching it in 180 seconds. You’re basically pitching yourself as a candidate to those companies. You’re saying, ‘Don’t look at me like a graduate student. I’m actually super goal-orientated, or systems-orientated. I can take all this data, apply these algorithms, and give you some amazing results.’ That’s what those three minutes are for, and that’s the whole paradigm shift. Now, the focus is not so much on the new idea or how much you’ve added to the body of knowledge. The focus is what can you tell me in 100 seconds. That’s all the CEO has time for.

In scientific lectures, you’re not trying to reach a super-broad audience. In the case of science, you’re trying to deliver an idea, and then you try to back it up in 15,000 words.

You need to do that in business as well. You need to be able to take your idea and defend it. The thing is that, here, you’re no longer trying to defend it to the CEO, you’re no longer trying to defend it to anyone else. You just need to defend it to yourself, and then you need to give them the ideas; there’s an implicit trust there.

No one else is going to check your work and no one else should check your work. You need to be an independent party and you need to break it down as to what is important.

You have to build up small kernels of truth, and that’s all you can deliver. A lot of the time, people find it distressing, but I thought it was great. I thought it was an awesome challenge to be able to compress my message down and figure out what all the tidbits are. It’s like a whole design philosophy. I liked the idea of throwing out everything except for what you need to function. I like it from a designer standpoint and also from an algorithms and data analysis standpoint. I think that embodying that philosophy was the single most successful part of the Insight Fellowship.

“Data science” has now become a very common phrase in many business sectors. Yet, it’s still nebulous and no one is really sure what it means. So, what does data science mean to you? How would you break it down?

It means a lot. It always means to measure data, being able to make sense of that data, create models of that data, and most importantly, to be able to communicate what that data means.

I think data science splits into two fields, and I believe a lot of hiring companies are starting to reflect this. Data science is starting to break off into descriptive analytics and predictive analytics.

Descriptive analytics is, ‘we saw this trend.’ Or, for example, ‘We saw this spike or dip… is that because our service crashed? We saw this huge spike…is it a multiplicity of things?’ It’s always asking questions of dynamics, and then asking what is going on. So the raw data comes back, and then you make something useful – actionable business intelligence – from that data. That’s descriptive analytics, taking data that has been produced and trying to make head or tails out of it, to drive some decisions out of it. So that might mean, ‘We saw some really exciting events in Bulgaria, but why is our site exploding in Bulgaria and nowhere else?’ You may find out that it’s not really from Bulgaria, or that it’s raining everywhere else, or a volcano just went off and everybody’s Tweeting about it, or something ridiculous like that.

The other side of data science is predictive analytics; being ahead of the game. This is where you’re shifting towards machine learning algorithms. You’re looking at things such as fraud, where you’re trying to predict whether a transaction is fraudulent or not. Or, you’re trying to figure out security applications: is this malevolent activity? But that’s what it is, fundamentally. It’s pattern finding within all the data, in real-time, which adds additional constraints on computational complexity.

Data science rapidly becoming something concrete, especially as it becomes a more well-defined field. But it’s definitely splitting off into those two directions of data, analyzing it and figuring out underlying trends. If there are multiple trends, maybe it’s multiple elements stacking up to produce the signal you’re looking at. Maybe it’s not really a signal at all, and it’s a bug somewhere, so you have to look at the data.

The other side is not just trying to make heads or tails of the data, but also making predictions. Which city are we going to open up in next? What are the relevant quantities? A lot of business is driven by intuition and gut feelings, and this scares a lot of people. CEOs are trying to pitch entire companies on feelings, essentially. They’re trying to drive home their points on a colloquial basis. The whole field of data science is trying to turn that feeling into something a little more rigorous; trying to deliver on something that’s not intuition, and finding something that you can ground yourself on. That gives your business a lot of stability, especially when there’s a lot of startups and they’re all thinking of great ideas, but only some of them are really as great as they believe, and most of them won’t pan out.

You engage a data scientist at the point when you’re looking to add an incremental value. That’s not going to make your business take off, it’s not guaranteed. But at least it will give you something that’s not solely based on a feeling.

Of the two different types of data science you articulated, do they also require different skills?

For the most part, they require a lot of the same core skills. Predictive data science requires a little more machine learning type skills, and descriptive probably requires a lot more statistical skills. But then, in predictive data analysis, you might be using a lot more random forests or neural networks – all these really cool algorithms.

Which side of data science, from your physics background, seems more intuitive with you?

I started learning programming In high school, because I wanted to play around with genetic algorithms. So that’s been a long running interest. Even though I went off and did experimental physics and computational astrophysics, I’ve always had this background of really wanting to do machine learning. That appeals more to the predictive side than the descriptive side. Both of them have a lot of overlap. There’s not a wall between the two, but you can start to see the continuum of data science. So, I think I’m far more attracted to the predictive side. Neural networks I just think are really cool because you’re essentially training artificial intelligence. You’re taking these tiny artificial brains and making a decision with them. You’re actually turning a whole company based on that.

What do you feel are the defining qualities of a top-notch data scientist, compared with someone who is merely good?

I think it deals with communication. I think that’s the difference between the good scientists and the great. Both are going to know a lot about statistics, the techniques they can use, and how to design, implement, and execute an experiment. Those things are all important. The biggest thing, though, is that you need to be able to communicate those results. That’s a lot harder than it looks.

I think the easiest thing for a graduate student to do, coming into this field, is to gloss over it, but that’s the single most important thing. Most people complain that graduate students don’t have a great programming background. All of their other intuitions, well designed experiments, caveated results, are sound. But I think that a lot of people believe that a programming background is not necessary.

So, maybe it is programming for a lot of people, but if you’re already pretty good, then you’re probably already a good programmer. The last step is just communication. People need to sense the passion inside of you. This defines the most successful people. It’s the realization that you are working with other people, and for a lot of scientists, I think that’s quite a shock. It really goes against this notion of romantic science.

Isaac Newton spent three years in a shack during the plague. He didn’t want to get the plague and he hated talking to everyone. Granted, he was possibly autistic in some ways, but I think a lot of people follow that archetype of going back and living by themselves, and then they emerge with all of their findings. But in reality, it needs to be a much more continuous process. It needs to be a much smoother process than just coming back and reeling off a list of accomplishments. So it’s always communication, but that’s the easiest part to skip over.

What do you see as the promise with data science, and also the interplay between mathematics and computer science, that really speaks to you? Where does your passion lie?

We’re living in a really exciting time because I think what were formerly highly theoretical principles are finally having an impact on the world. Before, I was looking at clumps and galaxies. To do that I needed to run clustering algorithms. I needed to be able to run distributed frameworks on thousands of nodes to answer basic questions.

Now, I can do almost the same stuff, and I can tweak a learning algorithm that teaches students in the best way they can learn. There’s a whole feedback system that says, “you should answer these questions, and then five minutes from now, we’ll come back and repeat it, and then we’ll come back a week later and repeat it again.”

The wonderful thing is that those algorithms, that whole pattern, is being replicated from galaxies to psychology and cognition. All of these high topics of knowledge are beginning to trickle down, and they’re actually making a real impact on day-to-day interactions. There is not a single company on NASDAQ that doesn’t use some aspect of this. Your Facebook Newsfeed is highly tweaked to give you everything that you think is relevant, and new content to test your preferences.

LinkedIn is using all kinds of graph networks. Square is using all these fraud detection techniques. HealthTap is fielding all of these questions, and training a computer to understand what these questions are. And there really are doctors who will be answering a lot of those medical questions.

The cool thing here is that they can take a doctor and clone him virtually. He can answer a question, and that might reduce patient time in a hospital somewhere. And when you take that power, and you multiply it by the number of patients in the whole world – it’s a huge number. These are real things. We’re not limited to theoretical worlds. You really can go out and have awesome effects immediately, and they’re tangible. We’re collecting more and more data, to the point that there are not that many aspects of life that aren’t becoming data driven. So it’s super exciting.

Imagine if you were able to go back to the beginning of your graduate school career, and you meet yourself coming in the corridors and you have a five minute window to speak to yourself. Would you tell yourself to do anything differently?

A lot of it would have centered on working more with people. I joined an open source project, and that was the single best decision in all of graduate school. I learned how to code in a collaborative way.

The second most important thing probably would have been communication. Every week, I would deliver a presentation on my results during the past week, and usually, it would boil down to giving a two or three minute feedback session at the end of that. So I was already doing a lot of communication and I wouldn’t have changed that.

My programming context was great; maybe I should have started that earlier and taken more formal programming classes. If you were to design the curriculum, I’d say you have to have a lot of programming. A lot of classes are like, ‘go and do this assignment.’ The real world is, ‘go do this assignment but you only have to do this module, and someone else will do the next module. You guys need to be working collaboratively.’

They should also be doing lots of statistics, and they should be able to do it as quickly as possible. People love to talk about this Pareto Principle, where 80% of the outcomes result from 20% of the effort. The hard part is trying to figure out where that 80% line actually is, and once you realize you’re at it, stop.

How can people find open source projects to participate in?

A lot of the time, they already exist. You probably already know what they are because you hear about them. The biggest thing is not to be shy about it, and not to be scared off. It took me a long time to work up the courage to actually push code back out and be able to take the criticism. No matter where you’re working, there are other people working with similar problems. Just go out and search for them. If they haven’t solved your specific niche problem, join the effort. It’s a worthwhile process. It’s really hard to convince graduate students about this, who are already overwhelmed with a lot of other things, but it is definitely the best part of those five years.

Your advisor is going to be pushing you for results, and my advisor said it had been years since he’d written any code. So you might not realize how important this is. But in a world that is becoming way more team-based, both in industry and science, it’s super important to push everything into a team-based context.

Also, if you’re in science, you’re all about trying to communicate your results. One of the best ways to do that is through your open source network. They have an audience there, waiting for you, and they might be really interested. A lot of it is, ‘I built this feature onto this project.’ They’ll go try it out and maybe they’ll write a paper about it, and then you get an extra citation.

There are a lot of extra indirect effects. The direct effect is that you’ll be better. The indirect effects are that there are a lot of other people who will benefit, and that will reflect very well on you.

It’s a little unfortunate that the primary currency of science is citations and not source code, even though that’s a big infrastructure push. I think that will have to change going forward because everything is being done in a team-based context. To do science more efficiently, it has to be that way. There’s no other alternative.

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末耘成,一起剝皮案震驚了整個(gè)濱河市胎围,隨后出現(xiàn)的幾起案子蚂维,更是在濱河造成了極大的恐慌涛癌,老刑警劉巖,帶你破解...
    沈念sama閱讀 212,718評(píng)論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異买乃,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)坎藐,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,683評(píng)論 3 385
  • 文/潘曉璐 我一進(jìn)店門为牍,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人岩馍,你說我怎么就攤上這事碉咆。” “怎么了蛀恩?”我有些...
    開封第一講書人閱讀 158,207評(píng)論 0 348
  • 文/不壞的土叔 我叫張陵疫铜,是天一觀的道長。 經(jīng)常有香客問我双谆,道長壳咕,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 56,755評(píng)論 1 284
  • 正文 為了忘掉前任顽馋,我火速辦了婚禮谓厘,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘寸谜。我一直安慰自己竟稳,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,862評(píng)論 6 386
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著他爸,像睡著了一般聂宾。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上诊笤,一...
    開封第一講書人閱讀 50,050評(píng)論 1 291
  • 那天系谐,我揣著相機(jī)與錄音,去河邊找鬼讨跟。 笑死纪他,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的晾匠。 我是一名探鬼主播止喷,決...
    沈念sama閱讀 39,136評(píng)論 3 410
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼混聊!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起乾巧,我...
    開封第一講書人閱讀 37,882評(píng)論 0 268
  • 序言:老撾萬榮一對情侶失蹤句喜,失蹤者是張志新(化名)和其女友劉穎,沒想到半個(gè)月后沟于,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體咳胃,經(jīng)...
    沈念sama閱讀 44,330評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,651評(píng)論 2 327
  • 正文 我和宋清朗相戀三年旷太,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了展懈。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,789評(píng)論 1 341
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡供璧,死狀恐怖存崖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情睡毒,我是刑警寧澤来惧,帶...
    沈念sama閱讀 34,477評(píng)論 4 333
  • 正文 年R本政府宣布,位于F島的核電站演顾,受9級(jí)特大地震影響供搀,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜钠至,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 40,135評(píng)論 3 317
  • 文/蒙蒙 一葛虐、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧棉钧,春花似錦屿脐、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,864評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽赞季。三九已至,卻和暖如春奢驯,著一層夾襖步出監(jiān)牢的瞬間申钩,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 32,099評(píng)論 1 267
  • 我被黑心中介騙來泰國打工瘪阁, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留撒遣,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 46,598評(píng)論 2 362
  • 正文 我出身青樓管跺,卻偏偏與公主長得像义黎,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個(gè)殘疾皇子豁跑,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,697評(píng)論 2 351

推薦閱讀更多精彩內(nèi)容

  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 9,442評(píng)論 0 23
  • 當(dāng)初 這里是幸福生活的天堂 如今 戰(zhàn)亂將那里幻化愁云慘霧的地獄 當(dāng)初 這是一個(gè)毫無猜忌的年代 如今 戰(zhàn)爭在野蠻的重...
    雨人2017閱讀 297評(píng)論 0 0
  • 每個(gè)人應(yīng)該活得是自己而且干凈廉涕。 ...
    秋水長天MM閱讀 271評(píng)論 0 0
  • 【傷春曲】代人賦 寂寞來時(shí),算還有艇拍、書茶自已狐蜕。空癡望卸夕、老街曲巷层释,幽窗獨(dú)倚。拋卻鉛華云幕落快集,滄桑閱盡和風(fēng)起贡羔。向余年、...
    雲(yún)末閱讀 389評(píng)論 0 7
  • 我沒讀過幾本書个初,沒有堅(jiān)持下來什么有意義的事乖寒,我就這樣在這個(gè)世界里活著,沒有什么存在感勃黍,也沒有什么激情宵统。每天跟自己羅...
    aishe閱讀 171評(píng)論 0 0