Data Scientist's Toolbox

【W(wǎng)1-01】 Specialization Motivation

About this course

The key word in data science is "science", not "data"

  • An introduction to key 'ideas behind working with data in a scientific way that will produce new and reproducible insight
  • An introduction to the tools that allows you to execute on a data analytic strategy, from raw data in database to a complete report with interactive graphics
  • Hands on practice

1 【W(wǎng)1-01】Specialization Motivation

1.1 Why do data science

Credits blongs the person who's actually trying to ==get things done==, even when there are obstacles in the way.

It's important to strive the valiantly do these sorts of things, even if you're going to take some criticism.

1.2 The key challenge of Data Science

The heart of philosophy about data science is ==answering question with data==. The question should come first and then data follow after.

  • Finding the worth answering problem
  • Have the right information can answer the question
  • Have the information in advance
  • Have the right amount of data ( no more or less )

Answering the question that you are interested in, and with the data that you have.

1.3 Why data science

  • Data deluge: data is much cheaper and easier to collecting, storing, and processing
  • Big data: We have data in new areas that we didn't use to have, that allow us to answer new questions we never could before

1.4 Why statistical data science

  • Statistics is the science of learning from data
  • Statistics deals with any uncertainties when answering questions with data

1.5 Why now

  • Explosive growth of data in every possible area
  • Tools, competitions and websites are all developed around the idea of helping to learn from data
  • Huge investment in algorithm and prediction development
  • Have the opportunities to get involved in projects that have super high profile result

1.6 Why R

  • Increasingly the most commonly used programming language in data science
  • comprehensive set of packages for all processes involved in data science ( from rawest of raw file to interactive reports and web apps )
  • it is Free
  • It has one of the best development environment - RStudio
  • It has an amazing ecosystem of developers
  • Packages are easy to install and integrate

1.7 Who is data scientist

  • Using data to answer all kinds of questions

1.8 Goal

Data science Venn diagram
  • Hacking skills
    • Computer programming: access data, clean data, analysis data and plot data
    • Figure out answers for yourself

2 [W1-02] The Toolbox

2.1 What data scientist do

  • Define the question
  • Identify the ideal data set
  • Determine what data is accessible
  • Obtain data
  • Clean data
  • Exploratory data analysis, including making plots and clusterings to identify patterns in the data set
  • Statistical modeling
  • Interpret result
  • Synthesizing and writing up result
  • Create reproducible code
  • Distributing result
    • Interactive graphics, write ups, presentations and interactive apps

2.2 Main workinghorse

  • R, RStudio, R scripts, R markdown
  • Git & Github ( distributed version control )

3 [W1-03] Getting Help and Finding Answers

3.1 Asking questions

  • Often the fastest answer is the one you find youself
  • Being an active participant in pnline community environment ( message board, stackoverflow and etc )
    • if you figure out an answer to a question, post it back to the message board

3.1.1 How to ask an R question.

  • What are the steps will repeat this problem
  • What is the expected output and what do you get instead
  • What version of R or R package are used
  • What operating system are used

3.1.2 How to ask a data analysis question

  • What is the question you are trying to answer
  • What steps or tools did you use to answer it
  • What is the expected output and what do you get instead
  • What other solutions you have thought about

3.2 Find the answer for yourself

  • Google it or search on Stack Overflow
  • Post the solution you found

3.3 Getting help with R ( see Evernote )

3.4 Key characters of hacker

  • Willing to find answers on their own
  • Knowledgeable about where to find answers (eg. CrossValidation for data analysis/statistics )
  • Unintimidated by new data type or R packages
  • Unafraid to say they don't know
  • Polite but relentless

3.5 How to search

  • Stackoverflow with "[r]" tag
  • R mailing list for software questions
  • CrossValidated for more general questions
  • Google [data type] R package

4 Types of Data Science Questions

4.1 Descriptive analysis

Goal: Describe a set of data
  • It is the first kind of data analysis performed
  • Most commonly applied to census data
  • The description and interpretation of the data are different steps
    • Descriptions usually can not be generalized without additional statistical modeling

4.2 Exploratory analysis

Goal: Find new relationships but not necessarily confirm them
  • Exploratory models are good for discovering new connections
  • Define future studies to confirm the findings
  • Exploratory analysis are usually not the final conclusion
  • Exploratory analysis alone should not be used for generalizing or predicting
  • Correlation does not imply causation

4.3 Inferential Analysis

Goal: Extrapolate or generalize a small sample of data to a large population
  • Inference is commonly the goal for statistical model
  • Inference involves estimating both the quantity you interested in and the uncertainty about the estimation
  • Inference depends heavily on both the population and the sampling schema

4.4 Predictive Analysis

Goal: To use the value on some objects to predict values for another object
  • If X predicts Y, it doesn't mean that X causes Y
  • Accurate prediction depends heavily on measuring the right variables
  • Althrough there are bettern adn worse prediction models, more data and a simple models works really well
  • Prediction is very hard, especially for future

4.5 Causal Analysis

Goal: To find out what happends to one variable when you change another variable
  • Usually randomized studies are required to identify causation
  • There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions
  • Causal relationships are usually identified on average effects, but may not apply to every individual
  • Causal models are usually the "gold standard" for data analysis

4.6 Mechanistic Analysis

Goal: Understand the exact changes in variables that lead to changes in other variables for individual object
  • Increadiable hard to infer, except in simple situations or i
  • Usually in situations that are modeled by a deterministic set of equations ( physical/engineering science)
  • Generally the random component of the data is measurement error
  • If the equations are known but the parameters are not, they maybe inferred with data analysis

5 What is Data

5.1 Definition of Data

Data are values of qualitative or quantitative variable, belonging to a set of items

* set of items: Sometimes called the population; the set of objects you are interested in; a set of things you make measurement on
* variables: A measurement or characteristic of an item
* qualitative: not necessrily orderd and not necessarily measured in scale
* quanlitative: usually measured on a continuous scale, and have an ordering on that scale

5.2 Data is the Second Most Important Thing in Data Science

  • The most important thing in data science is question, so data should follow the question
  • Often the data will limit or enable the question
    • Start with the question, then may not have data to answer that question, so you have to modify the question
  • But having data is useless if you don't have a question

6 What about Big Data

  • One way to solve big data problem is to wait until hardware catches up with the size of data
  • Most questions that you are trying to answer don't necessarily have the big data component that necessitates the need of huge numbe of computers
  • It is now possible to collect much more data much more cheaply than it was before and to analysis it
  • But the question is how much of that data is useful for answering the question that you are interested in
  • Regardless of size of data, you need the right data

7 Experimental Design

7.1 Why should we care

A exciting result can lead you astray if you are not very careful about experimental design and analysis

Be aware of when performing experimental design or data science project:

  • Know and care about the analysis plan
    • Pay attention to all aspects of the design and analysis of the study so that you aware of what are the key issues from data cleaning to the data analysis to the reporting that can trip you up
  • Have a plan for data and code sharing
    • eg. Github for code sharing
    • eg. figshare for data sharing
    • The Leek group guide to data sharing in Github

7.2 Formulate your question in advance

  • The first and most important thing of performaning an experiment

7.3 Statistical inference

7.3.1

image.png

7.3.2 Confunding and spurious correlation

  • Be careful what are the other variables that may causing a relationship
  • Correlation is not causation
  • Even if you observe that two variables are correlated with each other, you have to prove that they are not correlated because of some other variables we didn't measure

7.3.3 Deal with potential confounders: Randomization and Blocking

  • Fix some variables
  • Stratify some variables, make a measurement metrics
  • Randomize variables, to aim is to balance the comfounding effect

7.4 Prediction

7.4.1

image.png

7.4.2 Prediction versus inference

  • For prediction, you need the distributions to be more separated
  • It is important to pay attention to the relative size of effects when considering predition versus inference

7.4.3 Prediction key quantities

image.png

7.5 Data dredging

Data dredging (also data fishing, data snooping, and p-hacking) is the use of data mining to uncover patterns in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality.

最后編輯于
?著作權歸作者所有,轉載或內容合作請聯(lián)系作者
  • 序言:七十年代末笨鸡,一起剝皮案震驚了整個濱河市关翎,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌幅虑,老刑警劉巖,帶你破解...
    沈念sama閱讀 217,084評論 6 503
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件童本,死亡現(xiàn)場離奇詭異摄职,居然都是意外死亡,警方通過查閱死者的電腦和手機笼踩,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,623評論 3 392
  • 文/潘曉璐 我一進店門逗爹,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人嚎于,你說我怎么就攤上這事掘而。” “怎么了于购?”我有些...
    開封第一講書人閱讀 163,450評論 0 353
  • 文/不壞的土叔 我叫張陵袍睡,是天一觀的道長。 經(jīng)常有香客問我价涝,道長女蜈,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,322評論 1 293
  • 正文 為了忘掉前任色瘩,我火速辦了婚禮伪窖,結果婚禮上,老公的妹妹穿的比我還像新娘居兆。我一直安慰自己覆山,他們只是感情好,可當我...
    茶點故事閱讀 67,370評論 6 390
  • 文/花漫 我一把揭開白布泥栖。 她就那樣靜靜地躺著簇宽,像睡著了一般勋篓。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上魏割,一...
    開封第一講書人閱讀 51,274評論 1 300
  • 那天譬嚣,我揣著相機與錄音,去河邊找鬼钞它。 笑死拜银,一個胖子當著我的面吹牛,可吹牛的內容都是我干的遭垛。 我是一名探鬼主播尼桶,決...
    沈念sama閱讀 40,126評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼锯仪!你這毒婦竟也來了泵督?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 38,980評論 0 275
  • 序言:老撾萬榮一對情侶失蹤庶喜,失蹤者是張志新(化名)和其女友劉穎小腊,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體久窟,經(jīng)...
    沈念sama閱讀 45,414評論 1 313
  • 正文 獨居荒郊野嶺守林人離奇死亡溢豆,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 37,599評論 3 334
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了瘸羡。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片漩仙。...
    茶點故事閱讀 39,773評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖犹赖,靈堂內的尸體忽然破棺而出队他,到底是詐尸還是另有隱情,我是刑警寧澤峻村,帶...
    沈念sama閱讀 35,470評論 5 344
  • 正文 年R本政府宣布麸折,位于F島的核電站,受9級特大地震影響粘昨,放射性物質發(fā)生泄漏垢啼。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點故事閱讀 41,080評論 3 327
  • 文/蒙蒙 一张肾、第九天 我趴在偏房一處隱蔽的房頂上張望芭析。 院中可真熱鬧,春花似錦吞瞪、人聲如沸馁启。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,713評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽惯疙。三九已至翠勉,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間霉颠,已是汗流浹背对碌。 一陣腳步聲響...
    開封第一講書人閱讀 32,852評論 1 269
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留蒿偎,地道東北人俭缓。 一個月前我還...
    沈念sama閱讀 47,865評論 2 370
  • 正文 我出身青樓,卻偏偏與公主長得像酥郭,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子愿吹,可洞房花燭夜當晚...
    茶點故事閱讀 44,689評論 2 354

推薦閱讀更多精彩內容

  • **2014真題Directions:Read the following text. Choose the be...
    又是夜半驚坐起閱讀 9,491評論 0 23
  • 二叔被判刑的時候不从,我們都覺得他們一家子完了,八年犁跪,等二叔刑滿釋放椿息,已經(jīng)年過花甲。 二嬸坷衍,今年只有43歲寝优,我們之前一...
    彭晨龍閱讀 268評論 0 1
  • 故事梗概:他和他都是追求自由的人,就算是戀人也無法束縛他們枫耳,因此爭吵不斷乏矾,于是便有了戀愛合約…… ...
    流年留念i閱讀 586評論 0 1
  • 當你一次次為了夢想挫敗的時候钻心,一個聲音說,繼續(xù)前進吧铅协,遠方有你的夢想捷沸,有你最高的期待,有你超越別人生活的向往...
    律動青春ing閱讀 342評論 0 0