Introduction to Data Science (still reading...)

Introduction to Data Science

Before we start

Libraries that we are using

There's bunch of library that data scientist will use during the research, here are some important libraries:

NumPy: provides multidimensional array with basic operation, linear algebra function
SciPy: provides numerical algorithms and domain-specific toolbox. Signal processing, optimization, statistics...
Matplotlib: data visualization
SCIKIT-Learn: machine learning library, classification, regression, clustering, dimensiality reduction, model selection, and preprocessing.
PANDAS: data analysis library. Importing CSV, text file, xls, SQL databases, and some other files. It can handling missing data and intelligent data alignment.

Python version and import toolboxes

This book use Anaconda distribution of python. It has integrated some of the most imoportant package that we will use. Detail about installing will not be covered in this note. And we are using Jupyter Notebook Environment. All the code was on python 2.7. The reason why we don't use python 3.6 is the poor 3rd party module support. Everything runs fine with 2.7.

Begin with python

Remember to import the toolboxes that we will use in the program

import pandas as pd
import numpy as np
import matplotlib.pylot as plt
#so that we can use short cut pd, np, plt

To run the code in just one cell, use Cell -> Run or just press Ctrl+Enter
To add a cell below, use Insert -> Insert Cell Below or just press Ctrl+B

Data input and output

The data structure of Pandas called DataFrame object. It's basiclly a tabular data structure, that's rows and columns.
Colums are called series, consist of a list of serval values.
e.g.

data = { 'year': [
2010, 2011, 2012,
2010, 2011, 2012,
2010, 2011, 2012
],
'team': [
'FCBarcelona', 'FCBarcelona',
'FCBarcelona', 'RMadrid',
'RMadrid', 'RMadrid',
'ValenciaCF', 'ValenciaCF',
'ValenciaCF'
],
'wins':     [30, 28, 32, 29, 32, 26, 21, 17, 19],
'draws':    [6, 7, 4, 5, 4, 7, 8, 10, 8],
'losses':   [2, 3, 2, 4, 2, 5, 9, 11, 11]
}

football = pd.DataFrame(data, columns = [
'year', 'team', 'wins', 'draws', 'losses'
]
)

index	year	team	wins	draws	losses
0	2010	FCBarcelona	30	6	2
1	2011	FCBarcelona	28	7	3
2	2012	FCBarcelona	32	4	2
3	2010	RMadrid	29	5	4
4	2011	RMadrid	32	4	2
5	2012	RMadrid	26	7	5
6	2010	ValenciaCF	21	8	9
7	2011	ValenciaCF	17	10	11
8	2012	ValenciaCF	19	8	11

this is what the result looks like

index	year	team	wins	draws	losses
0	2010	FCBarcelona	30	6	2
1	2011	FCBarcelona	28	7	3
2	2012	FCBarcelona	32	4	2
3	2010	RMadrid	29	5	4
4	2011	RMadrid	32	4	2
5	2012	RMadrid	26	7	5
6	2010	ValenciaCF	21	8	9
7	2011	ValenciaCF	17	10	11
8	2012	ValenciaCF	19	8	11

We can not only use dictionary to input data, but also a CSV file.
e.g.

edu = pd.read_csv('files/Data.csv',
na_values = ':',
usecols = ["A","B","C"])
edu

na_values set the character that represents NaN
usecols set the name of the columns.
You can also use Pandas to read some other files, by using read_excel(), read_hdf(), read_table(), read_clipboard()
by calling edu.head() edu.tail() we can see the first 5 rows and the last 5 rows of the Data in edu.
If we want to know the name of the column or the index or the value, call attribute columns, index, value.
If you want a quick view how this statistic is like. call edu.describe()
It give us the count, mean, standard diviation, min, max, 25% 50% 75% quantile of each column or series

Selecting Data

Select or filter some of the data normally use square bracket [ ].

# select just the column name "Value"
edu[`Value`]

#select some rows
edu[10: 14]

# select some rows of some suluns
edu.i[90: 94, ['TIME', 'GEO']]

We can also fiter out some of the data that we don't want. Criteria for filtering can use all the Boolean operator.

# we want the Value larger than 6.5, and just last 5 row of the data
edu[edu['Value'] > 6.5].tail()

to represent NaN using isnull()

edu[edu["Value"].isnull]

Manipulating Data

Aggregation function

We can call the aggregation functions to get some information about the data that we want to know about.
Here're some commenly used functions:count(), sum(), mean(), median(), max(), min(), std(), var(), prod() -> product of values
They have a common parameter axis.
axis = 0 means function apply to the rows for each column.
axis = 1 means function apply to the columns for each row. (Normally this don't make sence)
Note that the max function of Pandas and of Python are different.
In Pandas, NaN are exclude.

max function of python actually regard NaN as maximum

print "Pandas max function:" edu['Value'].max()  #return 8.81
print "Python max function:" max(edu['Value'])  #return NaN

we can also apply functions in NumPy by append. apply(np.functionname)
e.g.

s = edu["Value"].apply(np.sqrt)
s.head()

we can also apply a lambda function

s = edu["Value"].apply(lambda d: d**2)

Lambda function is a easy and better way to define a simple function that with no name and will only be use one time.

def f(x):
return x**2
print f(8) # result: 64

#is equal to
g = lambda x: x**2
print g(8) # result: 64

Insert and remove

We can simply use assign operator over a DataFrame.

edu['ValueNorm'] = edu['Value']/edu['Value'].max()

Some of the data should be removed. Parameter axis shows whether we are removing the row or the column. Parameter inplace shows whether we are changing the duplicate or the original. If you don't want the old data , set True

edu.drop('ValueNorm', axis = 1, inplace = True)

Inssert new row at bottom using append.
remember to set ignore_index to True, otherwise it will be set to index 0, which will lead to an error if index 0 already exist.

edu = edu.append({"TIME": 2000, "Value": 5.00, "GEO": 'a'}, ignore_index = True)

deleting the last row would be easy if you write like this.

edu.drop(max(edu.index), axis = 0, inplace = True)

To delete NaN we can use isnull() function and set axis to 0 to delate the whole row.

eduDrop = edu.drop(edu['Value'].isnull(), axis = 0)

or we can use specific dropna() function

eduDrop = edu.dropna(how = 'any', subset = ['Value'])

If we want to set a default value to all the NaN we can use fillna()

eduFilled = edu.fillna(value = {"Value": 0})

Sorting

Data that sorted by column are commenly seen. We can use sort function.

edu.sort_values(by = 'Value', ascending = False,
inplace = True)

note that if the inplace was True, there will be no new DataFrame return.
Sorted data have their oringinal index, so we can just use sort_index to return the original order

edu.sort_index(axis = 0, ascending = True, inplace = True)

Grouping Data

We want to group data that meet some commen criteria. Pandas has the groupby function to do this.

# shows the mean of the value of each country
group = edu[["GEO", "Value"]].groupby("GEO").mean()

-	Value
GEO
Austria	5.618333
Belgium	6.189091
Bulgaria	4.093333
Cyprus	7.023333
Czech Republic	4.16833

result:

-	Value
GEO
Austria	5.618333
Belgium	6.189091
Bulgaria	4.093333
Cyprus	7.023333
Czech Republic	4.16833

Rearranging Data

Sometimes we need to make the index meaningful by rearranging data. The result is normally a new spreadsheet containing the data that we want and well arranged.

filtered_data = edu[edu["TIME"] > 2005]
pivedu = pd.pivot_table(filtered_data, values = 'Value',
index = ['GEO'],
columns = ['Timer'])
pivedu.head()

TIME	2006	2007	2008	2009	2010	2011
GEO
Austria	...	...	...	...	..	..
Belgium	....	...	...	...	...	...
Bulgaria	...	...	..	...	...	..
Cyprus	...	..	..	..	...	..
Czech Republic	...	....	...	....	..	...

result:

TIME	2006	2007	2008	2009	2010	2011
GEO
Austria	...	...	...	...	..	..
Belgium	....	...	...	...	...	...
Bulgaria	...	...	..	...	...	..
Cyprus	...	..	..	..	...	..
Czech Republic	...	....	...	....	..	...

and we can directly select specific row and column by lable of the new spreadsheet that we just made.

pivedu.ix[['Spain','Portugal'], [2006, 2011]]

Ranking

Ranking number is really useful. Forexample, we want to know the rank of each country by year.
in the example, we first rename the Germany that with a long long description. Then we drop all the ``NaN`. Finally we do rank.

pivedu = pivedu.rename(index = {'Germany(until 1990 former territory of the FRG)': Germany})
pivedu = pivedu.dropna()
pivedu.rank(ascending = False, method = 'first').head()

if we want to see the rank from top to bottom, we can first rank then sort.

totalSum = pivedu.sum(axis = 1)
totalSum.rank(ascending = False, method = 'dense')
.sort_values().head()

GEO	-
Denmark	1
Cyprus	2
Finland	3
Malta	4
Belgium	5

result:

GEO	-
Denmark	1
Cyprus	2
Finland	3
Malta	4
Belgium	5

Plotting

By using plot function, we can plot the graphic.

totalSum = pivedu.sum(axis = 1)
.sort_values(ascending = False)
totalSum.plot(kind = 'bar', style = 'b', alpha = 0.4, title = "Total Values for Country")

kind define which kind of graphic are we using.
Style refer to the style properties, in bar gragh means color of the bar.
alpha refer to percentage transparency of the color
if we want to add a legend, a description of the graph, use legend function.

myplot.legend(loc = 'center left', # set reletive location of the legend with respect to the plot
bbox_to_anchor = (1, .5)) # set a absolute position with respect to the plot

Basic of statistics in Python

First we introduce two concept. Population and sample. Population is a collection ob object, a sample is a part of the population that is observed.
Before the statistics we should prepare the data that suit our need.

Data Prepareation

Data prepareation include following operation:

Obtaining the data: e.g. read a file, scraping from internet
Parsing the data: to unwrap the datapackage
Cleaning the data: deal with incomplete records
Building data structure: e.g. a database

file = open('files/ch03/adult.data','r')
data = [] # set a empty list
for line in file:
data1 = line.split(', ')
if len(data1) == 15:
data.append([.., .., .., .., .., .., .., .., ..., ..., .., .., .., .., ..])

df = pd.DataFrame(data)
df.columns = ['..', ... ,'..']
# give them a data lable in column property

after fit it in DataFrame, we can command shape to know the exact number of the data samples, in rows and column.

df.shape
# return (32561, 15)

we can also group the dataFrame like we did earlier.

ml50 = df[(df.sex == 'Male') & (df.income == '>50K\n')]

Summerize some statistic value

Mean
$$\mu=\frac{1}{n}\sum_{i=1}^nx_i$$

ml['age'].mean()

Sample Variance
$$\sigma^{2=\frac{1}{n}\sum_{i}(x_i-\mu)}2$$
σ is the standard deviation

ml['age'].var()
ml['age'].std()

Sample Median: ml['age'].median()
Quantil: a fraction p of the data value is less or equal to $x_p$
Data distribution: commenly use histogram to show the distribution

maAge = ml['age']
mlAge.hist(normed = 0, histtype = 'stepfilled', bins = 20)
# bins: width of the bin
# normed: 1 to show the value of the density function at the bin.(0-1)
#         0 to show the number of the sample in the bin(normally a real number)

to make two histogram compareable, we can overlapping each other in a same graphic

import seaborn as sns
fmAge.hist(normed = 0, histtype = 'stepfilled', alpha = .5, bins = 20)
mlAge.hist(normed = 0, histtype = 'stepfilled', alpha = 0.5,
color = sns.desaturate("indianred", .75),
bin = 10)

if we want to show the cumulative histogram, add a parameter with cumulative = True

Outlier Treatment

First define the standard of the outlier. Normally the samples that far from the median and whose value exceed the mean by 2 or 3 standard deviation.

df2 = df.drop(df.index[
(df.income == '>50K\n') &
(df['age'] > df['age'].median() + 35) &
(df['age'] > df['age'].median() - 15)
])

you can check the removal of your data by overlapping the graphic before and after the removing.
histogram() returns

hist: value of the bar(according to normed), Array
bin.edge: edge of the bin, array of float

Measureing Asymmetry: Skewness and Pearson's Median Skewness Coefficient

Skewness: how asymmetric the distribution is.
Skewness of a set of n data samples, $x_i$:
$$ g_1 = \frac{1}{n}\frac{\sum_{i}(x_i - \mu^3)}{\sigma3} $$
nagative skewness means the distribution "skews left". A normal distribution have a 0 skewness.
we can define a function to calculate the skewness of a data set

def skewness()
res = 0
m = x.mean()
s = x.std()
for i in x:
res += (i-m) * (i-m) * (i-m)
res /= (len(x) * s * s * s)
return res

Normal Distribution

Also called the Gaussian distribution,
$$ P(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e{- \frac{({x-\mu})^2}{2\sigma2}} $$
really important in the statistics

Kernel Density

We don't really get every distribution's parameter of every data set.
Instead, we use kernel density estimation. We use Gaussian kernel to generate the density around the data. The sum of these Gaussian distribution can give us a continuous function. So that we can have a continuous representation of the original distribution

x1 = np.random.normal(-1, 0.5, 15)
# random.normal have 3 parameter
# loc: center of the normal distribution
# scale: standard deviation
# size: output shape, canbe a tuple of int or int
x2 = np.random.normal(6, 1, 10)
y = np.r_[x1, x2]
# r_[] merging all the input to a new Array.
x = np.linspace(min(y), max(y), 100)
# linspace return evenly distribute number, array

s = 0.4 # smoothing parameter

#calculate the kernels
kernels = np.transpose([norm.pdf(x, yi, s) for yi in y])
#norm.pdf(x, loc, scale)
#pdf stands for probability density function
plt.plot(x, kernels, 'k:')
plt.plot(x, kernels.sum(1), 'r')
plt.plot(y, np.zeros(len(y)), 'bo', ms = 10)

we can also call the build in function of SciPy

from scipy.stats import kde
density = kde.gaussian_kde(y)
# kde.gaussian_kde Representation of a kernel-density estimate using Gaussian kernels.
xgrid = np.linspace(x.min(), x.max(), 200)
plt.hist(y, bins = 28, normed = True)
plt.plot(xgrid, density(xgrid), 'r-')

Estimation

Mean

we guess the mean $\mu$ of the distribution using the sample mean. This process called estimation and the statistic called estimator.
The sample mean minimize the following mean squared error
$$ MSE = \frac{1}{n}\sum{(\bar{x}-\mu)^2} $$

NTs = 200
mu = 0.0
var = 1.0
err = 0.0
NPs = 1000
for i in range(NTs)
x = np.random.normal(mu, var, NPs)
err += (x.mean()-mu)**2
print 'MSE: ', err/NTests

Variance

If the sample number is large enough, we can use sample variance estimator
$$ \bar{\sigma}^2 = \frac{1}{n}\sum(x_i-\bar{x})^2 $$
if we don't have enough sample, this estimator won't works well, it will be biased. So better estimator is
$$ \bar{\sigma}^2 = \frac{1}{n-1}\sum(x_i-\bar{x})^2 $$

Standard Score

$$ z_i = \frac{(x_i-\mu)}{\sigma} $$
It's a normalize number set. After the normalizing, it has a distribution, mean of 0 and variace of 1. It inherits the shape of the orincginal dataset.

Covariance, Rank Correlation

Covarience only will be mention when two variable share the same tendency.
We center the data with respect to their mean, and look for the difference.
$$ Cov(X, Y) = \frac{1}{n}\sum_{i=1}^ndx_idy_i $$
Pearson's correlation is after we normalized the data and multiply them together.
$$ \rho_i = \frac{x_i-\mu_X}{\rho_X}\frac{y_i-\mu_Y}{\rho_Y} $$
$$ \rho = \frac{1}{n}\sum_{i=1}^n{\rho_i}=\frac{Cov(X,Y)}{\sigma_X\sigma_Y} $$
Pearson's correlation is always between -1 and +1. The magnitude shows the correlation. The sign shows positive or negative relation
$\rho = 0$ dose not mean that they have no correlation, it only shows the correlation of first order. And note that it dosen't work well in the presence of outliers.
Spearmans Rank Correlation is a robust solution especially when the data contain outliers. Use rank to sort sample data.
$$ r_s = 1-\frac{6\sum_i{d_i^2}}{n\cdot(n2-1)} $$
with $$ d_i = rg(x_i)-rg(y_i)$$
is the difference of the rank of $x$ and $y$

Statistical Inference

Theres two way to address the problem of statistical inference.

frequentist approach
bayesian approach:

The frequentist approach

If we accept this asumption, that means we concern about the population parameters from analysis of sample.
There's three things we mostly concern about:

point estimation: a particular value that approximates some parameter of interest. e.g. mean or variance
confidence intervals, or set estimattes: range of the value that best represent the parameter of insterest
hypothesis: accept or reject

Estimation

Remember estimation is not truth. More data the estimation will be better.

Point estimation

Sampling

data = pd.read_csv("files/aaa.csv")
data['Date'] = data[u'Dia de mes'].apply(lambda x: str(x)) +
'-' +
data[u'Mes de any'].apply(lambda x: str(x))
data['Date'] = pd.to_datetime(data['Date'])
accidents = data.groupby(['Date']).size()
print accidents.mean()

How we estimate the sampling distribution of the sample mean

Draw $s$ (a large number) independent samples $ { x^1,..., x^s} $ from the population where each element $x^j$ is composed of ${x_i^j}_{i=1,...,n}$.
Evaluate the sample mean $\hat{\mu}^{j=\frac{1}{n}\sum_{i=1}}nx_i^j$ of each sample
Estimate the sampling distribution of $\hat{\mu}$ by the empirical distribution of the sample replications.

# population
df = accidents.to_frame() # convert accidents to DataFrame
nTest = 10000
elements = 200

# mean array of samples
means = [0] * nTest # build a sero array with nTest 0

#sample generation
for i in range(nTest):
rows = np.random.choice(df.index.values, elements)
sampled_df = df.ix[rows]
means[i] = sampled_df.mean()

Traditional Approach

But actually we can't always get access to the population, so we have to solve the problem by using some theoretical results from traditonal statistics.
Standard error
$$ SE = \frac{\sigma_x}{\sqrt{n}} $$
This formula use the standard deviation $\sigma_x$, which is not known. If the population distribution is not skewed and estimation is sufficiently good if $n>30$. THis allow us to use the standard error of the sample value if we can't get access to the population.

rows = np.random.choice(df.index.values, 200)
sampledDf = df.ix[rows]
estSigmaMean = sampledDf.std()/math.sqrt(200)

Computationally intensive approach

When the full dataset has super large population, a alternative to the traditional.
In the bootstrap, we draw n observation with replacement from the oringinal data to create a bootstrap sample or resample. Then, we can calculate the mean for this resample. By repeating this process more times, we can get a good approximation of the mean sampling distribution.

def meanBootStrap (X, numberb):
x = [0] * numberb
for i in range (numberb):
sample = [X[j]
for j in np.random.randint(len(X), size = len(X)
]
x[i] = np.mean(sample)
return x
m = meanBootstrap(accidents, 10000)
np.mean(m)

Confidence Interval

Confidence Level	90%	95%	99%	99.9%
z Value	1.65	1.96	2.58	3.291

A point estimate $\Theta$, is hardly to be perfect. So we want to provide a plausible range of value of the parameter. This range called confidence interval.
We care about the interval and the plausibility .
If the point estimate follow the normal model with standard error SE, then a confidence interval for the population parameter is $$\Theta \pm z \times SE$$

Confidence Level	90%	95%	99%	99.9%
z Value	1.65	1.96	2.58	3.291

Use bootstrapping to calculate the 95% confidence interval

m= meanBootstrap(accidents, 10000)
sampleMean = np.mean(m)
sampleSe = np.std(m)

# Confidence Interval
ci = [np.percentile(m, 2.5), np.precentile(m, 97.5)]

95% confidence means:

In 95% of the cases, when I compute the 95% confidence interval from this sample, the true mean of the population will fall within the interval define by the bounds: $\pm1.96\times SE$

Hypothesis Testing

Two kinds of hypotheses:

$H_0$: Two distribution is the same. Null hypothesis
$H_1$: Two distribution is different. Alternative hypothesis

n = len(counts2013)
mean = counts2013.mean()
s = counts2013.std()
ci = [ mean - s*1.96/np.sqrt(n), mean + s*1.96/np.sqrt()]

Interpretation of CI Test
The result canbe reject, or refuse to reject.

if we use a 95% confidence interval to test a problem where the null hypothesis is true, we will make an error whenever the point estimate is at least $1.96\sigma$ away from the population parameter. This hapen about 5% of the time.

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市湘今，隨后出現(xiàn)的幾起案子涕滋，更是在濱河造成了極大的恐慌，老刑警劉巖浪耘，帶你破解...
沈念sama閱讀 222,378評論 6贊 516
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異塑崖，居然都是意外死亡七冲，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 94,970評論 3贊 399
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門规婆，熙熙樓的掌柜王于貴愁眉苦臉地迎上來澜躺，“玉大人，你說我怎么就攤上這事抒蚜【虮桑” “怎么了？”我有些...
開封第一講書人閱讀 168,983評論 0贊 362
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵削锰，是天一觀的道長通铲。經(jīng)常有香客問我，道長器贩，這世上最難降的妖魔是什么颅夺？我笑而不...
開封第一講書人閱讀 59,938評論 1贊 299
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮蛹稍，結(jié)果婚禮上吧黄，老公的妹妹穿的比我還像新娘。我一直安慰自己唆姐，他們只是感情好拗慨，可當(dāng)我...
茶點故事閱讀 68,955評論 6贊 398
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著奉芦，像睡著了一般赵抢。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上声功，一...
開封第一講書人閱讀 52,549評論 1贊 312
城市分裂傳說
那天烦却，我揣著相機(jī)與錄音，去河邊找鬼先巴。笑死其爵，一個胖子當(dāng)著我的面吹牛冒冬，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播摩渺，決...
沈念sama閱讀 41,063評論 3贊 422
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼简烤，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了摇幻？” 一聲冷哼從身側(cè)響起横侦，我...
開封第一講書人閱讀 39,991評論 0贊 277
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎囚企，沒想到半個月后丈咐，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體瑞眼，經(jīng)...
沈念sama閱讀 46,522評論 1贊 319
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡龙宏，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 38,604評論 3贊 342
?白月光啟示錄
正文我和宋清朗相戀三年园骆，在試婚紗的時候發(fā)現(xiàn)自己被綠了馅巷。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片爷辱。...
茶點故事閱讀 40,742評論 1贊 353
活死人
序言：一個原本活蹦亂跳的男人離奇死亡荧止，死狀恐怖洞斯，靈堂內(nèi)的尸體忽然破棺而出镊逝，到底是詐尸還是另有隱情触创，我是刑警寧澤薪丁，帶...
沈念sama閱讀 36,413評論 5贊 351
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布锯蛀，位于F島的核電站灭衷，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏旁涤。R本人自食惡果不足惜翔曲，卻給世界環(huán)境...
茶點故事閱讀 42,094評論 3贊 335
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望劈愚。院中可真熱鬧瞳遍，春花似錦、人聲如沸菌羽。這莊子的主人今日做“春日...
開封第一講書人閱讀 32,572評論 0贊 25
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽注祖。三九已至猾蒂，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間是晨，已是汗流浹背肚菠。一陣腳步聲響...
開封第一講書人閱讀 33,671評論 1贊 274
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留署鸡，地道東北人案糙。一個月前我還...
沈念sama閱讀 49,159評論 3贊 378
代替公主和親
正文我出身青樓限嫌，卻偏偏與公主長得像，于是被迫代替她去往敵國和親时捌。傳聞我的和親對象是個殘疾皇子怒医，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 45,747評論 2贊 361

Introduction to Data Science (still reading...)

Introduction to Data Science

Before we start

Libraries that we are using

Python version and import toolboxes

Begin with python

Data input and output

Selecting Data

Manipulating Data

Aggregation function

Insert and remove

Sorting

Grouping Data

Rearranging Data

Ranking

Plotting

Basic of statistics in Python

Data Prepareation

Summerize some statistic value

Outlier Treatment

Measureing Asymmetry: Skewness and Pearson's Median Skewness Coefficient

Normal Distribution

Kernel Density

Estimation

Mean

Variance

Standard Score

Covariance, Rank Correlation

Statistical Inference

The frequentist approach

Estimation

Point estimation

Traditional Approach

Computationally intensive approach

Confidence Interval

Hypothesis Testing

推薦閱讀更多精彩內(nèi)容