Python入門：數(shù)據(jù)分析步驟

廣義的數(shù)據(jù)分析可以分為五個(gè)步驟：定義問題茉盏、獲取數(shù)據(jù)、清洗數(shù)據(jù)枢冤、分析數(shù)據(jù)鸠姨、報(bào)告結(jié)論。

定義問題

我們可以從兩個(gè)方面來思考：是描述還是推斷掏导？前者傾向于關(guān)注我們看見的數(shù)據(jù)是什么樣的享怀，而后者傾向于我們可以基于數(shù)據(jù)對(duì)未知情況做出什么樣的預(yù)測(cè)羽峰。舉個(gè)例子趟咆，我們看見下面一組數(shù)據(jù)：

# 每日學(xué)習(xí)時(shí)長（分鐘）
learn_time = {"老王":30, "阿強(qiáng)":5, "妞妞":13, "小趙":0, "張小明":22}
# Python項(xiàng)目完成度
project_complete = {"老王":1, "阿強(qiáng)":0.1, "妞妞":0.2, "小趙":0.0, "張小明":0.7}

上方示例中，我們看到 Python 學(xué)員每天學(xué)習(xí) Python 的時(shí)間以及項(xiàng)目的完成水平梅屉，我們據(jù)此提出兩個(gè)問題：
1）有兩個(gè)學(xué)員的每天學(xué)習(xí)時(shí)長超過20分鐘值纱；
2）新學(xué)員萌仔的項(xiàng)目完成度在0.9，TA的每日學(xué)習(xí)時(shí)長大于20分鐘坯汤。
上述兩個(gè)問題中虐唠，第一個(gè)問題是描述性問題，而第二個(gè)問題是推斷性問題惰聂。

survey = "https://stackoverflow.blog/2017/09/06/incredible-growth-python/"
SO_question = {2012:0.040, 2013:0.045, 2014:0.050, 2015:0.060, 2016: 0.075, 2017: 0.09, 2018: 0.10}

print("據(jù)統(tǒng)計(jì)在網(wǎng)站Stack Overflow上：")
for year, perc in SO_question.items():
    print("{year}年疆偿，關(guān)于Python的提問占比為{perc}%".format(year = year, perc = perc*100))
print("報(bào)告來源："+survey)

# 輸出：
# 據(jù)統(tǒng)計(jì)在網(wǎng)站Stack Overflow上：
# 2012年，關(guān)于Python的提問占比為4.0%
# 2013年搓幌，關(guān)于Python的提問占比為4.5%
# 2014年杆故，關(guān)于Python的提問占比為5.0%
# 2015年，關(guān)于Python的提問占比為6.0%
# 2016年溉愁，關(guān)于Python的提問占比為7.5%
# 2017年处铛，關(guān)于Python的提問占比為9.0%
# 2018年，關(guān)于Python的提問占比為10.0%
# 報(bào)告來源：https://stackoverflow.blog/2017/09/06/incredible-growth-python/

獲取數(shù)據(jù)

有時(shí)候，我們可以根據(jù)看到的數(shù)據(jù)來提出問題撤蟆，而有時(shí)奕塑，我們會(huì)先提出問題，再去想辦法獲取相應(yīng)的數(shù)據(jù)家肯。一般來說龄砰，可以有以下途徑獲取數(shù)據(jù)：

1）讀取文件：
可以用代碼來讀取電腦中的文件，從而獲取其中的數(shù)據(jù)讨衣。具體的方法我們會(huì)在“文件”章節(jié)中為大家做講解寝贡。

2）網(wǎng)絡(luò)抓取（爬蟲）：
互聯(lián)網(wǎng)的普及造就了今日數(shù)據(jù)科學(xué)的蓬勃態(tài)勢(shì)值依，我們之前學(xué)到的網(wǎng)頁爬蟲技能也是人們常用的獲取數(shù)據(jù)手段圃泡。但是需要注意的是，要用合法手段抓取網(wǎng)站數(shù)據(jù)愿险。

3）使用API：
許多網(wǎng)站提供應(yīng)用程序接口（Application Programming Interface颇蜡， API）允許你明確地請(qǐng)求結(jié)構(gòu)化格式的數(shù)據(jù)。這樣省去了我們不得不去抓取數(shù)據(jù)的麻煩辆亏。

import urllib.request
from urllib.request import urlopen
web_adr = "https://assets.baydn.com/baydn/public/codetime/1/scrape_py.html"
web_reponse = urllib.request.urlopen(web_adr)

print(web_reponse.read())

清洗數(shù)據(jù)

我們獲取的數(shù)據(jù)往往是“不干凈”的风秤，我們需要對(duì)它進(jìn)行清洗。
清洗數(shù)據(jù)一般包括三個(gè)方面：異常值的處理扮叨，空值的處理以及數(shù)據(jù)調(diào)整缤弦。

下方是我們從世界銀行網(wǎng)站免費(fèi)下載的公開數(shù)據(jù)，其中記錄了每年的世界人口統(tǒng)計(jì)彻磁。我們想計(jì)算每年人口的變化趨勢(shì)碍沐。到手的數(shù)據(jù)都是字符串，無法直接進(jìn)行計(jì)算衷蜓。

清洗數(shù)據(jù)：將字典的值（字符串）改為 float 數(shù)據(jù)累提。

data = "https://data.worldbank.org/indicator/sp.pop.totl"
world_population = {2017:"7530000000", 2016: "7444000000", 2015: "7358000000", 2014: "7271000000", 2013: "718500000"}

print("世界人口統(tǒng)計(jì)（2013～2017）：")
for pop in world_population.values():
    #將字典的值（字符串）改為 float數(shù)據(jù)
    pop = float(pop)
    print(pop)
    print(type(pop))
    
print("數(shù)據(jù)來源："+data)

# 世界人口統(tǒng)計(jì)（2013～2017）：
# 718500000.0
# <class 'float'>
# 7271000000.0
# <class 'float'>
# 7358000000.0
# <class 'float'>
# 7444000000.0
# <class 'float'>
# 7530000000.0
# <class 'float'>
# 數(shù)據(jù)來源：https://data.worldbank.org/indicator/sp.pop.totl

分析數(shù)據(jù)

對(duì)數(shù)據(jù)清洗完畢后，我們可以從中挖掘有價(jià)值的信息磁浇。根據(jù)我們對(duì)問題的分類斋陪，也可以將對(duì)數(shù)據(jù)的分析角度劃分為描述性分析和推斷性分析。

比如下方程序置吓，依據(jù)運(yùn)行結(jié)果你愿意去哪家公司工作无虚？

company_a = [3000,3500,3300,4000,3200,30000,4300,3000,4200,3000]
company_b = [6000,6500,6000,5500,5300,5300,6300,5800]

def ave_income(company):
  total = 0
  count = 0
  for num in company:
    total += num
    count += 1 
  return total/count
  
print("A公司的平均收入為{}元/月".format(ave_income(company_a)))
print("B公司的平均收入為{}元/月".format(ave_income(company_b)))

# A公司的平均收入為6150.0元/月
# B公司的平均收入為5837.5元/月

報(bào)告結(jié)論

最常見的結(jié)論展現(xiàn)方式是將數(shù)據(jù)可視化，比如Python中的pygal庫實(shí)現(xiàn)數(shù)據(jù)可視化衍锚。

Line：折線圖

import pygal
line_chart = pygal.Line()
line_chart.title = '瀏覽器使用比例 (%)'
line_chart.x_labels = map(str, range(2002, 2013))
line_chart.add('火狐', [None, None,    0, 16.6,   25,   31, 36.4, 45.5, 46.3, 42.8, 37.1])
line_chart.add('Chrome',  [None, None, None, None, None, None,    0,  3.9, 10.8, 23.8, 35.3])
line_chart.add('IE',      [85.8, 84.6, 84.7, 74.5,   66, 58.6, 54.7, 44.8, 36.2, 26.6, 20.1])
line_chart.add('其他',  [14.2, 15.4, 15.3,  8.9,    9, 10.4,  8.9,  5.8,  6.7,  6.8,  7.5])
line_chart.render()

image

Bar：柱狀圖

import pygal

"""
對(duì)891名泰坦尼克號(hào)乘客統(tǒng)計(jì)（人數(shù)）：
女性幸存者：223
男性幸存者：109
女性遇難者：81
男性遇難者：468
"""

survive_female = 223
non_survive_female = 81
survive_male = 109
non_survive_male = 468

line_chart = pygal.Bar()
line_chart.title = '泰坦尼克號(hào)生存統(tǒng)計(jì)'
line_chart.x_labels = map(str, ["幸存者","遇難者"])
line_chart.add('女性',  [survive_female, non_survive_female])
line_chart.add('男性',  [survive_male, non_survive_male])
line_chart.render()