《用Python玩轉(zhuǎn)數(shù)據(jù)》-05 基本數(shù)據(jù)統(tǒng)計(jì)

5.1 便捷數(shù)據(jù)獲取

  • 便捷網(wǎng)絡(luò)數(shù)據(jù)
    直接利用matplotlib.finance模塊進(jìn)行數(shù)據(jù)獲取,詳情可見matplotlib技術(shù)文檔旺聚,另溫故已有的兩次數(shù)據(jù)獲取方式
#cnn_stock
import re
import requests
import pandas as pd

def strievieDijList():
    r = requests.get('http://money.cnn.com/data/dow30/')
    pattern_s = re.compile('class="wsod_symbol">(.*?)<\/a>.*?<span title=.*?">(.*?)<\/span>.*?\n.*?class="wsod_stream">(.*?)<\/span>')
                           #class="wsod_symbol">(.*?)<\/a>.*?<span.*?">(.*?)<\/span>.*?\n.*?class="wsod_stream">(.*?)<\/span>
    dijListInText = re.findall(pattern_s, r.text)
    return dijListInText

djiList = strievieDijList()
dataFrame = pd.DataFrame(djiList,columns = ['Abbr','Name','Price'])
print(dataFrame)
#quotes_history
import requests
import re
import json
import pandas as pd
from datetime import date

def retrieve_quotes_historical(stock_code):
    quotes = []
    url = 'https://finance.yahoo.com/quote/%s/history?p=%s' % (stock_code, stock_code)
    r = requests.get(url)
    m = re.findall('"HistoricalPriceStore":{"prices":(.*?),"isPending"', r.text)
    if m:
        quotes = json.loads(m[0])
        quotes = quotes[::-1]
    return [item for item in quotes if not 'type' in item]

quotes = retrieve_quotes_historical('AXP')
list1 = []
for i in range(len(quotes)):
    x = date.fromtimestamp(quotes[i]['date'])
    y = date.strftime(x, '%y-%m-%d')
    list1.append(y)

quotesdf_ori = pd.DataFrame(quotes, index = list1)
quotesdf = quotesdf_ori.drop(['date'], axis = 1)
#quotesdf = quotesdf_ori.drop(['unadjclose'], axis = 1)
print(quotesdf)

5.2 數(shù)據(jù)準(zhǔn)備

對(duì)數(shù)據(jù)的前期整理可以對(duì)后期分析工作起到很大幫助

  • 修改列索引和行索引
#更新列明
cols = ['code', 'name', 'price']
dataFrame.columns = cols

#更新index序號(hào)
dataFrame.index = range(1, len(dataFrame) + 1)

  • 修改日期形式
#絕對(duì)時(shí)間轉(zhuǎn)化
from datetime import date
firstday = date.fromtimestamp(1464010200)

#轉(zhuǎn)化成固定的是時(shí)間格式
finalday = date.strftime(firstday, '%Y-%m-%d')
  • 創(chuàng)建時(shí)間序列
import pandas as pd
dates = pd.date_range('20170520', periods = 7)

import numpy as np
datesdf = pd.DataFrame(np.random.randn(7,3), index = dates, columns = list('ABC'))
print(datesdf)

Out[13]: 
                   A         B         C
2017-05-20  1.135092  0.163002 -1.232315
2017-05-21 -0.017070 -0.429692  0.090594
2017-05-22 -0.811601 -0.859939 -0.290507
2017-05-23 -0.332632  1.147155 -0.393566
2017-05-24 -0.823202 -0.094191  1.420667
2017-05-25  0.499164 -0.418811 -0.675755
2017-05-26  0.713523 -0.445534 -0.392662

5.3 數(shù)據(jù)顯示

對(duì)數(shù)據(jù)的結(jié)果進(jìn)行一次查驗(yàn),可以對(duì)數(shù)據(jù)的正誤進(jìn)行一次查看仗岸,一面后期程序進(jìn)行出現(xiàn)問題

  • 行索引 list(dataFrame.index)
  • 列索引 list(dataFrame.columns)
  • 查看值 dataFrame.values
  • 數(shù)據(jù)描述查看 dataFrame.describe
  • 查看數(shù)值基本類型是否對(duì)應(yīng)dataFrame.lasttrade

5.4 數(shù)據(jù)選擇

在數(shù)據(jù)分析前需要對(duì)數(shù)據(jù)對(duì)象進(jìn)行選擇

  • 選擇行
  • 選擇列
  • 選擇區(qū)域
  • 篩選(條件選擇)
#篩選時(shí)間
quotesdf['18-03-15':'18-03-27']
Out[8]: 
           adjclose      close       high        low       open   volume
18-03-15  94.389999  94.389999  95.410004  93.959999  95.059998  3213000
18-03-16  95.610001  95.610001  96.389999  94.269997  94.550003  5124900
18-03-19  94.839996  94.839996  95.480003  93.800003  95.440002  2610400
18-03-20  94.949997  94.949997  95.650002  94.720001  95.349998  2501600
18-03-21  94.739998  94.739998  96.269997  94.400002  94.620003  2124000
18-03-22  91.410004  91.410004  93.959999  91.120003  93.739998  4113900
18-03-23  90.449997  90.449997  92.360001  90.410004  91.820000  5100400
18-03-26  92.620003  92.620003  92.989998  90.800003  91.370003  2769800
18-03-27  91.419998  91.419998  93.739998  90.870003  92.809998  4306200

#選中單列
quotesdf['close']
#或者
quotesdf.close

#選中多行多列前标,loc類
 dataFrame.loc[1:5,['Name','Price']]
Out[19]: 
               Name   Price
1  American Express   91.42
2             Apple  168.34
3            Boeing  321.12
4       Caterpillar  146.99
5           Chevron  114.66

#選中多行列墓贿,iloc類
dataFrame.iloc[1:5,[0,2]]
Out[20]: 
   Abbr   Price
1   AXP   91.42
2  AAPL  168.34
3    BA  321.12
4   CAT  146.99

#選中確定的某個(gè)數(shù)值
dataFrame.iat[1,2]

5.5簡(jiǎn)單統(tǒng)計(jì)與處理

簡(jiǎn)單統(tǒng)計(jì)與篩選

  • 求平均值
dataFrame.Price.mean()
Out[58]: 116.14099999999999
  • 條件篩選
dataFrame[dataFrame.Price >= 120].Name
Out[5]: 
0                          3M
2                       Apple
3                      Boeing
4                 Caterpillar
12              Goldman Sachs
13                 Home Depot
14                        IBM
16          Johnson & Johnson
18                 McDonald's
24    Travelers Companies Inc
25        United Technologies
26               UnitedHealth
Name: Name, dtype: object

#統(tǒng)計(jì)AXP股價(jià)上漲下跌的天數(shù)
len(quotesdf[quotesdf.close > quotesdf.open])
len(quotesdf[quotesdf.close <= quotesdf.open])

#統(tǒng)計(jì)AXP一年相鄰兩天收盤情況
status = np.sign(np.diff(quotesdf.close))
status[np.where(status == 1)].size
Out[11]: 132
status[np.where(status == -1)].size
Out[12]: 114
  • 排序
    DataFrame.sort_values(by = , ascending = False)逆序排列
tempdf = dataFrame.sort_values(by = 'Price', ascending = False)
tempdf
Out[17]: 
    Abbr                     Name   Price
3     BA                   Boeing  327.88
12    GS            Goldman Sachs  251.86
0    MMM                       3M  219.52
26   UNH             UnitedHealth  214.00
13    HD               Home Depot  178.24
2   AAPL                    Apple  167.78
18   MCD               McDonald's  156.38
14   IBM                      IBM  153.43
4    CAT              Caterpillar  147.38
24   TRV  Travelers Companies Inc  138.86
16   JNJ        Johnson & Johnson  128.15
25   UTX      United Technologies  125.82
28     V                     Visa  119.62
5    CVX                  Chevron  114.04
17   JPM           JPMorgan Chase  109.97
8    DIS                   Disney  100.44
1    AXP         American Express   93.28
20  MSFT                Microsoft   91.27
29   WMT                 Wal-Mart   88.97
23    PG         Procter & Gamble   79.28
10   XOM              Exxon Mobil   74.61
21   NKE                     Nike   66.44
9   DWDP            DowDuPont Inc   63.71
19   MRK                    Merck   54.47
15  INTC                    Intel   52.08
27    VZ                  Verizon   47.82
7     KO                Coca-Cola   43.43
6   CSCO                    Cisco   42.89
22   PFE                   Pfizer   35.49
11    GE         General Electric   13.48

tempdf[:3].Name
Out[18]: 
3            Boeing
12    Goldman Sachs
0                3M
Name: Name, dtype: object
  • 計(jì)數(shù)統(tǒng)計(jì)
#統(tǒng)計(jì)1月開盤天數(shù)
t = quotesdf[(quotesdf.index >= '2017-01-01') & (quotesdf.index < '2017-02-01')]
len(t)

#統(tǒng)計(jì)每個(gè)月的股票開盤天數(shù)
#運(yùn)用到strptime()方法
import time
...
listtemp = []
for i in range(len(quotesdf)):
    temp = time.strptime(quotesdf.index[i], '%y-%m-%d')
    listtemp.append(temp.tm_mon)
    
tempdf = quotesdf.copy()
tempdf['month'] = listtemp
print(tempdf['month'].value_counts())  

5.6 Grouping

用分組的思想進(jìn)行我們的統(tǒng)計(jì)
DataFrame.groupby()

x = tempdf.groupby('month').count()
print(x)
Out[8]: 
       close  high  low  open  volume
month                                
1         21    21   21    21      21
2         19    19   19    19      19
3         21    21   21    21      21
4         19    19   19    19      19
5         22    22   22    22      22
6         22    22   22    22      22
7         20    20   20    20      20
8         23    23   23    23      23
9         20    20   20    20      20
10        22    22   22    22      22
11        21    21   21    21      21
12        20    20   20    20      20

#任意選擇一個(gè)屬性進(jìn)行月數(shù)的統(tǒng)計(jì)
print(x.close)
month
1     21
2     19
3     21
4     19
5     22
6     22
7     20
8     23
9     20
10    22
11    21
12    20

#統(tǒng)計(jì)每個(gè)月的成交量,同理還有mean()、min()甜熔、max()等方法
tempdf.groupby('month').sum().volume

#高效統(tǒng)計(jì)每個(gè)月的成交量
tempdf.groupby('month').volume.sum()

問題的核心在于用什么樣的方式來處理相對(duì)應(yīng)的字段名

5.7 Merge

  • append 追加
#把AXP公司本年度1月1日至1月5日間的股票交易信息合并到近一年中前兩天的股票信息
p = quotesdf[:2]
print(p)
Out[23]: 
           adjclose      close       high        low       open   volume
17-04-03  77.401367  78.589996  79.180000  77.970001  79.169998  3022700
17-04-04  77.076370  78.260002  78.610001  78.150002  78.489998  2563700

q = quotesdf['18-01-01':'18-01-03']
print(q)
Out[25]: 
           adjclose      close       high        low       open   volume
18-01-02  98.592148  98.940002  99.730003  98.220001  99.730003  2746700
18-01-03  99.200005  99.550003  99.760002  99.019997  99.239998  2976400

p.append(q)
Out[26]: 
           adjclose      close       high        low       open   volume
17-04-03  77.401367  78.589996  79.180000  77.970001  79.169998  3022700
17-04-04  77.076370  78.260002  78.610001  78.150002  78.489998  2563700
18-01-02  98.592148  98.940002  99.730003  98.220001  99.730003  2746700
18-01-03  99.200005  99.550003  99.760002  99.019997  99.239998  2976400
  • conact 連接
#將AXP公司近一年股票數(shù)據(jù)中前五個(gè)和后五個(gè)合并
pieces = [tempdf[:5], tempdf[len(tempdf)-5:]]
pd.concat(pieces)
Out[35]: 
              close       high        low       open   volume  month
17-04-03  78.589996  79.180000  77.970001  79.169998  3022700      4
17-04-04  78.260002  78.610001  78.150002  78.489998  2563700      4
17-04-05  77.760002  79.029999  77.660004  78.589996  2858400      4
17-04-06  77.919998  78.300003  77.150002  77.760002  2914800      4
17-04-07  77.769997  78.239998  77.370003  77.480003  2203000      4
18-03-23  90.449997  92.360001  90.410004  91.820000  5100400      3
18-03-26  92.620003  92.989998  90.800003  91.370003  2769800      3
18-03-27  91.419998  93.739998  90.870003  92.809998  4357300      3
18-03-28  92.209999  93.540001  91.599998  91.760002  5509700      3
18-03-29  93.279999  94.370003  92.290001  92.389999  4914600      3

#利用concat連接不同數(shù)據(jù)類型
pieces1 = quotesdf[:3]

pieces2 = tempdf[:3]

pd.concat([pieces1, pieces2])
Out[38]: 
           adjclose      close       high        low  month       open  \
17-04-03  77.401367  78.589996  79.180000  77.970001    NaN  79.169998   
17-04-04  77.076370  78.260002  78.610001  78.150002    NaN  78.489998   
17-04-05  76.898369  77.760002  79.029999  77.660004    NaN  78.589996   
17-04-03        NaN  78.589996  79.180000  77.970001    4.0  79.169998   
17-04-04        NaN  78.260002  78.610001  78.150002    4.0  78.489998   
17-04-05        NaN  77.760002  79.029999  77.660004    4.0  78.589996   

           volume  
17-04-03  3022700  
17-04-04  2563700  
17-04-05  2858400  
17-04-03  3022700  
17-04-04  2563700  
17-04-05  2858400  

pd.concat([pieces1,pieces2],ignore_index = True)#即不使用連接軸上的信息
Out[39]: 
    adjclose      close       high        low  month       open   volume
0  77.401367  78.589996  79.180000  77.970001    NaN  79.169998  3022700
1  77.076370  78.260002  78.610001  78.150002    NaN  78.489998  2563700
2  76.898369  77.760002  79.029999  77.660004    NaN  78.589996  2858400
3        NaN  78.589996  79.180000  77.970001    4.0  79.169998  3022700
4        NaN  78.260002  78.610001  78.150002    4.0  78.489998  2563700
5        NaN  77.760002  79.029999  77.660004    4.0  78.589996  2858400
  • join(SQL類型) 連接
#和數(shù)據(jù)庫中的join概念相同圆恤,join前提是要有共同字段
pd.merge(diff.drop(['lasttrade'], axis = 1), AKdf, on = 'code') #該表未實(shí)際生成,此處了解即可
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末腔稀,一起剝皮案震驚了整個(gè)濱河市盆昙,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌焊虏,老刑警劉巖淡喜,帶你破解...
    沈念sama閱讀 219,366評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場(chǎng)離奇詭異炕淮,居然都是意外死亡拆火,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,521評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門涂圆,熙熙樓的掌柜王于貴愁眉苦臉地迎上來们镜,“玉大人,你說我怎么就攤上這事润歉∧O粒” “怎么了?”我有些...
    開封第一講書人閱讀 165,689評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵踩衩,是天一觀的道長(zhǎng)嚼鹉。 經(jīng)常有香客問我,道長(zhǎng),這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,925評(píng)論 1 295
  • 正文 為了忘掉前任宽涌,我火速辦了婚禮,結(jié)果婚禮上线脚,老公的妹妹穿的比我還像新娘。我一直安慰自己叫榕,他們只是感情好浑侥,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,942評(píng)論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著晰绎,像睡著了一般寓落。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上荞下,一...
    開封第一講書人閱讀 51,727評(píng)論 1 305
  • 那天伶选,我揣著相機(jī)與錄音史飞,去河邊找鬼。 笑死考蕾,一個(gè)胖子當(dāng)著我的面吹牛祸憋,可吹牛的內(nèi)容都是我干的会宪。 我是一名探鬼主播肖卧,決...
    沈念sama閱讀 40,447評(píng)論 3 420
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼掸鹅!你這毒婦竟也來了塞帐?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,349評(píng)論 0 276
  • 序言:老撾萬榮一對(duì)情侶失蹤巍沙,失蹤者是張志新(化名)和其女友劉穎葵姥,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體句携,經(jīng)...
    沈念sama閱讀 45,820評(píng)論 1 317
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡榔幸,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,990評(píng)論 3 337
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了矮嫉。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片削咆。...
    茶點(diǎn)故事閱讀 40,127評(píng)論 1 351
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖蠢笋,靈堂內(nèi)的尸體忽然破棺而出拨齐,到底是詐尸還是另有隱情,我是刑警寧澤昨寞,帶...
    沈念sama閱讀 35,812評(píng)論 5 346
  • 正文 年R本政府宣布瞻惋,位于F島的核電站,受9級(jí)特大地震影響援岩,放射性物質(zhì)發(fā)生泄漏歼狼。R本人自食惡果不足惜,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,471評(píng)論 3 331
  • 文/蒙蒙 一享怀、第九天 我趴在偏房一處隱蔽的房頂上張望羽峰。 院中可真熱鬧,春花似錦凹蜈、人聲如沸限寞。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,017評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽履植。三九已至,卻和暖如春悄晃,著一層夾襖步出監(jiān)牢的瞬間玫霎,已是汗流浹背凿滤。 一陣腳步聲響...
    開封第一講書人閱讀 33,142評(píng)論 1 272
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留庶近,地道東北人翁脆。 一個(gè)月前我還...
    沈念sama閱讀 48,388評(píng)論 3 373
  • 正文 我出身青樓,卻偏偏與公主長(zhǎng)得像鼻种,于是被迫代替她去往敵國和親反番。 傳聞我的和親對(duì)象是個(gè)殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,066評(píng)論 2 355