Python數(shù)據(jù)分析筆記-08

1.I/O API工具

2.讀取csv或文本文件中的數(shù)據(jù)

1)創(chuàng)建一個.csv文件

White,red,blue,green,animal

1,5,2,3,cat

2,7,8,5,dog

3,3,6,7,horse

2,2,8,3,duck

4,4,2,1,mouse

2)用read_csv()函數(shù)讀取他的內(nèi)容剧浸,同時將其轉(zhuǎn)化為DataFrame對象

>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_01.csv')

>>> csvframe

??White?red?blue?green animal

0???1??5???2???3??cat

1???2??7???8???5??dog

2???3??3???6???7?horse

3???2??2???8???3??duck

4???4??4???2???1?mouse

3)既然csv也是文本文件卡辰,還可以使用read_table()函數(shù)但是得指定分隔符

>>> csvframe=pd.read_table('/Users/caojin/Desktop/myCSV_01.csv',sep=',')

>>> csvframe

??White?red?blue?green animal

0???1??5???2???3??cat

1???2??7???8???5??dog

2???3??3???6???7?horse

3???2??2???8???3??duck

4???4??4???2???1?mouse

4)上面例子逊朽,標(biāo)識各列的表頭位于csv文件的第一行,但一般情況并非如此航厚,可能第一行就是列表數(shù)據(jù)如下:

5)沒有表頭的數(shù)據(jù)使用read_csv()函數(shù)時候肉康,使用header選項,將其設(shè)置為None抡句,pandas會自動為其添加默認(rèn)表頭

未使用header

>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_02.csv')

>>> csvframe

??1?5?2?3??cat

0?2?7?8?5??dog

1?3?3?6?7?horse

2?2?2?8?3??duck

3?4?4?2?1?mouse

使用header

>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_02.csv',header=None)

>>> csvframe

??0?1?2?3???4

0?1?5?2?3??cat

1?2?7?8?5??dog

2?3?3?6?7?horse

3?2?2?8?3??duck

4?4?4?2?1?mouse

6)或者可以使用read_csv()函數(shù)的時候滩租,使用names指定表頭,直接把存有各列名稱的數(shù)組賦給它即可

>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_02.csv',names=['a','b','v','d','w'])

>>> csvframe

??a?b?v?d???w

0?1?5?2?3??cat

1?2?7?8?5??dog

2?3?3?6?7?horse

3?2?2?8?3??duck

4?4?4?2?1?mouse

7)讀取csv創(chuàng)建一個具有等級結(jié)構(gòu)的DataFrame利朵,可以read_csv()指定index_col選項律想,把想要轉(zhuǎn)換為索引的列名稱賦給index_col

>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_03.csv',index_col=['colors','status'])

>>> csvframe

????????item1?item2?item3

colors status

black?up?????3???4???6

????down????2???6???7

white?up?????5???5???5

????down????3???3???2

????left????1???2???1

red??up?????2???2???2

????down????1???1???4

3.用RegExp(正則表達(dá)式)解析txt文件

1)不是以逗號或者分號分割的需要用read_table()函數(shù),并指sep選項為一個正則表達(dá)式

2)以下txt元素都以一個或者多個制表符或者空格相隔

>>> import pandas as pd

>>> txtframe=pd.read_table('/Users/caojin/Desktop/ch05_04.txt',sep='\s*')

>>> txtframe

??White?red?blue?green

0???1??5???2???3

1???2??7???8???5

2???3??3???6???7

3)以下txt文件中數(shù)字和字母雜糅绍弟,需要從中抽取數(shù)字部分技即,無表頭需要用header設(shè)置成None

>>> txtframe=pd.read_table('/Users/caojin/Desktop/ch05_05.txt',sep='\D*',header=None)

>>> txtframe

??0??1??2

0?0?123?122

1?1?124?321

2?2?125?333

4)使用skiprows可以排除多余的行

如果排除前5行

如果只排除第5行

>>> txtframe=pd.read_table('/Users/caojin/Desktop/ch05_06.txt',sep=',',skiprows=[0,1,3,6])

>>> txtframe

??White?red?blue green animal

0???1??5???4??cat??NaN

1???2??7???8???5??dog

2???3??3???6???7?horse

3???2??2???8???3??duck

4???4??4???2???1?mouse

4.從txt文件中讀取部分?jǐn)?shù)據(jù)

只想讀取文件一部分,可明確指定要解析的行號這時要用到nrows和skiprows選項樟遣,可以指定起始行n(n=skiprows)和從起始行往后讀多少行(nrows=i)

>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_02.csv',skiprows=[2],nrows=3,header=None)

>>> csvframe

??0?1?2?3???4

0?1?5?2?3??cat

1?2?7?8?5??dog

2?2?2?8?3?duck

5.切分想要的文本而叼,遍歷各個部分逐一對其執(zhí)行某一特定操作

對于一列數(shù)字,每隔兩行取一個累加起來豹悬,最后把和插入倒Series對象中(暫時略過)

6.往csv文件寫入數(shù)據(jù)

1)攜帶索引和列名的寫入

>>> import numpy as np

>>> import pandas as pd

>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=[1,2,3,4],columns=['a','b','c','d'])

>>> frame

??a??b??c??d

1??0??1??2??3

2??4??5??6??7

3??8??9?10?11

4?12?13?14?15

>>> frame.to_csv('/Users/caojin/Desktop/ch05_07.csv')

2)取消攜帶索引和列名

>>> frame.to_csv('/Users/caojin/Desktop/ch05_08.csv',index=False,header=False)

3)數(shù)據(jù)結(jié)構(gòu)中的NaN寫入csv文件后顯示為空

4)使用to_csv()函數(shù)中的na_rep選項把空字段替換為你需要的值

>>> frame=pd.read_csv('/Users/caojin/Desktop/ch05_09.csv')

>>> frame

??Unnamed: 0??a??b??c??d

0??????0?NaN?2.0?NaN?1.0

1??????1?1.0?NaN?NaN?2.0

2??????2?NaN?1.0?NaN?NaN

3??????3?1.0?NaN??d?NaN

4??????4?3.0?3.0?NaN?5.0

>>> frame.to_csv('/Users/caojin/Desktop/ch05_10.csv',na_rep='h')

7.安裝html5lib模塊

8.寫入數(shù)據(jù)到HTML文件

>>> frame=pd.DataFrame(ny.arange(16).reshape(4,4))

>>> frame

??0??1??2??3

0??0??1??2??3

1??4??5??6??7

2??8??9?10?11

3?12?13?14?15

>>> print(frame.to_html())




???0

???1

???2

???3





???0

???0

???1

???2

???3



???1

???4

???5

???6

???7



???2

???8

???9

???10

???11



???3

???12

???13

???14

???15



9.從HTML文件讀取數(shù)據(jù)

1)先寫入DataFrame到一個網(wǎng)頁

>>> import html5lib

>>> import numpy as np

>>> import pandas as pd

>>> frame=pd.DataFrame(np.random.random((4,4)),index=['white','black','red','blue'],columns=['up','down','right','left'])

>>> frame

???????up???down???right???left

white?0.003468?0.319286?0.713373?0.169162

black?0.228553?0.289013?0.263125?0.817748

red??0.032618?0.286309?0.099676?0.765746

blue??0.824121?0.820978?0.858056?0.468772

>>> s=['']

>>> s.append('MY DATAFRAME')

>>> s.append('')

>>> s.append(frame.to_html())

>>> s.append('')

>>> html=''.join(s)

>>> html_file=open('/Users/caojin/Desktop/myFrame.html','w')

>>> html_file.write(html)

835

>>> html_file.close()

·2)從這個網(wǎng)頁讀取數(shù)據(jù)

>>> import lxml

>>> import numpy as np

>>> import pandas as pd

>>> web_frames=pd.read_html('file:///Users/caojin/Desktop/myFrame.html')

>>> web_frames

[?Unnamed: 0????up???down???right???left

0???white?0.003468?0.319286?0.713373?0.169162

1???black?0.228553?0.289013?0.263125?0.817748

2????red?0.032618?0.286309?0.099676?0.765746

3????blue?0.824121?0.820978?0.858056?0.468772]

10.從XML讀取數(shù)據(jù)(暫時略過)

11.讀寫excel文件

1)讀出excel文件

>>> pd.read_excel('/Users/caojin/Desktop/data.xls')

??white?red?green?black

a???12??23???17???18

b???22??16???19???18

c???14??23???22???21

>>> pd.read_excel('/Users/caojin/Desktop/data.xls','Sheet2')

??yellow?purple?blue?orange

A???11???16??44???22

B???20???22??23???44

C???30???31??37???32

>>> pd.read_excel('/Users/caojin/Desktop/data.xls','Sheet1')

??white?red?green?black

a???12??23???17???18

b???22??16???19???18

c???14??23???22???21

>>> pd.read_excel('/Users/caojin/Desktop/data.xls',1)

??yellow?purple?blue?orange

A???11???16??44???22

B???20???22??23???44

C???30???31??37???32

>>> pd.read_excel('/Users/caojin/Desktop/data.xls',0)

??white?red?green?black

a???12??23???17???18

b???22??16???19???18

c???14??23???22???21

2)將dataframe對象寫入xlsx文件中

>>> import numpy as np

>>> import pandas as pd

>>> import xlrd as xd

>>> import openpyxl as oxl

>>> frame=pd.DataFrame(np.random.random((4,4)),index=['white','black','red','blue'],columns=['up','down','right','left'])

>>> frame

???????up???down???right???left

white?0.464526?0.030887?0.893895?0.998091

black?0.116045?0.423554?0.255795?0.407073

red??0.378234?0.124358?0.475569?0.041171

blue??0.764820?0.093615?0.830751?0.356033

>>> frame.to_excel('/Users/caojin/Desktop/data2.xlsx')

12.讀寫json文件

1)將DataFrame轉(zhuǎn)化位json

>>> frame

???????up???down???right???left

white?0.464526?0.030887?0.893895?0.998091

black?0.116045?0.423554?0.255795?0.407073

red??0.378234?0.124358?0.475569?0.041171

blue??0.764820?0.093615?0.830751?0.356033

>>> frame.to_json('/Users/caojin/Desktop/frame.json')

2)讀取json

>>> pd.read_json('/Users/caojin/Desktop/frame.json')

??????down???left???right????up

black?0.423554?0.407073?0.255795?0.116045

blue??0.093615?0.356033?0.830751?0.764820

red??0.124358?0.041171?0.475569?0.378234

white?0.030887?0.998091?0.893895?0.464526

3)復(fù)雜的json文件

編寫復(fù)雜的json文件

以上結(jié)構(gòu)不再是列表形式葵陵,而是一種更為復(fù)雜的形式,因此無法在使用read_json()來處理瞻佛,所以首先要對負(fù)責(zé)的json進(jìn)行格式化規(guī)范化

>>> import numpy as np

>>> import pandas as pd

>>> import json #由于后面要用到j(luò)son.loads()函數(shù)將json文件轉(zhuǎn)化成python結(jié)果脱篙,所以要引入json包

>>> from pandas.io.json import json_normalize #由于后面要將json的數(shù)據(jù)進(jìn)行規(guī)范化所以要引入,規(guī)范化后就產(chǎn)出一個dataframe格式的對象

>>> file=open('/Users/caojin/Desktop/books.json','r') #以只讀的形式打開已經(jīng)存號的json文件

>>> text=file.read() #讀出json內(nèi)容賦值給text

>>> print(text)

[

{

"writer":"mark ross",

"nationalit":"usa",

"books":

[

{

"title":"xmlcookbiook",

"price":23.56

},

{

"title":"python fundamentals",

"price":50.70

},

{

"title":"the numpy library",

"price":12.30

}

]

},

{

"writer":"barbara bracket",

"nationalit":"uk",

"books":

[

{

"title":"java Enterprise",

"price":28.60

},

{

"title":"html5",

"price":31.35

},

{

"title":"python for dummies",

"price":28.00

}

]

}

]

>>> text2=json.loads(text)#利用loads函數(shù)對讀出的text內(nèi)容進(jìn)行轉(zhuǎn)換成python格式

>>> print(text2)

[{'nationalit': 'usa', 'books': [{'title': 'xmlcookbiook', 'price': 23.56}, {'title': 'python fundamentals', 'price': 50.7}, {'title': 'the numpy library', 'price': 12.3}], 'writer': 'mark ross'}, {'nationalit': 'uk', 'books': [{'title': 'java Enterprise', 'price': 28.6}, {'title': 'html5', 'price': 31.35}, {'title': 'python for dummies', 'price': 28.0}], 'writer': 'barbara bracket'}]

>>> text3=json_normalize(text2,'books')#利用json_normalize函數(shù)對text2的內(nèi)容按照books鍵進(jìn)行產(chǎn)出

>>> text3

??price????????title

0?23.56?????xmlcookbiook

1?50.70?python fundamentals

2?12.30??the numpy library

3?28.60???java Enterprise

4?31.35????????html5

5?28.00??python for dummies

然而可以將其余同books統(tǒng)一級別的其他鍵的作為第三個數(shù)組參數(shù)傳入

>>> text4=json_normalize(text2,'books',['writer','nationalit'])

>>> text4

??price????????title nationalit??????writer

0?23.56?????xmlcookbiook????usa????mark ross

1?50.70?python fundamentals????usa????mark ross

2?12.30??the numpy library????usa????mark ross

3?28.60???java Enterprise?????uk?barbara bracket

4?31.35????????html5?????uk?barbara bracket

5?28.00??python for dummies?????uk?barbara bracket

13.HDF5格式

如果想要分析大量數(shù)據(jù)伤柄,最好使用二進(jìn)制格式

python有很多二進(jìn)制數(shù)據(jù)處理工具绊困,HDF5庫比較優(yōu)秀,這種文件的數(shù)據(jù)結(jié)構(gòu)由節(jié)點組成适刀,能夠存儲大量數(shù)據(jù)集

>>> import numpy as np

>>> import pandas as pd

>>> import tables as tb #后續(xù)要使用HDFS函數(shù)必須用這個模塊

>>> from pandas.io.pytables import HDFStore

>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['white','black','red','blue'],columns=['up','down','right','left'])

>>> frame

????up?down?right?left

white??0???1???2???3

black??4???5???6???7

red???8???9???10??11

blue??12??13???14??15

>>> store=HDFStore('/Users/caojin/Desktop/mydata.h5')#創(chuàng)建一個h5格式文件

>>> store['obj1']=frame#將dataframe對象放入倒h5中

>>> store['obj1']

????up?down?right?left

white??0???1???2???3

black??4???5???6???7

red???8???9???10??11

blue??12??13???14??15

14.pickle--python對象序列化

15.用cPickle實現(xiàn)Python對象序列化

序列化=將對象的層級結(jié)構(gòu)轉(zhuǎn)換位字節(jié)流的過程

16.用pandas實現(xiàn)對象序列化

>>> import pandas as pd

>>> import numpy as np

>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['white','black','red','blue'],columns=['up','down','right','left'])

>>> frame

????up?down?right?left

white??0???1???2???3

black??4???5???6???7

red???8???9???10??11

blue??12??13???14??15

>>> frame.to_pickle('/Users/caojin/Desktop/frame.pkl')

>>> pd.read_pickle('/Users/caojin/Desktop/frame.pkl')#反序列化

????up?down?right?left

white??0???1???2???3

black??4???5???6???7

red???8???9???10??11

blue??12??13???14??15

17.對接數(shù)據(jù)庫--mysql數(shù)據(jù)庫連接(這個之后找時間專門寫一天筆記)

1)python直接鏈接數(shù)據(jù)庫(這個之后找時間專門寫一天的筆記)

2)Python借助pandas鏈接數(shù)據(jù)庫

import pandas as pd

import MySQLdb

conn=MySQLdb.connect(host="localhot",user="root",passwd="*****",db="test",charset="utf8")

sql = "select * from user limit 3"

df = pd.read_sql(sql,conn,index_col="id")

print df

cur = conn.cursor()

cur.execute("drop table if exists user")

cur.execute('create table user(id int,name varchar(20))' )

pd.io.sql.write_frame(df,"user",conn)

18.SQLite3數(shù)據(jù)讀寫(暫時略過)

19.PostgreSQL數(shù)據(jù)讀寫(展示略過)

20.NoSQL數(shù)據(jù)庫MongDB數(shù)據(jù)讀寫(展示略過)

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末秤朗,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子笔喉,更是在濱河造成了極大的恐慌取视,老刑警劉巖,帶你破解...
    沈念sama閱讀 218,640評論 6 507
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件常挚,死亡現(xiàn)場離奇詭異贫途,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)待侵,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,254評論 3 395
  • 文/潘曉璐 我一進(jìn)店門丢早,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事怨酝】酰” “怎么了?”我有些...
    開封第一講書人閱讀 165,011評論 0 355
  • 文/不壞的土叔 我叫張陵农猬,是天一觀的道長赡艰。 經(jīng)常有香客問我,道長斤葱,這世上最難降的妖魔是什么慷垮? 我笑而不...
    開封第一講書人閱讀 58,755評論 1 294
  • 正文 為了忘掉前任,我火速辦了婚禮揍堕,結(jié)果婚禮上料身,老公的妹妹穿的比我還像新娘。我一直安慰自己衩茸,他們只是感情好芹血,可當(dāng)我...
    茶點故事閱讀 67,774評論 6 392
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著楞慈,像睡著了一般幔烛。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發(fā)上囊蓝,一...
    開封第一講書人閱讀 51,610評論 1 305
  • 那天饿悬,我揣著相機(jī)與錄音,去河邊找鬼聚霜。 笑死乡恕,一個胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的俯萎。 我是一名探鬼主播傲宜,決...
    沈念sama閱讀 40,352評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼夫啊!你這毒婦竟也來了函卒?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 39,257評論 0 276
  • 序言:老撾萬榮一對情侶失蹤撇眯,失蹤者是張志新(化名)和其女友劉穎报嵌,沒想到半個月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體熊榛,經(jīng)...
    沈念sama閱讀 45,717評論 1 315
  • 正文 獨居荒郊野嶺守林人離奇死亡锚国,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,894評論 3 336
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了玄坦。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片血筑。...
    茶點故事閱讀 40,021評論 1 350
  • 序言:一個原本活蹦亂跳的男人離奇死亡绘沉,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出豺总,到底是詐尸還是另有隱情车伞,我是刑警寧澤,帶...
    沈念sama閱讀 35,735評論 5 346
  • 正文 年R本政府宣布喻喳,位于F島的核電站另玖,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏表伦。R本人自食惡果不足惜谦去,卻給世界環(huán)境...
    茶點故事閱讀 41,354評論 3 330
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望蹦哼。 院中可真熱鬧鳄哭,春花似錦、人聲如沸翔怎。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,936評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽赤套。三九已至,卻和暖如春珊膜,著一層夾襖步出監(jiān)牢的瞬間容握,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 33,054評論 1 270
  • 我被黑心中介騙來泰國打工车柠, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留剔氏,地道東北人。 一個月前我還...
    沈念sama閱讀 48,224評論 3 371
  • 正文 我出身青樓竹祷,卻偏偏與公主長得像谈跛,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子塑陵,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 44,974評論 2 355

推薦閱讀更多精彩內(nèi)容