1.I/O API工具
2.讀取csv或文本文件中的數(shù)據(jù)
1)創(chuàng)建一個.csv文件
White,red,blue,green,animal
1,5,2,3,cat
2,7,8,5,dog
3,3,6,7,horse
2,2,8,3,duck
4,4,2,1,mouse
2)用read_csv()函數(shù)讀取他的內(nèi)容剧浸,同時將其轉(zhuǎn)化為DataFrame對象
>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_01.csv')
>>> csvframe
??White?red?blue?green animal
0???1??5???2???3??cat
1???2??7???8???5??dog
2???3??3???6???7?horse
3???2??2???8???3??duck
4???4??4???2???1?mouse
3)既然csv也是文本文件卡辰,還可以使用read_table()函數(shù)但是得指定分隔符
>>> csvframe=pd.read_table('/Users/caojin/Desktop/myCSV_01.csv',sep=',')
>>> csvframe
??White?red?blue?green animal
0???1??5???2???3??cat
1???2??7???8???5??dog
2???3??3???6???7?horse
3???2??2???8???3??duck
4???4??4???2???1?mouse
4)上面例子逊朽,標(biāo)識各列的表頭位于csv文件的第一行,但一般情況并非如此航厚,可能第一行就是列表數(shù)據(jù)如下:
5)沒有表頭的數(shù)據(jù)使用read_csv()函數(shù)時候肉康,使用header選項,將其設(shè)置為None抡句,pandas會自動為其添加默認(rèn)表頭
未使用header
>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_02.csv')
>>> csvframe
??1?5?2?3??cat
0?2?7?8?5??dog
1?3?3?6?7?horse
2?2?2?8?3??duck
3?4?4?2?1?mouse
使用header
>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_02.csv',header=None)
>>> csvframe
??0?1?2?3???4
0?1?5?2?3??cat
1?2?7?8?5??dog
2?3?3?6?7?horse
3?2?2?8?3??duck
4?4?4?2?1?mouse
6)或者可以使用read_csv()函數(shù)的時候滩租,使用names指定表頭,直接把存有各列名稱的數(shù)組賦給它即可
>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_02.csv',names=['a','b','v','d','w'])
>>> csvframe
??a?b?v?d???w
0?1?5?2?3??cat
1?2?7?8?5??dog
2?3?3?6?7?horse
3?2?2?8?3??duck
4?4?4?2?1?mouse
7)讀取csv創(chuàng)建一個具有等級結(jié)構(gòu)的DataFrame利朵,可以read_csv()指定index_col選項律想,把想要轉(zhuǎn)換為索引的列名稱賦給index_col
>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_03.csv',index_col=['colors','status'])
>>> csvframe
????????item1?item2?item3
colors status
black?up?????3???4???6
????down????2???6???7
white?up?????5???5???5
????down????3???3???2
????left????1???2???1
red??up?????2???2???2
????down????1???1???4
3.用RegExp(正則表達(dá)式)解析txt文件
1)不是以逗號或者分號分割的需要用read_table()函數(shù),并指sep選項為一個正則表達(dá)式
2)以下txt元素都以一個或者多個制表符或者空格相隔
>>> import pandas as pd
>>> txtframe=pd.read_table('/Users/caojin/Desktop/ch05_04.txt',sep='\s*')
>>> txtframe
??White?red?blue?green
0???1??5???2???3
1???2??7???8???5
2???3??3???6???7
3)以下txt文件中數(shù)字和字母雜糅绍弟,需要從中抽取數(shù)字部分技即,無表頭需要用header設(shè)置成None
>>> txtframe=pd.read_table('/Users/caojin/Desktop/ch05_05.txt',sep='\D*',header=None)
>>> txtframe
??0??1??2
0?0?123?122
1?1?124?321
2?2?125?333
4)使用skiprows可以排除多余的行
如果排除前5行
如果只排除第5行
>>> txtframe=pd.read_table('/Users/caojin/Desktop/ch05_06.txt',sep=',',skiprows=[0,1,3,6])
>>> txtframe
??White?red?blue green animal
0???1??5???4??cat??NaN
1???2??7???8???5??dog
2???3??3???6???7?horse
3???2??2???8???3??duck
4???4??4???2???1?mouse
4.從txt文件中讀取部分?jǐn)?shù)據(jù)
只想讀取文件一部分,可明確指定要解析的行號這時要用到nrows和skiprows選項樟遣,可以指定起始行n(n=skiprows)和從起始行往后讀多少行(nrows=i)
>>> csvframe=pd.read_csv('/Users/caojin/Desktop/myCSV_02.csv',skiprows=[2],nrows=3,header=None)
>>> csvframe
??0?1?2?3???4
0?1?5?2?3??cat
1?2?7?8?5??dog
2?2?2?8?3?duck
5.切分想要的文本而叼,遍歷各個部分逐一對其執(zhí)行某一特定操作
對于一列數(shù)字,每隔兩行取一個累加起來豹悬,最后把和插入倒Series對象中(暫時略過)
6.往csv文件寫入數(shù)據(jù)
1)攜帶索引和列名的寫入
>>> import numpy as np
>>> import pandas as pd
>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=[1,2,3,4],columns=['a','b','c','d'])
>>> frame
??a??b??c??d
1??0??1??2??3
2??4??5??6??7
3??8??9?10?11
4?12?13?14?15
>>> frame.to_csv('/Users/caojin/Desktop/ch05_07.csv')
2)取消攜帶索引和列名
>>> frame.to_csv('/Users/caojin/Desktop/ch05_08.csv',index=False,header=False)
3)數(shù)據(jù)結(jié)構(gòu)中的NaN寫入csv文件后顯示為空
4)使用to_csv()函數(shù)中的na_rep選項把空字段替換為你需要的值
>>> frame=pd.read_csv('/Users/caojin/Desktop/ch05_09.csv')
>>> frame
??Unnamed: 0??a??b??c??d
0??????0?NaN?2.0?NaN?1.0
1??????1?1.0?NaN?NaN?2.0
2??????2?NaN?1.0?NaN?NaN
3??????3?1.0?NaN??d?NaN
4??????4?3.0?3.0?NaN?5.0
>>> frame.to_csv('/Users/caojin/Desktop/ch05_10.csv',na_rep='h')
7.安裝html5lib模塊
8.寫入數(shù)據(jù)到HTML文件
>>> frame=pd.DataFrame(ny.arange(16).reshape(4,4))
>>> frame
??0??1??2??3
0??0??1??2??3
1??4??5??6??7
2??8??9?10?11
3?12?13?14?15
>>> print(frame.to_html())
???0
???1
???2
???3
???0
???0
???1
???2
???3
???1
???4
???5
???6
???7
???2
???8
???9
???10
???11
???3
???12
???13
???14
???15
9.從HTML文件讀取數(shù)據(jù)
1)先寫入DataFrame到一個網(wǎng)頁
>>> import html5lib
>>> import numpy as np
>>> import pandas as pd
>>> frame=pd.DataFrame(np.random.random((4,4)),index=['white','black','red','blue'],columns=['up','down','right','left'])
>>> frame
???????up???down???right???left
white?0.003468?0.319286?0.713373?0.169162
black?0.228553?0.289013?0.263125?0.817748
red??0.032618?0.286309?0.099676?0.765746
blue??0.824121?0.820978?0.858056?0.468772
>>> s=['']
>>> s.append('MY DATAFRAME')
>>> s.append('')
>>> s.append(frame.to_html())
>>> s.append('')
>>> html=''.join(s)
>>> html_file=open('/Users/caojin/Desktop/myFrame.html','w')
>>> html_file.write(html)
835
>>> html_file.close()
·2)從這個網(wǎng)頁讀取數(shù)據(jù)
>>> import lxml
>>> import numpy as np
>>> import pandas as pd
>>> web_frames=pd.read_html('file:///Users/caojin/Desktop/myFrame.html')
>>> web_frames
[?Unnamed: 0????up???down???right???left
0???white?0.003468?0.319286?0.713373?0.169162
1???black?0.228553?0.289013?0.263125?0.817748
2????red?0.032618?0.286309?0.099676?0.765746
3????blue?0.824121?0.820978?0.858056?0.468772]
10.從XML讀取數(shù)據(jù)(暫時略過)
11.讀寫excel文件
1)讀出excel文件
>>> pd.read_excel('/Users/caojin/Desktop/data.xls')
??white?red?green?black
a???12??23???17???18
b???22??16???19???18
c???14??23???22???21
>>> pd.read_excel('/Users/caojin/Desktop/data.xls','Sheet2')
??yellow?purple?blue?orange
A???11???16??44???22
B???20???22??23???44
C???30???31??37???32
>>> pd.read_excel('/Users/caojin/Desktop/data.xls','Sheet1')
??white?red?green?black
a???12??23???17???18
b???22??16???19???18
c???14??23???22???21
>>> pd.read_excel('/Users/caojin/Desktop/data.xls',1)
??yellow?purple?blue?orange
A???11???16??44???22
B???20???22??23???44
C???30???31??37???32
>>> pd.read_excel('/Users/caojin/Desktop/data.xls',0)
??white?red?green?black
a???12??23???17???18
b???22??16???19???18
c???14??23???22???21
2)將dataframe對象寫入xlsx文件中
>>> import numpy as np
>>> import pandas as pd
>>> import xlrd as xd
>>> import openpyxl as oxl
>>> frame=pd.DataFrame(np.random.random((4,4)),index=['white','black','red','blue'],columns=['up','down','right','left'])
>>> frame
???????up???down???right???left
white?0.464526?0.030887?0.893895?0.998091
black?0.116045?0.423554?0.255795?0.407073
red??0.378234?0.124358?0.475569?0.041171
blue??0.764820?0.093615?0.830751?0.356033
>>> frame.to_excel('/Users/caojin/Desktop/data2.xlsx')
12.讀寫json文件
1)將DataFrame轉(zhuǎn)化位json
>>> frame
???????up???down???right???left
white?0.464526?0.030887?0.893895?0.998091
black?0.116045?0.423554?0.255795?0.407073
red??0.378234?0.124358?0.475569?0.041171
blue??0.764820?0.093615?0.830751?0.356033
>>> frame.to_json('/Users/caojin/Desktop/frame.json')
2)讀取json
>>> pd.read_json('/Users/caojin/Desktop/frame.json')
??????down???left???right????up
black?0.423554?0.407073?0.255795?0.116045
blue??0.093615?0.356033?0.830751?0.764820
red??0.124358?0.041171?0.475569?0.378234
white?0.030887?0.998091?0.893895?0.464526
3)復(fù)雜的json文件
編寫復(fù)雜的json文件
以上結(jié)構(gòu)不再是列表形式葵陵,而是一種更為復(fù)雜的形式,因此無法在使用read_json()來處理瞻佛,所以首先要對負(fù)責(zé)的json進(jìn)行格式化規(guī)范化
>>> import numpy as np
>>> import pandas as pd
>>> import json #由于后面要用到j(luò)son.loads()函數(shù)將json文件轉(zhuǎn)化成python結(jié)果脱篙,所以要引入json包
>>> from pandas.io.json import json_normalize #由于后面要將json的數(shù)據(jù)進(jìn)行規(guī)范化所以要引入,規(guī)范化后就產(chǎn)出一個dataframe格式的對象
>>> file=open('/Users/caojin/Desktop/books.json','r') #以只讀的形式打開已經(jīng)存號的json文件
>>> text=file.read() #讀出json內(nèi)容賦值給text
>>> print(text)
[
{
"writer":"mark ross",
"nationalit":"usa",
"books":
[
{
"title":"xmlcookbiook",
"price":23.56
},
{
"title":"python fundamentals",
"price":50.70
},
{
"title":"the numpy library",
"price":12.30
}
]
},
{
"writer":"barbara bracket",
"nationalit":"uk",
"books":
[
{
"title":"java Enterprise",
"price":28.60
},
{
"title":"html5",
"price":31.35
},
{
"title":"python for dummies",
"price":28.00
}
]
}
]
>>> text2=json.loads(text)#利用loads函數(shù)對讀出的text內(nèi)容進(jìn)行轉(zhuǎn)換成python格式
>>> print(text2)
[{'nationalit': 'usa', 'books': [{'title': 'xmlcookbiook', 'price': 23.56}, {'title': 'python fundamentals', 'price': 50.7}, {'title': 'the numpy library', 'price': 12.3}], 'writer': 'mark ross'}, {'nationalit': 'uk', 'books': [{'title': 'java Enterprise', 'price': 28.6}, {'title': 'html5', 'price': 31.35}, {'title': 'python for dummies', 'price': 28.0}], 'writer': 'barbara bracket'}]
>>> text3=json_normalize(text2,'books')#利用json_normalize函數(shù)對text2的內(nèi)容按照books鍵進(jìn)行產(chǎn)出
>>> text3
??price????????title
0?23.56?????xmlcookbiook
1?50.70?python fundamentals
2?12.30??the numpy library
3?28.60???java Enterprise
4?31.35????????html5
5?28.00??python for dummies
然而可以將其余同books統(tǒng)一級別的其他鍵的作為第三個數(shù)組參數(shù)傳入
>>> text4=json_normalize(text2,'books',['writer','nationalit'])
>>> text4
??price????????title nationalit??????writer
0?23.56?????xmlcookbiook????usa????mark ross
1?50.70?python fundamentals????usa????mark ross
2?12.30??the numpy library????usa????mark ross
3?28.60???java Enterprise?????uk?barbara bracket
4?31.35????????html5?????uk?barbara bracket
5?28.00??python for dummies?????uk?barbara bracket
13.HDF5格式
如果想要分析大量數(shù)據(jù)伤柄,最好使用二進(jìn)制格式
python有很多二進(jìn)制數(shù)據(jù)處理工具绊困,HDF5庫比較優(yōu)秀,這種文件的數(shù)據(jù)結(jié)構(gòu)由節(jié)點組成适刀,能夠存儲大量數(shù)據(jù)集
>>> import numpy as np
>>> import pandas as pd
>>> import tables as tb #后續(xù)要使用HDFS函數(shù)必須用這個模塊
>>> from pandas.io.pytables import HDFStore
>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['white','black','red','blue'],columns=['up','down','right','left'])
>>> frame
????up?down?right?left
white??0???1???2???3
black??4???5???6???7
red???8???9???10??11
blue??12??13???14??15
>>> store=HDFStore('/Users/caojin/Desktop/mydata.h5')#創(chuàng)建一個h5格式文件
>>> store['obj1']=frame#將dataframe對象放入倒h5中
>>> store['obj1']
????up?down?right?left
white??0???1???2???3
black??4???5???6???7
red???8???9???10??11
blue??12??13???14??15
14.pickle--python對象序列化
15.用cPickle實現(xiàn)Python對象序列化
序列化=將對象的層級結(jié)構(gòu)轉(zhuǎn)換位字節(jié)流的過程
16.用pandas實現(xiàn)對象序列化
>>> import pandas as pd
>>> import numpy as np
>>> frame=pd.DataFrame(np.arange(16).reshape(4,4),index=['white','black','red','blue'],columns=['up','down','right','left'])
>>> frame
????up?down?right?left
white??0???1???2???3
black??4???5???6???7
red???8???9???10??11
blue??12??13???14??15
>>> frame.to_pickle('/Users/caojin/Desktop/frame.pkl')
>>> pd.read_pickle('/Users/caojin/Desktop/frame.pkl')#反序列化
????up?down?right?left
white??0???1???2???3
black??4???5???6???7
red???8???9???10??11
blue??12??13???14??15
17.對接數(shù)據(jù)庫--mysql數(shù)據(jù)庫連接(這個之后找時間專門寫一天筆記)
1)python直接鏈接數(shù)據(jù)庫(這個之后找時間專門寫一天的筆記)
2)Python借助pandas鏈接數(shù)據(jù)庫
import pandas as pd
import MySQLdb
conn=MySQLdb.connect(host="localhot",user="root",passwd="*****",db="test",charset="utf8")
sql = "select * from user limit 3"
df = pd.read_sql(sql,conn,index_col="id")
print df
cur = conn.cursor()
cur.execute("drop table if exists user")
cur.execute('create table user(id int,name varchar(20))' )
pd.io.sql.write_frame(df,"user",conn)
18.SQLite3數(shù)據(jù)讀寫(暫時略過)
19.PostgreSQL數(shù)據(jù)讀寫(展示略過)
20.NoSQL數(shù)據(jù)庫MongDB數(shù)據(jù)讀寫(展示略過)