數(shù)據(jù)加載钧惧、存儲與文件格式

文本格式

1.解析函數(shù)

解析函數(shù)	說明
read_csv	文件叛复、URL仔引、文件型對象—有分隔符的數(shù)據(jù)（默認，）
read_table	文件致扯、URL肤寝、文件型對象—有分隔符的數(shù)據(jù)（默認\t）
read_fwf	讀取定寬，無分隔符
read_clipboard	讀取剪貼板（read_table)

? 功能：索引抖僵，類型推斷和數(shù)據(jù)轉(zhuǎn)化，日期解析缘揪，迭代耍群，不規(guī)整數(shù)據(jù)問題

? 參數(shù)：

? 自定義列名：names=[]

? 列做索引：index_col='col_name'

? 分隔符：sep='\s+'不固定空格數(shù)

read_csv/read_table 參數(shù)	說明
path	位置，URL找筝，文件型對象
sep或delimiter	分隔符
header	列明行號蹈垢，默認0，可選None
index_col	索引的列編號或列名
skiprows	忽略行數(shù)
na_values	替換NA的值
comment	注釋從行尾分出去的字符
parse_dates	解析為日期袖裕，默認False曹抬，可指定列號列名
keep_date_col	連接多列解析日期
converters	列號列名與函數(shù)之間的映射關(guān)系組成的字典，<br />`{foo:f}`急鳄，foo列執(zhí)行f函數(shù)
dayfirst	日期
date_praser	日期
nrows	讀取的行數(shù)目
iterator	返回TextParser谤民，逐塊讀取
chunksize	文件塊的大小
skip_footer	忽略的行數(shù)（從行尾）
verbose	打印解析器輸出
encoding	Unicode編碼格式
squeeze	僅含一列時堰酿，返回Series
thousands	千分位分隔符

逐塊讀取：

? 只讀取幾行：nrow=5张足，5行

? 逐塊：chunksize=, 讀却ゴ础：TextParser

chunker = pd.read_csv('a.csv',chunksize=1000)

tot = Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

寫出to_csv：

? 禁用：index=False

? 僅部分列：cols=[]

? 缺失值為空：na_rep='Null'

Series也有to_csv方法

? 讀取為Series：Series.from_csv()

手工處理分隔符格式：

import csv

f = open('a.csv')
reader = csv.reader(f)

for line in reader:
    print line  #元組形式
    
# 整理格式：dict
lines = list(reader)
header, values = lines[0],lines[:1]
data_dict = {h:v for h, v in zip(header, zip(*values))}

#傳入子類
class my_dialect(csv.Dialect):
    lineterminator = '/n'
    delimier = ';'
    quotechar = ':'
reader = csv.reader(f,dialect=my_dialect)   #也可以關(guān)鍵字形式傳入

csv語句選項	說明
`delimiter`	分隔符
`lineterminator`	行尾符號
`quotechar`	引用，帶特殊字符的引用
`quoting`	引用为牍，（多個可選值）
`skipinitialspace`	忽略分隔符后的空格
`doublequote`	引用符哼绑，True為雙寫
`escapechar`	對分隔符進行轉(zhuǎn)義的字符（quoting）

手工輸出分隔符文件：csv.writer

with open('a.csv', 'w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(...)
    writer.writerow(...)

2.JSON

HTTP請求在Web和其他應(yīng)用程序之間發(fā)送的數(shù)據(jù)標準格式之一

import json

# 載入或輸出（dict）,load和jump針對文件
result = json.loads(obj)

asjson = json.jumps(result)

# 轉(zhuǎn)化為DataFrame
siblings = DataFrame(result['slibings'], columns=['name','age'])

3.HTML: `lxml`

from lxml.html import parse
from urllib2 import urlopen

# 獲取 標簽
parsed = parse(urlopen('http://funance.yahoo.com/q/op?s=AAPL+Options'))
doc = parsed.getroot()  #獲取根節(jié)點

# 查詢a標簽
links = doc.findall('./a')

# 獲取url、文本
lnk = links[28]
lnk.get('href')
lnk.text_content()

# 推導式形式：a標簽的href
urls = [lnk.get('href') for lnk in doc.findall('./a')]

# 找出表格
tables = doc.findall('./table')
calls = table[9]
puts = table[13]

# 找出row的所有'td'子標簽的text
def _upack(row, kind='td'):
    elts = row.findall('.//%s' % kind)
    return [val.text_content() for val in elts]


# 自動轉(zhuǎn)化類型TextParser:部分列轉(zhuǎn)換為浮點
from pands.io.parsers import TextParser
def parse_options_data(table):
    rows = table.findall('./tr')            # 每行tr標簽行
    header = _unpack(row[0], kind='th')     # 標題行
    data = [_unpack(r) for r in rows[1:]]   # 內(nèi)容
    return TextParser(data, names=header).get_chunk()

# 得到最終DataFrame
call_data = parse_options_data(calls)
put_data = parse_options_data(puts)

4.XML: `lxml.objectify`

常見的支持分層碉咆、嵌套抖韩、元數(shù)據(jù)的結(jié)構(gòu)化數(shù)據(jù)格式

from lxml import objectify
path = '文件名'
parsed = objectify.parse(open(path))
root = parsed.getroot()

# 提取數(shù)據(jù)
data = []
skip_field = ['不需要','提取的','標簽']
for elt in root.INDICATOR:          # 所有indicator標簽
    for child in elt.children:      # 遍歷所有indicator標簽的子標簽
        if child.tag in skip_field:
            continue
        el_data = {child.tag: child.pyval}  #pyval屬性得到對應(yīng)的數(shù)據(jù)值,text字符串
    data.append(el_data)

二進制

使用python內(nèi)置的pickle序列化，僅使用短期存儲

? 存儲：frame.save()

? 讀纫咄：pd.load()

HDF5

高效讀寫磁盤上以二進制格式存儲的科學數(shù)據(jù)茂浮，工業(yè)級庫C庫，有許多語言接口块攒。每個HDF5都含有文件系統(tǒng)式的節(jié)點結(jié)構(gòu)励稳。分塊讀寫，適合海量數(shù)據(jù)囱井。

python中有兩個接口：PyTables驹尼、h5py。

pandas有類似字典的HDFStore類庞呕，通過PyTables存儲新翎，可像字典一樣獲取

store = pd.HDFStore('mydata.h5')    # 文件名
# 內(nèi)容
store['obj1'] = frame
stroe['obj1_col1'] = frame['a']

Microsoft Excel

需要xlrd和openyxl包

# 創(chuàng)建實例
xls_file = pd.ExcelFile('data.xls')
# 讀取sheet1
table = xls_file.parse('sheet1')

HTML和Web API: requests

import requests
import json
resp = request.get(url)         #url網(wǎng)址,get請求
data = json.loads(resp.text)    #text是內(nèi)容，json加載
print(data.keys())

數(shù)據(jù)庫

sqlite3

import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL,         d INTEGER
);"""
con = sqlite3.connect(':memory:')   #放在內(nèi)存中
# conn=sqlite3.connect('urls.db') 鏈接數(shù)據(jù)庫
con.execute(query)
con.commit()

# 查詢住练，顯示所有
cursor = con.execute('select * from test')
rows = cursor.fetchall()    #fetchone
# description屬性
print(cursor.description)

簡化：read_frame()函數(shù)

import pandas.io.sql as sql
# 只需要語句和鏈接對象
sql.read_frame('select * from test', con)

MongoDB（NoSQL）：`pymongo`

import pymongo
con = pymongo.Connection('localhost', port=27017) # 默認端口進行連接

數(shù)據(jù)加載嫩舟、存儲與文件格式

數(shù)據(jù)加載、存儲與文件格式

文本格式

1.解析函數(shù)

寫出to_csv：

手工處理分隔符格式：

2.JSON

3.HTML: `lxml`

4.XML: `lxml.objectify`

二進制

HDF5

Microsoft Excel

HTML和Web API: requests

數(shù)據(jù)庫

sqlite3

簡化：read_frame()函數(shù)

MongoDB（NoSQL）：`pymongo`

數(shù)據(jù)加載、存儲與文件格式

文本格式

1.解析函數(shù)

寫出to_csv：

手工處理分隔符格式：

2.JSON

3.HTML: lxml

4.XML: lxml.objectify

二進制

HDF5

Microsoft Excel

HTML和Web API: requests

數(shù)據(jù)庫

sqlite3

簡化：read_frame()函數(shù)

MongoDB（NoSQL）：pymongo

3.HTML: `lxml`

4.XML: `lxml.objectify`

MongoDB（NoSQL）：`pymongo`