上次說到了pandas中常用的一些選擇以及切片的方法,總結(jié)一下拴泌,主要有以下幾種:
data['A’]
data[['A','B’]]
data.iloc[:,1] #截取第二列數(shù)據(jù),iloc只能用數(shù)字截取, Select row by integer location
data.loc[2] #截取index label為2的第二行數(shù)據(jù),loc只能用label來截取器联,Select row by label
data.loc[:,'A':'B’] #截取截取從header為’A’到’B’的列
data.ix[:, ‘A’:'B’] #截取從header為’A’到’B’的列
data.ix[:,0:2] #截取前兩列數(shù)據(jù)
有一點(diǎn)要注意:
In pandas version 0.20.0 and above, ix is deprecated and the use of loc and iloc is encouraged instead.
在pandas最新的documentation里,loc和iloc是比較推薦使用的婿崭,在stack overflow上有個(gè)問題專門問了這三種截取方法的區(qū)別拨拓,這里引用一下一遍我們深入了解他們的區(qū)別:
loc works on labels in the index.
iloc works on the positions in the index (so it only takes integers).
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
舉個(gè)栗子,如果一個(gè)data frame, 它的index label是mixed type氓栈,既包含數(shù)字類型渣磷,又包含文本類型,那么ix既可用位置數(shù)字去截取授瘦,也可以用label去截取, 但是一定要記住醋界,ix是優(yōu)先label的竟宋,如果label不存在,就會(huì)用位置數(shù)字去截取 (僅僅在mixed type下適用形纺,一旦我們的label是固定類型丘侠,那么ix和loc的作用就完全一樣了)
接下來我想說一說數(shù)據(jù)過濾的問題, 這個(gè)問題也是我在數(shù)據(jù)預(yù)處理的時(shí)候經(jīng)常遇到的問題,我們先來創(chuàng)建一個(gè)數(shù)據(jù)表, 包含一些城市的信息
import pandas as pd
data = pd.read_excel('rhythm.xlsx')
print data
Out[23]:
A B C D city house_price \
0 1 2 3 4 Beijing 70000
1 2 2 3 4 Shanghai 120000
2 3 2 3 4 NaN 5000
3 4 2 3 4 New York 140000
4 5 2 3 4 Brasilia 50000
5 6 2 3 4 Atlanta 20000
6 7 2 3 4 Tokyo 130000
7 8 2 3 4 #NAME 30000
information
0 page not found
1 Shanghai is a Chinese city located on the east...
2 Kunming is also called the Spring city due to ...
3 New York is a state in the northeastern United...
4 Brasília (Portuguese pronunciation: [b?a?zilj?...
5 404 not found
6 Tokyo (Japanese: [to?kjo?] ( listen), English ...
7 Mumbai (/m?m?ba?/; also known as Bombay, the o...
我們可以看到逐样,這個(gè)dataset有很多問題蜗字,比如city列有NaN數(shù)據(jù)缺失的問題,information列有’page not found’的錯(cuò)誤信息脂新,我們要怎么剔除這些信息呢挪捕?
第一問:如果我想把city列含有NaN的行去掉,要怎么做呢争便?
drop_nan = data.dropna(subset=['city'])
print drop_nan
output:
A B C D city house price \
0 1 2 3 4 Beijing 70000
1 2 2 3 4 Shanghai 120000
3 4 2 3 4 New York 140000
4 5 2 3 4 Brasilia 50000
5 6 2 3 4 Atlanta 20000
6 7 2 3 4 Tokyo 130000
information
0 page not found
1 Shanghai is a Chinese city located on the east...
3 New York is a state in the northeastern United...
4 Brasília (Portuguese pronunciation: [b?a?zilj?...
5 404 not found
6 Tokyo (Japanese: [to?kjo?] ( listen), English ...
可以看到级零,我們city缺失的那一行已經(jīng)完全消失了,dropna就是專門用于過濾空值的函數(shù)滞乙,一般來說它有兩種形式:
data.dropna(how='any') #to drop if any value in the row has a nan 只要任何一列包含有na值妄讯,就會(huì)完全刪除那一行
data.dropna(how='all') #to drop if all values in the row are nan 如果所有列的值都是na,才刪除那一行
在這里酷宵,需要注意行和列的區(qū)別
第二問:我想把東京和亞特蘭大那兩行刪掉
drop_value = data[data.city.str.contains('Tokyo', 'Atlanta') == False]
#or
drop_value = data[~data.city.str.contains('Tokyo', 'Atlanta',na=False)]
output:
A B C D city house price \
0 1 2 3 4 Beijing 70000
1 2 2 3 4 Shanghai 120000
2 3 2 3 4 NaN 5000
3 4 2 3 4 New York 140000
4 5 2 3 4 Brasilia 50000
information
0 page not found
1 Shanghai is a Chinese city located on the east...
2 Kunming is also called the Spring city due to ...
3 New York is a state in the northeastern United...
4 Brasília (Portuguese pronunciation: [b?a?zilj?...
第三問,我想把house price 大于等于70000小于等于14000的行提取出來:
在做這一步之前躬窜,有個(gè)很重要的問題需要我們先解決了浇垦,仔細(xì)觀察我們這個(gè)數(shù)據(jù)集,就會(huì)發(fā)現(xiàn)“house price”這個(gè)header是包含有空格的荣挨,這對(duì)我們之后的數(shù)據(jù)處理會(huì)造成很大麻煩男韧,所以我們需要先把這個(gè)空格替換為”_"
data.columns = [c.replace(' ', '_') for c in data.columns]
print data.columns.values
output:
[u'A' u'B' u'C' u'D' u'city' u'house_price' u'information']
接下來就可以進(jìn)行數(shù)字的范圍截取了:
select_range = data[(data.house_price >= 70000) & (data.house_price <= 140000)]
output:
A B C D city house_price \
0 1 2 3 4 Beijing 70000
1 2 2 3 4 Shanghai 120000
3 4 2 3 4 New York 140000
6 7 2 3 4 Tokyo 130000
information
0 page not found
1 Shanghai is a Chinese city located on the east...
3 New York is a state in the northeastern United...
6 Tokyo (Japanese: [to?kjo?] ( listen), English ...
第四問:我想把information那一列中,包含“page not found”, “404 not found”這些文本的行刪除默垄,并且此虑,如果city一列中,開頭的第一個(gè)文本是符號(hào)”#",那么也將這一行刪除
ignore_list = ['404 not found','page not found']
remove_start_with = data[~data['city'].str.startswith('#', na=False)
& ~data['information'].str.contains('|'.join(ignore_list), na=True)]
print remove_start_with
A B C D city house_price \
1 2 2 3 4 Shanghai 120000
2 3 2 3 4 NaN 5000
3 4 2 3 4 New York 140000
4 5 2 3 4 Brasilia 50000
6 7 2 3 4 Tokyo 130000
information
1 Shanghai is a Chinese city located on the east...
2 Kunming is also called the Spring city due to ...
3 New York is a state in the northeastern United...
4 Brasília (Portuguese pronunciation: [b?a?zilj?...
6 Tokyo (Japanese: [to?kjo?] ( listen), English ...
第五問口锭,我想把截取information文本長(zhǎng)度大于30個(gè)詞的行
len_filter = data[data.information.str.len() >= 30]
print len_filter
從上面的各種例子來看朦前,我們就會(huì)發(fā)現(xiàn),pandas在截取數(shù)據(jù)鹃操,過濾數(shù)據(jù)的時(shí)候功能強(qiáng)大韭寸,且寫出的代碼可讀性極高。我會(huì)繼續(xù)復(fù)習(xí)pandas相關(guān)的知識(shí)荆隘,在下一篇文章中說談?wù)勅绾斡胮andas內(nèi)聯(lián)結(jié)兩個(gè)數(shù)據(jù)集恩伺。