索引

1.pandas常用方法總結
2.pandas時間序列
3.時間類：datetime
4.字符串相關操作
5.數據分組、統(tǒng)計罩缴、連接眶诈、透視表
6.可視化

一衍锚、pandas常用方法總結

--type() 查看數據類型方法
--dict([("low",0),("medium",1),("high",2)]) 轉換成字典
--range(i , j , step ) #i 起始值 ; j 終止值褂微；step 步長
--Series
--DataFrame
--df.shape 查看數據維度
--df.info 查看數據表信息
--df.dtypes 查看各列數據表格式
--df.isnull 查看空值
--df.['字段名'].unique() 查看唯一值
--df.values 查看數據表數值
--df.columns 查看列名稱
--df.head() 查看前幾行
--df.tail() 查看后幾行
--df.drop_duplicates(subset =（列名1功蜓，列名2...）) 刪除重復項
--df.("列名1").value_counts() 按列名值計數
--df.T 轉置
--del df['class'] 刪除列
--df.drop("",axis = 1) 刪除列,不指定axis會默認為0，去刪除行蕊梧，會報錯
--df.cumsum() 返回DataFrame或Series軸上的累積和
--df.set_index() 設置索引
--df.reset_index() 恢復索引
--df.describe() 描述表格
--pd.concat(str1,str2) 拼接
--df['TotalCharges'].convert_objects(convert_numeric=True) 強制轉換

處理空值（刪除或填充）
--a霞赫、刪除數據表中含有空值的行 df.dropna(how='any')
--b腮介、使用數字0填充數據表中空值 df.fillna(value=0)
使用price均值對NA進行填充 df['price'].fillna(df['price'].mean())
--清理空格 df['city']=df['city'].map(str.strip)
--大小寫轉換 df['city']=df['city'].str.lower() df['city']=df['city'].str.upper()
--更改數據格式 df['price'].astype('int')
--更改列名稱 df.rename(columns={'category': 'category-size'})
--數據替換 df['city'].replace('sh', 'shanghai')
索引操作
--設置索引列 df_inner.set_index('id')
--重設索引 df.reset_index()
--按特定列的值排序 df_inner.sort_values(by=['age'])
--按索引列排序 df_inner.sort_index()

--.loc 操作索引標簽（名稱）篩選數據肥矢，與set_index() 一起使用
--df.set_index(keys=['birth_city','birth_state'],append=True,drop = False)
--.sort_index(na_position="last",inplace=True)
--.iloc 按照索引位置篩選數據

二、pandas時間序列

--pandas處理時間序列（4）: 移動窗口函數
--pd.rolling_count(df,int)<===>df.rolling(int).count()
--pd.rolling_sum(df,int)<===>df.rolling(int).sum()
--pd.rolling_mean(df,int)<===>df.rolling(int).mean()
--pd.rolling_median(df,int)<===>df.rolling(int).median()
--pd.rolling_var(df,int)<===>df.rolling(int).var()
--pd.rolling_std(df,int)<===>df.rolling(int).std()
--pd.rolling_max(df,int)<===>df.rolling(int).max()
--pd.rolling_min(df,int)<===>df.rolling(int).min()
..............
*降采樣：對時間數據細粒度增大叠洗，可以把每天的數據聚合成一周甘改，可以求和或者均值的方式進行聚合
--df.resample('7D',closed='right',label='left').sum()/mean()
重采樣：降低時間的細粒度，對于重采樣灭抑，主要是涉及到值的填充十艾。有下面的三種填充方法
不填充。那么對應無值的地方腾节，用NaN代替忘嫉。對應的方法是asfreq。
用前值填充案腺。用前面的值填充無值的地方庆冕。對應的方法是ffill或者pad。
用后值填充劈榨。對應的方法是bfill访递，b代表back。
--df.resample('7H').asfreq()
--df.reasmple('7H').ffill()
--df.reasmple('7H').bfill()

--df.shift(5) 數據往前移動5位
--df.shift(-5) 數據往后移動5位

三同辣、時間類：datetime

--datime模塊是專門用于處理時間的類拷姿，它是PYTHON的標準庫之一，內容復雜且強大旱函，我們只需要學習一個常用的函數即可：
1.--獲取時間
--from datetime import datetime
--print(datetime.now())
2.--也可以創(chuàng)建指定的時間
--dt = datetime(2017,8,1)
3.--字符串與時間的轉換
--s = '20170901'
--s1 = datetime.strptime(s,'%Y%m%d')
--s = "2019/05/03"
--s2 = datetime.strptime(s,'%Y/%m/%d')
4.-- 時間的提取
--s1.day
--s1.hour
--s1.year
--s1.date()
5.--日期之間的計算需要用到timedelta模塊
--from datetime import datetime, timedelta
--s2 - s1
--s2 + timedelta(100)

6.--日期格式設置：
--.dt.strftime('%Y-%m-%d')
7.--提取日期數據：
--.dt.year
--.dt.month
--.dt.day
--.dt.weekday
--.dt.dayofyear
--.dt.days
--.dt.weekofyear
--.dt.date

四响巢、字符串相關操作

字符串分列
--.str.split('-',expand = Ture) #按“-”進行拆分，拆分后分列
--.str.split('-').str.get(1) #按“-”進行拆分棒妨，取第2個元素
--.str.split('-',expand = Ture踪古，n = 1) #按“-”進行拆分，拆分后分列,拆分1次
--.str.rsplit('-',expand = Ture，n = 1) #rsplit類似于split灾炭，除了它在反向工作茎芋，即從字符串的結尾到字符串的開頭：

--.str.replace("" , "") #字符串替換可使用正則表達式
--.str.strip #清除空格

--s.str.lower() 把字符串中字母轉化為小寫
--s.str.upper() 把字符串中字母轉化為大寫
--s.str.len() 查看字符串長度
--.str.strip() 清除字符串前、后空格
--.str.lstrip() 清除字符串前空格
--.str.rstrip() 清除字符串后空格
--.str.contains('str') 判斷是否包含str
--df.columns 查看列名
--.interpolate() 空值用上下值的平均值
--df['grammer'].map(lambda x: len(x)) 統(tǒng)計grammer列每個字符串的長度
字符串切片
--Series.str.slice(start=None, stop=None, step=None) 把Series按字符串切片進行提取
--start 切片開始選取的位置
--stop 切片結束選取的位置
--step 切片步長

五蜈出、數據分組田弥、統(tǒng)計、連接铡原、透視表

分組
1.--df.groupby("director_name")
--類似于SQL里面的group by 語句偷厦，不過pandas提供了更加復雜的函數方法
--我們可以對index或者column進行分組，可以被一個元素燕刻，也可以是任意多個元素分組只泼。分組后計算的方式否是一樣的，無論是基于index還是column卵洗。
2.--分箱操作
--等頻分箱：pd.qcut(x,q,lables,retbins) 是根據這些值的頻率來選擇箱子的均勻間隔请唱，即每個箱子中含有的數的數量是相同的
--等距分箱：pd.cut(x,bins,lables,retbins) cut將根據值本身來選擇箱子均勻間隔，即每個箱子的間距都是相同的
*參數說明
--x array过蹂，僅能使用一維數組
--bins integer或sequence of scalars十绑，指示劃分的組數或指定組距
--labels array或bool，默認為None酷勺。當傳入數組時本橙，分組的名稱由label指示；當傳入Flase時脆诉，僅顯示分組下標
--retbins bool甚亭，是否返回bins，默認為False击胜。當傳入True時亏狰，額外返回bins，即每個邊界值潜的。
--precision int骚揍，精度，默認為3

3.--條件分組
--如果price列的值>3000啰挪，group列顯示high信不，否則顯示low
--df_inner['group'] = np.where(df_inner['price'] > 3000,'high','low')
--對復合多個條件的數據進行分組標記
--df_inner.loc[(df_inner['city'] == 'beijing') & (df_inner['price']>= 4000), 'sign']=1

統(tǒng)計計算——探索性數據分析
--df[""].quantile(q=0.25) 四分位數
--df[""].median() 中位數
--df[""].var() 方差
--df[""].std() 標準差
--df[""].mode() 眾數
--df[""].skew() 偏態(tài)系數，大于0為正偏（向右亡呵，平均數大于中位數）抽活，小于0為負偏（向左，平均數小于中位數）锰什，等于0正態(tài)分布
--df[""].kurt() 峰態(tài)系數下硕，大于3為厚尾丁逝，小于3為窄尾
--df.sample(100) 抽樣
--df.dropna(how = "any", axis = 0) 刪除異常值

表連接
合并兩個表（同Excel中vlookup功能）
--df_inner=pd.merge(df,df1,how=' ') #how參數可取left、right梭姓、inner霜幼、outer
--pd.merge(left, right, on=['key1', 'key2']) #在多個主鍵上Join
--pd.merge(left,right,left_on = ['key1','key2'],right_on = ['key3','key4'])
--pd.merge(left,right,left_index=True,right_index=True) #在索引上join

透視表
--pd.pivot_table(df , index = [ 'city' ] , values = [ 'price' ] ,columns = [ 'size' ] ,aggfunc = [len , np.sum ] ,fill_value = 0 ,margin = Ture)

--pd.crosstab(df['director_name'],df['color'],margins=True)
--crosstab 用于計算兩個以上的因子的cross-tabulation. 默認的是計算因子之間的頻率，除非指定了其它數組或者函數進行計算

六誉尖、可視化

%matplotlib inline
import matplotlib.pyplot as plt
1.使用 subplot 繪制多個圖形
plt.subplot(nrows, ncols, index, kwargs) #row行 column列 index索引號（從左-右-上-下）
plt.subplot(211) # 等價于 subplot(2,1,1)
2.BAR CHART：條形圖
data = [5,25,50,20]
豎向：plt.bar(range(len(data)),data)
plt.barh(range(len(data)),data)
多個bar條形圖
data = [[5,25,50,20], [4,23,51,17],[6,22,52,19]]
X = np.arange(4)
plt.bar(X + 0.00, data[0], color = 'b', width = 0.25,label = "A")
plt.bar(X + 0.25, data[1], color = 'g', width = 0.25,label = "B")
plt.bar(X + 0.50, data[2], color = 'r', width = 0.25,label = "C")
plt.legend()
--疊放
plt.bar(X, data[0], color = 'b', width = 0.25)
plt.bar(X, data[1], color = 'g', width = 0.25,bottom = data[0])
plt.bar(X, data[2], color = 'r', width = 0.25,bottom = np.array(data[0]) + np.array(data[1]))
plt.show()
3.SCATTER POINTS：散點圖
散點圖用來衡量兩個連續(xù)變量之間的相關性
N = 50罪既；x = np.random.rand(N)；y = np.random.rand(N)铡恕；
plt.scatter(x, y)
--氣泡圖
colors = np.random.randn(N)
area = np.pi * (15 * np.random.rand(N))2 # 調整大小
plt.scatter(x, y, c=colors, alpha=0.5, s = area)
4.Histogram：直方圖
解釋:直方圖是用來衡量連續(xù)變量的概率分布的琢感。在構建直方圖之前，我們需要先定義好bin（值的范圍）探熔，也就是說我們需要先把連續(xù)值劃分成不同等份驹针，然后計算每一份里面數據的數量。
a = np.random.rand(100)
plt.hist(a,bins= 40)
plt.ylim(0,15)
plt.title("Standard Normal Distribution")
5.BOXPLOTS：箱型圖
--boxlot用于表達連續(xù)特征的百分位數分布诀艰。統(tǒng)計學上經常被用于檢測單變量的異常值柬甥，或者用于檢查離散特征和連續(xù)特征的關系
x = np.random.randint(20,100,size = (30,3))
plt.boxplot(x)
plt.ylim(0,120)
plt.xticks([1,2,3],['A','B','C'])
6.COLORS/TEXTS/annotate
data = [[5,25,50,20],[4,23,51,17],[6,22,52,19]]；X = np.arange(4)
plt.bar(X, data[0], color = 'darkorange', width = 0.25,label = 'A')
plt.bar(X, data[1], color = 'steelblue', width = 0.25,bottom = data[0],label = 'B')
plt.bar(X, data[2], color = 'violet', width = 0.25,bottom = np.array(data[0]) + np.array(data[1]),label = 'C')
fig, ax = plt.subplots(facecolor='teal')
ax.set_title("Figure 1")
plt.legend()
--在數據可視化的過程中涡驮，圖片中的文字經常被用來注釋圖中的一些特征暗甥。使用annotate()方法可以很方便地添加此類注釋。在使用annotate時捉捅，要考慮兩個點的坐標：被注釋的地方xy(x, y)和插入文本的地方xytext(x, y)
X = np.linspace(0, 2np.pi,100)# 均勻的劃分數據
Y = np.sin(X)；Y1 = np.cos(X)
plt.plot(X,Y)虽风；plt.plot(X,Y1)
plt.annotate('Points',xy=(1, np.sin(1)),xytext=(2, 0.5), fontsize=16, arrowprops=dict(arrowstyle="->"))
plt.title("這是一副測試圖棒口！")*
--想要讓matplotlib正確的顯示中文，我們需要進行一行特殊的設置
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標簽
plt.rcParams['axes.unicode_minus']=False #用來正常顯示負號
7.seaborn
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style(style="darkgrid")#配置樣式
sns.set_context(context="poster",font_scale=1.5)#配置字體
sns.set_palette(sns.color_palette("RdBu", n_colors=7))#配置色板
只需要在繪制圖形之前調用Seaborn的set()函數就可以直接使用其設定好的默認主題進行美化sns.set()辜膝；sinplot()
sns.lmplot(x = "",y = "" ,data = ,fit_reg = False ,hue = "") 散點圖/回歸圖
plt.xlim(0,150) 設定x軸范圍
plt.ylim(0,200) 設定y軸范圍
sns.boxplot(data = , axis = 1) 箱形圖
sns.violinplot( x = "",y = "",data = ) 小提琴圖
sns.swarmplot(x = 'Type 1',y = 'Attack',data = df) 群集圖
sns.distplot(df[""]) 直方圖
sns.countplot(x = "" , data = df ,palette = ) 條形圖（幫助分類變量的可視化无牵，palette調色板）
plt.xticks(rotation = -45) 設置x標簽
g = sns.factorplot(x = 'Type 1',y = 'Attack',data = df ,
hue = 'Stage', --根據不同顏色表示Stage
col = 'Stage', --根據Stage來分離圖表
kind = 'swarm') --創(chuàng)建集群圖
g.set_xticklabels(rotation = -45) --設置x軸標簽，factorplot能夠根據類別分離圖表
sns.kdeplot(df['Attack'],df['Defense']) --密度圖顯示的是兩個變量之間的分布,曲線越密集的地方說明兩個變量的關系越近厂抖，越稀疏的地方說明關系越遠
sns.jointplot(x = 'Attack', y = 'Defense',data = df) 聯(lián)合分布圖將散點圖和直方圖的信息結合起來茎毁，提供雙變量分布的詳細信息
更多案例：http://seaborn.pydata.org/examples/*

python數據清洗方法總結