Kin Lim Lee 分析了8個(gè)簡(jiǎn)單的預(yù)處理代碼凌停,一共涵蓋8個(gè)場(chǎng)景侮东,分別是:
刪除多列圈盔、更改數(shù)據(jù)類型、將分類變量轉(zhuǎn)換為數(shù)字變量悄雅、檢查缺失數(shù)據(jù)驱敲、刪除列中的字符串、刪除列中的空格宽闲、用字符串連接兩列(帶條件)众眨、轉(zhuǎn)換時(shí)間戳(從字符串到日期時(shí)間格式)
刪除多列
Delete multiple columns
Not all columns are useful for data analysis. Df.drop makes it easy to delete the columns you specify.
在進(jìn)行數(shù)據(jù)分析時(shí),并非所有的列都有用容诬,用df.drop可以方便地刪除你指定的列围辙。
def drop_multiple_col(col_names_list, df):
AIM -> Drop multiple columns based on their column names
INPUT -> List of column names, df
OUTPUT -> updated df with dropped columns
------
df.drop(col_names_list, axis=1, inplace=True)
return df
轉(zhuǎn)換數(shù)據(jù)類型
Convert data type
When the data set gets bigger, you need to convert the data type to save memory.
當(dāng)數(shù)據(jù)集變大時(shí),需要轉(zhuǎn)換數(shù)據(jù)類型來節(jié)省內(nèi)存放案。
def change_dtypes(col_int, col_float, df):
AIM -> Changing dtypes to save memory
INPUT -> List of column names (int, float), df
OUTPUT -> updated df with smaller memory
------
df[col_int] = df[col_int].astype( int32 )
df[col_float] = df[col_float].astype( float32 )
將分類變量轉(zhuǎn)換為數(shù)值變量
Convert categorical variables to numeric variables
Some machine learning models require variables to be in numeric format. This requires first converting the categorical variable to a numeric variable. At the same time, you can also keep categorical variables for data visualization.
一些機(jī)器學(xué)習(xí)模型要求變量采用數(shù)值格式。這需要先將分類變量轉(zhuǎn)換為數(shù)值變量矫俺。同時(shí)吱殉,你也可以保留分類變量掸冤,以便進(jìn)行數(shù)據(jù)可視化。
def convert_cat2num(df):
# Convert categorical variable to numerical variable
num_encode = { col_1 : { YES :1, NO :0},
col_2 : { WON :1, LOSE :0, DRAW :0}}
df.replace(num_encode, inplace=True)
檢查缺失數(shù)據(jù)
Check for missing data
If you want to check the amount of missing data per column, using the following code is the fastest way. It allows you to better understand which columns are missing more data and determine how to proceed with the next step of data cleansing and analysis.
如果你要檢查每列缺失數(shù)據(jù)的數(shù)量友雳,使用下列代碼是最快的方法稿湿。可以讓你更好地了解哪些列缺失的數(shù)據(jù)更多押赊,從而確定怎么進(jìn)行下一步的數(shù)據(jù)清洗和分析操作饺藤。
def check_missing_data(df):
# check for any missing data in the df (display in descending order)
return df.isnull().sum().sort_values(ascending=False)
刪除列中的字符串
Delete strings in columns
Sometimes new characters or other strange symbols appear in the string column, which can be handled simply by using df[‘col_1’].replace Drop it.
有時(shí)候,會(huì)有新的字符或者其他奇怪的符號(hào)出現(xiàn)在字符串列中流礁,這可以使用df[‘col_1’].replace很簡(jiǎn)單地把它們處理掉涕俗。
def remove_col_str(df):
# remove a portion of string in a dataframe column - col_1
df[ col_1 ].replace(, , regex=True, inplace=True)
# remove all the characters after &# (including &#) for column - col_1
df[ col_1 ].replace( &#.* , , regex=True, inplace=True)
刪除列中的空格
Delete spaces in columns
When data is confusing, anything can happen. There are often some spaces at the beginning of the string. The following code is very useful when deleting spaces at the beginning of a string in a column.
數(shù)據(jù)混亂的時(shí)候,什么情況都有可能發(fā)生神帅。字符串開頭經(jīng)常會(huì)有一些空格再姑。在刪除列中字符串開頭的空格時(shí),下面的代碼非常有用找御。
def remove_col_white_space(df):
# remove white space at the beginning of string
df[col] = df[col].str.lstrip()
用字符串連接兩列(帶條件)
Connect two columns with strings (with condition)
This code is helpful when you want to conditionally join two columns together with a string. For example, you can set some letters at the end of the first column and then use them to connect to the second column.
As needed, the letters at the end can also be deleted after the connection is complete.
當(dāng)你想要有條件地用字符串將兩列連接在一起時(shí)元镀,這段代碼很有幫助。比如霎桅,你可以在第一列結(jié)尾處設(shè)定某些字母栖疑,然后用它們與第二列連接在一起。根據(jù)需要滔驶,結(jié)尾處的字母也可以在連接完成后刪除遇革。
def concat_col_str_condition(df):
# concat 2 columns with strings if the last 3 letters of the first column are pil
mask = df[ col_1 ].str.endswith( pil , na=False)
col_new = df[mask][ col_1 ] + df[mask][ col_2 ]
col_new.replace( pil , , regex=True, inplace=True) # replace the pil with emtpy space
轉(zhuǎn)換時(shí)間戳(從字符串到日期時(shí)間格式)
Conversion timestamp (from string to datetime format)
When processing time series data, we are likely to encounter timestamp columns in string format.
This means converting the string format to a datetime format (or other format specified according to our needs) for meaningful analysis of the data.
在處理時(shí)間序列數(shù)據(jù)時(shí),我們很可能會(huì)遇到字符串格式的時(shí)間戳列瓜浸。這意味著要將字符串格式轉(zhuǎn)換為日期時(shí)間格式(或者其他根據(jù)我們的需求指定的格式) 澳淑,以便對(duì)數(shù)據(jù)進(jìn)行有意義的分析。
def convert_str_datetime(df):
AIM -> Convert datetime(String) to datetime(format we want)
INPUT -> df
OUTPUT -> updated df with new datetime format
------
df.insert(loc=2, column= timestamp , value=pd.to_datetime(df.transdate, format= %Y-%m-%
一般在DataFrame中插佛,簡(jiǎn)易獲取統(tǒng)計(jì)信息的函數(shù):
df.count() #非空元素計(jì)算
df.min() #最小值
df.max() #最大值
df.idxmin() #最小值的位置杠巡,類似于R中的which.min函數(shù)
df.idxmax() #最大值的位置,類似于R中的which.max函數(shù)
df.quantile(0.1) #10%分位數(shù)
df.sum() #求和
df.mean() #均值
df.median() #中位數(shù)
df.mode() #眾數(shù)
df.var() #方差
df.std() #標(biāo)準(zhǔn)差
df.mad() #平均絕對(duì)偏差
df.skew() #偏度
df.kurt() #峰度
df.describe() #一次性輸出多個(gè)描述性統(tǒng)計(jì)指標(biāo)