df_all_cities是我們建立的一個(gè)包含所有數(shù)據(jù)的Pandas Dataframe计螺,考慮到我們的分析目標(biāo)夯尽,我們可能會需要提取部分?jǐn)?shù)據(jù)來針對我們感興趣的具體問題進(jìn)行分析。為了方便大家對數(shù)據(jù)進(jìn)行探索登馒,在下面我們定義了一個(gè)filter_data和reading_stats的函數(shù)匙握,通過輸入不同的條件(conditions),該函數(shù)可以幫助我們篩選出這部分的數(shù)據(jù)。
def filter_data(data, condition):
"""
Remove elements that do not match the condition provided.
Takes a data list as input and returns a filtered list.
Conditions should be a list of strings of the following format:
'<field> <op> <value>'
where the following operations are valid: >, <, >=, <=, ==, !=
Example: ["duration < 15", "start_city == 'San Francisco'"]
"""
# Only want to split on first two spaces separating field from operator and
# operator from value: spaces within value should be retained.
field, op, value = condition.split(" ", 2)
# check if field is valid
if field not in data.columns.values :
raise Exception("'{}' is not a feature of the dataframe. Did you spell something wrong?".format(field))
# convert value into number or strip excess quotes if string
try:
value = float(value)
except:
value = value.strip("\'\"")
# get booleans for filtering
if op == ">":
matches = data[field] > value
elif op == "<":
matches = data[field] < value
elif op == ">=":
matches = data[field] >= value
elif op == "<=":
matches = data[field] <= value
elif op == "==":
matches = data[field] == value
elif op == "!=":
matches = data[field] != value
else: # catch invalid operation codes
raise Exception("Invalid comparison operator. Only >, <, >=, <=, ==, != allowed.")
# filter data and outcomes
data = data[matches].reset_index(drop = True)
return data
def reading_stats(data, filters = [], verbose = True):
"""
Report number of readings and average PM2.5 readings for data points that meet
specified filtering criteria.
"""
n_data_all = data.shape[0]
# Apply filters to data
for condition in filters:
data = filter_data(data, condition)
# Compute number of data points that met the filter criteria.
n_data = data.shape[0]
# Compute statistics for PM 2.5 readings.
pm_mean = data['PM_US_Post'].mean()
pm_qtiles = data['PM_US_Post'].quantile([.25, .5, .75]).as_matrix()
# Report computed statistics if verbosity is set to True (default).
if verbose:
if filters:
print('There are {:d} readings ({:.2f}%) matching the filter criteria.'.format(n_data, 100. * n_data / n_data_all))
else:
print('There are {:d} reading in the dataset.'.format(n_data))
print('The average readings of PM 2.5 is {:.2f} ug/m^3.'.format(pm_mean))
print('The median readings of PM 2.5 is {:.2f} ug/m^3.'.format(pm_qtiles[1]))
print('25% of readings of PM 2.5 are smaller than {:.2f} ug/m^3.'.format(pm_qtiles[0]))
print('25% of readings of PM 2.5 are larger than {:.2f} ug/m^3.'.format(pm_qtiles[2]))
seaborn.boxplot(data['PM_US_Post'], showfliers=False)
plt.title('Boxplot of PM 2.5 of filtered data')
plt.xlabel('PM_US Post (ug/m^3)')
# Return three-number summary
return data
在使用中,我們只需要調(diào)用reading_stats即可,我們在這個(gè)函數(shù)中調(diào)用了filter_data函數(shù)法褥,因此并不需要我們直接操作filter_data函數(shù)半等。下面是對于該函數(shù)的一些提示。
reading_stats函數(shù)中包含有3個(gè)參數(shù):
第一個(gè)參數(shù)(必須):需要被加載的 dataframe谬擦,數(shù)據(jù)將從這里開始分析惨远。
第二個(gè)參數(shù)(可選):數(shù)據(jù)過濾器葡幸,可以根據(jù)一系列輸入的條件(conditions)來過濾將要被分析的數(shù)據(jù)點(diǎn)贺氓。過濾器應(yīng)作為一系列條件提供蔑水,每個(gè)條件之間使用逗號進(jìn)行分割肤粱,并在外側(cè)使用""將其定義為字符串格式领曼,所有的條件使用[]包裹毁渗。每個(gè)單獨(dú)的條件應(yīng)該為包含三個(gè)元素的一個(gè)字符串:'<field> <op> <value>'(元素與元素之間需要有一個(gè)空格字符來作為間隔)单刁,<op>可以使用以下任意一個(gè)運(yùn)算符:>肺樟、<么伯、>=田柔、<=、==缀磕、!=虐骑。數(shù)據(jù)點(diǎn)必須滿足所有條件才能計(jì)算在內(nèi)廷没。例如颠黎,["city == 'Beijing'", "season == 'Spring'"] 僅保留北京市夭坪,季節(jié)為春天的數(shù)據(jù)过椎。在第一個(gè)條件中, <field>是city疚宇,<op>是 ==, <value>是'Beijing',因?yàn)楸本樽址浜约恿藛我柟戳ǎ鼈內(nèi)齻€(gè)元素之間分別添加一個(gè)空格。最后敢艰,這個(gè)條件需要使用雙引號引用起來。這個(gè)例子中使用了兩個(gè)條件森瘪,條件與條件之間使用逗號進(jìn)行分割,這兩個(gè)條件最后被放在[]之中窗宇。
第三個(gè)參數(shù)(可選):詳細(xì)數(shù)據(jù)军俊,該參數(shù)決定我們是否打印被選擇的數(shù)據(jù)的詳細(xì)統(tǒng)計(jì)信息担败。如果verbose = True,會自動打印數(shù)據(jù)的條數(shù)狈网,以及四分位點(diǎn)拓哺,并繪制箱線圖。如果verbose = False, 則只會返回篩選后的dataframe础淤,不進(jìn)行打印鸽凶。