內(nèi)容來自datacamp課程:pandas foundation
數(shù)據(jù)以及代碼在github
數(shù)據(jù):
數(shù)據(jù)1
-
weather_data_austin_2010:
2010年的Austin天氣情況
為了后續(xù)更好使用,把date作為index
df.Date=pd.to_datetime(df.Date)
df.index=df.Date
df=df.drop(['Date'],axis=1)
df.head()
數(shù)據(jù)2
NOAA_QCLCD_2011_hourly_13904.txt
2011年的天氣情況滑负,沒有header进宝,數(shù)據(jù)列數(shù)44列锌雀,會在后面刪除一些
column_labels='Wban,date,Time,StationType,sky_condition,sky_conditionFlag,visibility,visibilityFlag,wx_and_obst_to_vision,wx_and_obst_to_visionFlag,dry_bulb_faren,dry_bulb_farenFlag,dry_bulb_cel,dry_bulb_celFlag,wet_bulb_faren,wet_bulb_farenFlag,wet_bulb_cel,wet_bulb_celFlag,dew_point_faren,dew_point_farenFlag,dew_point_cel,dew_point_celFlag,relative_humidity,relative_humidityFlag,wind_speed,wind_speedFlag,wind_direction,wind_directionFlag,value_for_wind_character,value_for_wind_characterFlag,station_pressure,station_pressureFlag,pressure_tendency,pressure_tendencyFlag,presschange,presschangeFlag,sea_level_pressure,sea_level_pressureFlag,record_type,hourly_precip,hourly_precipFlag,altimeter,altimeterFlag,junk'
column_labels_list = column_labels.split(',')
df2.columns = column_labels_list
list_to_drop=['sky_conditionFlag', 'visibilityFlag', 'wx_and_obst_to_vision', 'wx_and_obst_to_visionFlag', 'dry_bulb_farenFlag', 'dry_bulb_celFlag', 'wet_bulb_farenFlag', 'wet_bulb_celFlag', 'dew_point_farenFlag', 'dew_point_celFlag', 'relative_humidityFlag', 'wind_speedFlag', 'wind_directionFlag', 'value_for_wind_character', 'value_for_wind_characterFlag', 'station_pressureFlag', 'pressure_tendencyFlag', 'pressure_tendency', 'presschange', 'presschangeFlag', 'sea_level_pressureFlag', 'hourly_precip', 'hourly_precipFlag', 'altimeter', 'record_type', 'altimeterFlag', 'junk']
df2_dropped = df2.drop(list_to_drop,axis='columns')
print(df2_dropped.head())
數(shù)據(jù)清洗伞矩,把date還有time合并靴迫,并且作為index
# Convert the date column to string: df_dropped['date']
df2_dropped['date'] = df2_dropped['date'].astype(str)
# Pad leading zeros to the Time column: df_dropped['Time']
df2_dropped['Time'] = df2_dropped['Time'].apply(lambda x:'{:0>4}'.format(x))
# Concatenate the new date and Time columns: date_string
date_string = df2_dropped['date'] + df2_dropped['Time']
# Convert the date_string Series to datetime: date_times
date_times = pd.to_datetime(date_string, format='%Y%m%d%H%M')
# Set the index to be the new date_times container: df_clean
df2_clean = df2_dropped.set_index(date_times)
# Print the output of df_clean.head()
print(df2_clean.head())
處理缺失值 把表格中標(biāo)記為M的缺失值改為NAN
# Print the dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
print(df2_clean.loc['2011-6-20 8:00:00':'2011-6-20 9:00:00','dry_bulb_faren' ])
# Convert the dry_bulb_faren column to numeric values: df_clean['dry_bulb_faren']
df2_clean['dry_bulb_faren'] = pd.to_numeric(df2_clean['dry_bulb_faren'], errors='coerce')
# Print the transformed dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
print(df2_clean.loc['2011-6-20 8:00:00':'2011-6-20 9:00:00', 'dry_bulb_faren'])
# Convert the wind_speed and dew_point_faren columns to numeric values
df2_clean['wind_speed'] = pd.to_numeric(df2_clean['wind_speed'], errors='coerce')
df2_clean['dew_point_faren'] = pd.to_numeric(df2_clean['dew_point_faren'], errors='coerce')
了解數(shù)據(jù)2
# Print the median of the dry_bulb_faren column
print(df2_clean.dry_bulb_faren.median())
# Print the median of the dry_bulb_faren column for the time range '2011-Apr':'2011-Jun'
print(df2_clean.loc['2011-Apr':'2011-Jun', 'dry_bulb_faren'].median())
# Print the median of the dry_bulb_faren column for the month of January
print(df2_clean.loc['2011-Jan', 'dry_bulb_faren'].median())
72.0
78.0
48.0
只分析列了‘干球溫度’的中位數(shù)佳镜,以及他在不同時(shí)間的中位數(shù)
how much hotter was every day in 2011 than expected from the 30-year average?求方差
# Downsample df_clean by day and aggregate by mean: daily_mean_2011
daily_mean_2011 = df2_clean.resample('D').mean()
# Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
daily_temp_2011 = daily_mean_2011['dry_bulb_faren'].values
# Downsample df_climate by day and aggregate by mean: daily_climate
daily_climate = df.resample('D').mean()
# Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
daily_temp_climate = daily_climate.reset_index()['Temperature']
# Compute the difference between the two arrays and print the mean difference
difference = daily_temp_2011 - daily_temp_climate
print(difference.mean())
1.3301831870056477
晴天還是雨天启盛?
On average, how much hotter is it when the sun is shining? In this exercise, you will compare temperatures on sunny days against temperatures on overcast days.
Your job is to use Boolean selection to filter out sunny and overcast days, and then compute the difference of the mean daily maximum temperatures between each type of day.
The column 'sky_condition' provides information about whether the day was sunny ('CLR') or overcast ('OVC').
# Using df_clean, when is sky_condition 'CLR'?
is_sky_clear = df2_clean['sky_condition']=='CLR'
# Filter df_clean using is_sky_clear
sunny = df2_clean.loc[is_sky_clear]
# Resample sunny by day then calculate the max
sunny_daily_max = sunny.resample('D').max()
# Using df_clean, when does sky_condition contain 'OVC'?
is_sky_overcast = df2_clean['sky_condition'].str.contains('OVC')
# Filter df_clean using is_sky_overcast
overcast = df2_clean.loc[is_sky_overcast]
# Resample overcast by day then calculate the max
overcast_daily_max = overcast.resample('D').max()
# Calculate the mean of sunny_daily_max
sunny_daily_max_mean = sunny_daily_max.mean()
# Calculate the mean of overcast_daily_max
overcast_daily_max_mean = overcast_daily_max.mean()
# Print the difference (sunny minus overcast)
print(sunny_daily_max_mean-overcast_daily_max_mean)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Wban 0.000000
StationType 0.000000
dry_bulb_faren 6.504304
dew_point_faren -4.339286
wind_speed -3.246062
dtype: float64
The average daily maximum dry bulb temperature was 6.5 degrees Fahrenheit higher on sunny days compared to overcast days.
可見度和溫度
your job is to plot the weekly average temperature and visibility as subplots.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
weekly_mean = df2_clean[['visibility','dry_bulb_faren']].resample('W').mean()
# Print the output of weekly_mean.corr()
print(weekly_mean.corr())
# Plot weekly_mean with subplots=True
weekly_mean.plot(subplots=True)
plt.show()
計(jì)算晴天的比例
# Using df_clean, when is sky_condition 'CLR'?
is_sky_clear = df2_clean['sky_condition']=='CLR'
# Resample is_sky_clear by day
resampled = is_sky_clear.resample('D')
# Calculate the number of sunny hours per day
sunny_hours = resampled.sum()
# Calculate the number of measured hours per day
total_hours = resampled.count()
# Calculate the fraction of hours per day that were sunny
sunny_fraction = sunny_hours/total_hours
sunny_fraction.plot(kind='box')
plt.show()
露點(diǎn)和溫度
Dew point is a measure of relative humidity based on pressure and temperature. A dew point above 65 is considered uncomfortable while a temperature above 90 is also considered uncomfortable.
In this exercise, you will explore the maximum temperature and dew point of each month. The columns of interest are 'dew_point_faren' and 'dry_bulb_faren'. After resampling them appropriately to get the maximum temperature and dew point in each month, generate a histogram of these values as subplots.
# Resample dew_point_faren and dry_bulb_faren by Month, aggregating the maximum values: monthly_max
monthly_max = df2_clean[['dew_point_faren','dry_bulb_faren']].resample('M').max()
# Generate a histogram with bins=8, alpha=0.5, subplots=True
monthly_max.plot(kind='hist',bins=8,alpha=0.5,subplots=True)
# Show the plot
plt.show()
溫度高的可能性 cdf
We already know that 2011 was hotter than the climate normals for the previous thirty years. In this final exercise, you will compare the maximum temperature in August 2011 against that of the August 2010 climate normals. More specifically, you will use a CDF plot to determine the probability of the 2011 daily maximum temperature in August being above the 2010 climate normal value. To do this, you will leverage the data manipulation, filtering, resampling, and visualization skills you have acquired throughout this course.
The two DataFrames df_clean and df_climate are available in the workspace. Your job is to select the maximum temperature in August in df_climate, and then maximum daily temperatures in August 2011. You will then filter out the days in August 2011 that were above the August 2010 maximum, and use this to construct a CDF plot.
# Extract the maximum temperature in August 2010 from df_climate: august_max
august_max = df.loc['2010-Aug','Temperature'].max()
print(august_max)
# Resample August 2011 temps in df_clean by day & aggregate the max value: august_2011
august_2011 = df2_clean.loc['2011-Aug','dry_bulb_faren'].resample('D').max()
# Filter for days in august_2011 where the value exceeds august_max: august_2011_high
august_2011_high = august_2011.loc[august_2011 > august_max]
# Construct a CDF of august_2011_high
august_2011_high.plot(kind='hist', normed=True, cumulative=True, bins=25)
# Display the plot
plt.show()