寫在前面:
這是我見(jiàn)過(guò)的最嚴(yán)肅的數(shù)據(jù)集,幾乎每一行數(shù)據(jù)背后都是生命和鮮血的代價(jià)疯汁。這次探索分析并不妄圖說(shuō)明什么乒融,僅僅是對(duì)數(shù)據(jù)處理能力的鍛煉。因此本次的探索分析只會(huì)展示數(shù)據(jù)該有的樣子而不會(huì)進(jìn)行太多的評(píng)價(jià)欠橘。有一句話叫“因?yàn)檎鋹?ài)和平矩肩,我們回首戰(zhàn)爭(zhēng)”。這里也是肃续,因?yàn)檎鋹?ài)生命黍檩,所以回首空難。現(xiàn)在安全的飛行是10萬(wàn)多無(wú)辜的人通過(guò)性命換來(lái)的始锚,向這些偉大的探索者致敬刽酱。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
導(dǎo)入數(shù)據(jù)集
crash = pd.read_csv("./Airplane_Crashes_and_Fatalities_Since_1908.csv")
crash.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 13 columns):
Date 5268 non-null object
Time 3049 non-null object
Location 5248 non-null object
Operator 5250 non-null object
Flight # 1069 non-null object
Route 3562 non-null object
Type 5241 non-null object
Registration 4933 non-null object
cn/In 4040 non-null object
Aboard 5246 non-null float64
Fatalities 5256 non-null float64
Ground 5246 non-null float64
Summary 4878 non-null object
dtypes: float64(3), object(10)
memory usage: 535.1+ KB
crash = crash.drop(["Summary","cn/In","Flight #","Route","Location"],axis=1)
crash.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 8 columns):
Date 5268 non-null object
Time 3049 non-null object
Operator 5250 non-null object
Type 5241 non-null object
Registration 4933 non-null object
Aboard 5246 non-null float64
Fatalities 5256 non-null float64
Ground 5246 non-null float64
dtypes: float64(3), object(5)
memory usage: 329.3+ KB
print(crash[2200:2205])
Date Time Operator Type \
2200 03/06/1968 8:00 Military - U.S. Air Force Fairchild C-123K
2201 03/08/1968 19:18 Air Manila Fairchild F-27
2202 03/09/1968 23:20 Military - French Air Force Douglas DC6B
2203 03/19/1968 19:37 Viking Airways - Air Taxi Cessna 182
2204 03/23/1968 13:00 Fortaire Aviation - Air Taxi Brantly 305
Registration Aboard Fatalities Ground
2200 54-0590 49.0 49.0 0.0
2201 PI-C871 14.0 14.0 0.0
2202 43748 20.0 19.0 0.0
2203 N2623F 2.0 2.0 0.0
2204 N2224U 5.0 3.0 0.0
傷亡分析
傷亡排序
print(crash["Fatalities"].sum())
fatal_crash = crash[crash["Fatalities"].notnull()]
fatal_crash = fatal_crash.sort_values(by="Fatalities")
print(fatal_crash[-5:])
105479.0
Date Time Operator \
3562 06/23/1985 7:15 Air India
2726 03/03/1974 11:41 Turkish Airlines (THY)
4455 11/12/1996 18:40 Saudi Arabian Airlines / Kazastan Airlines
3568 08/12/1985 18:56 Japan Air Lines
2963 03/27/1977 17:07 Pan American World Airways / KLM
Type Registration Aboard \
3562 Boeing B-747-237B VT-EFO 329.0
2726 McDonnell Douglas DC-10-10 TC-JAV 346.0
4455 Boeing B-747-168B / Ilyushin IL-76TD HZAIH/UN-76435 349.0
3568 Boeing B-747-SR46 JA8119 524.0
2963 Boeing B-747-121 / Boeing B-747-206B N736PA/PH-BUF 644.0
Fatalities Ground
3562 329.0 0.0
2726 346.0 0.0
4455 349.0 0.0
3568 520.0 0.0
2963 583.0 0.0
- 內(nèi)特里費(fèi)空難:兩架波音-747相撞,死亡583人瞧捌,又稱世紀(jì)大空難
- 日航123空難:波音747撞富士山棵里,單架飛機(jī)失事最高死亡記錄
- 恰爾基達(dá)德里撞機(jī)事件润文,最嚴(yán)重的的空中撞機(jī)事件
- 土耳其航空981號(hào)班機(jī)空難:貨艙門未鎖定導(dǎo)致爆炸性施壓
- 印度航空182號(hào)班機(jī):恐怖襲擊
傷亡概率
print(fatal_crash["Fatalities"].sum() / fatal_crash["Aboard"].sum())
0.729700936002
print(fatal_crash[fatal_crash["Fatalities"] != 0]["Fatalities"].sum() / fatal_crash[fatal_crash["Fatalities"] != 0]["Aboard"].sum())
0.754191781605
機(jī)型處理
處理函數(shù)
type_crash = fatal_crash["Type"]
def type_handle(x):
x = str(x)
if "McDonnell Douglas" in x:
return "McDonnell Douglas"
elif "Douglas" in x:
return "Douglas"
elif "Boeing" in x:
return "Boeing"
elif "Airbus" in x:
return "Airbus"
elif "Embraer" in x:
return "Embraer"
elif "Ilyushin" in x:
return "Ilyushin"
else:
return "other"
company_crash = type_crash.map(type_handle)
print(pd.value_counts(company_crash))
other 3581
Douglas 984
Boeing 376
McDonnell Douglas 123
Ilyushin 96
Embraer 61
Airbus 35
Name: Type, dtype: int64
fatal_crash["company"] = company_crash
print(fatal_crash[:2])
Date Time Operator Type Registration \
108 10/21/1926 13:15 Imperial Airways Handley Page W-10 G-EBMS
5178 11/08/2007 8:00 Juba Air Cargo Antonov 12 ST-JUA
Aboard Fatalities Ground year month company
108 12.0 0.0 0.0 1926 10 other
5178 4.0 0.0 2.0 2007 11 other
處理結(jié)果
def airplane_counte(x):
fatal_ratio = x["Fatalities"].sum() / x["Aboard"].sum()
crash_time = x.shape[0]
fatal_num = x["Fatalities"].sum()
return pd.Series({"fatal_num":fatal_num,"crash_time":crash_time,"fatal_ratio":fatal_ratio})
company = fatal_crash.groupby(['company']).apply(airplane_counte)
print(company)
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
company['crash_time'].drop("other").plot(kind='bar',title="time")
plt.subplot(1,3,2)
company['fatal_num'].drop("other").plot(kind='bar',title="fatal_num")
plt.subplot(1,3,3)
company['fatal_ratio'].plot(kind='bar',title="fatal_ratio")
plt.show()
crash_time fatal_num fatal_ratio
company
Airbus 35.0 2980.0 0.510711
Boeing 376.0 18705.0 0.649434
Douglas 984.0 16899.0 0.794350
Embraer 61.0 644.0 0.779661
Ilyushin 96.0 4547.0 0.883084
McDonnell Douglas 123.0 6827.0 0.531946
other 3581.0 54877.0 0.785854
分廠商分析結(jié)果
時(shí)間分析
年
def get_year(x):
return x.split("/")[-1]
fatal_crash['year'] = fatal_crash["Date"].map(get_year)
year_fatal = fatal_crash[fatal_crash["year"] != np.NaN][["year","Fatalities","Aboard"]]
year_fatal.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5256 entries, 108 to 2963
Data columns (total 3 columns):
year 5256 non-null object
Fatalities 5256 non-null float64
Aboard 5246 non-null float64
dtypes: float64(2), object(1)
memory usage: 164.2+ KB
def year_analysis(x):
return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum() / x["Aboard"].sum()})
year = year_fatal.groupby(["year"]).apply(year_analysis)
year = year.sort_index()
year.info()
<class 'pandas.core.frame.DataFrame'>
Index: 98 entries, 1908 to 2009
Data columns (total 3 columns):
fatal_num 98 non-null float64
fatal_ratio 98 non-null float64
time 98 non-null float64
dtypes: float64(3)
memory usage: 3.1+ KB
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
year["fatal_num"].plot(title="fatal_num")
plt.subplot(1,3,2)
year["time"].plot(title="crash_time")
plt.subplot(1,3,3)
year["fatal_ratio"].plot(title="fatal_ratio")
plt.show()
按年分析結(jié)果
月
def get_month(x):
return x.split("/")[0]
fatal_crash['month'] = fatal_crash["Date"].map(get_month)
month_fatal = fatal_crash[fatal_crash["month"] != np.NaN][["month","Fatalities","Aboard"]]
month_fatal.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5256 entries, 108 to 2963
Data columns (total 3 columns):
month 5256 non-null object
Fatalities 5256 non-null float64
Aboard 5246 non-null float64
dtypes: float64(2), object(1)
memory usage: 164.2+ KB
def month_analysis(x):
return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum() / x["Aboard"].sum()})
month = month_fatal.groupby(["month"]).apply(year_analysis)
month = month.sort_index()
print(month)
fatal_num fatal_ratio time
month
01 8425.0 0.768354 494.0
02 7966.0 0.693057 395.0
03 8708.0 0.787057 453.0
04 6769.0 0.711852 378.0
05 7130.0 0.731807 370.0
06 7909.0 0.681399 385.0
07 9232.0 0.700349 427.0
08 10174.0 0.729162 474.0
09 10286.0 0.760349 458.0
10 8388.0 0.778758 452.0
11 10033.0 0.766522 454.0
12 10459.0 0.668478 516.0
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
month["fatal_num"].plot(kind="bar",title="fatal_num")
plt.subplot(1,3,2)
month["time"].plot(kind="bar",title="crash_time")
plt.subplot(1,3,3)
month["fatal_ratio"].plot(kind="bar",title="fatal_ratio")
plt.show()
按月份分析
小時(shí)
def get_hour(x):
hour = x.split(":")[0]
try:
hour = float(hour)
if int(hour) == hour and hour < 24:
return hour
else:
return np.nan
except:
return np.nan
time_fatal = fatal_crash[fatal_crash["Time"].isnull() == False]
time_fatal["hour"] = time_fatal["Time"].map(get_hour)
time_fatal.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3049 entries, 108 to 2963
Data columns (total 13 columns):
Date 3049 non-null object
Time 3049 non-null object
Operator 3046 non-null object
Type 3048 non-null object
Registration 2952 non-null object
Aboard 3049 non-null float64
Fatalities 3049 non-null float64
Ground 3046 non-null float64
airplane 3049 non-null object
company 3049 non-null object
year 3049 non-null object
month 3049 non-null object
hour 3036 non-null float64
dtypes: float64(4), object(9)
memory usage: 333.5+ KB
c:\users\qiank\appdata\local\programs\python\python35\lib\site-packages\ipykernel_launcher.py:13: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
del sys.path[0]
def hour_analysis(x):
return pd.Series({"fatal_num":x["Fatalities"].sum(),"time":x.shape[0],"fatal_ratio":x["Fatalities"].sum() / x["Aboard"].sum()})
hour = time_fatal.groupby(["hour"]).apply(hour_analysis)
plt.close()
plt.figure(figsize=(16,4))
plt.subplot(1,3,1)
hour["fatal_num"].plot(title="fatal_num")
plt.subplot(1,3,2)
hour["time"].plot(title="crash_time")
plt.subplot(1,3,3)
hour["fatal_ratio"].plot(title="fatal_ratio")
plt.show()
按時(shí)間分析