1.導(dǎo)入數(shù)據(jù)
import pandas #ipython notebook
titanic = pandas.read_csv("titanic_train.csv")
titanic.head(5)
數(shù)據(jù)預(yù)覽
由上可知立磁,該數(shù)據(jù)共有12個字段告丢,各個字段含義如下:
- PassengerId 整型變量,標(biāo)識乘客的ID垒玲,遞增變量陆馁,對預(yù)測無幫助
- Survived 整型變量,標(biāo)識該乘客是否幸存合愈。0表示遇難叮贩,1表示幸存。將其轉(zhuǎn)換為factor變量比較方便處理
- Pclass 整型變量佛析,標(biāo)識乘客的社會-經(jīng)濟(jì)狀態(tài)益老,1代表Upper,2代表Middle寸莫,3代表Lower
- Name 字符型變量捺萌,除包含姓和名以外,還包含Mr.
Mrs. Dr.這樣的具有西方文化特點的信息 - Sex 字符型變量储狭,標(biāo)識乘客性別互婿,適合轉(zhuǎn)換為factor類型變量
- Age 整型變量,標(biāo)識乘客年齡辽狈,有缺失值
- SibSp 整型變量慈参,代表兄弟姐妹及配偶的個數(shù)。其中Sib代表Sibling也即兄弟姐妹刮萌,Sp代表Spouse也即配偶
- Parch 整型變量驮配,代表父母或子女的個數(shù)。其中Par代表Parent也即父母,Ch代表Child也即子女
- Ticket 字符型變量壮锻,代表乘客的船票號 Fare 數(shù)值型琐旁,代表乘客的船票價
- Cabin 字符型,代表乘客所在的艙位猜绣,有缺失值
- Embarked 字符型灰殴,代表乘客登船口岸,適合轉(zhuǎn)換為factor型變量
2.數(shù)據(jù)預(yù)處理
2.1數(shù)據(jù)描述性統(tǒng)計
titanic.describe()
描述性統(tǒng)計
可以知道掰邢,字段Age有缺失值牺陶,將平均值作為填充
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
2.2非數(shù)字型數(shù)據(jù)轉(zhuǎn)換為數(shù)字表示類型
# 男性用0表示,女性用1表示
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
print(titanic["Embarked"].unique())
# 將Embarked字段缺失值填充為數(shù)量最多的S
titanic["Embarked"] = titanic["Embarked"].fillna('S')
# 把S辣之、C掰伸、Q分別用數(shù)字0、1怀估、2表示
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
處理后的數(shù)據(jù)預(yù)覽:
處理后數(shù)據(jù)
2.3對訓(xùn)練數(shù)據(jù)進(jìn)行劃分狮鸭,進(jìn)行交叉驗證
from sklearn.model_selection import KFold # cross_validation 已經(jīng)被model_selection 代替
# 七個特征值
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
3.邏輯回歸算法實現(xiàn)
# 從sklearn中導(dǎo)入邏輯回歸算法
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
alg = LogisticRegression(random_state=1)
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
print(scores.mean())
輸出結(jié)果:
邏輯回歸算法預(yù)測分?jǐn)?shù)
可知,該算法預(yù)測準(zhǔn)確率達(dá)到78%多搀,預(yù)測效果不錯
4.隨機(jī)森林算法實現(xiàn)
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
kf = model_selection.KFold( n_splits=3, random_state=1).split(titanic)
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
輸出結(jié)果:
隨機(jī)森林算法預(yù)測分?jǐn)?shù)
可知歧蕉,隨機(jī)森林算法預(yù)測準(zhǔn)確率達(dá)到80%,預(yù)測效果好于邏輯回歸算法
4.1調(diào)整隨機(jī)森林算法參數(shù)
# 調(diào)參
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!)
kf = model_selection.KFold(n_splits=3, random_state=1).split(titanic)
scores = model_selection.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())
輸出結(jié)果:
參數(shù)調(diào)整后的預(yù)測分?jǐn)?shù)
可以知道酗昼,通過調(diào)節(jié)參數(shù)廊谓,能夠達(dá)到提高模型預(yù)測能力的效果