project checklist
frame the problem
select a performance measure
RMSE:均方誤差根
MAE: 平均絕對(duì)誤差
范數(shù)越大對(duì)大特征值更有效遮咖,會(huì)忽略小特征值,但數(shù)據(jù)正態(tài)分布時(shí)根吁,RSEM性能更好譬涡。
Download and load the data
Take a quick look at the data strucure
data.head()
data.info()
data[‘a(chǎn)ttribute’].value_counts()
data.describe()
也可以畫(huà)直方圖來(lái)了解各個(gè)數(shù)字型屬性的分布
data.hist(bins = 50,figsize=(20,15))
create a test set
random select
from sklearn.model_selection import train_test_split
train_set,test_set = train_test_split(data,test_size = 0.2, random_state = 42)
stratified sampling通過(guò)對(duì)分組屬性進(jìn)行分層采樣劃分
from sklearn.model_selection import StratifiedShuffleSplit
spliter = StratifiedShuffleSplit(n_splits = 1,test_size = 0.2,random_state = 42)
for train_index,test_index in spliter.split(data,data[‘category’]):
strat_train_set = data.loc[train_index]
start_test_set = data.loc[test_index]
exploring the data:discover and visualize the data to gain insights
visualizing geographical data
housing.plot(kind = ’scatter’,x= ‘longitude’,y = ‘latitude’ ,alpha = 0.4,s = housing[‘population’]/100,label = ‘population’,c = “median_house_value”,camp = plt.get_cmap(“jet”),colorbar = Ture)
plt.legend()