查看訓(xùn)練好的模型的各個(gè)特征的系數(shù)有助于做特征篩選滔驾,下面針對(duì)不同特征類(lèi)型使用了不同方法來(lái)得到不同特征的系數(shù)削锰。
# 使用http://www.reibang.com/p/20456b512fa7中的模型數(shù)據(jù)
# 假設(shè)數(shù)據(jù)已經(jīng)處理好括袒,因此直接訓(xùn)練模型以及預(yù)測(cè)
from itertools import chain
#原始數(shù)據(jù)如下
births.show(3)
+----------------------+-----------+----------------+------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+
|INFANT_ALIVE_AT_REPORT|BIRTH_PLACE|MOTHER_AGE_YEARS|FATHER_COMBINE_AGE|CIG_BEFORE|CIG_1_TRI|CIG_2_TRI|CIG_3_TRI|MOTHER_HEIGHT_IN|MOTHER_PRE_WEIGHT|MOTHER_DELIVERY_WEIGHT|MOTHER_WEIGHT_GAIN|DIABETES_PRE|DIABETES_GEST|HYP_TENS_PRE|HYP_TENS_GEST|PREV_BIRTH_PRETERM|
+----------------------+-----------+----------------+------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+
| 0| 1| 29| 99| 0| 0| 0| 0| 99| 999| 999| 99| 0| 0| 0| 0| 0|
| 0| 1| 22| 29| 0| 0| 0| 0| 65| 180| 198| 18| 0| 0| 0| 0| 0|
| 0| 1| 38| 40| 0| 0| 0| 0| 63| 155| 167| 12| 0| 0| 0| 0| 0|
+----------------------+-----------+----------------+------------------+----------+---------+---------+---------+----------------+-----------------+----------------------+------------------+------------+-------------+------------+-------------+------------------+
# 需要注意的是:在訓(xùn)練模型之前需要使用 VectorAssembler 將所有特征合并在一列
pipeline = Pipeline(stages=[encoder, featuresCreator, logistic])
model = pipeline.fit(birth_train)
test_res= model.transform(birth_test)
lrm = model.stages[-1]
# 得到各個(gè)特征
attrs = sorted(
(attr["idx"], attr["name"]) for attr in (chain(*test_model
.schema[lrm.summary.featuresCol]
.metadata["ml_attr"]["attrs"].values())))
print(attrs)
# 輸出
[(0, 'BIRTH_PLACE_VEC_0'),
(1, 'BIRTH_PLACE_VEC_1'),
(2, 'BIRTH_PLACE_VEC_2'),
(3, 'BIRTH_PLACE_VEC_3'),
(4, 'BIRTH_PLACE_VEC_4'),
(5, 'BIRTH_PLACE_VEC_5'),
(6, 'BIRTH_PLACE_VEC_6'),
(7, 'BIRTH_PLACE_VEC_7'),
(8, 'BIRTH_PLACE_VEC_8'),
(9, 'MOTHER_AGE_YEARS'),
(10, 'FATHER_COMBINE_AGE'),
(11, 'CIG_BEFORE'),
(12, 'CIG_1_TRI'),
(13, 'CIG_2_TRI'),
(14, 'CIG_3_TRI'),
(15, 'MOTHER_HEIGHT_IN'),
(16, 'MOTHER_PRE_WEIGHT'),
(17, 'MOTHER_DELIVERY_WEIGHT'),
(18, 'MOTHER_WEIGHT_GAIN'),
(19, 'DIABETES_PRE'),
(20, 'DIABETES_GEST'),
(21, 'HYP_TENS_PRE'),
(22, 'HYP_TENS_GEST'),
(23, 'PREV_BIRTH_PRETERM')]
# 將特征與系數(shù)對(duì)應(yīng)起來(lái)
feats_coef = [(name, lrm.coefficients[idx]) for idx, name in attrs]
print(feats_coef)
[('BIRTH_PLACE_VEC_0', 0.0),
('BIRTH_PLACE_VEC_1', 0.594420849821937),
('BIRTH_PLACE_VEC_2', 2.4075589670913335),
('BIRTH_PLACE_VEC_3', 1.7823125440410161),
('BIRTH_PLACE_VEC_4', -1.6531133349571725),
('BIRTH_PLACE_VEC_5', -0.5495784312261248),
('BIRTH_PLACE_VEC_6', -1.7332912701009395),
('BIRTH_PLACE_VEC_7', 0.039713396666346504),
('BIRTH_PLACE_VEC_8', 0.0),
('MOTHER_AGE_YEARS', 0.00576202997456978),
('FATHER_COMBINE_AGE', -0.01461223060174637),
('CIG_BEFORE', 0.011062646656450726),
('CIG_1_TRI', 0.0080557042396814),
('CIG_2_TRI', 0.004632194351793351),
('CIG_3_TRI', 0.021007970934441053),
('MOTHER_HEIGHT_IN', -0.0010835415347563793),
('MOTHER_PRE_WEIGHT', -0.002190453970910452),
('MOTHER_DELIVERY_WEIGHT', -0.0011442841260634116),
('MOTHER_WEIGHT_GAIN', 0.02308236363565165),
('DIABETES_PRE', -0.9841689991671982),
('DIABETES_GEST', 0.7913093211204729),
('HYP_TENS_PRE', -0.2552870610582304),
('HYP_TENS_GEST', 0.26936315771969194),
('PREV_BIRTH_PRETERM', -1.2085697819317305)]
使用上面的方法可以查看一個(gè)模型各個(gè)特征的系數(shù)從而進(jìn)行特征篩選,但是summary
函數(shù)目前只適用于二分類(lèi)箱舞。此外遍坟,上面的數(shù)據(jù)的特征大部分都是數(shù)值型的,而在實(shí)際應(yīng)用中晴股,有的特征是從文本中提取的愿伴,需要使用CountVectorizer
將其轉(zhuǎn)換為詞向量。這時(shí)可以使用下面的方法來(lái)得到各個(gè)詞的系數(shù):
# 數(shù)據(jù)如下电湘,其中channel為label隔节,os和name為特征
df.show(4)
+-------+-------------------------------------------------------------------------------------+-------+
|os |name |channel|
+-------+-------------------------------------------------------------------------------------+-------+
|iOS |-中國(guó)X檔案:馴火奇人.mp4-娛樂(lè)-高清正版視頻在線(xiàn)觀看–愛(ài)奇藝 |綜藝 |
|android|0001.土豆網(wǎng)-錫劇新版全本《珍珠塔》--周東亮董云華許美-綜藝-高清正版視頻在線(xiàn)觀看–愛(ài)奇藝|綜藝 |
|iOS |0051彝族麗江打跳 (16)_baofeng-娛樂(lè)-高清正版視頻在線(xiàn)觀看–愛(ài)奇藝 |綜藝 |
|iOS |10歲男孩從軍 沒(méi)想到竟是個(gè)神槍狙擊手 男子看傻了-電視劇-高清正版視頻在線(xiàn)觀看–愛(ài)奇藝 |電視劇 |
+-------+-------------------------------------------------------------------------------------+-------+
### 方式一: 先將所有的詞都放在一列,然后使用CountVectorizer詞向量化
def text2terms(sentence):
'''使用textRank分詞
'''
import jieba.analyse
terms = jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'))
if not terms: # 若textrank算法的到的結(jié)果為空寂呛,則使用tf-idf算法提取關(guān)鍵詞
terms = jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
for t in terms:
if t.isnumeric() or (t in ['一','二','三', '四', '五', '六', '七', '八', '九', '十', 'Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ','Ⅷ','Ⅸ']):
terms.remove(t)
return terms
def get_features(row):
features = []
features += [row.os]
terms = text2terms(row.name)
features += terms
if row.channel=='電視劇':
label=0
else:
label=1
return row.channel, label, row.name, features
df1 = df.rdd.map(lambda row: get_features(row)).toDF(['channel', 'label', 'name', 'terms'])
df1.show(2, truncate=False)
+-------+-----+-------------------------------------------------------------------------------------+-----------------------------------------------------+
|channel|label|name |terms |
+-------+-----+-------------------------------------------------------------------------------------+-----------------------------------------------------+
|綜藝 |1 |-中國(guó)X檔案:馴火奇人.mp4-娛樂(lè)-高清正版視頻在線(xiàn)觀看–愛(ài)奇藝 |[iOS, 視頻, 正版, 馴火, 檔案, 娛樂(lè), 奇人, 觀看, 中國(guó)]|
|綜藝 |1 |0001.土豆網(wǎng)-錫劇新版全本《珍珠塔》--周東亮董云華許美-綜藝-高清正版視頻在線(xiàn)觀看–愛(ài)奇藝|[android, 視頻, 正版, 錫劇, 珍珠, 全本, 綜藝, 觀看] |
+-------+-----+-------------------------------------------------------------------------------------+-----------------------------------------------------+
### 擬合模型
cv = CountVectorizer(inputCol='terms', outputCol='features')
cv_model = cv.fit(df1)
df1 = cv_model.transform(df1)
df2 = df1.select('label', 'features')
logistic = cl.LogisticRegression(maxIter=10,
regParam=0.01,
featuresCol='features',
labelCol='label')
lr_model = logistic.fit(df2)
res = lr_model.transform(df2)
# 查看詞向量中的所有詞怎诫,這里只查看前10個(gè)
cv_model.vocabulary[:10]
#輸出
['視頻', '觀看', '正版', 'iOS', 'wp', 'android', '娛樂(lè)', '電視劇', '片花', '綜藝']
# 查看前10個(gè)詞的系數(shù)
lr_model.coefficients[:10]
array([ 0.380, 0.357, 1.339, -0.182, -0.250, -0.587, 2.607, -2.313,
-0.966, 3.085])
# 將他們組合在一起
for i,j in zip(cv_model.vocabulary[:10], lr_model.coefficients[:10]):
print(i,j)
視頻 0.37996859807875527
觀看 0.3567728962448092
正版 1.3386805525611496
iOS -0.18176377875140984
wp -0.2501881651132442
android -0.5865211142654886
娛樂(lè) 2.6067191688211433
電視劇 -2.3126880551420914
片花 -0.966167767504617
綜藝 3.0854430292662474
在上面我們展示了其中如何得到詞向量中每個(gè)詞的系數(shù)大小,主要是用到了CountVectorizerModel
的vocabulary
屬性來(lái)得到詞向量中的各個(gè)詞贷痪,從而將詞與系數(shù)對(duì)應(yīng)起來(lái)幻妓。需要注意的是,這是用使用前面的summary
屬性來(lái)得到特征名稱(chēng)是不可行的劫拢,返回的特征名為空肉津,這可能是因?yàn)樵嫉乃性~匯就在一列中。
下面我們將os和name特征放在兩列中進(jìn)行詞向量化后再組合在一起進(jìn)行訓(xùn)練模型:
def get_features2(row):
terms = text2terms(row.name)
if row.channel=='電視劇':
label=0
else:
label=1
return row.channel, label, [row.os], terms
df4 = df.rdd.map(lambda x: get_features2(x)).toDF(['channel','label','os','terms'])
df4.show(2)
# 輸出
+-------+-----+---------+------------------------------------------------+
|channel|label|os |terms |
+-------+-----+---------+------------------------------------------------+
|綜藝 |1 |[iOS] |[視頻, 正版, 馴火, 檔案, 娛樂(lè), 奇人, 觀看, 中國(guó)]|
|綜藝 |1 |[android]|[視頻, 正版, 錫劇, 珍珠, 全本, 綜藝, 觀看] |
+-------+-----+---------+------------------------------------------------+
### 詞向量化然后和并舱沧,最后擬合模型
cv1 = CountVectorizer(inputCol='os',outputCol='os_vec')
cv_os = cv1.fit(df4)
df5 = cv_os.transform(df4)
cv2 = CountVectorizer(inputCol='terms', outputCol='terms_vec')
cv_term = cv2.fit(df5)
df6 = cv_term.transform(df5)
assembler = VectorAssembler(inputCols=['os_vec', 'terms_vec'], outputCol='features')
df7 = assembler.transform(df6)
df7.show()
# 輸出
+-------+-----+---------+----------------------------+-------------+--------------------+--------------------+
|channel|label| os| terms| os_vec| terms_vec| features|
+-------+-----+---------+----------------------------+-------------+--------------------+--------------------+
| 綜藝| 1| [iOS]|[視頻, 正版, 馴火, 檔案, ...|(3,[0],[1.0])|(888,[0,1,2,3,9,1...|(891,[0,3,4,5,6,1...|
| 綜藝| 1|[android]|[視頻, 正版, 錫劇, 珍珠, ...|(3,[2],[1.0])|(888,[0,1,2,6,49,...|(891,[2,3,4,5,9,5...|
+-------+-----+---------+----------------------------+-------------+--------------------+--------------------+
logistic = cl.LogisticRegression(maxIter=10,
regParam=0.01,
featuresCol='features',
labelCol='label')
lr2 = logistic.fit(df7)
res2 = lr2.transform(df7)
attrs = sorted(
(attr["idx"], attr["name"]) for attr in (chain(*res2
.schema['features']
.metadata["ml_attr"]["attrs"].values())))
for i,j in zip(attrs[:10], lr2.coefficients[:10]):
print(i, j)
# 輸出
(0, 'os_vec_0') -0.18176377875140926
(1, 'os_vec_1') -0.25018816511324415
(2, 'os_vec_2') -0.5865211142654884
(3, 'terms_vec_0') 0.37996859807875594
(4, 'terms_vec_1') 0.35677289624480935
(5, 'terms_vec_2') 1.33868055256115
(6, 'terms_vec_3') 2.606719168821142
(7, 'terms_vec_4') -2.312688055142091
(8, 'terms_vec_5') -0.9661677675046166
(9, 'terms_vec_6') 3.0854430292662482
雖然不是很明顯妹沙,但是通過(guò)簡(jiǎn)單的推測(cè)可以知道結(jié)果與上面是相同的,但是缺點(diǎn)是這里無(wú)法得知各個(gè)特征的準(zhǔn)確名稱(chēng)熟吏。
獲取特征對(duì)應(yīng)的名稱(chēng)
在平時(shí)寫(xiě)代碼時(shí)候距糖,習(xí)慣于將各列分開(kāi)處理然后串成一個(gè)pipeline來(lái)一起fit transform,這樣的好處是代碼簡(jiǎn)單分俯,但對(duì)于我們?nèi)绾潍@取原始特征名稱(chēng)卻帶來(lái)了一些麻煩肾筐。
一個(gè)簡(jiǎn)單的示例:
pipeline = [stringIndexer1, stringIndexer2, stringIndexer3, onehotEncoder, CounterVectorizer1, CounterVectorizer2, CounterVectorizer3]
pipe_model = pipeline.fit(data)
# 將所有特征合并為一列以便輸入到模型中
assembler = VectorAssembler(inputCols=concat_cols, outputCol='features')
data_transformed = assembler.transform(data)
lr = logistic.fit(data_transformed)
train_res = lr.transform(data_transformed)
如何獲取lr中所有特征對(duì)應(yīng)原始特征名稱(chēng)呢?
attrs = sorted((attr["idx"], attr["name"]) for attr in
(chain(*train_res.schema['features'].metadata["ml_attr"]["attrs"].values())))
# 先通過(guò)attrs把onehotEncoder的特征的原始名稱(chēng)提取出來(lái)缸剪,后綴其實(shí)就是原始特征名稱(chēng)
oh_features = [i for i in attrs if '_oh_' in i[1]]
# 提取CountVectorizer 特征的原始名稱(chēng)
for cv in pipe_model.stages:
if str(cv).startswith('CountVectorizerModel'):
start = len(oh_features)
field_name = cv.getInputCol()
cv_features = [(i, field_name+"-"+"_".join(w.split())) for i, w in enumerate(cv.vocabulary, start)]
oh_features += cv_features
還有一種方法其實(shí)也可以得到StringIndexer模型的原始特征名稱(chēng)吗铐,即通過(guò)StringIndexer.label
屬性來(lái)獲取。但是可能在最終的特征里面會(huì)多出一個(gè)__unknown后綴結(jié)尾的特征杏节,這個(gè)特征在.label
里面是獲取不到的唬渗。但是在attrs里面其實(shí)是可以看到的,并且可以通過(guò)對(duì)比上面提取到的attrs和.label
來(lái)進(jìn)行對(duì)比看是否多了個(gè)__unknown
結(jié)尾的特征奋渔。