比賽鏈接
數(shù)據(jù)介紹:
數(shù)據(jù)
*注 : 報名參賽或加入隊伍后,可獲取數(shù)據(jù)下載權(quán)限簿透。
數(shù)據(jù)包含2個csv文件:
train_set.csv:此數(shù)據(jù)集用于訓(xùn)練模型,每一行對應(yīng)一篇文章。文章分別在“字”和“詞”的級別上做了脫敏處理闷尿。共有四列:
第一列是文章的索引(id),第二列是文章正文在“字”級別上的表示女坑,即字符相隔正文(article)填具;第三列是在“詞”級別上的表示,即詞語相隔正文(word_seg)匆骗;第四列是這篇文章的標(biāo)注(class)劳景。
注:每一個數(shù)字對應(yīng)一個“字”,或“詞”碉就,或“標(biāo)點符號”盟广。“字”的編號與“詞”的編號是獨立的瓮钥!test_set.csv:此數(shù)據(jù)用于測試筋量。數(shù)據(jù)格式同train_set.csv烹吵,但不包含class。
注:test_set與train_test中文章id的編號是獨立的毛甲。
1.讀取數(shù)據(jù)
import pandas as pd
train_data = pd.read_csv(r"F:\NLP\比賽\“達(dá)觀杯”文本智能處理挑戰(zhàn)賽\new_data\train_set.csv")
# test_data = pd.read_csv(r"F:\NLP\比賽\“達(dá)觀杯”文本智能處理挑戰(zhàn)賽\new_data\test_set.csv")
2.查看數(shù)據(jù)結(jié)構(gòu)
train_data.columns
Index(['id', 'article', 'word_seg', 'class'], dtype='object')
查看文本有多少類
train_data["class"].unique()
array([14, 3, 12, 13, 1, 10, 19, 18, 7, 9, 4, 17, 2, 8, 6, 11, 15,
5, 16], dtype=int64)
數(shù)字描述的總共有16類
3.思路
構(gòu)建詞向量
數(shù)據(jù)預(yù)處理
TF-IDF
F-IDF是一種統(tǒng)計方法年叮,用以評估一字詞對于一個文件集或一個語料庫中的其中一份文件的重要程度。字詞的重要性隨著它在文件中出現(xiàn)的次數(shù)成正比增加玻募,但同時會隨著它在語料庫中出現(xiàn)的頻率成反比下降只损。 TF-IDF加權(quán)的各種形式常被搜尋引擎應(yīng)用,作為文件與用戶查詢之間相關(guān)程度的度量或評級七咧。
TF-IDF有兩層意思跃惫,一層是"詞頻"(Term Frequency,縮寫為TF)艾栋,另一層是"逆文檔頻率"(Inverse Document Frequency爆存,縮寫為IDF)。
在一份給定的文件里蝗砾,詞頻 (term frequency, TF) 指的是某一個給定的詞語在該文件中出現(xiàn)的次數(shù)先较。
逆向文件頻率 (inverse document frequency, IDF) 是一個詞語普遍重要性的度量
其中+1是平滑項
在sklearn中的應(yīng)用
vectorizer = TfidfVectorizer(ngram_range=(1, 2), min_df=3, max_df=0.9, sublinear_tf=True)
vectorizer.fit(df_all['word_seg'])
x_train = vectorizer.transform(df_train['word_seg'])
x_test = vectorizer.transform(df_test['word_seg'])
Word2vec詞向量
參考
詞向量(word embedding):
用一個向量表示一個詞,機器學(xué)習(xí)任務(wù)需要把任何輸入量化成數(shù)值表示(稠密向量DenseVector)悼粮,然后通過充分利用計算機的計算能力闲勺,計算得到最終的結(jié)果,詞向量的一種表達(dá)形式為one-hot
CBOW(continuous bag of words)和skip-gram
這是word2vec的兩種模式扣猫,CBOW是根據(jù)目標(biāo)單詞所在的原始語句的上下文來推測目標(biāo)單詞本身菜循,而skip-gram則是利用該目標(biāo)單詞推測原始語句信息也就是上下文
詞向量的獲取方法
基于奇異值分解的方法:
- 單詞-文檔矩陣
基于的假設(shè):
相關(guān)詞往往出現(xiàn)在同一文檔中,例如,banks和bonds,stocks,money 更相關(guān)常出現(xiàn)在一篇文檔中,而banks 和octous,banana,hockey 不太可能同時出現(xiàn)在一起, 因此可以建立詞和文檔的矩陣,通過對此矩陣做奇異值分解,可以獲取詞的向量表示. - 單詞-單詞矩陣
基于的假設(shè):
一個詞的含義由上下文信息決定,那么兩個詞之間的上下文相似,是否可推測二者非常相似.設(shè)定上下文窗口,統(tǒng)計建立詞和詞之間的共現(xiàn)矩陣,通過對矩陣做奇異值分解獲得詞向量
基于迭代的方法:
目前基于迭代的方法獲取詞向量大多是基于語言模型的訓(xùn)練得到的,對于一個合理的句子,希望語言模型能夠給予一個較大的概率,同理,對于一個不合理的句子,給予較小的概率評估,具體的形式化表示如下:
第一個公式:一元語言模型,假設(shè)當(dāng)前詞的概率只和自己有關(guān);
第二個公式: 二元語言模型,假設(shè)當(dāng)前詞的概率和前一個詞有關(guān).那么問題來了,如何從語料庫中學(xué)習(xí)給定上下文預(yù)測當(dāng)前詞的概率值呢?
就是使用我們上面說的兩種模式:CBOW與skip-gram
gensim的使用 實現(xiàn)word2vec
1.安裝
# 依賴環(huán)境
python>=2.6
NumPy>=1.3
Scipy>=0.7
## 開始安裝
pip install gensim
2 參數(shù)介紹
word2vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False)
sg defines the training algorithm. By default (sg=0), CBOW is used.Otherwise (sg=1), skip-gram is employed.
size is the dimensionality of the feature vectors.
window is the maximum distance between the current and predicted word within a sentence.
alpha is the initial learning rate (will linearly drop to min_alpha as training progresses).
seed = for the random number generator. Initial vectors for eachword are seeded with a hash of the concatenation of word + str(seed).Note that for a fully deterministically-reproducible run, you must also limit the model toa single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEEDenvironment variable to control hash randomization.)
min_count = ignore all words with total frequency lower than this.
max_vocab_size = limit RAM during vocabulary building; if there are more uniquewords than this, then prune the infrequent ones. Every 10 million word typesneed about 1GB of RAM. Set to
None for no limit (default).
sample = threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5).
workers = use this many worker threads to train the model (=faster training with multicore machines).hs = if 1, hierarchical softmax will be used for model training.If set to 0 (default), and
negative is non-zero, negative sampling will be used.negative = if > 0, negative sampling will be used, the int for negativespecifies how many "noise words" should be drawn (usually between 5-20).Default is 5. If set to 0, no negative samping is used.
cbow_mean = if 0, use the sum of the context word vectors. If 1 (default), use the mean.Only applies when cbow is used.
hashfxn = hash function to use to randomly initialize weights, for increasedtraining reproducibility. Default is Python's rudimentary built in hash function.
iter = number of iterations (epochs) over the corpus. Default is 5.
trim_rule = vocabulary trimming rule, specifies whether certain words should remainin the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count).Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) andreturns either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT.Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as partof the model.
sorted_vocab = if 1 (default), sort the vocabulary by descending frequency beforeassigning word indexes.
batch_words= target size (in words) for batches of examples passed to worker threads (andthus cython routines). Default is 10000. (Larger batches will be passed if individualtexts are longer than 10000 words, but the standard cython code truncates to that maximum.)
1.sg=1是skip-gram算法,對低頻詞敏感申尤;默認(rèn)sg=0為CBOW算法癌幕。
2.size是輸出詞向量的維數(shù),值太小會導(dǎo)致詞映射因為沖突而影響結(jié)果昧穿,值太大則會耗內(nèi)存并使算法計算變慢勺远,一般值取為100到200之間。
3.window是句子中當(dāng)前詞與目標(biāo)詞之間的最大距離时鸵,3表示在目標(biāo)詞前看3-b個詞谚中,后面看b個詞(b在0-3之間隨機)。
4.min_count是對詞進(jìn)行過濾寥枝,頻率小于min-count的單詞則會被忽視,默認(rèn)值為5磁奖。
5.negative和sample可根據(jù)訓(xùn)練結(jié)果進(jìn)行微調(diào)囊拜,sample表示更高頻率的詞被隨機下采樣到所設(shè)置的閾值,默認(rèn)值為1e-3比搭。
6.hs=1表示層級softmax將會被使用冠跷,默認(rèn)hs=0且negative不為0,則負(fù)采樣將會被選擇使用。
7.workers控制訓(xùn)練的并行蜜托,此參數(shù)只有在安裝了Cpython后才有效抄囚,否則只能使用單核
使用代碼
import gensim.models as g
from gensim.models.word2vec import LineSentence
'''Word2vec的輸入是一個LineSentence的迭代器,即我們需要將原始的訓(xùn)練語料轉(zhuǎn)化成一個sentence的迭代器橄务;每一次迭代返回的sentence是一個word(utf8格式)的列表幔托。我們再用這個迭代器作為輸入,構(gòu)造一個Gensim內(nèi)建的word2vec模型的對象蜂挪。
'''
# data/Corpus.txt為輸入的文件
model=g.Word2Vec(LineSentence('data/Corpus.txt'),size=100,window=1,min_count=1)
#--------------------------------------------------
# 將訓(xùn)練的詞向量結(jié)果保存至data/vectors.bin文件重挑,一般將文件保存為二進(jìn)制文件,方便以后做研究用棠涮。
model.save('data/vectors.bin')
# 為了方便查看訓(xùn)練的詞向量結(jié)果谬哀,也可以將訓(xùn)練的結(jié)果保存至data/vectors.txt文本文件。
model.wv.save_word2vec_format('data/vectors.txt', binary=False)
4.使用knn進(jìn)行預(yù)測
from sklearn.model_selection import train_test_split
train_voc,test_voc, train_y, test_y = train_test_split(voc,label,test_size=0.1,shuffle=True)
vectorizer = HashingVectorizer(stop_words='english', non_negative=True,
n_features=10000)
fea_train = vectorizer.fit_transform(train_voc)
fea_test = vectorizer.fit_transform(test_voc)
knnclf = KNeighborsClassifier()
knnclf.fit(fea_train, train_y)
precdict = knnclf.predict(fea_test)
評價函數(shù) :
from sklearn import metrics
def get_score(actual, pred):
"""
:param actual:真實值
:param pred:預(yù)測值
:return:
"""
m_precision = metrics.precision_score(actual, pred,average=None)
m_recall = metrics.recall_score(actual, pred, average=None)
m_f1 = metrics.f1_score(actual, pred,average=None)
return m_precision, m_recall, m_f1
m_precision, m_recall, m_f1 = get_score(train_y,precdict)
結(jié)果:
準(zhǔn)確率:
array([0.37407953, 0.38492063, 0.61172902, 0.76487252, 0.75113122,
0.88423154, 0.53184713, 0.59090909, 0.82114883, 0.61748634,
0.57777778, 0.55202312, 0.48722317, 0.75415282, 0.88945578,
0.52380952, 0.64822134, 0.81324278, 0.34197731])
整體準(zhǔn)確率:
0.645804050307131
召回率:
array([0.46098004, 0.69039146, 0.7202381 , 0.71808511, 0.66135458,
0.66020864, 0.52351097, 0.67651195, 0.86282579, 0.45934959,
0.65 , 0.36380952, 0.71055901, 0.65512266, 0.68997361,
0.2483871 , 0.53770492, 0.66251729, 0.39962121])
0.6198670316777474
F1:
array([0.41300813, 0.49426752, 0.6615637 , 0.74074074, 0.70338983,
0.7559727 , 0.52764613, 0.63081967, 0.84147157, 0.52680653,
0.61176471, 0.43857635, 0.57806973, 0.7011583 , 0.77711738,
0.33698031, 0.58781362, 0.73018293, 0.36855895])
0.6228989846281351