SIF
Smooth Inverse Frequency
- 用weighted average of the word vectors表示句子;
對(duì)詞w播聪,詞頻表示為治拿,權(quán)重為a阅爽,則:
- 在用PCA或者SVD進(jìn)行降維:去掉 first principal component (common component removal)豁陆。
FastSent
We propose using the definitions found in everyday dictionaries as a means of bridging this gap between lexical and phrasal semantics. Neural language embedding models can be effectively trained to map dictionary definitions (phrases) to (lexical) representations of the words defined by those definitions. We present two applications of these architectures: reverse dictionaries that return the name of a concept given a definition or description and general-knowledge crossword question answerers. On both tasks, neural language embedding models trained on definitions from a handful of freely-available lexical resources perform as well or better than existing commercial systems that rely on significant task-specific engineering.
Skip-Thought
Skip-Thought Vectors
Code: https://github.com/ryankiros/skip-thoughts
無監(jiān)督學(xué)習(xí)句子向量研侣。
數(shù)據(jù)集:continuity of text from books
Hypothesize:Sentences that share semantic and syntactic properties are thus mapped to similar vector representations痕支。
OOV的處理:vocabulary expansion掂铐,使得詞表可以覆蓋million級(jí)別的詞量。
Model
模型結(jié)構(gòu):encoder-decoder
skip-gram: use a word to predict its surrounding context;
skip-thought: encode a sentence to predict the sentences around it.
Training corpus: BookCorpus dataset, a collection of novels, with 16 different genres,.
輸入:a sentence tuple ()
Encoder: feature extractor --> skip-thought vector
Decoders: one for , one for
Objective function:
下游任務(wù):8種埋合, semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets。
InferSent
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Code: https://github.com/facebookresearch/InferSent
監(jiān)督式學(xué)習(xí)句子向量萄传。
數(shù)據(jù)集:Standford Natural Language Inference datasets(SNLI)甚颂,包含570k個(gè)人工生成的英語句子對(duì),3種標(biāo)簽判讀句子對(duì)的關(guān)系秀菱,entailment, contradiction, neutral振诬。
向量表示:300d GloVe vectors
Hypothesize:使用SNLI數(shù)據(jù)集,足夠?qū)W習(xí)到句子表達(dá)衍菱。
Model
3種向量拼接方式:
- concatenation of u,v;
- element-wise product
;
- absolute element-wise difference
7種模型結(jié)構(gòu):
- LSTM
- GRU
- GRU_last: concatenation of last hidden states of forward and backward
- BiLSTM with mean pooling
- BiLSTM with max pooling
- Self-attentive network
- Hierarchical convolutional networks.
Universal Sentence Encoder
Universal Sentence Encoder
code: https://tfhub.dev/google/universal-sentence-encoder/2
通過學(xué)習(xí)多個(gè)NLP任務(wù)來encode句子赶么,從而得到句子表達(dá)。
Model
訓(xùn)練集:SNLI
遷移任務(wù):MR, CR, SUBJ, MPQA, TREC, SST, STS Benchmark, WEAT
遷移的輸入:concatenation (sentence, word)
模型結(jié)構(gòu):2種encoders:
Transformer
準(zhǔn)確率高脊串,但模型復(fù)雜度高辫呻,計(jì)算開銷大清钥。
步驟:
a. Word representation:element-wise sum (word, word_position)
c. PTB tokenized string得到512d的句子表示DAN(deep averaging network)
損失一點(diǎn)準(zhǔn)度,但效率高放闺。
a. Words + bi_grams
b. averaged embeddings
c. DNN得到sentence embeddings
SentenceBERT
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Model
訓(xùn)練集:SNLI, Multi-Genre NLI祟昭,前者為3分類數(shù)據(jù),后者為sentence-pair形式怖侦。
使用的BERT向量:
- [CLS]:BERT output [CLS]token
- MEAN:BERT output 向量平均
- MAX:BERT output 向量取max
目標(biāo)函數(shù):
- 分類:
- 回歸
- Triplet:
其中:
:anchor sentence
:positive sentence
:negative sentence
模型需要讓anchor和positive的距離小于anchor和negative的距離篡悟。