1:word2vec 可以用于時(shí)序數(shù)據(jù)的挖掘求妹,比如商品的瀏覽分析讯蒲,app 下載分析童谒,通過word2vec 可以得到商品或 app 的 向量表示,從而用于推薦等拦赠,個(gè)性化展示
http://ginobefunny.com/post/learning_word2vec/
2:一些使用經(jīng)驗(yàn)
There's no universal rules-of-thumb, as even what makes a set of word-vectors good for one purpose might not be best for other purposes. (For example, word-vecs that do best on the analogies-test may not be also do the best at a topical-classification task that works on some mean-of-word-vectors.)
That said:
be sure to use the latest gensim; earlier versions could be significantly slower on very-short text examples (like tweets)
larger window sizes seem to position words closer according to topical-domain/field-of-use/semantic similarity; shorter window sizes position words closer based on functional/syntactic similarity (serve same role in sentence)
as your dataset gets larger, sometimes very-small values of
window
andnegative
are just as good (or better) and faster than larger valuesas your dataset gets larger, more-aggressive frequent-word downsampling (the 'sample' parameter becoming smaller but not zero) can offer both speed and quality benefits (by spending fewer training cycles on redundant well-represented words)
it's typical to use more than one iteration, but as your data gets larger (and if you're confident word/word-senses are randomly distributed from front to back) the benefits of extra iterations will lessen
- Gordon