1.分析
中文的情感分析可以用詞林做,詞林有一大類(G類)對應(yīng)心理活動轻姿,但是相對于wordnet還是太簡單了.因此使用nltk+wordnet的方案腹尖,如下:
1)中文分詞:結(jié)巴分詞
2)中英文翻譯:wordnet漢語開放詞網(wǎng),可從以下網(wǎng)址下載:
http://compling.hss.ntu.edu.sg/cow/
3)情感分析:wordnet的sentiwordnet組件
4)停用詞:參考以下網(wǎng)頁都许,另外加入常用標點符號
http://blog.csdn.net/u010533386/article/details/51458591
2.代碼
# encoding=utf-8
import jieba
import sys
import codecs
reload(sys)
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
sys.setdefaultencoding('utf8')
def doSeg(filename) :
f =open(filename, 'r+')
file_list = f.read()
f.close()
seg_list = jieba.cut(file_list)
stopwords= []
forword in open("./stop_words.txt", "r"):
stopwords.append(word.strip())
ll = []
for segin seg_list :
if(seg.encode("utf-8") not in stopwords and seg != ' ' and seg != ''and seg != "\n" and seg != "\n\n"):
ll.append(seg)
returnll
def loadWordNet():
f =codecs.open("./cow-not-full.txt", "rb", "utf-8")
known =set()
for lin f:
ifl.startswith('#') or not l.strip():
continue
row= l.strip().split("\t")
iflen(row) == 3:
(synset, lemma, status) = row
elif len(row) == 2:
(synset, lemma) = row
status = 'Y'
else:
print "illformed line: ", l.strip()
ifstatus in ['Y', 'O' ]:
if not (synset.strip(),lemma.strip()) in known:
known.add((synset.strip(), lemma.strip()))
returnknown
def findWordNet(known, key):
ll =[];
for kkin known:
if(kk[1] == key):
ll.append(kk[0])
returnll
def id2ss(ID):
returnwn._synset_from_pos_and_offset(str(ID[-1:]), int(ID[:8]))
def getSenti(word):
returnswn.senti_synset(word.name())
if __name__ == '__main__' :
known =loadWordNet()
words =doSeg(sys.argv[1])
n = 0
p = 0
forword in words:
ll =findWordNet(known, word)
if(len(ll) != 0):
n1 = 0.0
p1 = 0.0
for wid in ll:
desc = id2ss(wid)
swninfo = getSenti(desc)
p1 = p1 + swninfo.pos_score()
n1 = n1 + swninfo.neg_score()
if (p1 != 0.0 or n1 != 0.0):
print word, '-> n ', (n1 / len(ll)), ", p ", (p1 / len(ll))
p= p + p1 / len(ll)
n= n + n1 / len(ll)
print"n", n, ", p", p
3.待解決的問題
1)結(jié)巴分詞與wordnet chinese中的詞不能一一對應(yīng)
結(jié)巴分詞雖然可以導(dǎo)入自定義的詞典褂萧,但仍有些結(jié)巴分出的詞押桃,在wordnet找不到對應(yīng)詞義,比如"太后","童子"唱凯,還有一些組合詞如"很早已前"羡忘,"黃山"等等.大多是名詞,需要進一步"學(xué)習(xí)".
臨時的解決方案是:將其當作"專有名詞"處理
2)一詞多義/一義多詞
無論是情感分析磕昼,還是語義分析卷雕,中文或者英文,都需要解決詞和義的對應(yīng)問題.
臨時的解決方案是:找到該詞的所有語義票从,取其平均的情感值.另外漫雕,結(jié)巴也可判斷出詞性作為進一步參考.
3)語義問題
語義問題是最根本的問題,一方面需要分析句子的結(jié)構(gòu)峰鄙,另外也和內(nèi)容也有關(guān)浸间,尤其是長文章,經(jīng)常會使用"先抑后揚""對比分析"吟榴,這樣就比較難以判斷感情色彩了.
4.參考:
1)Learning lexical scales:WordNet and SentiWordNet
http://compprag.christopherpotts.net/wordnet.html
2)SentiWordNet Interface