python--stanfordcorenlp
stanford core nlp 是一個(gè)用于nlp的工具庫(kù)。它是用java寫的镇草,但是現(xiàn)在也為python提供了接口。前段時(shí)間筆者嘗試在python中使用它:
首先引入stanfordcorenlp的包
在python文件中引用:
from stanfordcorenlp import StanfordCoreNLP
stanfordcorenlp 中只有 StanfordCoreNLP 一個(gè)類
獲得StanfordCoreNLP 的對(duì)象:
創(chuàng)建StanfordCoreNLP 對(duì)象需要傳入一個(gè)路徑參數(shù)瘤旨,從而獲得一個(gè)存放相應(yīng)jar包的文件夾:該文件夾下載地址:https://stanfordnlp.github.io/CoreNLP/download.html
筆者使用的是:stanford-corenlp-full-2016-10-31
nlp = StanfordCoreNLP(path) # 這里的path即是stanford-corenlp-full-2016-10-31 的路徑
使用
它的使用非常簡(jiǎn)單
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(path)
sentence = "i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor ."
print(nlp.dependency_parse(sentence))
nlp.close()
但是直接運(yùn)行會(huì)出錯(cuò)
$ python WordFormation.py
Traceback (most recent call last):
File "WordFormation.py", line 1, in <module>
from stanfordcorenlp import StanfordCoreNLP
ModuleNotFoundError: No module named 'stanfordcorenlp'
或是:
PermissionError: [Errno 1] Operation not permitted
因此使用root權(quán)限運(yùn)行:成功獲得dependency
$ sudo python WordFormation.py
[('ROOT', 0, 3), ('nsubj', 3, 1), ('aux', 3, 2), ('det', 5, 4), ('dobj', 3, 5), ('case', 9, 6), ('advmod', 8, 7), ('nummod', 9, 8), ('nmod', 5, 9), ('advmod', 3, 10), ('cc', 3, 11), ('nsubj', 14, 12), ('advmod', 14, 13), ('conj', 3, 14), ('advmod', 14, 15), ('case', 18, 16), ('det', 18, 17), ('nmod', 14, 18), ('case', 23, 19), ('det', 23, 20), ('amod', 23, 21), ('compound', 23, 22), ('nmod', 18, 23), ('case', 26, 24), ('det', 26, 25), ('nmod', 23, 26), ('punct', 3, 27)]
其中的那些數(shù)字代表的是第幾個(gè)單詞梯啤,但是它是從1開始數(shù)的,('ROOT', 0, 3) 中的0不代表sentence中的單詞
StanfordCoreNLP 還有一些功能存哲,比如詞性標(biāo)注等都可以使用
但是筆者沒有從StanfordCoreNLP 類中獲得可以進(jìn)一步獲得dependency的方法:比如復(fù)合名詞修飾 nmod 在這里我只能獲得 nmod 而不能獲得修飾用的介詞 nmod:for 的形式
筆者沒能找到合適的方法因宇,因此我決定改用java嘗試一下
JAVA--stanfordcorenlp
java 的話,語(yǔ)句相應(yīng)會(huì)復(fù)雜一些
首先引入相應(yīng)的jar包:
由于筆者建的maven項(xiàng)目
pom.xml 中加入:
<properties>
<corenlp.version>3.9.2</corenlp.version>
</properties>
<dependencies>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>${corenlp.version}</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>${corenlp.version}</version>
<classifier>models</classifier>
</dependency>
</dependencies>
開始運(yùn)行:
import edu.stanford.nlp.ling.CoreAnnotations;
import java.util.Properties;
public class StanfordEnglishNlpExample {
public static void main(String[] args) {
Properties props = new Properties();
// 設(shè)置相應(yīng)的properties
props.put("annotators", "tokenize,ssplit,pos,parse,depparse");
props.put("tokenize.options", "ptb3Escaping=false");
props.put("parse.maxlen", "10000");
props.put("depparse.extradependencies", "SUBJ_ONLY");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props); // 獲得StanfordCoreNLP 對(duì)象
String str = "i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .";
Annotation document = new Annotation(str);
pipeline.annotate(document);
CoreMap sentence = document.get(CoreAnnotations.SentencesAnnotation.class).get(0);
SemanticGraph dependency_graph = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class); // 獲得依賴關(guān)系圖
System.out.println("\n\nDependency Graph: " + dependency_graph.toString(SemanticGraph.OutputFormat.LIST));
}// 直接打印關(guān)系
}
獲得結(jié)果:
Dependency Graph: root(ROOT-0, had-3)
nsubj(had-3, i-1)
aux(had-3, 've-2)
det(player-5, the-4)
dobj(had-3, player-5)
case(years-9, for-6)
advmod(2-8, about-7)
nummod(years-9, 2-8)
nmod:for(player-5, years-9)
advmod(had-3, now-10)
cc(had-3, and-11)
nsubj(performs-14, it-12)
advmod(performs-14, still-13)
conj:and(had-3, performs-14)
advmod(performs-14, nicely-15)
case(exception-18, with-16)
det(exception-18, the-17)
nmod:with(performs-14, exception-18)
case(sound-23, of-19)
det(sound-23, an-20)
amod(sound-23, occasional-21)
compound(sound-23, wwhhhrrr-22)
nmod:of(exception-18, sound-23)
case(motor-26, from-24)
det(motor-26, the-25)
nmod:from(sound-23, motor-26)
punct(had-3, .-27)
在這里面就可以看到nmod:with nmod:of 這樣的依存關(guān)系了
但是其實(shí)還是有問題的:
對(duì)于以上代碼中的SemanticGraph 對(duì)象 dependency_graph 來(lái)說(shuō)
如果想要獲得它的對(duì)象的依存關(guān)系
List<SemanticGraphEdge> list = dependencies.edgeListSorted();
這時(shí)候就會(huì)發(fā)現(xiàn)祟偷,這個(gè)list中沒有root的關(guān)系
實(shí)際上察滑,如果想要root 關(guān)系,只能通過從 dependency_graph 再獲取root關(guān)系列表修肠,這樣的話贺辰,沒有很好的順序關(guān)系
因此用另一種方法來(lái)獲得:
為了將工作做的更完整一些,這里筆者將完成詞性標(biāo)注工作
想要使用詞性標(biāo)注器氛赐,首先需要獲得english-left3words-distsim.tagger文件魂爪,這個(gè)文件在stanford-corenlp-2016-10-31 中有,可以直接用艰管。但是很有可能由于引用的jar包和使用的tagger文件的版本不一致導(dǎo)致錯(cuò)誤滓侍。
實(shí)際上,在我們引入的stanford-corenlp-models的jar包里就有這個(gè)tagger文件牲芋,但是想要將它讀出來(lái)需要一點(diǎn)工作
URL url = new URL("jar:file:"+ path +
"!/edu/stanford/nlp/models/pos-tagger/english-left3words/" +
"english-left3words-distsim.tagger");
# 這里的path是jar包的路徑撩笆,捺球!后面的是tagger文件在jar包內(nèi)部路徑
JarURLConnection jarURLConnection = (JarURLConnection) url.openConnection();
由于詞性標(biāo)注器,MaxentTagger 類構(gòu)造器夕冲,可以傳入路徑氮兵,也可以傳入InputStream 對(duì)象:
MaxentTagger tagger = new MaxentTagger(jarURLConnection.getInputStream());
成功獲得對(duì)象:
public static void main(String[] args) throws java.net.MalformedURLException, IOException {
URL url = new URL("jar:file:"+ path +
"!/edu/stanford/nlp/models/pos-tagger/english-left3words/" +
"english-left3words-distsim.tagger");
JarURLConnection jarURLConnection = (JarURLConnection) url.openConnection();
MaxentTagger tagger = new MaxentTagger(jarURLConnection.getInputStream());
DependencyParser parser = DependencyParser.loadFromModelFile(DependencyParser.DEFAULT_MODEL); // 依存關(guān)系解析器
String review = "i 've had the player for about 2 years now and it still performs nicely with the exception of an occasional wwhhhrrr sound from the motor .";
String result = "[";
DocumentPreprocessor tockenizer = new DocumentPreprocessor(new StringReader(review)); // 將一段話,分成多個(gè)句子
for(List<HasWord> sentence: tockenizer){
List<TaggedWord> tagged = tagger.tagSentence(sentence); // 對(duì)句子中的詞打標(biāo)簽
GrammaticalStructure gs = parser.predict(tagged);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed(); // 獲得依賴關(guān)系
for(TypedDependency td: tdl){
result = result.concat(td.reln()+"("+td.gov()+", "+td.dep()+"),");
}
}
System.out.println(result.substring(0,result.length()-1)+"]");
}
獲得結(jié)果:
[nsubj(had/VBD, i/FW),aux(had/VBD, 've/VBP),root(ROOT, had/VBD),det(player/NN, the/DT),dobj(had/VBD, player/NN),case(years/NNS, for/IN),advmod(2/CD, about/IN),nummod(years/NNS, 2/CD),nmod:for(player/NN, years/NNS),advmod(had/VBD, now/RB),cc(had/VBD, and/CC),nsubj(performs/VBZ, it/PRP),advmod(performs/VBZ, still/RB),conj:and(had/VBD, performs/VBZ),advmod(performs/VBZ, nicely/RB),case(exception/NN, with/IN),det(exception/NN, the/DT),nmod:with(performs/VBZ, exception/NN),case(sound/NN, of/IN),det(sound/NN, an/DT),amod(sound/NN, occasional/JJ),compound(sound/NN, wwhhhrrr/NN),nmod:of(exception/NN, sound/NN),case(motor/NN, from/IN),det(motor/NN, the/DT),nmod:from(sound/NN, motor/NN),punct(had/VBD, ./.)]