一 在NLTK中使用StanfordNLP的功能
1 安装nltk:?
使用以下命令进行安装,
pip install nltk
参考
2 导入nltk数据:
import nltk
nltk.download()
因为网速原因,nltk.download()很慢甚至不成功。这时,我们记住nltk.download()运行时产生的图形界面中Download Directory地址。
3 下载NLTK官网上的packages包,并将其解压后名称改为nltk_data。将该包放入2记录的地址中即可。参考
4 执行测试语句,证明安装完成
from nltk.book import *
5 安装java jdk:
下载源码包,解压并移动到指定目录中,然后进行环境配置。
参考:文献1,文献2
6?在stanfordnlp官网上下载需要的相应包,并将这些包放入一个文件夹中(比如我放入了stanfordnltk文件夹中):
① 分词压缩包StanfordSegmenter 和StanfordTokenizer:下载stanford-segmenter-2018-10-16.zip,解压获取目录中的stanford-segmenter-3.9.2.jar复制为stanford-segmenter.jar 和slf4j-api.jar。
② 词性标注压缩包StanfordPOSTagger:下载stanford-postagger-full-2018-10-16.zip,解压获取stanford-postagger.jar。
③ 命名实体识别压缩包StanfordNERTagger:下载stanford-ner-2018-10-16.zip,解压获取stanford-ner.jar 和classifiers 文件。
④ 句法分析StanfordParser、句法依存分析StanfordDependencyParser:下载stanford-parser-full-2018-10-17.zip,解压获取stanford-parser.jar 和stanford-parser-3.9.2-models.jar.
7 代码执行中文分词功能(会出现警告,不用管它):
from?nltk.tokenize.stanford_segmenter import StanfordSegmenter
segmenter = StanfordSegmenter(
????path_to_jar="/home/StanfordNLTK/stanford-segmenter-4.2.0.jar",
????path_to_slf4j="/home/StanfordNLTK/slf4j-api.jar",
? ??path_to_sihan_corpora_dict="/home/StanfordNLTK/data",
????path_to_model="/home/StanfordNLTK/data/pku.gz",
????path_to_dict="/home/StanfordNLTK/data/dict-chris6.ser.gz",
java_class = 'edu.stanford.nlp.ie.crf.CRFClassifier' #最新版好像不需要了,如果出错就删除该项
)
str="我爱吃苹果"
result = segmenter.segment(str)
result
8 代码执行英文分词:
from nltk.tokenize.stanford import StanfordTokenizer
tokenizer = StanfordTokenizer(path_to_jar="/home/StanfordNLTK/stanford-parser.jar")
sent =?"Good muffins cost $3.88\nin New York.? Please buy me\ntwo of them.\nThanks."
print(tokenizer.tokenize(sent))
9 代码执行英文词性标注:
from?nltk.tag import StanfordPOSTagger
eng_tagger = StanfordPOSTagger(model_filename='/home/StanfordNLTK/models/english-bidirectional-distsim.tagger',path_to_jar='/home/StanfordNLTK/stanford-postagger.jar')
print(eng_tagger.tag('What is the airspeed of an unladen swallow ?'.split()))
10 代码执行中文词性标注:
from?nltk.tag import StanfordPOSTagger
chi_tagger = StanfordPOSTagger(model_filename='/home/StanfordNLTK/models/chinese-distsim.tagger',path_to_jar='/home/StanfordNLTK/stanford-postagger.jar')
str = "我在中国,我热爱这片土地"
print(chi_tagger.tag(result.split()))
11 英语句法分析
from?nltk.parse.stanford import StanfordParser
eng_parser = StanfordParser("/home/StanfordNLTK/stanford-parser.jar","/home/StanfordNLTK/stanford-parser-4.2.0-models.jar","/home/StanfordNLTK/data/englishPCFG.caseless.ser.gz")#英文初始化模型非必选项
print(list(eng_parser.parse("the quick brown fox jumps over the lazy dog".split())))
# 如果需要可视化绘图,则执行以下命令:
parser_result = eng_parser.parse("In many natural language processing tasks, words are often represented by their tfidf scores.".split())
for line in parser_result:
? ? for ele in line:
? ? ? ??ele.draw()
12 中文句法分析
from nltk.parse.stanford import StanfordParser
chi_parser = StanfordParser("/home/StanfordNLTK/stanford-parser.jar","/home/StanfordNLTK/stanford-parser-4.2.0-models.jar","/home/StanfordNLTK/data/chinesePCFG.ser.gz")#这里初始化时要使用中文模型,我选择了chinesePCFG.ser.gz模型
sent = u'北海 已 成为 中国 对外开放 中 升起 的 一 颗 明星'
print(list(chi_parser.parse(sent.split())))
13 英文依存句法分析
from?nltk.parse.stanford import StanfordDependencyParser
eng_parser = StanfordDependencyParser("/home/StanfordNLTK/stanford-parser.jar","/home/StanfordNLTK/stanford-parser-4.2.0-models.jar","/home/StanfordNLTK/data/englishPCFG.caseless.ser.gz")#英文初始化模型非必选项
res = list(eng_parser.parse("the quick brown fox jumps over the lazy dog".split()))
for?row?in?res[0].triples():
????print(row)
for?triple?in?res[0].triples():
? ??print (triple[1],"(",triple[0][0],", ",triple[2][0],")")
#如果需要可视化绘图,可以采用以下方式1,输出结果后copy到graphviz上来生成图:
p = eng_parser.parse("the quick brown fox jumps over the lazy dog".split())
for e in p:
? ? p = e
? ? break
print(p.to_dot())
#方法2的可视化方式也不错,参照。只是这里安装graphviz到本机上可能会出现问题(一般来说sudo apt-get install -y graphviz libgraphviz-dev就好,如果出问题请单独解决graphviz);
from graphviz import Source
from?nltk.parse.stanford import StanfordDependencyParser
eng_parser = StanfordDependencyParser("/home/StanfordNLTK/stanford-parser.jar","/home/StanfordNLTK/stanford-parser-4.2.0-models.jar","/home/StanfordNLTK/data/englishPCFG.caseless.ser.gz")#英文初始化模型非必选项
result = list(eng_parser.parse("In many natural language processing tasks, words are often represented by their tfidf scores"))
dep_tree_dot_repr=[parse for parse in result[0].to_dot()]
source = Source(dep_tree_dot_repr,filename="dep_tree",format='png',out_file=None)
source.view()
附:安装graphviz的测试代码
from graphviz import Digraph
g = Digraph('测试图片')
g.node(name='a',color='red')
g.node(name='b',color='blue')
g.edge('a','b',color='green')
g.view()
#方法3的可视化方法
方式1中的p.to_dot()换做p.tree().draw()。但该方法不能展示边标签
from?nltk.parse.stanford import StanfordDependencyParser
eng_parser = StanfordDependencyParser("/home/StanfordNLTK/stanford-parser.jar","/home/StanfordNLTK/stanford-parser-4.2.0-models.jar","/home/StanfordNLTK/data/englishPCFG.caseless.ser.gz")#英文初始化模型非必选项
p = eng_parser.parse("The detection mutation on exon-19 of EGFR gene was present in 16 patients, while the L858E point mutation on exon-21 was noted.".split())
for e in p:
? ? p = e
? ? break
print(p.to_dot())
p.tree().draw()
参考文献:
https://stackoverflow.com/questions/39340907/converting-output-of-dependency-parsing-to-tree
14 中文依存句法分析
from?nltk.parse.stanford import StanfordDependencyParser
chi_parser = StanfordDependencyParser("/home/StanfordNLTK/stanford-parser.jar","/home/StanfordNLTK/stanford-parser-4.2.0-models.jar","/home/StanfordNLTK/data/xinhuaPCFG.ser.gz")#这里初始化时要使用中文模型,我选择了xinhuaPCFG.ser.gz模型
res = list(chi_parser.parse(u'四川 已 成为 中国 西部 对外开放 中 升起 的 一 颗 明星'.split()))
for?row?in?res[0].triples():
????print(row)
参考文献:
https://www.cnblogs.com/baiboy/p/nltk1.html
https://blog.csdn.net/lizzy05/article/details/88148097
https://stackoverflow.com/questions/13883277/how-to-use-stanford-parser-in-nltk-using-python
https://stackoverflow.com/questions/34395127/stanford-nlp-parse-tree-format
https://stackoverflow.com/questions/13883277/how-to-use-stanford-parser-in-nltk-using-python/49345866#49345866
https://www.cnblogs.com/AsuraDong/p/7050859.html#%E6%A0%91%E7%8A%B6%E5%9B%BE
https://blog.csdn.net/weixin_40231212/article/details/107209028
可视化的参考文献:
https://blog.csdn.net/qq_39971645/article/details/106326879
https://stackoverflow.com/questions/33433274/anaconda-graphviz-cant-import-after-installation/47043173#47043173
论文实现参考文献:
https://www.cnblogs.com/Harukaze/p/14274720.html
二 单独安装StanfordNLP
(1) 关于StanfordCoreNLP
网络上,特别是中文网络上,使用StanfordCoreNLP工具的比较多,因此关于这方面的说明和指导多。我在此处贴出部分博客,供大家参考:
https://stanfordnlp.github.io/CoreNLP/download.html
https://github.com/nltk/nltk/issues/2057
https://blog.csdn.net/qq_40426415/article/details/80994622
https://blog.csdn.net/lizzy05/article/details/87483539
https://zhuanlan.zhihu.com/p/62519341
https://www.jianshu.com/p/002157665bfd
https://blog.csdn.net/qq_35203425/article/details/80451243
https://cloud.tencent.com/developer/article/1537648?from=article.detail.1613017
https://blog.csdn.net/qq_39971645/article/details/106326879
https://blog.csdn.net/sunflower_sara/article/details/106475583
但是由于StanfordCoreNLP需要安装JDK,我嫌太麻烦,所以选择了StanfordNLP。
(2) What is the difference between both NLP tools?
至于两者有什么关联的和区别,我同样贴出博客,大家想理解的可以去看看。我的理解是StanfordNLP是一个集成功能更加全面的package,而CoreNLP就是它的核心。实际上对于我们的使用都是一样的。
https://stackoverflow.com/questions/40011896/nltk-vs-stanford-nlp
https://stackoverflow.com/questions/38855943/what-is-difference-between-core-nlp-and-stanford-nlp
https://meta.stackoverflow.com/questions/345593/whats-the-difference-between-stanford-nlp-and-corenlp
(3) From StanfordNLP to Stanza
当前StanfordNLP已经更新成了stanza。该包虽然简单,但由于是neural networks model,因此需要pytorch的支持(所以需要先安装pytorch)。想看stanfordnlp早期版本的同学,可以查看这里:
https://pypi.org/project/stanfordnlp/
https://www.analyticsvidhya.com/blog/2019/02/stanfordnlp-nlp-library-python/
https://stanfordnlp.github.io/stanfordnlp/installation_usage.html#human-languages-supported-by-stanfordnlp
(4) Let's Begin the Stanza
① pip安装:
pip install stanza
② 如果是在conda环境中,使用以下命令:
conda install -c stanfordnlp stanza
③ 如果此时出错,出现ModuleNotFoundError: No module named 'pip._internal',我们可以使用下列方式解决https://blog.csdn.net/wangweiwells/article/details/88374070:
python -m ensurepip
python -m pip install --upgrade pip。
④ 案例:
import stanza
stanza.download('en', processors='tokenize,lemma,pos', package=None)#第一次后可以skip此步
nlp = stanza.Pipeline('en', processors='tokenize,lemma,pos',package=None)#语言;功能:分词、词形还原及词性标注;原始的训练数据集
doc = nlp('Barack Obama was born in Hawaii.')
for sentence in doc.sentences:
for word in sentence.words:
print(word.text, word.lemma, word.pos)
import stanza
stanza.download('zh', processors='ner,depparse')#中文:NER,dep #第一次后可以skip此步
nlp = stanza.Pipeline('zh', processors='ner,depparse')
doc = nlp('我爱吃苹果')
for sentence in doc.sentences:
print(sentence.ents)
print(sentence.dependencies)
doc.sentences[0].print_dependencies()
参考文献:
https://stanfordnlp.github.io/stanza/
https://stanfordnlp.github.io/stanza/installation_usage.html
https://stanfordnlp.github.io/stanza/pipeline.html#processors?
附:
平时仅需要展示的话,可以采用以下网站,直接生成依存树:
https://explosion.ai/demos/displacy?text=Convulsions%20that%20occur%20after%20DTaP%20are%20caused%20by%20a%20fever.&model=en_core_web_sm&cpu=1&cph=1
|