数据集与目的
任务目的
此任务目的为使用StanfordNLP工具对文本进行解析,得到文本分析结果,做后续研究
数据集
获取到新闻网站上8个国家相关新闻的html,抓取正文内容以及发表日期,并存入txt文件。
新闻内容样例:“本报伦敦3月20日电(记者邢雪)英国最大工会组织“工会联盟”近期发布报告称,英国社会男女收入不平等,女性员工工资平均比男性员工少15.4%。英国还需要30多年才能弥合这一差距。英国财政研究所此前发布的报告也显示,过去25年英国男女收入差距“几乎没有任何变化”。
据报道,在英国不同行业,男女收入差距的严重程度不一。在金融和保险领域,男女收入差距达到32.2%,相当于女性一年内少获得将近4个月的工资。即使是在教育、医疗护理等以女性为主的行业领域,女性收入仍旧普遍低于男性。
英国财政研究所研究部门副主任科斯塔·迪亚斯认为,英国在就业、工资、工作时间等方面,依然存在较大性别差距。英国政府的政策缺乏一套“连贯性”激励机制,以确保男女社会责任均等、实现职场性别平等。
”
数据分类
因为此任务得到的结果之后会进行一些后续研究,所以对新闻需要进行分类。
- 首先按照发布日期将新闻数据按周长度分类
- 利用层次聚类算法对每一周的新闻进行分类
层次聚类核心代码如下:
def cluster(article_list):
num = len(article_list)
Scores=[]
labels=[]
metric = np.zeros((num, num))
for i in range(num):
for j in range(i + 1, num):
v1, v2 = get_word_vector(article_list[i], article_list[j], stop_word)
val = cos_dist(v1, v2)
metric[i, j] = 1 - val
metric[j, i] = 1 - val
n_clusters = min(11,num)
for k in range(2, n_clusters):
model = AgglomerativeClustering(n_clusters=k, affinity='precomputed', linkage='average')
model.fit(metric)
Scores.append(silhouette_score(metric, model.labels_, metric='euclidean'))
label = model.fit_predict(metric)
labels.append(label)
maxid=0
for i in range(1,len(Scores)):
if Scores[i]>Scores[maxid]:
maxid=i
return labels[maxid],maxid+2
stanfordnlp使用
具体使用安装使用教程大家可在别处找到,csdn上也有很多
在模型文件夹打开命令行中执行下述代码
java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -fileList ../filelist.txt -outputFormat json -outputDirectory ../output/England/
其中filelist里面的文件示例: stanfordnlp会按行处理,一行是一条新闻文本。 结果json文件样例:(因为篇幅过长,放在最后)
共指消解
使用stanfordnlp分析结果,对原文本进行共指消解,代码如下:
def get_coref(file):
with open(file,'r',encoding='UTF-8') as f:
f=f.read()
result=json.loads(f)
sentence=result['sentences']
corefs=result['corefs']
return [sentence,corefs]
def deal_origin(file,res):
text=""
for i in res[1].values():
sentenceList=i
origin_sentenceid=sentenceList[0]['sentNum']-1
origin_startIndex=sentenceList[0]['startIndex']-1
origin_endIndex=sentenceList[0]['endIndex']-1
origin_word=""
for j in range(origin_startIndex,origin_endIndex):
origin_word+=res[0][origin_sentenceid]['tokens'][j]["originalText"]
for j in sentenceList[1::]:
tgt_sentenceid=j['sentNum']-1
tgt_startIndex = j['startIndex'] - 1
tgt_endIndex = j['endIndex'] - 1
for k in range(tgt_startIndex, tgt_endIndex):
if k==tgt_startIndex:
res[0][tgt_sentenceid]['tokens'][k]["originalText"]=origin_word
else:
res[0][tgt_sentenceid]['tokens'][k]["originalText"] = ""
for i in res[0]:
for j in i['tokens']:
text+=j["originalText"]
with open(file, 'w', encoding='UTF-8') as f:
f.write(text)
执行完此代码会得到新的原文本,再使用stanfordnlp处理一次,这次是新的filelist,不要忘记重新生成filelist了。
java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -fileList ../filelist.txt -outputFormat json -outputDirectory ../output/England/
这样就得到了stanfordnlp处理文本后的结果,可以根据自己的需要修改命令行的代码,代码format指导在stanfordnlp官网就能看到
遇到问题
- 在学习过程中,发现使用stanfordnlp的python库会触发错误,试过了一些解决办法,以及版本更换,仍存在问题。同时从stanfordnlp官网了解到,他们已经开发出了名叫Stanza的python版nlp库。综上原因,尝试使用Stanza进行处理。
- Stanza没有开发共指解析模块,但是通过stanza可以访问stanfordnlp的接口,通过学习,调用了共指解析的接口,并实现文本转换。
- 命令行调用stanfordnlp也很方便,可以考虑使用这种方法,本文也是如此。
output.json
{
"docId": "2020-04-15_all_3.txt",
"sentences": [
{
"index": 0,
"parse": "(ROOT\r\n (IP\r\n (VP\r\n (VCD (VV 延伸) (VV 阅读))\r\n (PU :)\r\n (IP\r\n (NP\r\n (NP\r\n (NP (NN 全球) (NN 疫情) (NN 要览))\r\n (PRN (PU ()\r\n (NP (NT 4月) (NT 16日))\r\n (PU ))))\r\n (NP\r\n (NP (NR 亚欧))\r\n (NP (NN 地区)))\r\n (NP (NN 疫情)))\r\n (VP (VV 持续)\r\n (VP\r\n (VP (VV 蔓延)\r\n (NP\r\n (NP (NR 德国))\r\n (NP (NN 社交) (NN 限制) (NN 措施))))\r\n (VP (VV 延长))))))))",
"basicDependencies": [
{
"dep": "ROOT",
"governor": 0,
"governorGloss": "ROOT",
"dependent": 1,
"dependentGloss": "延伸"
},
{
"dep": "compound:vc",
"governor": 1,
"governorGloss": "延伸",
"dependent": 2,
"dependentGloss": "阅读"
},
{
"dep": "punct",
"governor": 1,
"governorGloss": "延伸",
"dependent": 3,
"dependentGloss": ":"
},
{
"dep": "compound:nn",
"governor": 6,
"governorGloss": "要览",
"dependent": 4,
"dependentGloss": "全球"
},
{
"dep": "compound:nn",
"governor": 6,
"governorGloss": "要览",
"dependent": 5,
"dependentGloss": "疫情"
},
{
"dep": "compound:nn",
"governor": 13,
"governorGloss": "疫情",
"dependent": 6,
"dependentGloss": "要览"
},
{
"dep": "punct",
"governor": 9,
"governorGloss": "16日",
"dependent": 7,
"dependentGloss": "("
},
{
"dep": "compound:nn",
"governor": 9,
"governorGloss": "16日",
"dependent": 8,
"dependentGloss": "4月"
},
{
"dep": "parataxis:prnmod",
"governor": 6,
"governorGloss": "要览",
"dependent": 9,
"dependentGloss": "16日"
},
{
"dep": "punct",
"governor": 9,
"governorGloss": "16日",
"dependent": 10,
"dependentGloss": ")"
},
{
"dep": "nmod:assmod",
"governor": 12,
"governorGloss": "地区",
"dependent": 11,
"dependentGloss": "亚欧"
},
{
"dep": "compound:nn",
"governor": 13,
"governorGloss": "疫情",
"dependent": 12,
"dependentGloss": "地区"
},
{
"dep": "nsubj",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 13,
"dependentGloss": "疫情"
},
{
"dep": "xcomp",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 14,
"dependentGloss": "持续"
},
{
"dep": "ccomp",
"governor": 1,
"governorGloss": "延伸",
"dependent": 15,
"dependentGloss": "蔓延"
},
{
"dep": "nmod:assmod",
"governor": 19,
"governorGloss": "措施",
"dependent": 16,
"dependentGloss": "德国"
},
{
"dep": "compound:nn",
"governor": 19,
"governorGloss": "措施",
"dependent": 17,
"dependentGloss": "社交"
},
{
"dep": "compound:nn",
"governor": 19,
"governorGloss": "措施",
"dependent": 18,
"dependentGloss": "限制"
},
{
"dep": "dobj",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 19,
"dependentGloss": "措施"
},
{
"dep": "conj",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 20,
"dependentGloss": "延长"
}
],
"enhancedDependencies": [
{
"dep": "ROOT",
"governor": 0,
"governorGloss": "ROOT",
"dependent": 1,
"dependentGloss": "延伸"
},
{
"dep": "compound:vc",
"governor": 1,
"governorGloss": "延伸",
"dependent": 2,
"dependentGloss": "阅读"
},
{
"dep": "punct",
"governor": 1,
"governorGloss": "延伸",
"dependent": 3,
"dependentGloss": ":"
},
{
"dep": "compound:nn",
"governor": 6,
"governorGloss": "要览",
"dependent": 4,
"dependentGloss": "全球"
},
{
"dep": "compound:nn",
"governor": 6,
"governorGloss": "要览",
"dependent": 5,
"dependentGloss": "疫情"
},
{
"dep": "compound:nn",
"governor": 13,
"governorGloss": "疫情",
"dependent": 6,
"dependentGloss": "要览"
},
{
"dep": "punct",
"governor": 9,
"governorGloss": "16日",
"dependent": 7,
"dependentGloss": "("
},
{
"dep": "compound:nn",
"governor": 9,
"governorGloss": "16日",
"dependent": 8,
"dependentGloss": "4月"
},
{
"dep": "parataxis:prnmod",
"governor": 6,
"governorGloss": "要览",
"dependent": 9,
"dependentGloss": "16日"
},
{
"dep": "punct",
"governor": 9,
"governorGloss": "16日",
"dependent": 10,
"dependentGloss": ")"
},
{
"dep": "nmod:assmod",
"governor": 12,
"governorGloss": "地区",
"dependent": 11,
"dependentGloss": "亚欧"
},
{
"dep": "compound:nn",
"governor": 13,
"governorGloss": "疫情",
"dependent": 12,
"dependentGloss": "地区"
},
{
"dep": "nsubj",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 13,
"dependentGloss": "疫情"
},
{
"dep": "xcomp",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 14,
"dependentGloss": "持续"
},
{
"dep": "ccomp",
"governor": 1,
"governorGloss": "延伸",
"dependent": 15,
"dependentGloss": "蔓延"
},
{
"dep": "nmod:assmod",
"governor": 19,
"governorGloss": "措施",
"dependent": 16,
"dependentGloss": "德国"
},
{
"dep": "compound:nn",
"governor": 19,
"governorGloss": "措施",
"dependent": 17,
"dependentGloss": "社交"
},
{
"dep": "compound:nn",
"governor": 19,
"governorGloss": "措施",
"dependent": 18,
"dependentGloss": "限制"
},
{
"dep": "dobj",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 19,
"dependentGloss": "措施"
},
{
"dep": "conj",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 20,
"dependentGloss": "延长"
}
],
"enhancedPlusPlusDependencies": [
{
"dep": "ROOT",
"governor": 0,
"governorGloss": "ROOT",
"dependent": 1,
"dependentGloss": "延伸"
},
{
"dep": "compound:vc",
"governor": 1,
"governorGloss": "延伸",
"dependent": 2,
"dependentGloss": "阅读"
},
{
"dep": "punct",
"governor": 1,
"governorGloss": "延伸",
"dependent": 3,
"dependentGloss": ":"
},
{
"dep": "compound:nn",
"governor": 6,
"governorGloss": "要览",
"dependent": 4,
"dependentGloss": "全球"
},
{
"dep": "compound:nn",
"governor": 6,
"governorGloss": "要览",
"dependent": 5,
"dependentGloss": "疫情"
},
{
"dep": "compound:nn",
"governor": 13,
"governorGloss": "疫情",
"dependent": 6,
"dependentGloss": "要览"
},
{
"dep": "punct",
"governor": 9,
"governorGloss": "16日",
"dependent": 7,
"dependentGloss": "("
},
{
"dep": "compound:nn",
"governor": 9,
"governorGloss": "16日",
"dependent": 8,
"dependentGloss": "4月"
},
{
"dep": "parataxis:prnmod",
"governor": 6,
"governorGloss": "要览",
"dependent": 9,
"dependentGloss": "16日"
},
{
"dep": "punct",
"governor": 9,
"governorGloss": "16日",
"dependent": 10,
"dependentGloss": ")"
},
{
"dep": "nmod:assmod",
"governor": 12,
"governorGloss": "地区",
"dependent": 11,
"dependentGloss": "亚欧"
},
{
"dep": "compound:nn",
"governor": 13,
"governorGloss": "疫情",
"dependent": 12,
"dependentGloss": "地区"
},
{
"dep": "nsubj",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 13,
"dependentGloss": "疫情"
},
{
"dep": "xcomp",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 14,
"dependentGloss": "持续"
},
{
"dep": "ccomp",
"governor": 1,
"governorGloss": "延伸",
"dependent": 15,
"dependentGloss": "蔓延"
},
{
"dep": "nmod:assmod",
"governor": 19,
"governorGloss": "措施",
"dependent": 16,
"dependentGloss": "德国"
},
{
"dep": "compound:nn",
"governor": 19,
"governorGloss": "措施",
"dependent": 17,
"dependentGloss": "社交"
},
{
"dep": "compound:nn",
"governor": 19,
"governorGloss": "措施",
"dependent": 18,
"dependentGloss": "限制"
},
{
"dep": "dobj",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 19,
"dependentGloss": "措施"
},
{
"dep": "conj",
"governor": 15,
"governorGloss": "蔓延",
"dependent": 20,
"dependentGloss": "延长"
}
],
"entitymentions": [
{
"docTokenBegin": 7,
"docTokenEnd": 9,
"tokenBegin": 7,
"tokenEnd": 9,
"text": "4月16日",
"characterOffsetBegin": 14,
"characterOffsetEnd": 19,
"ner": "DATE",
"normalizedNER": "XXXX-04-16",
"nerConfidences": {
"DATE": -1
}
},
{
"docTokenBegin": 10,
"docTokenEnd": 11,
"tokenBegin": 10,
"tokenEnd": 11,
"text": "亚欧",
"characterOffsetBegin": 20,
"characterOffsetEnd": 22,
"ner": "LOCATION",
"nerConfidences": {
"LOCATION": 0.48412511863581
}
},
{
"docTokenBegin": 15,
"docTokenEnd": 16,
"tokenBegin": 15,
"tokenEnd": 16,
"text": "德国",
"characterOffsetBegin": 30,
"characterOffsetEnd": 32,
"ner": "COUNTRY",
"nerConfidences": {
"GPE": 0.9540884277315
}
}
],
"tokens": [
{
"index": 1,
"word": "延伸",
"originalText": "延伸",
"lemma": "延伸",
"characterOffsetBegin": 0,
"characterOffsetEnd": 2,
"pos": "VV",
"ner": "O",
"speaker": "PER0"
},
{
"index": 2,
"word": "阅读",
"originalText": "阅读",
"lemma": "阅读",
"characterOffsetBegin": 2,
"characterOffsetEnd": 4,
"pos": "VV",
"ner": "O",
"speaker": "PER0"
},
{
"index": 3,
"word": ":",
"originalText": ":",
"lemma": ":",
"characterOffsetBegin": 4,
"characterOffsetEnd": 5,
"pos": "PU",
"ner": "O",
"speaker": "PER0"
},
{
"index": 4,
"word": "全球",
"originalText": "全球",
"lemma": "全球",
"characterOffsetBegin": 7,
"characterOffsetEnd": 9,
"pos": "NN",
"ner": "O",
"speaker": "PER0"
},
{
"index": 5,
"word": "疫情",
"originalText": "疫情",
"lemma": "疫情",
"characterOffsetBegin": 9,
"characterOffsetEnd": 11,
"pos": "NN",
"ner": "O",
"speaker": "PER0"
},
{
"index": 6,
"word": "要览",
"originalText": "要览",
"lemma": "要览",
"characterOffsetBegin": 11,
"characterOffsetEnd": 13,
"pos": "NN",
"ner": "O",
"speaker": "PER0"
},
{
"index": 7,
"word": "(",
"originalText": "(",
"lemma": "(",
"characterOffsetBegin": 13,
"characterOffsetEnd": 14,
"pos": "PU",
"ner": "O",
"speaker": "PER0"
},
{
"index": 8,
"word": "4月",
"originalText": "4月",
"lemma": "4月",
"characterOffsetBegin": 14,
"characterOffsetEnd": 16,
"pos": "NT",
"ner": "DATE",
"normalizedNER": "XXXX-04-16",
"speaker": "PER0"
},
{
"index": 9,
"word": "16日",
"originalText": "16日",
"lemma": "16日",
"characterOffsetBegin": 16,
"characterOffsetEnd": 19,
"pos": "NT",
"ner": "DATE",
"normalizedNER": "XXXX-04-16",
"speaker": "PER0"
},
{
"index": 10,
"word": ")",
"originalText": ")",
"lemma": ")",
"characterOffsetBegin": 19,
"characterOffsetEnd": 20,
"pos": "PU",
"ner": "O",
"speaker": "PER0"
},
{
"index": 11,
"word": "亚欧",
"originalText": "亚欧",
"lemma": "亚欧",
"characterOffsetBegin": 20,
"characterOffsetEnd": 22,
"pos": "NR",
"ner": "LOCATION",
"speaker": "PER0"
},
{
"index": 12,
"word": "地区",
"originalText": "地区",
"lemma": "地区",
"characterOffsetBegin": 22,
"characterOffsetEnd": 24,
"pos": "NN",
"ner": "O",
"speaker": "PER0"
},
{
"index": 13,
"word": "疫情",
"originalText": "疫情",
"lemma": "疫情",
"characterOffsetBegin": 24,
"characterOffsetEnd": 26,
"pos": "NN",
"ner": "O",
"speaker": "PER0"
},
{
"index": 14,
"word": "持续",
"originalText": "持续",
"lemma": "持续",
"characterOffsetBegin": 26,
"characterOffsetEnd": 28,
"pos": "VV",
"ner": "O",
"speaker": "PER0"
},
{
"index": 15,
"word": "蔓延",
"originalText": "蔓延",
"lemma": "蔓延",
"characterOffsetBegin": 28,
"characterOffsetEnd": 30,
"pos": "VV",
"ner": "O",
"speaker": "PER0"
},
{
"index": 16,
"word": "德国",
"originalText": "德国",
"lemma": "德国",
"characterOffsetBegin": 30,
"characterOffsetEnd": 32,
"pos": "NR",
"ner": "COUNTRY",
"speaker": "PER0"
},
{
"index": 17,
"word": "社交",
"originalText": "社交",
"lemma": "社交",
"characterOffsetBegin": 32,
"characterOffsetEnd": 34,
"pos": "NN",
"ner": "O",
"speaker": "PER0"
},
{
"index": 18,
"word": "限制",
"originalText": "限制",
"lemma": "限制",
"characterOffsetBegin": 34,
"characterOffsetEnd": 36,
"pos": "NN",
"ner": "O",
"speaker": "PER0"
},
{
"index": 19,
"word": "措施",
"originalText": "措施",
"lemma": "措施",
"characterOffsetBegin": 36,
"characterOffsetEnd": 38,
"pos": "NN",
"ner": "O",
"speaker": "PER0"
},
{
"index": 20,
"word": "延长",
"originalText": "延长",
"lemma": "延长",
"characterOffsetBegin": 38,
"characterOffsetEnd": 40,
"pos": "VV",
"ner": "O",
"speaker": "PER0"
}
]
}
],
"corefs": {
}
}
|