[人工智能] 2021-05-27-CNdeepdive安装部署及踩坑过程

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 2021-05-27-CNdeepdive安装部署及踩坑过程 -> 正文阅读

[人工智能]2021-05-27-CNdeepdive安装部署及踩坑过程

主要参考了如下大佬们的博客：（有些直接复制了内容，仅个人学习使用，感谢大佬的总结）

关系抽取工具：DeepDive的环境配置与实践排雷https://zhuanlan.zhihu.com/p/53804721

Deepdive 实战-从下载到跑路https://github.com/theDoctor2013/DeepDive-tutorial/blob/master/Deepdive_new.md

DeepDive安装教程https://blog.csdn.net/qq_28840013/article/details/88406918

1.下载

需要翻墙

更改? ./install.sh? 文件

cxt@cxt:~/deepdive/CNdeepdive$ ./install.sh
问题

：提示curl: (7) Failed to connect to raw.githubusercontent.com port 443: Connection refused

解决：

sudo apt-get date? 执行后可以运行下一步

出现如下选项，选择1

### DeepDive installer for Ubuntu
1) deepdive                   5) jupyter_notebook
2) deepdive_docker_sandbox    6) postgres
3) deepdive_example_notebook  7) run_deepdive_tests
4) deepdive_from_release      8) spouse_example
# Install what (enter to repeat options, a to see all, q to quit, or a number)? 1

出现如下错误：

E: 有未能满足的依赖关系。

解决：更换了官方源??

cxt@cxt:~$ sudo apt-get date

cxt@cxt:~$ sudo apt-get upgrade

cxt@cxt:~$ sudo apt-get -f install

然后退出重新执行./install.sh

成功

3 配置环境??
?

cxt@cxt:~/deepdive/CNdeepdive$ gedit ~/.bashrc
添加：export PATH="~/local/bin:$PATH"


if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi
export PATH="~/local/bin:$PATH"
# enable programmable completion features (you don't need to enable
# this, if it's already enabled in /etc/bash.bashrc and /etc/profile
# sources /etc/bash.bashrc).
if ! shopt -oq posix; then
  if [ -f /usr/share/bash-completion/bash_completion ]; then
    . /usr/share/bash-completion/bash_completion
  elif [ -f /etc/bash_completion ]; then
    . /etc/bash_completion
  fi
fi

生效环境变量

cxt@cxt:~/deepdive/CNdeepdive$ source ~/.bashrc
?

4.数据库

cxt@cxt:~/deepdive/CNdeepdive$ bash <(curl -fsSL git.io/getdeepdive) postgres
出现问题：curl: (22) The requested URL returned error: 503 Service Unavailable

重复执行

echo “postgresql://lumy@localhost:5432/transaction” > db.url

安装成功

cxt@cxt:~$ psql postgres
psql (9.5.25)
Type "help" for help.

postgres=# CREATE USER cxt WITH PASSWORD 'cxt';
ERROR: ?role "cxt" already exists
postgres=# CREATE DATABASE transaction OWNER cxt;
CREATE DATABASE
postgres=# \q
?

接着执行? cxt@cxt:~$ psql -U cxt -d transaction -h 127.0.0.1 -p 5432
?

cxt@cxt:~$ psql postgres
CREATE DATABASE transaction?OWNER cxt;

配置

echo “postgresql://cxt@localhost:5432/transaction” > db.url

(待删）

配置中文nlp环境

cxt@cxt:~/deepdive/CNdeepdive$ ./nlp_setup.sh
Install Dependency.
Denpendency Already Installed.
?

DeepDive需要用到 Stanford CoreNLP 中的一些功能，因此处理中文时需要下载额外的中文模型。

下载此链接 http://nlp.stanford.edu/software/stanford-chinese-corenlp-2016-01-19-models.jar 中的中文模型。

下载后，将jar文件放置到 transaction 下的 udf/bazaar/parser/lib 下。然后在 udf/bazaar/parser/　下，运行命令进行编译。

? ? sbt/sbt stage
编译完成后，运行下列命令开启parser：

? ? ./run.sh -p 8080
可以测试一下向 localhost:8080 POST 一段中文，能够正确进行分词和标注。
————————————————
版权声明：本文为CSDN博主「cyh_90」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/cyh_90/article/details/88093513

在进行如下命令时，一直报错的原因：第一，开了pgadmin。第二，开了postman。

$ deepdive compile && deepdive do transaction_dbdata

cxt@cxt:~/deepdive/CNdeepdive/project_cxt/udf/bazaar/parser$ sbt/sbt stage
?

cxt@cxt:~/deepdive/CNdeepdive/project_cxt/udf/bazaar/parser$ deepdive compile

cxt@cxt:~/deepdive/CNdeepdive/project_cxt/udf/bazaar/parser$ deepdive do articles

"run/FINISHED" -> "run/FINISHED~"
"run/FINISHED" -> "20210604/220307.088140845"

说明成功

查询：?

cxt@cxt:~/deepdive/CNdeepdive/project_cxt/udf/bazaar/parser$ deepdive query '?- articles(id, _).'

? ? ?id ? ??
------------
?1201734370
(1 row)

?nlp模块进??本处理

deepdive默认采?standford nlp进??本处理。输??本数据，nlp模块将以句?为单位，返回每句的分词、 lemma、pos、NER和句法分析的结果，为后续特征抽取做准备。我们将这些结果存?sentences表中。

在app.ddlog?件中定义sentences表，?于存放nlp结果：

sentences(
    doc_id text,
    sentence_index int,
    sentence_text text,
    tokens text[],
    lemmas text[],
    pos_tags text[],
    ner_tags text[],
    doc_offsets int[],
    dep_types text[],
    dep_tokens int[]
    ).

定义NLP处理的函数nlp_markup：

function nlp_markup over (
    doc_id text,
    content text
) returns rows like sentences
implementation "udf/nlp_markup.sh" handles tsv lines.

使?如下语法调?nlp_markup函数，从articles表中读取输?，输出存放在sentences表中：

sentences += nlp_markup(doc_id, content) :-
articles(doc_id, content).

声明?个ddlog函数，这个函数输??章的doc_id和content，输出按sentences表的字段格式。?函数调?udf/nlp_markup.sh调?nlp模块，nlp_markup.sh的脚本内容?transaction示例代码中的udf/?件夹，它调?udf/bazzar/parser下的run.sh实现；?此处需要将CNdeepdive示例代码目录transaction/udf下的nlp_markup.sh复制到当前项目的对应目录下。

执行

deepdive compile && deepdive do sentences

建立 sentences 表。

执?以下命令来查询?成结果：

deepdive query 'doc_id, index, tokens, ner_tags | 5
?- sentences(doc_id, index, text, tokens, lemmas, pos_tags, ner_tags, _, _, _).'

实体抽取及候选实体对?成

这?步，我们要抽取?本中的候选实体（公司），并?成候选实体对。 ?先在app.ddlog中定义实体数据表：

company_mention(
    mention_id text,
    mention_text text,
    doc_id text,
    sentence_index int,
    begin_index int,
    end_index int
).

每个实体都是表中的?列数据，同时存储了实体的id，、实体内容、所在文本的id、句子索引、在句中的起始位置和结束位置。

再定义实体抽取的函数：

function map_company_mention over (
    doc_id text,
    sentence_index int,
    tokens text[],
    ner_tags text[]
) returns rows like company_mention
implementation "udf/map_company_mention.py" handles tsv lines.

map_company_mention.py也需要从CNdeepdive示例代码目录transaction/udf中复制到当前项目对应目录中；?这个脚本遍历每个数据库中的句?，找出连续的NER标记为ORG的序列，再做其它过滤处理，返回候选实体；这个脚本是?个?成函数，?yield语句返回输出?。?其它所有CNdeepdive示例代码目录transaction/udf下的脚本和文件都要复制过去（包括company_full_short.csv）。

然后在app.ddlog中写调?函数，从sentences表中输?，输出到company_mention中：

company_mention += map_company_mention(
doc_id, sentence_index, tokens, ner_tags) :-
sentences(doc_id, sentence_index, _, tokens, _, _, ner_tags, _, _, _).

最后编译并执?：

$ deepdive compile && deepdive do company_mention

测试刚刚抽取得到的实体表：

$ deepdive query 'mention_id, mention_text, doc_id,sentence_index, begin_index, end_index
| 50 ?- company_mention(
mention_id, mention_text, doc_id, sentence_index, begin_index, end_index).'

结果如图：

在执行

deepdive compile && deepdive do company_mention

时一直报错。

显示kill: (6258) - 没有那个进程? ??

措施;kill? 进程

又显示找不到? company_full_short.csv文件? 路径

措施：在transform中添加文件路径，添加"/home/cxt/deepdive/CNdeepdive/project_cxt/udf/"

ENTITY_FILE = "/home/cxt/deepdive/CNdeepdive/project_cxt/udf/company_full_short.csv"
entity_dict = loaddict(ENTITY_FILE)

然后编译并执?，?成特征数据库：

$ deepdive compile && deepdive do transaction_feature

执?如下语句，查看?成结果：

deepdive query '| 20 ?- transaction_feature(p1_id, p2_id, feature).'

这里直接用了大佬的代码https://github.com/changyiru-code/-

构建知识图谱：

{
? ? "p": {
"start": {
"identity": 133,
"labels": [
? ? ? ? ? "题目"
? ? ? ? ],
"properties": {
"Title_name": ".NET框架下的防SQL注入登录模块的研究与实现",
"Abstract_content": "随着信息化、数据化的逐步深入,安全问题暴露无疑,大家对安全的意识也在不断提高。对于中职学生来说,编程类课程越来越丰富,涉及的安全问题也就越来越多,了解和掌握常见的程序安全问题,并加以防范和修补也逐渐成为编程学习的一部分。SQL注入是最常见、最古老、最流行的程序漏洞之一。在编程中,尤其是在登录阶段该漏洞表现的尤为突出。如果在编程阶段就将SQL注入漏洞等安全问题进行有效的防范对整个项目安全性的提高,将有极大的帮助。"
? ? ? ? }
? ? ? },
"end": {
"identity": 162,
"labels": [
? ? ? ? ? "关键词"
? ? ? ? ],
"properties": {
"Keywords_name": "SQL注入"
? ? ? ? }
? ? ? },
"segments": [
? ? ? ? {
? ? ? ? ? "start": {
"identity": 133,
"labels": [
? ? ? ? ? ? ? "题目"
? ? ? ? ? ? ],
"properties": {
"Title_name": ".NET框架下的防SQL注入登录模块的研究与实现",
"Abstract_content": "随着信息化、数据化的逐步深入,安全问题暴露无疑,大家对安全的意识也在不断提高。对于中职学生来说,编程类课程越来越丰富,涉及的安全问题也就越来越多,了解和掌握常见的程序安全问题,并加以防范和修补也逐渐成为编程学习的一部分。SQL注入是最常见、最古老、最流行的程序漏洞之一。在编程中,尤其是在登录阶段该漏洞表现的尤为突出。如果在编程阶段就将SQL注入漏洞等安全问题进行有效的防范对整个项目安全性的提高,将有极大的帮助。"
? ? ? ? ? ? }
? ? ? ? ? },
? ? ? ? ? "relationship": {
"identity": 207,
"start": 133,
"end": 162,
"type": "keywords_of",
"properties": {

? ? ? ? ? ? }
? ? ? ? ? },
? ? ? ? ? "end": {
"identity": 162,
"labels": [
? ? ? ? ? ? ? "关键词"
? ? ? ? ? ? ],
"properties": {
"Keywords_name": "SQL注入"
? ? ? ? ? ? }
? ? ? ? ? }
? ? ? ? }
? ? ? ],
"length": 1.0
? ? }
? },

?实体合并：

def main():
    graph = Graph("http://localhost:7474", username="××××", password='×××××')
    # csv 读取
    csv_file1 = csv.reader(open(
        '/home/cxt/下载/changyirukg/data/作者.csv',
        'r', encoding='utf-8'))
    print(csv_file1)  # 打印出来的csv_file1只是一个对象的模型
    csv_file2 = csv.reader(open(
        '/home/cxt/下载/changyirukg/data/关键词.csv',
        'r', encoding='utf-8'))
    print(csv_file2)  # 打印出来的csv_file2只是一个对象的模型
    csv_file3 = csv.reader(open(
        '/home/cxt/下载/changyirukg/data/Bug.csv',
        'r', encoding='utf-8'))
    print(csv_file3)  # 打印出来的csv_file3只是一个对象的模型
#    cxt注释掉output.csv
    # csv_file4 = csv.reader(open(
    #     '/home/cxt/下载/changyirukg/data/output.csv',
    #     'r', encoding='utf-8'))
    # print(csv_file4)  # 打印出来的csv_file1只是一个对象的模型
    file1(csv_file1, graph)
    file2(csv_file2, graph)
    file3(csv_file3, graph)
    # file4(csv_file4, graph)
   
    gql1 = 'MATCH (a:entity1),(b:entity2)  where (a.entity_1 = b.entity_2 and id(a) <> id(b))  call apoc.refactor.mergeNodes([a,b]) YIELD node  RETURN node;'
    gql2 = 'MATCH (a:entity1),(b:title)  where (a.entity_1 = b.Title_name and id(a) <> id(b)) call apoc.refactor.mergeNodes([a,b]) YIELD node  RETURN node'
    gql3 = 'MATCH (a:entity1),(b:keywords)  where (a.entity_1 = b.Keywords_name and id(a) <> id(b))  call apoc.refactor.mergeNodes([a,b]) YIELD node  RETURN node;'
    gql4 = 'MATCH (a:entity2),(b:title)  where (a.entity_2 = b.Title_name and id(a) <> id(b)) call apoc.refactor.mergeNodes([a,b]) YIELD node  RETURN node'
    gql5 = 'MATCH (a:entity2),(b:keywords)  where (a.entity_2 = b.Keywords_name and id(a) <> id(b)) call apoc.refactor.mergeNodes([a,b]) YIELD node  RETURN node'
    gql6 = 'MATCH (a:title),(b:keywords)  where (a.Title_name = b.Keywords_name and id(a) <> id(b)) call apoc.refactor.mergeNodes([a,b]) YIELD node  RETURN node'
    graph.run(gql1)
    graph.run(gql2)
    graph.run(gql3)
    graph.run(gql4)
    graph.run(gql5)
    graph.run(gql6)

在进行实体合并时出现如下问题：

Traceback (most recent call last):
? File "/home/cxt/下载/changyirukg/2020.9.5合并实体.py", line 110, in <module>
? ? main()
? File "/home/cxt/下载/changyirukg/2020.9.5合并实体.py", line 103, in main
? ? graph.run(gql1)
? File "/usr/local/lib/python3.5/dist-packages/py2neo/database/__init__.py", line 709, in run
? ? return self.auto().run(cypher, parameters, **kwparameters)
? File "/usr/local/lib/python3.5/dist-packages/py2neo/database/work.py", line 128, in run
? ? readonly=self.readonly, hydrant=hydrant)
? File "/usr/local/lib/python3.5/dist-packages/py2neo/client/__init__.py", line 936, in auto_run
? ? result = cx.auto_run(graph_name, cypher, parameters, readonly=readonly)
? File "/usr/local/lib/python3.5/dist-packages/py2neo/client/http.py", line 183, in auto_run
? ? rs.audit()
? File "/usr/local/lib/python3.5/dist-packages/py2neo/client/http.py", line 448, in audit
? ? raise failure
py2neo.database.work.ClientError: [Procedure.ProcedureNotFound] There is no procedure with the name `apoc.refactor.mergeNodes` registered for this database instance. Please ensure you've spelled the procedure name correctly and that the procedure is properly deployed.