想做一个基于词向量搜索的问答系统,参考github代码修改后,用于自己的项目。但是BERT、ElasticSearch都是第一次接触,遇到了各种各样的问题,花了四五天才跑通。
实践可知版本很重要(未能在使用前就了解全面)
python版本3.6 Tensorflow1.10.0
不支持Python2 不支持tensorflow2.0,python3.7以上只能安装tensorflow2.0以上版本 tensorflow与python版本对应
tensorflow版本1.2.0(错误)
BERT需要基于tensorflow,所以安装tensorflow。初始安装了tensorflow最新的2.0版本,报错。查看tensorflow安装文档 安装 TensorFlow 2 用pip安装的tensorflow2.0报错,提示没有NVIDIA。发现自己的电脑不是独立显卡,只能允许与CPU。不过新版文档显示“对于 1.15 及更早版本,CPU 和 GPU 软件包是分开的”,那也就是说新版本应该是两个软件包不分开,应当都支持,但是实际安装报错,不知道问题出在哪里。 同时tensorflow2.0以后不支持bert_serving,选择降低tensorflow版本tensorflow-1.2.0。
python版本3.5->3.6(正确)
tensorflow-1.2.0对应的python版本为3.5~3.6,选择较低的3.5版本
tensorflow-1.2.0 3.5-3.6
python降到3.5版本后,pip指令出现SyntaxError: invalid syntax错误
PS F:\kg\zhuanan> python -m pip install --upgrade pip
Error processing line 1 of D:\python-data\site-packages\distutils-precedence.pth:
Traceback (most recent call last):
File "C:\Users\hfore\AppData\Local\Programs\Python\Python35\lib\site.py", line 167, in addpackage
exec(line)
File "<string>", line 1, in <module>
File "D:\python-data\site-packages\_distutils_hack\__init__.py", line 194
f'spec_for_{name}',
^
SyntaxError: invalid syntax
Remainder of file ignored
Traceback (most recent call last):
File "C:\Users\hfore\AppData\Local\Programs\Python\Python35\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\hfore\AppData\Local\Programs\Python\Python35\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "D:\python-data\site-packages\pip\__main__.py", line 29, in <module>
from pip._internal.cli.main import main as _main
File "D:\python-data\site-packages\pip\_internal\cli\main.py", line 57
sys.stderr.write(f"ERROR: {exc}")
^
SyntaxError: invalid syntax
配置pycharm中的python,报错 File->Setting 在最新的pycharm2021.2版本中,python Interpreter不支持python3.5。官方已经将python3.5放弃,所以pycharm不支持python3.5,升级python版本为3.6
tensorflow版本1.10.0(正确)
pip install tensorflow==1.2.0
pip install bert-serving-server
pip install bert-serving-client
pip install Elasticsearch
pip install flask
pip install helpers
pip install CORS
运行后,报错。Tensorflow 版本不对。故卸载重新安装
ModuleNotFoundError: Tensorflow >=1.10 (one-point-ten) is required!
pip uninstall tensorflow==1.2.0
pip install tensorflow==1.10.0
conda创建虚拟环境
为了方便python版本管理,选择conda安装python
conda create --name python36 python=3.6
conda activate python36
bert-serving-start -model_dir F:\kg\zhuanan\chinese_L-12_H-768_A-12 -num_worker=4
python ask_es.py -model_dir chinese_L-12_H-768_A-12/ -num_worker=4
ES向量查询报错:elasticsearch.BadRequestError: BadRequestError(400, ‘search_phase_execution_exception’, ‘runtime error’)
向量查询语句与ES文档类似,没有问题
script_query = {
"script_score": {
"query": {
"match_all": {
},
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'query_vector') + 1.0",
"params": {"query_vector": query_vector}
}
}
}
body = {
"_source": ["standard_question","answer"],
"size": 1,
"query": script_query
}
pycharm中报错较简单,在kibana控制台中输出详细报错信息
将查询语句在kibana控制台中输出,获取详细的错误信息。从报错看,问题是DenseVectorDocValuesField和FloatDocValuesField字段类型不同无法匹配。但是在ES索引创建时已经设置了向量,而传入的查询数据也是bert分词后的向量,不知道问什么会类型不同。
{
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.server@8.4.3/org.elasticsearch.script.VectorScoreScriptUtils$DenseVectorFunction.<init>(VectorScoreScriptUtils.java:38)",
"org.elasticsearch.server@8.4.3/org.elasticsearch.script.VectorScoreScriptUtils$CosineSimilarity.<init>(VectorScoreScriptUtils.java:112)",
"cosineSimilarity(params.query_vector, 'query_vector') + 1.0",
" ^---- HERE"
],
"script": "cosineSimilarity(params.query_vector, 'query_vector') + 1.0",
"lang": "painless",
"position": {
"offset": 23,
"start": 0,
"end": 59
}
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "test-index",
"node": "gQaKss5wTrOYc5ncBarXfA",
"reason": {
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"org.elasticsearch.server@8.4.3/org.elasticsearch.script.VectorScoreScriptUtils$DenseVectorFunction.<init>(VectorScoreScriptUtils.java:38)",
"org.elasticsearch.server@8.4.3/org.elasticsearch.script.VectorScoreScriptUtils$CosineSimilarity.<init>(VectorScoreScriptUtils.java:112)",
"cosineSimilarity(params.query_vector, 'query_vector') + 1.0",
" ^---- HERE"
],
"script": "cosineSimilarity(params.query_vector, 'query_vector') + 1.0",
"lang": "painless",
"position": {
"offset": 23,
"start": 0,
"end": 59
},
"caused_by": {
"type": "class_cast_exception",
"reason": "class org.elasticsearch.script.field.FloatDocValuesField cannot be cast to class org.elasticsearch.script.field.vectors.DenseVectorDocValuesField (org.elasticsearch.script.field.FloatDocValuesField and org.elasticsearch.script.field.vectors.DenseVectorDocValuesField are in module org.elasticsearch.server@8.4.3 of loader 'app')"
}
}
}
]
},
"status": 400
}
search查询下数据,看到也已经成功存入了数据,mapping的query_vector向量字段也对
还是不明白问题出在了哪里
GET test-index/_search
{
"query":{"match_all": {}}
}
查看下mapping中的数据类型
GET test-index/_mapping
奇怪的事情发生了,query_vector字段的类型竟然是long。还没弄明白为什么会这样,新手上路,如果大家知道原因,麻烦留言
正确做法(一个大坑,可能是操作不对导致)
到此是知道问题出在哪里了,索引类型在插入数据时发生了变化。 后面一步步来,建立索引->查看索引->插入数据。一定是建立索引后,就查看索引,仍然不知道为什么。
|