一、安装 ingest attachment 插件
安装方法:https://blog.csdn.net/catoop/article/details/124468788
二、定义文本抽取管道
1.单附件(示例)
PUT _ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{
"attachment": {
"field": "data",
"ignore_missing": true
}
},
{
"remove": {
"field": "data",
}
}
]
}
其中remove段的配置表示附件经过管道处理后删除附件本身,只将附件的文字存入ES中,附件自身base64的数据抛弃掉。
2.多附件(示例)
PUT _ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment"
}
}
}
},
{
"foreach": {
"field": "attachments",
"processor": {
"remove": {
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment"
}
}
}
}
]
}
需要注意的是,多附件的情况下,field 和 target_field 必须要写成 _ingest._value.*,否则不能匹配正确的字段。 从 es 8.0 版本开始,需要删除二进制文件内容,只需要为 attachment 添加一个属性 remove_binary 为 true ,就不需要像上面那样单独写一个 remove 处理器了。
三、建立文档结构映射
1.单附件(示例)
PUT newdoc_dispatch
{
"mappings": {
"properties": {
"businessId":{
"type": "keyword"
},
"title":{
"type": "text",
"analyzer": "ik_smart"
},
"fullDocNO":{
"type": "text",
"analyzer": "ik_smart"
},
"drafterUser":{
"type": "keyword"
},
"dispatchNO":{
"type": "text",
"analyzer": "ik_smart"
},
"dispatchDept":{
"type": "keyword"
},
"dispatchTime":{
"type": "date"
},
"abolish":{
"type": "keyword"
},
"tenantId":{
"type": "keyword"
},
"attachment": {
"properties": {
"content":{
"type": "text",
"analyzer": "ik_smart"
}
}
}
}
}
}
2.多附件(示例)
PUT newdoc_dispatch
{
"mappings": {
"properties": {
"businessId":{
"type": "keyword"
},
"title":{
"type": "text",
"analyzer": "ik_smart"
},
"fullDocNO":{
"type": "text",
"analyzer": "ik_smart"
},
"drafterUser":{
"type": "keyword"
},
"dispatchNO":{
"type": "text",
"analyzer": "ik_smart"
},
"dispatchDept":{
"type": "keyword"
},
"dispatchTime":{
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
},
"abolish":{
"type": "keyword"
},
"tenantid":{
"type": "keyword"
},
"attachments" : {
"properties" : {
"attachment" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer": "ik_smart"
}
}
}
}
}
}
}
}
工程中的代码是多附件的示例,mapping结构映射的对象详见ESDispatchDocumentVo
官网参考资料:https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html 其他参考资料:https://www.cnblogs.com/ncore/p/10475909.html 代码工程参考:https://gitee.com/catoop/es-attachment
(END)
|