一、系统分词器
- 可以使用GET发送_analyze命令,指定分析器和需要分析的文本内容
- 标准分析器,按照最小粒度
GET _analyze
{
"analyzer": "standard",
"text": ["中国人ABC"]
}
{
"tokens" : [
{
"token" : "中",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "国",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "人",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "abc",
"start_offset" : 3,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 3
}
]
}
- 作为关键词,关键词不会拆分
GET _analyze
{
"analyzer": "keyword",
"text": ["中国人ABC"]
}
{
"tokens" : [
{
"token" : "中国人ABC",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 0
}
]
}
二、IK分词器
2.1 IK分词器说明
- IK分词器提供两个分词算法:ik_smart、ik_max_word
- ik_smart:最少拆分
- ik_max_word:最为细粒度切分
2.2 IK分词器安装
- 下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases
- 注意事项:版本一定要和ES版本一致
- 解压ik分词器到es/plugins中,文件夹名称用ik
- 重启Elasticsearch,安装完成,在界面启动时将会有插件加载信息
2.3 IK分词器使用
- 可以通过_analyze来测试分词器的使用
GET _analyze
{
"analyzer": "分词器类型",
"text": "我是中国人码坐标"
}
2.3.1 ik_smart
GET _analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人码坐标"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "码",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "坐标",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 4
}
]
}
2.3.2 ik_max_word
GET _analyze
{
"analyzer": "ik_max_word",
"text": "我是中国人码坐标"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "码",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_CHAR",
"position" : 5
},
{
"token" : "坐标",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 6
}
]
}
2.4 自定义数据词典
- 在elasticsearch/plugins/ik/config下新建.dic文件,例如此处为codecoord.dic
- 编辑codecoord.dic文件,在其中加入词典,加入的信息在分词器中将会作为一个词语使用,不会进行拆分
- 编辑ik/config/IKAnalyzer.cfg.xml文件,在ext_dict中加入刚刚创建的codecoord.dic词典,多个使用逗号分开
- 此时进行分词器的使用,将会作为一个词语显示
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "码坐标",
"start_offset" : 5,
"end_offset" : 8,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "坐标",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 4
}
]
}
2.5 IK分词器查询
- 创建索引并指定分析器
PUT index
{
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
}
- 创建文档
POST index/_doc/1
{
"content": "美国留给伊拉克的是个烂摊子吗"
}
POST index/_doc/2
{
"content": "公安部:各地校车将享最高路权"
}
POST index/_doc/3
{
"content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
}
POST index/_doc/4
{
"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
}
- 搜索时指定高亮信息
GET index/_search
{
"query": {
"match": {
"content": "中国"
}
},
"highlight" : {
"pre_tags" : ["<tag1>", "<tag2>"],
"post_tags" : ["</tag1>", "</tag2>"],
"fields" : {
"content" : {}
}
}
}
- 将会在高亮highlight中返回高亮信息
{
"took" : 50,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.642793,
"hits" : [
{
"_index" : "index",
"_type" : "_doc",
"_id" : "3",
"_score" : 0.642793,
"_source" : {
"content" : "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
},
"highlight" : {
"content" : [
"中韩渔警冲突调查:韩警平均每天扣1艘<tag1>中国</tag1>渔船"
]
}
},
{
"_index" : "index",
"_type" : "_doc",
"_id" : "4",
"_score" : 0.642793,
"_source" : {
"content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
},
"highlight" : {
"content" : [
"<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
]
}
}
]
}
}
|