[大数据] Elasticsearch 内置分词器

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> Elasticsearch 内置分词器 -> 正文阅读

[大数据]Elasticsearch 内置分词器

Elasticsrarch 提供了8中内置分词器，它们可以无需任何配置即可使用。每一个分词器都由3部分组件组成：Character Filters、Tokenizer和Token Filters，这3个组件就像一个流水线一样对输入的文本进行分词处理。

Character Filters: 对输入的文本进行第一次处理，例如去除文本中html标签符号
Tokenizer：对上一步处理后的结果按照规则进行单词拆分。
Token Filters：将切分后的单词进行二次加工，例如转小写、删除stop words、增加同义词等操作

例如，使用standard分词器，对文本“<h2>Hi</h2>,My name is QiQi.”处理过程为：

Character fileters将<和>去除，结果为: h2 Hi h2,My name is QiQi.（不同的character filter处理结果可能不同）
Tokenizer 继续对上一步的结果进行处理，去掉符号，结果为h2 Hi h2 My name is QiQi
Token Filters 对分词后的单词进一步处理，将单词转为小写，结果h2 hi h2 my name is qiqi
最终结果就是h2 hi h2 my name is qiqi

Elasticsearch 内置分词器

Standard 分词器

standard分词器按照单词的边界将文本分词（根据Unicode文本分割算法），它将大多数标点符号删除，将分割的词的开头大小字母转为小写，并支持删除stop words（默认配置关闭）。

什么是stop words？
在信息检索中，停用词是为节省存储空间和提高搜索效率，处理文本时自动过滤掉某些字或词，这些字或词即被称为Stop Words（停用词）。停用词大致分为两类。一类是语言中的功能词，这些词极其普遍而无实际含义，比如“the”、“is“、“which“、“on”等。另一类是词汇词，比如’want’等，这些词应用广泛，但搜索引擎无法保证能够给出真正相关的搜索结果，难以缩小搜索范围，还会降低搜索效率。实践中，通常把这些词从问题中过滤，从而节省索引的存储空间、提高搜索性能。

例如对“I am a student.”分词：

curl -L -X GET 'http://192.168.205.128:9200/_analyze' \
-H 'content-type: application/json' \
-d '{
    "text":"I am a student.",
    "analyzer":"standard"
}'

{
    "tokens": [
        {
            "token": "i",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "am",
            "start_offset": 2,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "a",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "student",
            "start_offset": 7,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 3
        }
    ]
}

standard 分词器按照每个单词进行拆分，并将单词转为小写，并去掉大多数符号。

Simple 分词器

将文本按照非字母字符进行拆分，并将分词转为小写。
例如对I123am a student.使用simple分词器分词：

curl -L -X GET 'http://192.168.205.128:9200/_analyze' \
-H 'content-type: application/json' \
--data-raw '{
    "analyzer":"simple",
    "text":"I123am a student."
}'

{
    "tokens": [
        {
            "token": "i",
            "start_offset": 0,
            "end_offset": 1,
            "type": "word",
            "position": 0
        },
        {
            "token": "am",
            "start_offset": 4,
            "end_offset": 6,
            "type": "word",
            "position": 1
        },
        {
            "token": "a",
            "start_offset": 7,
            "end_offset": 8,
            "type": "word",
            "position": 2
        },
        {
            "token": "student",
            "start_offset": 9,
            "end_offset": 16,
            "type": "word",
            "position": 3
        }
    ]
}

将I123am 分成了i am两个term并去掉了123单词转小写，去掉符号。

Whitespace Analyzer

Whitespace分词器使用空格进行分词，不对token进行小写转换，也不删除分词后的符号。
例如，对文本Good morning!Miss Wang!分词

curl -L -X GET 'http://192.168.205.128:9200/_analyze' \
-H 'content-type: application/json' \
-d '{
    "analyzer":"whitespace",
    "text":"Good morning!Miss Wang!"
}'

{
    "tokens": [
        {
            "token": "Good",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "morning!Miss",
            "start_offset": 5,
            "end_offset": 17,
            "type": "word",
            "position": 1
        },
        {
            "token": "Wang!",
            "start_offset": 18,
            "end_offset": 23,
            "type": "word",
            "position": 2
        }
    ]
}

Stop Analyzer

stop分词器和simple很像，只是不支持删除stop words。

Keyword Analyzer

keyword分词器对输入的文本不做分词处理，将整个输入作为一个term。

curl -L -X GET 'http://192.168.205.128:9200/_analyze' \
-H 'content-type: application/json' \
-d '{
    "analyzer":"keyword",
    "text":"Good morning!Miss Wang!"
}'

{
    "tokens": [
        {
            "token": "Good morning!Miss Wang!",
            "start_offset": 0,
            "end_offset": 23,
            "type": "word",
            "position": 0
        }
    ]
}

Pattern Analyzer

使用正则表达式对文本进行分割。并且支持小写转换，去除符号和stop words。默认使用的正则表达式为\W+。
例如对Good morning!Miss Wang!进行分词：

curl -L -X GET 'http://192.168.205.128:9200/_analyze' \
-H 'content-type: application/json' \
-d '{
    "analyzer":"pattern",
    "text":"Good morning!Miss Wang!"
}'

{
    "tokens": [
        {
            "token": "good",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "morning",
            "start_offset": 5,
            "end_offset": 12,
            "type": "word",
            "position": 1
        },
        {
            "token": "miss",
            "start_offset": 13,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "wang",
            "start_offset": 18,
            "end_offset": 22,
            "type": "word",
            "position": 3
        }
    ]
}

Language Analyzers

Elasticsearch提供许多特定于语言的分析器，如英语或法语。
例如使用chinese对我是中国人！进行分词：

curl -L -X GET 'http://192.168.205.128:9200/_analyze' \
-H 'content-type: application/json' \
-d '{
    "analyzer":"chinese",
    "text":"我是中国人！"
}'

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "中",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "国",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "人",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        }
    ]
}

看的出对中文支持并不好，对中文分词一般不用es自带的分词器，可以使用第三方插件比如IK！

Fingerprint Analyzer

Fingerprint指纹分词器是一个特殊的分词器，它可以创建一个可用于重复检测的指纹。移除符号,转小写。

curl -L -X GET 'http://192.168.205.128:9200/_analyze' \
-H 'content-type: application/json' \
-d '{
    "analyzer":"fingerprint",
    "text":"我是你大爷！"
}'

{
    "tokens": [
        {
            "token": "你 大 我 是 爷",
            "start_offset": 0,
            "end_offset": 6,
            "type": "fingerprint",
            "position": 0
        }
    ]
}

其实上边ES内置的分词器对中文支持都不太好，目前常用的中文分词器都是使用第三方插件，例如IK分词器,不仅支持自定义词库还可以热更新词库。
另外还有一个是THULAC：(THU Lexical Analyzer for Chinese）它是由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包，具有中文分词和词性标注功能。

当然ES也支持自定义分词器！

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理