[大数据] Elasticsearch：搜索相同内容，但评分不同，排序混乱问题解决

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> Elasticsearch：搜索相同内容，但评分不同，排序混乱问题解决 -> 正文阅读

[大数据]Elasticsearch：搜索相同内容，但评分不同，排序混乱问题解决

文章目录

问题

针对搜索结果，需要根据相关度智能排序
但是对于某些文本内容相似，搜索得分一致，需要启用其他排序规则，例如时间
后面又发现，对于这些相似的文本，部分文本得分score和其他文本不同，导致排序排在了后面
以以下数据为例，对于模糊搜索“上半年经济运行”需要根据标题检索，然后得分相同的再根据时间倒序排序。但是实际上2009年的出现在第一条，2021年的在第二条，这是不允许的

[
    {
        "createDate": "2009-07-21",
        "id": "7917561",
        "title": "2009年上半年全省经济运行情况"
    },
    {
        "createDate": "2021-08-02",
        "id": "8193901",
        "title": "2021年上半年全省经济运行情况"
    },
    {
        "createDate": "2020-08-02",
        "id": "8193891",
        "title": "2020年上半年全省经济运行情况"
    },
    {
        "createDate": "2019-08-02",
        "id": "8193881",
        "title": "2019年上半年全省经济运行情况"
    },
    {
        "createDate": "2014-08-02",
        "id": "8193861",
        "title": "2014年上半年全省经济运行情况"
    },
    {
        "createDate": "2019-07-18",
        "id": "4271871",
        "title": "2019年上半年全省经济运行情况"
    },
    {
        "createDate": "2017-08-02",
        "id": "8193871",
        "title": "2017年上半年全省经济运行情况"
    },
    {
        "createDate": "2017-01-23",
        "id": "7914371",
        "title": "2016年全省经济运行情况"
    },
    {
        "createDate": "2016-01-22",
        "id": "7914981",
        "title": "2015年全省经济运行情况"
    },
    {
        "createDate": "2015-01-22",
        "id": "7915411",
        "title": "2014年全省经济运行情况"
    },
    {
        "createDate": "2014-01-23",
        "id": "7915791",
        "title": "2013年全省经济运行情况"
    },
    {
        "createDate": "2012-01-20",
        "id": "7916451",
        "title": "2011年全省经济运行情况"
    },
    {
        "createDate": "2011-01-24",
        "id": "7916941",
        "title": "2010年全省经济运行情况"
    },
    {
        "createDate": "2010-01-23",
        "id": "7917271",
        "title": "2009年全省经济运行情况"
    }
]

原因探究

shard与Lucene

不同index的不同shard，对于同样的数据，检索得分可能不同
这是因为每一个shard都是一个Lucene实例，Lucene使用TF/IDF计算相关度算法。而每个Lucene实例只保存了自身的TF和IDF统计信息，所以一个shard只知道term在其自身中出现的次数，而非整个cluster

TF: Term Frequency的缩写，表示该term在当前document出现的频率
IDF: Inverse Document Frequency缩写，表示该term在所有文档中出现的频率

从TF/IDF算法可以看出，该term在当前文档出现次数越高，那么分值越大；如果该term在所有文档出现的频率越小，那么分值越大。这样term分数，不仅和此篇命中的文档有关，还和该shard的文档数量、文档内容量有关
而每个shard里的文档，是根据哈希算法分配的，数量不总是一致的。尤其当文档总数较少时，这种数量不一致可能比较明显。从而同一篇文档，针对term可能得分不同

searchType

QUERY_THEN_FETCH

在elasticsearch搜索时，默认使用QUERY_THEN_FETCH
根据官方文档，QUERY_THEN_FETCH模式搜索步骤如下：
- 发送查询到每个shard
- 找到所有匹配的文档，当然，使用本地的TF/IDF信息进行打分
- 对结果构建一个优先队列（排序，标页等）
- 返回关于结果的足够的元数据到请求节点。注意，不包含文档内容
- 来自所有shard的分数合并起来，并在请求节点上进行排序，获得要求的分页和数量的文档
- 最终，实际文档从他们各自所在的独立的shard上检索出来（此时包含文档内容）
- 按请求要求，包装好结果返回给用户请求
由以上可以看出，默认方法并不保证相同的文档得分一致
但是实际上当对准确率要求没那么苛刻时，结果还是很理想的，所以一般的检索场景都是能满足的
Lucene根据哈希算法分配文档到不同shard，当文档数据量比较大时，哈希结果会使不同shard文档数量趋于一致，默认的方式也能取得相当理想的结果

DFS_QUERY_THEN_FETCH

可以使用search_type参数指定其他搜索模式，DFS_QUERY_THEN_FETCH就是Elasticsearch提供的，针对以上问题的解决方案
与 {@link #QUERY_THEN_FETCH}大致相同
只是在初始分散阶段，DFS_QUERY_THEN_FETCH会向所有shard询问TF/IDF，以获得更准确的评分
在具体每个shard的查询时，就可以使用预先查询获取到的全局TF/IDF

源码

/*
 * Licensed to Elasticsearch under one or more contributor
 * license agreements. See the NOTICE file distributed with
 * this work for additional information regarding copyright
 * ownership. Elasticsearch licenses this file to you under
 * the Apache License, Version 2.0 (the "License"); you may
 * not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */

package org.elasticsearch.action.search;

/**
 * Search type represent the manner at which the search operation is executed.
 *
 *
 */
public enum SearchType {
    /**
     * Same as {@link #QUERY_THEN_FETCH}, except for an initial scatter phase which goes and computes the distributed
     * term frequencies for more accurate scoring.
     */
    DFS_QUERY_THEN_FETCH((byte) 0),
    /**
     * The query is executed against all shards, but only enough information is returned (not the document content).
     * The results are then sorted and ranked, and based on it, only the relevant shards are asked for the actual
     * document content. The return number of hits is exactly as specified in size, since they are the only ones that
     * are fetched. This is very handy when the index has a lot of shards (not replicas, shard id groups).
     */
    QUERY_THEN_FETCH((byte) 1),
    // 2 used to be DFS_QUERY_AND_FETCH

    /**
     * Only used for pre 5.3 request where this type is still needed
     */
    @Deprecated
    QUERY_AND_FETCH((byte) 3);

    /**
     * The default search type ({@link #QUERY_THEN_FETCH}.
     */
    public static final SearchType DEFAULT = QUERY_THEN_FETCH;

    private byte id;

    SearchType(byte id) {
        this.id = id;
    }

    /**
     * The internal id of the type.
     */
    public byte id() {
        return this.id;
    }

    /**
     * Constructs search type based on the internal id.
     */
    public static SearchType fromId(byte id) {
        if (id == 0) {
            return DFS_QUERY_THEN_FETCH;
        } else if (id == 1
            || id == 3) { // This is a BWC layer for pre 5.3 indices where QUERY_AND_FETCH was id 3 but we don't have it anymore from 5.3 on
            return QUERY_THEN_FETCH;
        } else {
            throw new IllegalArgumentException("No search type for [" + id + "]");
        }
    }

    /**
     * The a string representation search type to execute, defaults to {@link SearchType#DEFAULT}. Can be
     * one of "dfs_query_then_fetch"/"dfsQueryThenFetch", "dfs_query_and_fetch"/"dfsQueryAndFetch",
     * "query_then_fetch"/"queryThenFetch" and "query_and_fetch"/"queryAndFetch".
     */
    public static SearchType fromString(String searchType) {
        if (searchType == null) {
            return SearchType.DEFAULT;
        }
        if ("dfs_query_then_fetch".equals(searchType)) {
            return SearchType.DFS_QUERY_THEN_FETCH;
        } else if ("query_then_fetch".equals(searchType)) {
            return SearchType.QUERY_THEN_FETCH;
        } else {
            throw new IllegalArgumentException("No search type for [" + searchType + "]");
        }
    }
}

解决

如果要求评分必须一致，可以使用DFS_QUERY_THEN_FETCH，但是使用此方式可能会有一点点的查询性能损耗，目前在我们生产环境使用可以忽略

searchRequestBuilder.setSearchType(SearchType.DFS_QUERY_THEN_FETCH).get();

如果数据量比较少，可以考虑单shard，修改index的配置，number_of_shards=1

大数据最新文章

实现Kafka至少消费一次

亚马逊云科技：还在苦于ETL？Zero ETL的时代

初探MapReduce

【SpringBoot框架篇】32.基于注解+redis实现

Elasticsearch：如何减少 Elasticsearch 集

Go redis操作

Redis面试题

专题五 Redis高并发场景

基于GBase8s和Calcite的多数据源查询

Redis——底层数据结构原理

加:2021-08-03 11:16:33 更:2021-08-03 11:17:37

360图书馆购物三丰科技阅读网日历万年历 2025年8日历

-2025/8/3 13:29:57-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码