一、Elasticsearch三种分页技术
1. from + size分页性能低
最原始的分页方式,每一页数据都需要把前面的数据都查出来排序后计算出from和size。很明显,存在深分页的问题,查询的页面数越大返回数据的速度越慢。适用于少量数据分页查询。
By default, you cannot use from and size to page through more than 10,000 hits. This limit is a safeguard set by the index.max_result_window index setting. If you need to page through more than 10,000 hits, use the search_after parameter instead.
2. scroll滚动搜索
scroll是推荐的查询大量数据的分页方式,可以解决深分页的问题。缺点是:需要维护context,太多scroll查询会导致Elasticsearch性能下降,另外scroll查询涉及的segement在context过期之前是不能merge的。
其查询步骤如下:
- 提交一个scroll查询请求,Elasticsearch会初始化一个context,相当于给这个查询结果生成一个快照。scroll参数代表保持本次查询的context一分钟。该查询会返回一个_scroll_id。
POST /twitter/_search?scroll=1m
{
"size": 100,
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
- 利用上面的_scroll_id查询第一页的数据,以下查询也会返回一个_scroll_id,大多时候会和之前的_scroll_id相同,但也会变化,推荐使用每次新返回的_scroll_id查询下一页的数据。scroll参数表示重置之前的scroll为1分钟,防止在数据查询完之前context过期。
GET /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}
3. search_after
search_after性能高,主要用于实时数据查询;缺点是实现复杂,需要有一个全局唯一的字段,连续分页的时每一次查询都需要上次查询的结果;适用于需海量数据的分页的场景
search_after的分页实现需要借助于sort,查询步骤如下:
- 查询第一页数据时不需要指定search_after,只需要编写一个普通的query+sort+size的语句即可
- 查询第二页时需要query+sort+size+search_after,search_after的值等于第一页最后一个document的sort的值
- 反复执行第二步直到hit的数量为0,说明所有页的数据已经拉取完
GET /_search
{
"size": 10000,
"query": {
"match" : {
"user.id" : "elkbee"
}
},
"sort": [
{"date": "asc"},
{"id_copy": "asc"}
]
}
GET /_search
{
"size": 10000,
"query": {
"match" : {
"user.id" : "elkbee"
}
},
"search_after": [1463538857, "654323"],
"sort": [
{"date": "asc"},
{"id_copy": "asc"}
]
}
使用search_after需要注意:
- 默认情况下search_after不能保证查询结果一致性,即分页查询过程中有refresh发生(如document写入或者删除),会导致分页之间数据不一致,不像scroll查询每次从数据快照读取数据,和关系型数据库中的幻读相似。
- sort中的字段需要保证全局唯一,Elasticsearch不推荐使用_id作为排序字段,因为他没有开启doc value,这样Elasticsearch在排序时需要把所有的值加载到内存中排序。Elasticsearch建议新增加一个field,其值是_id的拷贝,但是开启了doc_value,作为排序字段。
- 需要设置from为-1或者0。
- search_after 不能随意的切换到任意的页面。
二、Springboot利用scroll返回大数据集
Elasticsearch has a scroll API for getting big result set in chunks. This is internally used by Spring Data Elasticsearch to provide the implementations of the SearchHitsIterator SearchOperations.searchForStream(Query query, Class clazz, IndexCoordinates index) method.
IndexCoordinates index = IndexCoordinates.of("sample-index");
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(matchAllQuery())
.withFields("message")
.withPageable(PageRequest.of(0, 10))
.build();
SearchHitsIterator<SampleEntity> stream = elasticsearchTemplate.searchForStream(searchQuery, SampleEntity.class, index);
List<SampleEntity> sampleEntities = new ArrayList<>();
while (stream.hasNext()) {
sampleEntities.add(stream.next());
}
stream.close();
There are no methods in the SearchOperations API to access the scroll id, if it should be necessary to access this, the following methods of the ElasticsearchRestTemplate can be used:
@Autowired ElasticsearchRestTemplate template;
IndexCoordinates index = IndexCoordinates.of("sample-index");
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(matchAllQuery())
.withFields("message")
.withPageable(PageRequest.of(0, 10))
.build();
SearchScrollHits<SampleEntity> scroll = template.searchScrollStart(1000, searchQuery, SampleEntity.class, index);
String scrollId = scroll.getScrollId();
List<SampleEntity> sampleEntities = new ArrayList<>();
while (scroll.hasSearchHits()) {
sampleEntities.addAll(scroll.getSearchHits());
scrollId = scroll.getScrollId();
scroll = template.searchScrollContinue(scrollId, 1000, SampleEntity.class);
}
template.searchScrollClear(scrollId);
To use the Scroll API with repository methods, the return type must defined as Stream in the Elasticsearch Repository. The implementation of the method will then use the scroll methods from the ElasticsearchTemplate.
interface SampleEntityRepository extends Repository<SampleEntity, String> {
Stream<SampleEntity> findBy();
}
|