[大数据] (一) elasticsearch-dsl查询操作

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> (一) elasticsearch-dsl查询操作 -> 正文阅读

[大数据](一) elasticsearch-dsl查询操作

一：连接配置

1：基本的连接配置： (默认起了一个名字：default )

from elasticsearch_dsl import connections

connections.create_connection(hosts['xxx'],timeout=20)

2：连接起别名：如果起别名了，则查询对象声明的时候，必须声明这个连接才能使用，否则使用的是默认的“default”。
```
from elasticsearch_dsl import connections

connections.create_connection(
    alias="my_connection",
    hosts=['xxx'],
    timeout=20)
```
3: 权限验证：
- 有些线上的环境，我们需要权限认证登录，则有特殊的写法：
```
hosts = ["http://username:password@10.73.18.240:9200"]
```

二：查询操作

1：简单的查询操作：

1：实例化Search对象：
- s = Search(Using = ”连接名“, index=“索引名”)

from elasticsearch_dsl import connections, Search

connections.create_connection(
    alias="my_connection",
    hosts=['xxx'],
    timeout=20)
s = Search(using="my_connection", index="xxxxx")

2：简单的查询操作：
- search.query(”匹配方式“, 字段的key=字段的值).execute()
- 得到的response是个Response对象，这个对象类似于列表。
- 遍历这个列表，得到的是Hit对象，这个对象类似于字典。可以直接使用[]方式取值。
- Hit对象.to_dict()可以直接转换成字典对象。

from elasticsearch_dsl import connections, Search

# 建立连接
connections.create_connection(hosts=hosts, timeout=timeout, alias=alias)
# 创建查询对象search,指定连接和索引
search = Search(using=alias, index="xxxxxx")
# 模糊匹配查询
response = search.query("match", lc_no="LC20072413391922").execute()
# 得到的这个对象是Response对象（列表），遍历这个对象就是Hit对象（字典）
for res in response:
    print(res['status'], res['request_time'])
    # 200 2021-05-04T18:23:57.594285+0800
# Hit对象可以直接转换成字典：使用to_dict()方法。
for res in response:
    my_dict = res.to_dict()
    print(my_dict)

3：基于查询进行删除操作：
- search.delete()即可。
- （由于线上环境，不方便演示）

2：普通查询：

1： Match：模糊匹配

如果该字段可以被IK中文分词，则匹配被分词后的所有数据。
如果该字段不能被分词，则等效于等值匹配。
query()中可以填入一个Match来进行查询操作。

from elasticsearch_dsl import connections, Search
from elasticsearch_dsl.query import Match

# 建立连接
connections.create_connection(hosts=hosts, timeout=timeout, alias=alias)
# 创建查询对象search,指定连接和索引
search = Search(using=alias, index="xxxxx")
response = search.query(Match(request_id={"query": "a088d751"}))
for res in response:
    print(res.to_dict().get('request_id'))
    # a088d751

2： MultiMatch: 模糊查询，可以匹配多列。

# 建立连接
connections.create_connection(hosts=hosts, timeout=timeout, alias=alias)
# 创建查询对象search,指定连接和索引
search = Search(using=alias, index="xxxxxxx")
response = search.query(MultiMatch(query="a088d751", fields=['request_id', ]))
for res in response:
    print(res.to_dict().get('request_id'))
    # a088d751

3： Q对象的使用：

我们还可以直接将Q对象放入query()中进行查询：
from elasticsearch_dsl.query import Q
Q对象可以直接包含一个DSL语句，也可以参数方式构造。

from elasticsearch_dsl import connections, Search
from elasticsearch_dsl.query import Q

# 建立连接
connections.create_connection(hosts=hosts, timeout=timeout, alias=alias)
# 创建查询对象search,指定连接和索引
search = Search(using=alias, index="xxxxxxx")

response = search.query(Q("multi_match", query="a088d751", fields=['request_id', ]))
for res in response:
    print(res.to_dict().get('request_id'))
    # a088d751

response2 = search.query(Q({"multi_match": {"query": "a088d751", "fields": ["request_id", ]}}))
for res in response2:
    print(res.to_dict().get('request_id'))
    # a088d751

query()还支持省略Q对象，而是直接将Q中的内容直接写入的方式。

response = search.query("multi_match", query="a088d751", fields=['request_id', ])
for res in response:
    print(res.to_dict().get('request_id'))
    # a088d751

response2 = search.query({"multi_match": {"query": "a088d751", "fields": ["request_id", ]}})
for res in response2:
    print(res.to_dict().get('request_id'))
    # a088d751

Q对象还可以复合使用：

response2 = search.query(Q('bool', must=[Q('match', request_id='a088d751'), Q('match', lc_no='LC20072413391922')]))
for res in response2:
    print(res.to_dict().get('request_id'))
    print(res.to_dict().get('lc_no'))
    # a088d751
    # LC20072413391922

3：组合查询：

1：使用Q对象进行组合查询：

操作符： | & ~

response2 = search.query(Q("match", request_id='a088d751') | Q("match", request_id="4f8e648a"))
for res in response2:
    print(res.to_dict().get('request_id'))
    # 4f8e648a
    # a088d751

2: 查询连接查询的方式：

等效于Q对象中的&

response2 = search.query("match", request_id="4f8e648a").query("match", lc_no="LC21081909540613")
for res in response2:
    print(res.to_dict().get('request_id'))
    # 4f8e648a

3: Q对象构造复杂查询：

q = Q('bool',
    must=[Q('match', title='python')],
    should=[Q(...), Q(...)],
    minimum_should_match=1
)
s = Search().query(q)

三：过滤

案例一：查询2021年8月25日0时刻到2021年8月26日零时刻，设备号为QbJK/5PqQ的所有记录。（结果是30条记录）

分析：先根据设备号查询，然后根据日期进行过滤。

response = search.query("term", serial="QbJK/5PqQ").filter("range",
                                                           request_time={"gte": "2021-08-25T00:00:00+0800",
                                                                         "lt": "2021-08-26T00:00:10+0800"}
                                                           ).params(size=10000)
for hit in response:
    print(hit.to_dict().get('serial'), hit.to_dict().get('request_time'))
    # QbJK/5PqQ 2021-08-25T22:04:16+0800
    # QbJK/5PqQ 2021-08-25T23:03:47+0800...(一共30条记录)

案例二：查询2021年8月25日0时刻到2021年8月26日零时刻，设备号为QtAVdFXBb和QbJK/zLIw的所有记录数量（时间闭区间下：数量是385，左闭右开下是：383）

注意： 使用count()就不能使用params()限制数量了。
等值查询单个用term, 多个值用terms。

# 建立连接
connections.create_connection(hosts=settings.ES_HOST, timeout=settings.ES_TIME_OUT, alias=settings.ES_ALIAS)
# 创建查询对象search,指定连接和索引
search = Search(using=settings.ES_ALIAS, index="xxxxxxxx")

counts = search.query("terms", serial=["QtAVdFXBb", "QbJK/zLIw"]).filter("range",
                                                                         request_time={
                                                                             "gte": "2021-08-25T00:00:00+0800",
                                                                             "lte": "2021-08-26T00:00:00+0800"}
                                                                         ).count()
print(counts)
# 385

四：聚合：

1: 聚合定义：
2：聚合嵌套：
- bucket()： 满足特定条件的文档的集合。
- metric()： 对桶内的文档进行统计计算(例如最小值,求和,最大值等)。
- pipline(): 管道
3：聚合添加到search对象：
- .agg

4.1: 聚合案例：

测试数据格式：

会有一个汽车数据文档，文档属性有4个：价格，颜色，

price color make sold
interger text text date
fielddata fieddata format: “yyyy-MM-dd”

1: 统计哪种颜色汽车的销量最好

思路：按照汽车的颜色进行聚合

# 1: 定义聚合：按照color字段进行等值聚合
a = A("terms", field="color")
# 2： 将聚合添加到search对象, 给聚合起名
search.aggs.bucket("popular_color", a)
# 3： 执行聚合操作
search.execute()

2：统计每种颜色车的平均价格：

思路：根据颜色聚合，聚合完对每种颜色中的价格取平均。

# 1：定义聚合
a1 = A("terms", field="color")
a2 = A("avg", filed="price")
# 2: 添加到search对象
search.aggs.bucket("colors", a1).metric("avg_price", a2)
# 3: 执行聚合
search.execute()

3: 统计每种颜色中的每种品牌的平均价格：

思路：先根据颜色聚合，再根据品牌聚合，然后对每个品牌取均价。

# 1: 先根据颜色聚合
search.aggs.bucket("colors", "terms", field="color")
# 2: 再根据品牌聚合
search.aggs['colors'].bucket("make", "terms", filed = "make")
# 3: 给品牌取均价
search.aggs['colors'].aggs["make"].mertric("avg_price", "avg", field="price")
# 4: 执行
search.execute()

4：统计每种品牌每种颜色的最高价格和最低价格

search.aggs.bucket("colors", "terms", field="color")
search.aggs["colors"].bucket("make", "terms", field="make")
search.aggs["colors"].aggs["make"].metric("min_price", "min", field="price")
search.aggs["colors"].aggs["make"].metric("max_price", "max", field="price")
search.execute()

五：排序：

1： sort()中直接写一个列名，表示对这个列正序排列。
2： sort()中写一个列名前面加一个负号，表示对这个列逆序排列。
3：注意：进行排序验证打印的时候，最好用time.sleep休眠一下，不然终端还是显示乱序的。

4 : 案例：查询2021年8月25日0时刻到2021年8月26日零时刻，设备号为QbJK/5PqQ的所有记录，按照时间从小到大排列：

response = search.query("range",
                        request_time={"gte": "2021-08-26T00:00:00+0800",
                                      "lt": "2021-08-27T00:00:00+0800"}
                        ).query("term", serial="QbJK/5PqQ").params(size=10000).sort("request_time")

for hit in response:
    data = hit.to_dict()
    print(data.get("serial"), data.get('request_time'))
    time.sleep(1)

5：案例：查询2021年8月25日0时刻到2021年8月26日零时刻，设备号为QbJK/5PqQ的所有记录，按照时间从大到小排列：

response = search.query("range",
                        request_time={"gte": "2021-08-26T00:00:00+0800",
                                      "lt": "2021-08-27T00:00:00+0800"}
                        ).query("term", serial="QbJK/5PqQ").params(size=10000).sort("-request_time")

for hit in response:
    data = hit.to_dict()
    print(data.get("serial"), data.get('request_time'))
    time.sleep(1)

六：分页：

1：直接使用切片：

connections.create_connection(hosts=settings.ES_HOST, timeout=settings.ES_TIME_OUT, alias=settings.ES_ALIAS)
# 创建查询对象search,指定连接和索引
search = Search(using=settings.ES_ALIAS, index="xxxxxx")
response = search.query()[10:11]
for hit in response:
    print(hit.to_dict().get("product_id"))

2：如果要获取拿到的所有的数据：可以使用scan:

# 建立连接
connections.create_connection(hosts=settings.ES_HOST, timeout=settings.ES_TIME_OUT, alias=settings.ES_ALIAS)
# 创建查询对象search,指定连接和索引
search = Search(using=settings.ES_ALIAS, index="xxxxxx")
response = search.query()
for hit in response.scan():
    print(hit.to_dict().get("product_id"))