说明
本文章其实是在调研测试分区键,排序健,主键过程中的一些总结和记录; 有助于了解到分区键排序健主键的区别,以及官方文档没有提及的东西 关于CH的一下基本知识的介绍,可以看结尾官方文档介绍的比较好;
问题
分区键 year,month,查询month的时候需不需要year 有没有办法达到映射,year,month,day,我查询day的时候,不需要带year,month条件
表结构
create table order_detail_4 on cluster default_cluster
(
customer_id Int32,
detail_id String,
year Int16,
month Int32,
day Int32,
sign Int8,
version Int64
)
engine = ReplicatedVersionedCollapsingMergeTree('/clickhouse/tables/{shard}/1/order_detail_4_2', '{replica}',
sign, version)
PARTITION BY (year, month, intDiv(day, 100))
ORDER BY (customer_id, day, detail_id)
SETTINGS index_granularity = 8192;
分区键测试
数据大小
count:3200240 parts:21 测试前之前强制合并
分区键顺序有影响吗?
A
PARTITION BY (year, month)
ORDER BY (customer_id, year, month, day, detail_id)
B
PARTITION BY (month, year)
ORDER BY (customer_id, day, detail_id)
select count(*) from sales_db.order_detail_1 where customer_id=10433 and year =2022;
A:Selected 9/21 parts by partition key, 9 parts by primary key, 175/184 marks by primary key, 175 marks to read from 9 ranges
B:Selected 9/21 parts by partition key, 9 parts by primary key, 173/182 marks by primary key, 173 marks to read from 9 ranges
select count(*) from sales_db.order_detail_1 where customer_id=10433 and month =202201;
A:Selected 1/21 parts by partition key, 1 parts by primary key, 18/19 marks by primary key, 18 marks to read from 1 ranges
B:Selected 1/21 parts by partition key, 1 parts by primary key, 18/19 marks by primary key, 18 marks to read from 1 ranges
可以看出分区键顺序无影响
再试一下排序健不修改
A:
PARTITION BY (year, month)
ORDER BY (customer_id, day, detail_id)
B:
PARTITION BY (month)
ORDER BY (customer_id, day, detail_id)
select count(*) from sales_db.order_detail where customer_id=10433 and year =2022;
A:
Selected 9/21 parts by partition key, 9 parts by primary key, 176/185 marks by primary key, 176 marks to read from 9 ranges
B:
Selected 21/21 parts by partition key, 21 parts by primary key, 406/427 marks by primary key, 406 marks to read from 21 ranges
select count(*) from sales_db.order_detail where customer_id=10433 and month =202201;
A:
Selected 1/21 parts by partition key, 1 parts by primary key, 18/19 marks by primary key, 18 marks to read from 1 ranges
B:
Selected 1/21 parts by partition key, 1 parts by primary key, 18/19 marks by primary key, 18 marks to read from 1 ranges
交换复合分区键位置
A:
PARTITION BY (year, month)
ORDER BY (customer_id, day, detail_id)
B:
PARTITION BY (month, year)
ORDER BY (customer_id, day, detail_id)
select count(*) from sales_db.order_detail where customer_id=10433 and year =2022;
A:
Selected 9/21 parts by partition key, 9 parts by primary key, 176/185 marks by primary key, 176 marks to read from 9 ranges
B:
Selected 9/21 parts by partition key, 9 parts by primary key, 175/184 marks by primary key, 175 marks to read from 9 ranges
select count(*) from sales_db.order_detail where customer_id=10433 and month =202201;
A:
Selected 1/21 parts by partition key, 1 parts by primary key, 18/19 marks by primary key, 18 marks to read from 1 ranges
B:
Selected 1/21 parts by partition key, 1 parts by primary key, 18/19 marks by primary key, 18 marks to read from 1 ranges
测试天分区映射能否实现
A:
PARTITION BY (year, month)
ORDER BY (customer_id, day, detail_id)
B:
PARTITION BY (year, month, intDiv(day, 100))
ORDER BY (customer_id, day, detail_id)
select count(*) from sales_db.order_detail_3 where customer_id=10433 and day=20220101;
A:
Selected 21/21 parts by partition key, 1 parts by primary key, 2/427 marks by primary key, 2 marks to read from 1 ranges
B:
Selected 1/21 parts by partition key, 1 parts by primary key, 2/19 marks by primary key, 2 marks to read from 1 ranges
结论
CH并不需要左匹配逻辑,列入year,month,查询month=202201的时候并不需要带上year=2022这样的条件; 并且CK可以支持花里胡哨的映射(day)
官方文档并没有说明这一点,也就是不需要左匹配;
但是我找到了相关大佬的回答,虽然大佬说过可能已经有不正确的部分了,可以参考;https://stackoverflow.com/questions/60142967/how-to-understand-part-and-partition-of-clickhouse/60155974#60155974 翻译如下: Select 不关心分区。 Select 不知道分区键。 因为每个部分都有特殊文件 minmax_{PARTITIONING_KEY_COLUMN}.idx 这些文件包含这部分中这些列的最小值和最大值。此外,这个 minmax_ 值存储在内存中的(c++ 向量)部件列表中。
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor 检查内存中的零件列表并找到 0 个零件,因为 minmax_A.idx = (1,1) 并且此选择需要 (555, 555)。 CH 不存储分区键值。 例如 toYYYYMM(today()) = 202002 但这个 202002 没有存储在零件或任何地方。 minmax_B.idx 商店 (18302, 18302) (2020-02-10 == select toInt16(today()))
看他这个回答,就像是每个分区键会对每个part建立minmax索引,也就不需要左匹配这种东西了; 为了验证一下,本地使用docker搭建了一个;
PARTITION BY (year, month, intDiv(day, 100))
ORDER BY (customer_id, day, detail_id)
可以看见,3个key,每个分区part下面都会生成3个索引文件,索引值也如图,会生成最小最大,来过滤数据; ps:索引查看需要使用od命令,https://www.wangfenjin.com/posts/clickhouse-od/
后面大佬又回复我了
翻译结果:
使用PARTITION BY intDiv(day,100)每个部分将有day列的最小值和最大值。
如果你的分区表达式有多个列分区BY (account_id,intDiv(day,100),substr(ccc,10))每个部分将有3对min_max每列。没有前缀,这些min - max是独立的。partition_by中的列的顺序并不重要。每个min_max可以单独检查,也可以一起检查。
例如PARTITION BY (account_id,toYYYYMM(tm)),该表有两个部分
Part1:account_id_min 10,account_id_max 10,tm_min 22-01-01 00:00:00,tm_ma×= 22-01-1512:43:21
Part2:account_id_min 11,account_id_max 11,tm_min 22-01-01 00:00:00,tmma×= 22-01-3123:59:59
查询select···from…其中account_id=lggg分区修剪将消除所有部分,因为没有匹配account_min_max的部分
当您查询select··from.·where tm>='2g22-g1-17 00:00:00'时,分区修剪将消除第一部分,因为只有第二部分具有匹配的tm_min_max等等。
它稍微复杂一些,因为ClickHouse能够在修剪过程中使用函数。因此,toYYYYMM(tm)>=202201也可以检查,因为可以将toYYYYMM应用于tm_min_max并获得每个部分的函数结果。
也可以检查更复杂的函数,例如'toYYYYMMDD(tm+100)-10 > 20211010',因为可以对tm_min_max应用toYYYYMMDD(tm+100)-10,并获得每个部分的函数结果。
与我所测试的结论一致,分区键无顺序要求。分区键还支持复杂的函数,目前没用到,有兴趣可以测试
##展开
分区介绍
PARTITION BY - 声明如何将表中的数据拆分为可以单独操作的部分。分区既不能太小也不能太大。通常是几个月/周/天 - 允许您删除旧分区,或对分区级别进行其他操作。Clickhouse 允许您删除/优化/冻结/分离/附加分区,清除其中的某些列。即分区是一种你可以操纵的元素。 当您使用允许读取较少分区的条件时 - 将使用该条件。即,如果您有 12 个每月分区并且您使用条件 where event_date = '2018-01-01' ,那么只有一个包含该日期的分区将被读取。如果您不对主键提供额外条件 - 将读取整个分区。 在每个分区内,数据根据您的主键进行物理排序,并且上面有一个索引。如果您将提供包含在主键中使用的字段的条件 - clickhouse 将仅读取该分区内包含您提供的主键的某些行范围。 在所有其他情况下,使用全扫描。
官网这样描述:
在大多数情况下,您不需要分区键,而在大多数其他情况下,您不需要比几个月更细的分区键。分区不会加速查询(与 ORDER BY 表达式相反)。 官方说分区不会加速查询,这不很奇怪吗,对我们认知分区就是减少数据量,剪枝用的; 这其实是官方承认的一个误导性描述,但是官方说为了防止新手乱用partition:https://github.com/ClickHouse/ClickHouse/issues/28850 官方是强烈不建议多分区数量的: 分区建议:
- 单个分区的好大小是 1-300Gb。
- 用于求和/替换更小的 (400Mb-40Gb)
- 最好避免使用典型的 SELECT 查询接触更多的几十个分区。
- 单次插入应该将数据带到一个或几个分区。
- 表中的分区数 - 几十个或几百个,而不是几千个。
排序键
排序键官网其实有比较详细的说明,包括row,mark,主键索引,跳数索引,讲的比较详细 https://clickhouse.com/docs/en/guides/improving-query-performance/sparse-primary-indexes/sparse-primary-indexes-intro
建议使用分区键
PARTITION BY (year, month, intDiv(day, 100))
primary key (customer_id, day)
ORDER BY (customer_id, day, detail_id)
分区键可以不需要左侧前提,只要有的分区键都可以用于剪枝; 主键默认不设置就是排序键,但是不建议,我们只需要讲常用来查询,低基数的列放到主键,来加速查询, orderby仅仅是为了合并数据; 这样可以减少主键索引空间,并且索引是常驻内存的,减少字段是有必要的 ps:内存不够索引,CH会直接报错,拒绝服务
|