[大数据] postgreSQL数据倾斜时全表vs索引

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 大数据 -> postgreSQL数据倾斜时全表vs索引 -> 正文阅读

[大数据]postgreSQL数据倾斜时全表vs索引

开发反馈一个简单的count语句接口压测超时，查看到该表上原本作为条件的索引被disable了，加上表中数据倾斜严重，导致不论传参在表中分布情况如何都会走全表扫

测试表

taria=# create table skewtest(id int, record varchar(30));
CREATE TABLE
taria=# create index idx_record on  skewtest(record);
CREATE INDEX
taria=# \d+ skewtest
                                         Table "public.skewtest"
 Column |         Type          | Collation | Nullable | Default | Storage  | Stats target | Description 
--------+-----------------------+-----------+----------+---------+----------+--------------+-------------
 id     | integer               |           |          |         | plain    |              | 
 record | character varying(50) |           |          |         | extended |              | 
Indexes:
    "idx_record" btree (record)
Access method: heap

造数据

ALTER TABLE skewtest ALTER COLUMN record TYPE varchar(50);

taria=# insert into skewtest(id,record) select generate_series(0,9) ,md5(random()::text) from generate_series(1,10); 
INSERT 0 100
taria=# select * from skewtest limit 10 ;
 id |              record              
----+----------------------------------
  0 | 515e2288a1f4472c24b1a5374c9ddc9a
  1 | 4c4d0a0c541f4fc2ff017e621a0c291b
  2 | a74a4a39a08649578319cc7066e404b5
  3 | 68bd3b4c42b2143d88204e663022021c
  4 | 1d81e8daf0f0e6c605a1e80aa271a41f
  5 | f047ab023c79015541e4a4c5a35f915b
  6 | 3878cf3b8774689868e04dbbe269e7c3
  7 | 5635dc3b9306943b1711c20492ee132d
  8 | 575677864bb988f1d3520ddc9d65348c
  9 | c651c43aacf7fbece4be09bb733bf3c1
(10 rows)

taria=# insert into  skewtest2(id,record) select generate_series(0,9),'f047ab023c79015541e4a4c5a35f915b' from generate_series(0,3350000);
INSERT 0 3350010
taria=# select count(*) from skewtest;
  count  
---------
 3350110
(1 row)

普通数据走索引
 Aggregate  (cost=4.58..4.58 rows=1 width=8) (actual time=0.148..0.148 rows=1 loops=1)
   ->  Index Only Scan using idx_record on skewtest  (cost=0.56..4.57 rows=1 width=0) (actual time=0.139..0.141 rows=1 loops=1)
         Index Cond: (record = '3878cf3b8774689868e04dbbe269e7c3'::text)
         Heap Fetches: 0
 Planning Time: 0.096 ms
 Execution Time: 0.182 ms
(6 rows)

 update pg_index set indisvalid = true where indexrelid ='idx_record':: regclass ;

强制普通数据走全表
 Aggregate  (cost=46366.59..46366.60 rows=1 width=8) (actual time=220.071..220.071 rows=1 loops=1)
   ->  Gather  (cost=1000.00..46366.59 rows=1 width=0) (actual time=0.189..223.421 rows=1 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Seq Scan on skewtest  (cost=0.00..45366.49 rows=1 width=0) (actual time=142.571..215.831 rows=0 loops=3)
               Filter: ((record)::text = '3878cf3b8774689868e04dbbe269e7c3'::text)
               Rows Removed by Filter: 1116703
 Planning Time: 0.077 ms
 Execution Time: 223.464 ms

倾斜数据走全表
 explain analyze select count(*) from skewtest where record ='f047ab023c79015541e4a4c5a35f915b';
 Finalize Aggregate  (cost=49856.40..49856.41 rows=1 width=8) (actual time=402.832..402.832 rows=1 loops=1)
   ->  Gather  (cost=49856.19..49856.40 rows=2 width=8) (actual time=402.714..405.983 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=48856.19..48856.20 rows=1 width=8) (actual time=398.518..398.518 rows=1 loops=3)
               ->  Parallel Seq Scan on skewtest  (cost=0.00..45366.49 rows=1395879 width=0) (actual time=0.062..284.255 rows=1116670 loops=3)
                     Filter: ((record)::text = 'f047ab023c79015541e4a4c5a35f915b'::text)
                     Rows Removed by Filter: 33
 Planning Time: 0.103 ms
 Execution Time: 406.033 ms
(10 rows)

taria=# set enable_seqscan TO off;
SET
这里不是不走全表扫，是走了个不带并发聚合的全表 更慢了

倾斜数据强制走索引
  Aggregate  (cost=10000078169.65..10000078169.66 rows=1 width=8) (actual time=1160.122..1160.122 rows=1 loops=1)
   ->  Seq Scan on skewtest  (cost=10000000000.00..10000069794.38 rows=3350110 width=0) (actual time=0.049..820.449 rows=3350011 loops=1)
         Filter: ((record)::text = 'f047ab023c79015541e4a4c5a35f915b'::text)
         Rows Removed by Filter: 99
 Planning Time: 0.089 ms
 Execution Time: 1160.159 ms
(6 rows)
 
 差异在Partial Aggregate 因为count带来的并行聚合，可以用set max_parallel_workers控制并发

为了测试安个pg_hint_plan

大数据量3000w+
explain analyze select count(*) from skewtest where record ='f047ab023c79015541e4a4c5a35f915b';                             
---------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=489543.90..489543.91 rows=1 width=8) (actual time=2038.704..2039.813 rows=1 loops=1)
   ->  Gather  (cost=489543.69..489543.90 rows=2 width=8) (actual time=2038.634..2039.807 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=488543.69..488543.70 rows=1 width=8) (actual time=2034.316..2034.316 rows=1 loops=3)
               ->  Parallel Seq Scan on skewtest  (cost=0.00..453647.74 rows=13958379 width=0) (actual time=0.056..1412.934 rows=11166670 loops=3)
                     Filter: ((record)::text = 'f047ab023c79015541e4a4c5a35f915b'::text)
                     Rows Removed by Filter: 33
 Planning Time: 0.314 ms
 Execution Time: 2039.909 ms
(10 rows)

explain analyze /*+ IndexScan(skewtest) */select count(*) from skewtest where record ='f047ab023c79015541e4a4c5a35f915b';
                                                                             QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=825342.34..825342.35 rows=1 width=8) (actual time=2220.391..2227.868 rows=1 loops=1)
   ->  Gather  (cost=825342.13..825342.34 rows=2 width=8) (actual time=2220.386..2227.864 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=824342.13..824342.14 rows=1 width=8) (actual time=2215.918..2215.919 rows=1 loops=3)
               ->  Parallel Index Scan using idx_record on skewtest  (cost=0.56..789446.18 rows=13958379 width=0) (actual time=0.129..1593.646 rows=11166670 loops=3)
                     Index Cond: ((record)::text = 'f047ab023c79015541e4a4c5a35f915b'::text)
 Planning Time: 0.097 ms
 Execution Time: 2227.973 ms
(9 rows)

小数据量33w+
taria=# select count(*) from skewtest2;
 count  
--------
 335110
(1 row)
      1 | c2876aa9a0642bbc86b6955731b35a19
      1 | c3a7199d80054e67b798e889b0618af4
      1 | c40cd2b1e0dcb8a6aae65a81e3ccfe58
      1 | c433904380e4ef6075b8cd1e5d8536f8
      1 | c64c7c448294aec005e8d00b4113e96b
      1 | c7fbf598f403f62ca704da0059d4c31a
      1 | c8c6b566ff4ccd52ae523c303b4f50be
      1 | cd28ad7edb887b46d29c576cae60a012
      1 | d0026ad8f3753e5c52e1e1dbb7746b56
      1 | d582ed9a85d618e60251989e91820179
      1 | d76ad7e27cc0fa261554167c4e358fa0
      1 | d7c555e9e731b724de2e7a07d861339c
      1 | dec495dc37e497e92b3ac7b38993737f
      1 | e4101c5bcd989a9b0af539184d477643
      1 | e933bf160e13a94c8f66101e4949b5f1
      1 | e9b75293969bf33e70a0571b9cbcc208
      1 | ecb4306a179958491a9d6b94d22c31ab
      1 | ef5c7f2f06ecd9f9316d20af59ce81f3
 335010 | f047ab023c79015541e4a4c5a35f915b
      1 | f255cf5361158e5847d843ac200d6442
      1 | fe78bb64b36914d41fc92f0b18245bcf

set max_parallel_workers =0;

explain analyze /*+ SeqScan(skewtest2)*/select count(*) from skewtest2 where record ='f047ab023c79015541e4a4c5a35f915b';                 
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=6749.80..6749.81 rows=1 width=8) (actual time=72.599..72.623 rows=1 loops=1)
   ->  Gather  (cost=6749.69..6749.80 rows=1 width=8) (actual time=72.595..72.619 rows=1 loops=1)
         Workers Planned: 1
         Workers Launched: 0
         ->  Partial Aggregate  (cost=5749.69..5749.70 rows=1 width=8) (actual time=72.477..72.478 rows=1 loops=1)
               ->  Parallel Seq Scan on skewtest2  (cost=0.00..5257.04 rows=197058 width=0) (actual time=0.015..49.267 rows=335010 loops=1)
                     Filter: ((record)::text = 'f047ab023c79015541e4a4c5a35f915b'::text)
                     Rows Removed by Filter: 100
 Planning Time: 0.119 ms
 Execution Time: 72.644 ms
(10 rows)



explain analyze /*+ IndexScan(skewtest2) */select count(*) from skewtest2 where record ='f047ab023c79015541e4a4c5a35f915b';
                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=9883.59..9883.60 rows=1 width=8) (actual time=75.539..75.570 rows=1 loops=1)
   ->  Gather  (cost=9883.48..9883.59 rows=1 width=8) (actual time=75.535..75.567 rows=1 loops=1)
         Workers Planned: 1
         Workers Launched: 0
         ->  Partial Aggregate  (cost=8883.48..8883.49 rows=1 width=8) (actual time=75.417..75.417 rows=1 loops=1)
               ->  Parallel Index Scan using idx_record2 on skewtest2  (cost=0.42..8390.83 rows=197058 width=0) (actual time=0.019..51.857 rows=335010 loops=1)
                     Index Cond: ((record)::text = 'f047ab023c79015541e4a4c5a35f915b'::text)
 Planning Time: 0.092 ms
 Execution Time: 75.597 ms