?在使用Prometheus监控的Postgres数据库时, 会发生主从复制延迟产生告警,但实际上数据库正常的问题,我们使用的exporter为https://github.com/prometheus-community/postgres_exporter,告警表达式为:
pg_replication_lag > 300
这个指标的说明如下:
# HELP pg_replication_lag Replication lag behind master in seconds
# TYPE pg_replication_lag gauge
pg_replication_lag{server=""}
但实际上,这个指标表示的是主从之间有多长时间未发生复制,从https://github.com/prometheus-community/postgres_exporter/blob/master/queries.yaml我们可以查到这个指标使用的SQL为:
SELECT
CASE
WHEN NOT pg_is_in_recovery() THEN 0
ELSE GREATEST(0, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_TIMESTAMP())))
END
AS LAG
官方文档对两个函数的说明如下:
Name | Return Type | Description |
---|
pg_is_in_recovery() | bool | True if recovery is still in progress. | pg_last_xact_replay_TIMESTAMP() | timestamp with time zone | Get time stamp of last transaction replayed during recovery. This is the time at which the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, this function returns NULL. Otherwise, if recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last transaction applied during that recovery. When the server has been started normally without recovery the function returns NULL. |
此SQL的结果为:主库为0,从库为当前时间与最后一次恢复事务时间之差。
所以一个特殊情况是,如主库中无事务提交,那么pg_last_xact_replay_TIMESTAMP() 的值会保持不变,相对应的pg_replication_lag 值则会不断增加,但并不代表主从复制发生故障。
因此,如果想要避免误告警,我们可以在主库创建一张测试表,每分钟更新表中的数据,保持数据库的活跃,这样如果发生告警才真正表示数据发生了严重故障。具体做法如下:
psql (11.7)
Type "help" for help.
postgres=
postgres=
postgres=
postgres=
配置定时任务:
* * * * * time=`date`;/usr/pgsql-11/bin/psql -h localhost -p 18083 -d test -c "UPDATE public.test SET time = '${time}'"
|