IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 系统运维 -> 四、系统监控之Prometheus实战篇 - 虚拟机监控 -> 正文阅读

[系统运维]四、系统监控之Prometheus实战篇 - 虚拟机监控

4 Prometheus实战之虚拟机监控

node-exporter用于监控Linux主机的度量信息。

参考自:https://grafana.com/oss/prometheus/exporters/node-exporter/?tab=installation#node-exporter-quickstart

4.1 安装node-exporter

下载地址:https://prometheus.io/download/#node_exporter

# 创建node-exporter目录
mkdir /usr/local/node-exporter
# 跳转到node-exporter目录
cd /usr/local/node-exporter
# 下载
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz
# 解压
tar -zxvf node_exporter-1.2.2.linux-amd64.tar.gz
# 跳转到安装目录
cd node_exporter-1.2.2.linux-amd64
# 启动
./node_exporter

默认监听9100端口,也可以通过如下参数修改:

./node_exporter --web.listen-address 127.0.0.1:8082

虚拟机的主机名是centos7,使用主机名访问:http://centos7:9100/metrics

4.2 修改Prometheus配置

当前虚拟机使用主机名访问,现在配置了两台虚拟机:

scrape_configs:
  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
      - targets: ['centos7:9100', 'centos8:9100']

注意:

1、targets里面不要写https或者http,如果需要的话,使用schema指定(默认是http)。

2、job名称一定得是node,因为后面Dashboard里面配置的默认Job名称都是node

4.3 配置Recording Rule

为什么要配置记录规则?

使用记录规则,可以预计算和缓存频繁查询的metrics。比如Dashboard需要使用到rate()查询,如果有了Recording Rule,则会进行计算并产生一个新的时间序列。这样就能避免在Dashboard更新的时候,必须抓取数据并计算数据。

在导入Dashboard之前,应当配置Recording Rule!

node_exporter_recording_rules.yml

Recording rule文件如下(规则名称是node-exporter.rules,默认选择器器是job=node):

"groups":
- "name": "node-exporter.rules"
  "rules":
  - "expr": |
      count without (cpu) (
        count without (mode) (
          node_cpu_seconds_total{job="node"}
        )
      )
    "record": "instance:node_num_cpu:sum"
  - "expr": |
      1 - avg without (cpu, mode) (
        rate(node_cpu_seconds_total{job="node", mode="idle"}[1m])
      )
    "record": "instance:node_cpu_utilisation:rate1m"
  - "expr": |
      (
        node_load1{job="node"}
      /
        instance:node_num_cpu:sum{job="node"}
      )
    "record": "instance:node_load1_per_cpu:ratio"
  - "expr": |
      1 - (
        node_memory_MemAvailable_bytes{job="node"}
      /
        node_memory_MemTotal_bytes{job="node"}
      )
    "record": "instance:node_memory_utilisation:ratio"
  - "expr": |
      rate(node_vmstat_pgmajfault{job="node"}[1m])
    "record": "instance:node_vmstat_pgmajfault:rate1m"
  - "expr": |
      rate(node_disk_io_time_seconds_total{job="node", device!=""}[1m])
    "record": "instance_device:node_disk_io_time_seconds:rate1m"
  - "expr": |
      rate(node_disk_io_time_weighted_seconds_total{job="node", device!=""}[1m])
    "record": "instance_device:node_disk_io_time_weighted_seconds:rate1m"
  - "expr": |
      sum without (device) (
        rate(node_network_receive_bytes_total{job="node", device!="lo"}[1m])
      )
    "record": "instance:node_network_receive_bytes_excluding_lo:rate1m"
  - "expr": |
      sum without (device) (
        rate(node_network_transmit_bytes_total{job="node", device!="lo"}[1m])
      )
    "record": "instance:node_network_transmit_bytes_excluding_lo:rate1m"
  - "expr": |
      sum without (device) (
        rate(node_network_receive_drop_total{job="node", device!="lo"}[1m])
      )
    "record": "instance:node_network_receive_drop_excluding_lo:rate1m"
  - "expr": |
      sum without (device) (
        rate(node_network_transmit_drop_total{job="node", device!="lo"}[1m])
      )
    "record": "instance:node_network_transmit_drop_excluding_lo:rate1m"

配置上述规则

修改prometheus.yml:

rule_files:
  - "node_exporter_recording_rules.yml"

重启Prometheus后,可以看到node-exporter.rules规则已经生效:

4.4 导入Dashboard

搜索Dashboard

访问:https://grafana.com/grafana/dashboards搜索node-exporter的Dashboard(就在Dashboard首页就能看到这个下面的面板)如下:

https://grafana.com/grafana/dashboards/13978?pg=dashboards&plcmt=featured-sub1

导入Dashboard

在Grafana中按照ID导入:

Dashboard显示效果

这个Dashboard包含如下面板信息:

  • CPU Usage
  • Load Average
  • Memory Usage
  • Disk I/O
  • Disk Usage
  • Network Received
  • Network Transmitted

最终显示效果:

4.5 配置告警规则

告警的含义

当PromQL表达式在一定时间内违背了某些限制条件或者满足了特定的条件,就会触发告警。只要触发了告警,告警就会进入Pending状态,当周期时间内满足for参数定义的条件后,就会进入Firing状态。这个时候配置告警工具(比如alertmanager)就可以通知了。

比如下面的时钟未同步告警,如果触发了一次告警,则会变成Pending状态,如果10min后还是满足条件,则会进入Firing状态。

node_exporter_alerting_rules.yml

告警配置包括如下内容:

  • NodeFilesystemSpaceFillingUp、NodeFilesystemAlmostOutOfSpace、NodeFilesystemFilesFillingUp、NodeFilesystemAlmostOutOfFiles:磁盘快满了
  • NodeNetworkReceiveErrs:过去两分钟内网卡收到很多错误
  • NodeHighNumberConntrackEntriesUsed:% of conntrack entries are used.
  • NodeTextFileCollectorScrapeError:Node Exporter文本文件收集器抓取失败
  • NodeClockSkewDetected:时钟不准,且超过300秒
  • NodeClockNotSynchronising:时钟没有同步,确保服务器上配置了NTP
  • NodeRAIDDegraded:因为磁盘故障导致RADI实例进入降级状态
  • NodeRAIDDiskFailure:RADI磁盘故障
"groups":
- "name": "node-exporter"
  "rules":
  - "alert": "NodeFilesystemSpaceFillingUp"
    "annotations":
      "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available space left and is filling up."
      "summary": "Filesystem is predicted to run out of space within the next 24 hours."
    "expr": |
      (
        node_filesystem_avail_bytes{job="node",fstype!=""} / node_filesystem_size_bytes{job="node",fstype!=""} * 100 < 40
      and
        predict_linear(node_filesystem_avail_bytes{job="node",fstype!=""}[6h], 24*60*60) < 0
      and
        node_filesystem_readonly{job="node",fstype!=""} == 0
      )
    "for": "1h"
    "labels":
      "severity": "warning"
  - "alert": "NodeFilesystemSpaceFillingUp"
    "annotations":
      "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available space left and is filling up fast."
      "summary": "Filesystem is predicted to run out of space within the next 4 hours."
    "expr": |
      (
        node_filesystem_avail_bytes{job="node",fstype!=""} / node_filesystem_size_bytes{job="node",fstype!=""} * 100 < 20
      and
        predict_linear(node_filesystem_avail_bytes{job="node",fstype!=""}[6h], 4*60*60) < 0
      and
        node_filesystem_readonly{job="node",fstype!=""} == 0
      )
    "for": "1h"
    "labels":
      "severity": "critical"
  - "alert": "NodeFilesystemAlmostOutOfSpace"
    "annotations":
      "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available space left."
      "summary": "Filesystem has less than 5% space left."
    "expr": |
      (
        node_filesystem_avail_bytes{job="node",fstype!=""} / node_filesystem_size_bytes{job="node",fstype!=""} * 100 < 5
      and
        node_filesystem_readonly{job="node",fstype!=""} == 0
      )
    "for": "1h"
    "labels":
      "severity": "warning"
  - "alert": "NodeFilesystemAlmostOutOfSpace"
    "annotations":
      "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available space left."
      "summary": "Filesystem has less than 3% space left."
    "expr": |
      (
        node_filesystem_avail_bytes{job="node",fstype!=""} / node_filesystem_size_bytes{job="node",fstype!=""} * 100 < 3
      and
        node_filesystem_readonly{job="node",fstype!=""} == 0
      )
    "for": "1h"
    "labels":
      "severity": "critical"
  - "alert": "NodeFilesystemFilesFillingUp"
    "annotations":
      "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available inodes left and is filling up."
      "summary": "Filesystem is predicted to run out of inodes within the next 24 hours."
    "expr": |
      (
        node_filesystem_files_free{job="node",fstype!=""} / node_filesystem_files{job="node",fstype!=""} * 100 < 40
      and
        predict_linear(node_filesystem_files_free{job="node",fstype!=""}[6h], 24*60*60) < 0
      and
        node_filesystem_readonly{job="node",fstype!=""} == 0
      )
    "for": "1h"
    "labels":
      "severity": "warning"
  - "alert": "NodeFilesystemFilesFillingUp"
    "annotations":
      "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available inodes left and is filling up fast."
      "summary": "Filesystem is predicted to run out of inodes within the next 4 hours."
    "expr": |
      (
        node_filesystem_files_free{job="node",fstype!=""} / node_filesystem_files{job="node",fstype!=""} * 100 < 20
      and
        predict_linear(node_filesystem_files_free{job="node",fstype!=""}[6h], 4*60*60) < 0
      and
        node_filesystem_readonly{job="node",fstype!=""} == 0
      )
    "for": "1h"
    "labels":
      "severity": "critical"
  - "alert": "NodeFilesystemAlmostOutOfFiles"
    "annotations":
      "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available inodes left."
      "summary": "Filesystem has less than 5% inodes left."
    "expr": |
      (
        node_filesystem_files_free{job="node",fstype!=""} / node_filesystem_files{job="node",fstype!=""} * 100 < 5
      and
        node_filesystem_readonly{job="node",fstype!=""} == 0
      )
    "for": "1h"
    "labels":
      "severity": "warning"
  - "alert": "NodeFilesystemAlmostOutOfFiles"
    "annotations":
      "description": "Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf \"%.2f\" $value }}% available inodes left."
      "summary": "Filesystem has less than 3% inodes left."
    "expr": |
      (
        node_filesystem_files_free{job="node",fstype!=""} / node_filesystem_files{job="node",fstype!=""} * 100 < 3
      and
        node_filesystem_readonly{job="node",fstype!=""} == 0
      )
    "for": "1h"
    "labels":
      "severity": "critical"
  - "alert": "NodeNetworkReceiveErrs"
    "annotations":
      "description": "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes."
      "summary": "Network interface is reporting many receive errors."
    "expr": |
      rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
    "for": "1h"
    "labels":
      "severity": "warning"
  - "alert": "NodeNetworkTransmitErrs"
    "annotations":
      "description": "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes."
      "summary": "Network interface is reporting many transmit errors."
    "expr": |
      rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
    "for": "1h"
    "labels":
      "severity": "warning"
  - "alert": "NodeHighNumberConntrackEntriesUsed"
    "annotations":
      "description": "{{ $value | humanizePercentage }} of conntrack entries are used."
      "summary": "Number of conntrack are getting close to the limit."
    "expr": |
      (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75
    "labels":
      "severity": "warning"
  - "alert": "NodeTextFileCollectorScrapeError"
    "annotations":
      "description": "Node Exporter text file collector failed to scrape."
      "summary": "Node Exporter text file collector failed to scrape."
    "expr": |
      node_textfile_scrape_error{job="node"} == 1
    "labels":
      "severity": "warning"
  - "alert": "NodeClockSkewDetected"
    "annotations":
      "description": "Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host."
      "summary": "Clock skew detected."
    "expr": |
      (
        node_timex_offset_seconds > 0.05
      and
        deriv(node_timex_offset_seconds[5m]) >= 0
      )
      or
      (
        node_timex_offset_seconds < -0.05
      and
        deriv(node_timex_offset_seconds[5m]) <= 0
      )
    "for": "10m"
    "labels":
      "severity": "warning"
  - "alert": "NodeClockNotSynchronising"
    "annotations":
      "description": "Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host."
      "summary": "Clock not synchronising."
    "expr": |
      min_over_time(node_timex_sync_status[5m]) == 0
      and
      node_timex_maxerror_seconds >= 16
    "for": "10m"
    "labels":
      "severity": "warning"
  - "alert": "NodeRAIDDegraded"
    "annotations":
      "description": "RAID array '{{ $labels.device }}' on {{ $labels.instance }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically."
      "summary": "RAID Array is degraded"
    "expr": |
      node_md_disks_required - ignoring (state) (node_md_disks{state="active"}) > 0
    "for": "15m"
    "labels":
      "severity": "critical"
  - "alert": "NodeRAIDDiskFailure"
    "annotations":
      "description": "At least one device in RAID array on {{ $labels.instance }} failed. Array '{{ $labels.device }}' needs attention and possibly a disk swap."
      "summary": "Failed device in RAID array"
    "expr": |
      node_md_disks{state="failed"} > 0
    "labels":
      "severity": "warning"

配置上述规则

修改prometheus.yml:

rule_files:
  - "node_exporter_alerting_rules.yml"

重启后,即可看到配置的规则:


在告警页面,可以看到所有的告警:

4.6 配置告警 - alertmanager

假设现在虚拟机发生了告警,比如centos8虚拟机没有设置时钟的NTP同步。具体告警使用alertmanager实现。

4.6.1 alertmanager概述

Prometheus告警组成部分

Prometheus中的告警分成两部分:Prometheus Server中定义告警规则,然后发送告警给alertmanager。alertmanager管理(去重、分组、路由)这些告警并使用邮件、聊天平台等方式发送出去。

告警和通知实现方式

告警和通知需要这样才能实现:

  • 在Prometheus中告警规则
  • 配置Prometheus(Prometheus作为alertmanager的客户端应用)与alertmanager进行通讯
  • 启动并配置alertmanager

Alertmanager中的核心概念

  • 分组(grouping):把多个具备特征的合并到一个里面去。用于大规模系统失败,触发成千上万告警的时候。比如网络出现,导致大量应用无法连接数据库。这些告警就可以归一到一个通知里面去。
  • 压制(inhibition):如果某些其它的告警已经触发了,那这个告警就需要被压制住,不要再发了。比如整个都不可达了,然后特定告警已经触发,那么就可以把其它关注于这个集群的告警都mute掉,防止发出无限多的通知出去。
  • 静默(silences):是一种直接让告警mute的方式,其配置基于匹配器(matcher),所有到来的告警都会和匹配器进行匹配,如果和这个silence匹配器匹配,则出现告警时,就不会发出通知。

Alertmanager中的参数配置

Alertmanager由命令行参数(即启动命令上加--参数)和配置文件来配置。命令行参数配置不可变的系统参数,配置文件配置压制规则、通知路由和通知接受者。

使用altermanager -h命令(即--help)可以查看所有可用的命令行参数:

Alertmanager如何在运行时重载配置

alertmanager可以在运行时重新加载配置。如果新配置格式问题,则不会新配置并在日志中记录错误。

通过给进程发送SIGNUP信号(还没找到实现方式)或者通过HTTP POST请求/-/reload端点即可重载配置(如下图所示,同时日志中出现“Completed loading of configuration file”)。

4.6.2 安装alertmanager

下载路径:https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz

# 创建alertmanager目录
mkdir /usr/local/alertmanager
# 跳转到alertmanager目录
cd /usr/local/alertmanager
# 下载
wget https://hub.fastgit.org/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
# 解压
tar -zxvf alertmanager-0.23.0.linux-amd64.tar.gz
# 跳转到安装目录
cd alertmanager-0.23.0.linux-amd64
# 启动(先不要启动,我们还得改下配置文件)
# ./alertmanager --config.file=alertmanager.yml

4.6.3 修改alertmanager.yml

这里简单的配置一个QQ邮箱发送邮件,发件人我自己的邮箱634501730@qq.com,收件人为了方便测试用的也是我自己的邮箱634501730@qq.com(即发件人和收件人是相同的邮箱):

global:
  # The default SMTP From header field.
  smtp_from: '634501730@qq.com'
  # The default SMTP smarthost used for sending emails, including port number.
  # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
  # Example: smtp.example.org:587
  smtp_smarthost: 'smtp.qq.com:465'
  # The default hostname to identify to the SMTP server.
  #smtp_hello: <string> | default = "localhost"
  # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
  smtp_auth_username: '634501730@qq.com'
  # SMTP Auth using LOGIN and PLAIN.
  smtp_auth_password: '这个是授权码,不是密码哈!'
  # SMTP Auth using PLAIN.
  #smtp_auth_identity: <string>
  # SMTP Auth using CRAM-MD5.
  #smtp_auth_secret: <secret>
  # The default SMTP TLS requirement.
  # Note that Go does not support unencrypted connections to remote SMTP endpoints.
  smtp_require_tls: false

templates:
#

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  # 默认发送时间间隔是repeat_interval + group_interval
  repeat_interval: 1h
  receiver: 'email'

receivers:
- name: 'email'
  email_configs:
  - to: '634501730@qq.com'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

配置完成后,启动alertmanager:

./alertmanager --config.file=alertmanager.yml

启动日志如下:

caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=HEAD, revision=61046b17771a57cfd4c4a51be370ab930a4d7d54)"

caller=main.go:226 build_context="(go=go1.16.7, user=root@e21a959be8d2, date=20210825-10:48:55)"

caller=cluster.go:184 component=cluster msg="setting advertise address explicitly" addr=192.168.31.245 port=9094

caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s

caller=coordinator.go:113 component=configuration msg="Loading configuration file" file=alertmanager.yml

caller=coordinator.go:126 component=configuration msg="Completed loading of configuration file" file=alertmanager.yml

caller=main.go:518 msg=Listening address=:9093

4.6.4 Prometheus中配置alertmanager

alertmanager是支持部署的(所以下面的配置是一个数组),这里只部署一台。因为是和Prometheus部署在一起的,所以地址写成了localhost:9093。修改prometheus.yml如下:

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - localhost:9093

配置修改完成后,重启Prometheus。

4.6.5 查看告警邮件

此时,查看邮箱就能看到收到的邮件:

这里尽管收到了告警信息,但是告警还是没有解决。在Prometheus中依然会显示出这个告警:

4.6.6 解决告警

因为告警是centos8虚拟机没有设置NTP时钟同步,那在centos8上设置一下就可以消除告警:

dnf install chrony
systemctl enable chronyd
重启服务器

重启完CentOS8后即可看到同步状态变成了1,然后告警消失。

至此,一个Prometheus完整的监控流程,就完整了。

  系统运维 最新文章
配置小型公司网络WLAN基本业务(AC通过三层
如何在交付运维过程中建立风险底线意识,提
快速传输大文件,怎么通过网络传大文件给对
从游戏服务端角度分析移动同步(状态同步)
MySQL使用MyCat实现分库分表
如何用DWDM射频光纤技术实现200公里外的站点
国内顺畅下载k8s.gcr.io的镜像
自动化测试appium
ctfshow ssrf
Linux操作系统学习之实用指令(Centos7/8均
上一篇文章      下一篇文章      查看所有文章
加:2021-09-05 11:27:40  更:2021-09-05 11:28:40 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年12日历 -2024/12/30 2:01:43-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码