场景
- 大数据工作流的任务失败告警
- 大数据服务节点监控告警
创建企业微信机器人
- 创建群聊
- 在群聊里添加机器人
data:image/s3,"s3://crabby-images/582a7/582a73bac0b0e0e3c655893e776baf389775a39c" alt="" - 创建完成后,点击配置说明
data:image/s3,"s3://crabby-images/2b718/2b718a8361a9f684496e777949f5b39e0b67d93e" alt="" - 查看语法
data:image/s3,"s3://crabby-images/8ff07/8ff079b9970c5b1c19dc208f2b60b27dc2a5138e" alt="" - 文本类型的数据格式
data:image/s3,"s3://crabby-images/037f8/037f8c39426e34d0817d03e89d81cd101ded60e5" alt=""
说明 创建群聊要拉2个人,创建后可以将另外2人移出,群聊就只剩自己和机器人了😓
CentOS7安装curl
yum -y install curl
man curl
命令参数 | 原文 | 说明 |
---|
-H, --header <header> | (HTTP) Extra header to use when getting a web page | 超文本传输协议的消息头 | -d, --data <data> | Sends the specified data in a POST request to the HTTP server | 在POST请求中发送指定的数据到HTTP服务器 |
curl 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=XXXXXXXXXXXXXXXXXXXXXXXXXXX' \
-H 'Content-Type: application/json' \
-d '
{
"msgtype": "text",
"text": {
"content": "hello world"
}
}'
Python实现curl功能
import requests
import json
url = 'https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=XXXXXXXXXXXXXXXXXXXXXXXXXXX'
headers = {'content-type': 'application/json'}
data = {
"msgtype": "text",
"text": {
"content": "xxx工作流的yyy节点执行失败",
"mentioned_list": ["某群成员", "@all"],
}
}
data = json.dumps(data)
print(requests.post(url=url, headers=headers, data=data))
效果 data:image/s3,"s3://crabby-images/b425f/b425f4cd1f55d36e6359d4b3909134174ab37d47" alt=""
Zabbix监控DolphinScheduler节点并告警
背景:DolphinScheduler经常挂,尤其WorkerServer,尝试调整DS的心跳时间和内存,但是死性不改 DS挂掉原因,多数是连不上CDH长期不良的ZooKeeper;此外,内存不够也会导致DS挂掉,调大内存后治好了
报错日志截取:
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[50] - Registry suspended
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[545] - registry connection state is SUSPENDED, ready to retry connection
org.apache.curator.ConnectionState:[376] - Session expired event received
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[552] - registry connection state is DISCONNECTED, ready to stop myself
org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService:[115] - persist task error
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.dolphinscheduler.server.master.processor.queue.StateEventResponseService$StateEventResponseWorker.run(StateEventResponseService.java:112)
1、创建模板
配置=>模板=>创建模板 data:image/s3,"s3://crabby-images/5d600/5d60043bd9cd270ee754507ec502f80dbe67d5cd" alt=""
2、模板里创建监控项
模板=>选中刚才创建的模板=>监控项=>创建监控项 data:image/s3,"s3://crabby-images/639c4/639c44f6d82ef48695214360fbf17e911acdefdc" alt=""
ps -df | grep org.apache.dolphinscheduler.server.worker.WorkerServer
以监控WorkerServer为例,org.apache.dolphinscheduler.server.worker.WorkerServer 可以精准定位WorkerServer data:image/s3,"s3://crabby-images/fae8a/fae8a2530158f7a684bbd5423e23cc38ec4e1341" alt=""
键值填写proc.num[,,all,WorkerServer] data:image/s3,"s3://crabby-images/82881/82881b6d8374e1f9b9bc9571898fe1eaca872ae0" alt=""
3、模板里创建触发器
切换到触发器=>创建触发器 data:image/s3,"s3://crabby-images/7a0ed/7a0ede7447bedd55c9099a1d260ee0480adf686f" alt=""
添加表达式 data:image/s3,"s3://crabby-images/f57d6/f57d6a08b5406dbdfed4cfa962731409f2d41f10" alt=""
条件 data:image/s3,"s3://crabby-images/e321e/e321e4c4fe6f8a769c2f732403394b85aa5fcc78" alt=""
4、动作
配置=>动作=>创建动作 data:image/s3,"s3://crabby-images/63dfc/63dfc44060422a07d63845e5203894cae7490953" alt=""
切换到操作,添加操作 data:image/s3,"s3://crabby-images/1e16b/1e16b9fc53418f02e8d5fb75fe7f4a47c3d4d822" alt=""
填写操作 data:image/s3,"s3://crabby-images/85dc7/85dc732da58471090f73066e76526a32e0a640a3" alt=""
5、批量主机添加模板
配置=>主机=>点选主机=>批量更新 data:image/s3,"s3://crabby-images/2715a/2715ab591ad054f950a92b2eb6939159802352f4" alt=""
模板=>勾选模板链接=>选择之前创建的模板 data:image/s3,"s3://crabby-images/b8c98/b8c98ecda8eaa8d07be34b81a98b65c145fb2dcf" alt=""
6、测试
杀掉WorkerServer进程,然后看看企业微信是否告警 data:image/s3,"s3://crabby-images/f0ba5/f0ba502a25dfe8eddfc7ba870cfb2cb288e50378" alt=""
7、自动重启DS服务进程(未搞掂)
cd /opt/module/dolphinscheduler/bin/;./stop-all.sh;./start-all.sh
|