项目场景
在实际工作中,经常会遇到这样的需求:对几十个甚至上百个 几百兆的日志进行批量过滤关键字。如只是单纯用grep “关键字” *.log 这样的shell命令去过滤,显而易见,对这种上百个文件的过滤效率非常低下。
前言
`那么shell如何才能达到像java这些高级语言一样的多线程操作呢,请仔细阅读我这篇文章。
提示:以下是本篇文章正文内容,下面案例可供参考
一、顺序执行代码示例,
[root@test-zookeeper-01 test]# cat test1.sh
#!/bin/bash
Njob=15 #任务总数
for ((i=0; i<$Njob; i++)); do
{
echo "progress $i is sleeping for 1 seconds zzz…"
sleep 1
}
done
echo -e "time-consuming: $SECONDS seconds" #显示脚本执行耗时
[root@test-zookeeper-01 test]# bash test1.sh
progress 0 is sleeping for 1 seconds zzz…
progress 1 is sleeping for 1 seconds zzz…
progress 2 is sleeping for 1 seconds zzz…
progress 3 is sleeping for 1 seconds zzz…
progress 4 is sleeping for 1 seconds zzz…
progress 5 is sleeping for 1 seconds zzz…
progress 6 is sleeping for 1 seconds zzz…
progress 7 is sleeping for 1 seconds zzz…
progress 8 is sleeping for 1 seconds zzz…
progress 9 is sleeping for 1 seconds zzz…
progress 10 is sleeping for 1 seconds zzz…
progress 11 is sleeping for 1 seconds zzz…
progress 12 is sleeping for 1 seconds zzz…
progress 13 is sleeping for 1 seconds zzz…
progress 14 is sleeping for 1 seconds zzz…
time-consuming: 15 seconds
像这样的顺序执行,就是单线程的效果了。模拟做15个任务,每个任务耗时1s,每次只能顺序执行一个任务,总耗时15s
二、并行执行代码示例
1.并发执行,不加并发控制。
代码如下(示例):
[root@test-zookeeper-01 test]# cat test2.sh
#!/bin/bash
Njob=15 #任务总数
for ((i=0; i<$Njob; i++)); do
{
echo "progress $i is sleeping for 1 seconds zzz…"
sleep 1 & #循环内容放到后台执行,实现并行执行15个任务
}
done
wait #等待循环结束再执行wait后面的内容
echo -e "time-consuming: $SECONDS seconds" #显示脚本执行耗时
[root@test-zookeeper-01 test]# bash test2.sh
progress 0 is sleeping for 1 seconds zzz…
progress 1 is sleeping for 1 seconds zzz…
progress 2 is sleeping for 1 seconds zzz…
progress 3 is sleeping for 1 seconds zzz…
progress 4 is sleeping for 1 seconds zzz…
progress 5 is sleeping for 1 seconds zzz…
progress 6 is sleeping for 1 seconds zzz…
progress 7 is sleeping for 1 seconds zzz…
progress 8 is sleeping for 1 seconds zzz…
progress 9 is sleeping for 1 seconds zzz…
progress 10 is sleeping for 1 seconds zzz…
progress 11 is sleeping for 1 seconds zzz…
progress 12 is sleeping for 1 seconds zzz…
progress 13 is sleeping for 1 seconds zzz…
progress 14 is sleeping for 1 seconds zzz…
time-consuming: 1 seconds
这种没有加并发控制的模拟程序,花了1s,效率提高了15倍,但生产环境请避免使用这种不加并发控制的脚本。
2.并发执行,加上简单的并发控制
代码如下(示例):
[root@test-zookeeper-01 test]# cat test3.sh
#/bin/bash
Njob=15 #任务总数
thread_num=5
seq 1 ${Njob} | xargs -n 1 -I {} -P ${thread_num} sh -c "echo progress {} is sleeping for 1 seconds zzz…;sleep 1"
echo -e "time-consuming: $SECONDS seconds" #显示脚本执行耗时
#这里面通过xargs的-P选项进行并发控制,控制线程数为5
运行结果:
[root@test-zookeeper-01 test]# bash test3.sh
progress 1 is sleeping for 1 seconds zzz…
progress 2 is sleeping for 1 seconds zzz…
progress 3 is sleeping for 1 seconds zzz…
progress 4 is sleeping for 1 seconds zzz…
progress 5 is sleeping for 1 seconds zzz…
progress 6 is sleeping for 1 seconds zzz…
progress 7 is sleeping for 1 seconds zzz…
progress 8 is sleeping for 1 seconds zzz…
progress 9 is sleeping for 1 seconds zzz…
progress 10 is sleeping for 1 seconds zzz…
progress 11 is sleeping for 1 seconds zzz…
progress 12 is sleeping for 1 seconds zzz…
progress 13 is sleeping for 1 seconds zzz…
progress 14 is sleeping for 1 seconds zzz…
progress 15 is sleeping for 1 seconds zzz…
time-consuming: 3 seconds
#这里说明,每次执行了5个任务,完成15个任务总耗时3s
3.实际工作中用到的并发控制
a. 场景:对237个文件过滤关键字"商品",每个文件大小大概500M,很明显单纯使用grep ,效率就非常低下,至少需要一个半小时,因为文件太多而且很大,如下
[root@local app]# ll -thr /opt/app/app.2022-03-11*.tmp |wc -l
237
[root@local app]# du -h /opt/app/app.2022-03-11*.tmp |head -10
267M /opt/app/app.2022-03-11.0.log12206568827092188.tmp
640M /opt/app/app.2022-03-11.0.log14026664999879751.tmp
1.2G /opt/app/app.2022-03-11.0.log6690038135875677.tmp
691M /opt/app/app.2022-03-11.10.log12220885563253784.tmp
351M /opt/app/app.2022-03-11.10.log5577542893032849.tmp
267M /opt/app/app.2022-03-11.10.log6702694056370112.tmp
501M /opt/app/app.2022-03-11.11.log12222929747182881.tmp
974M /opt/app/app.2022-03-11.11.log14040978299277329.tmp
887M /opt/app/app.2022-03-11.11.log6704352102433537.tmp
513M /opt/app/app.2022-03-11.12.log12226101152228757.tmp
b. 思路:写一个控制并发的脚本,提高过滤执行效率,我这边的cpu核数是4核,给了8个线程:
[root@local app]# cat grep.sh
#/bin/bash
thread_num=8
a=$(date +%H%M%S)
ls /opt/app/app.2022-03-11*.tmp | xargs -n 1 -I {} -P ${thread_num} sh -c "grep '商品' -C 10 {} >> /opt/app/b.txt"
b=$(date +%H%M%S)
echo -e "startTime:\t$a" >> /opt/app/time.txt
echo -e "endTime:\t$b" >> /opt/app/time.txt
#这里记录时间又用了另外一种方法,你也可以用前面的$SECONDS,减少代码量
c.结果:
[root@local app]# cat time.txt
startTime: 174957
endTime: 175722
#大概花了770s也就是12.8分钟,效率提高了8倍
#执行过程中,系统cpu是有高负载的,系统会变得有点慢,敲命令会卡。如果不控制并发,系统会受不了
总结
工作中shell也有像java一样的需要用到线程的场景,这里面给出了稍微简单的方法Xargs,网上还有其他办法,比如命名管道FIFO。 当工作中遇到这种多线程的需求,请务必控制并发数,务必并行执行,提高效率。 说明:以上实际生产的主机名,日志及目录均是模拟出来的,请勿胡乱猜测。
|