lecture note 主要是如何将数据从一种形式转为另一种形式,经常用于查看log,然后用管道和其他工具最终得到想要的数据格式
日志处理示例
journalctl 可以用来查询systemd-journald 服务收集到的日志。systemd-journald 服务是 systemd init 系统提供的收集系统日志的服务。但直接输出往往也别多日志,往往需要进一步处理筛选需要的信息。 比如 ssh myserver journalctl | grep sshd 不过把远程的整个日志弄下来在去grep仍然需要比较长的时间,可以进一步优化 ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' | less 注意这里单引号的使用,在远端就可以过滤,而本地的less只是更便于滚动查看,也可以把过滤后的数据先存一个本地文件。 但是这样仍然有很多噪声–>可以使用工具sed
sed
sed 是个流的编辑器stream editor,可以方便的操作数据,最常用的功能之一就是借助正则表达式进行替换
ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed 's/.*Disconnected from //'
其中s 的用法是s/REGEX/SUBSTITUTION/ 即正则表达式以及替换后的内容。
正则表达式
在数据处理种用得很多,常见的有:
. means “any single character” except newline* zero or more of the preceding match+ one or more of the preceding match[abc] any one character of a, b, and c(RX1|RX2) either something that matches RX1 or RX2^ the start of the line$ the end of the line
不过sed 无法处理? 代表的非贪心的匹配,需要替换成perl 命令 perl -pe 's/.*?Disconnected from //'
sed 还可以进行输出(-p)、多个替换、查找,插入文本(-i)等,具体可以看man sed 其实正则表达式确实比较复杂难写,可以借助在线的工具来初步检查正则表达式写的是否正确。
继续刚才的数据处理,可以对上面得到的登录用户名进行排序查看
sort
ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
| sort | uniq -c
sort 对输入进行排序,unique -c 将多个连续一样的行压缩为一行,前面是重复的行数。 可以把重复的行数进一步进行排序只保留出现次数较多的。
ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
| sort | uniq -c
| sort -nk1,1 | tail -n10
-n 表示按数字排序而非字典序,-k1,1 表示按照空格分隔的第一个字符进行排序,是从小到大排序的,-r 可以实现逆序。
如果进一步想要得到一行逗号分隔的用户名而不是每个一行的话,可以进一步处理:
ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
| sort | uniq -c
| sort -nk1,1 | tail -n10
| awk '{print $2}' | paste -sd,
不过paste 在MacOS上不能用,作用是将各个行-s 以逗号分隔开-d,
awk
非常适合处理文本流,{} 里的内容来说明对匹配行应该做什么处理,默认匹配所有行。参数$0 作用欲整个行,$1-$n 指的是被分割的各个field,默认是空格分隔。 对上面那个命令而言,就是对每行输出空格分隔的第二个字段,在这里指的就是用户名。 可以做一些更复杂的处理,比如只出现一次的、c开头、e结尾的用户名打印出来,| awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l awk其实算是一门编程语言,基本可以替代grep 和sed 的用法。
其他数据分析工具
bc 可以像计算器一样做运算 | paste -sd+ | bc -l 还可以用到R 语言的内容,可以方便的进行复杂的数据分析和画图
ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
| sort | uniq -c
| awk '{print $1}' | R --no-echo -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'
gnuplot 也可以进行简单的画图
ssh myserver journalctl
| grep sshd
| grep "Disconnected from"
| sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
| sort | uniq -c
| sort -nk1,1 | tail -n10
| gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'
另外,xargs 可以组合多个命令,或者作为给命令传递参数的过滤器。主要是很多命令不支持用管道来传递参数,此时就要用到xargs . 另外,shelle也是可以处理二进制数据的,比如图片。
练习
1、完成正则教程 2、Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words? sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur? 3、To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.
4、Find your average, median, and max system boot time over the last ten boots. Use journalctl on Linux and log show on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like: 5、Look for boot messages that are not shared between your past three reboots (see journalctl’s -b flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use sed ‘0,/STRING/d’ to remove all lines previous to one that matches STRING. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots). 6、Find an online data set like this one, this one, or maybe one from here. Fetch it using curl and extract out just two columns of numerical data. If you’re fetching HTML data, pup might be helpful. For JSON data, try jq. Find the min and max of one column in a single command, and the difference of the sum of each column in another.
|