[系统运维] 【Missing Semester L4】linux下数据格式转换

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 系统运维 -> 【Missing Semester L4】linux下数据格式转换 -> 正文阅读

[系统运维]【Missing Semester L4】linux下数据格式转换

lecture note 主要是如何将数据从一种形式转为另一种形式，经常用于查看log，然后用管道和其他工具最终得到想要的数据格式

日志处理示例

journalctl可以用来查询systemd-journald 服务收集到的日志。systemd-journald 服务是 systemd init 系统提供的收集系统日志的服务。但直接输出往往也别多日志，往往需要进一步处理筛选需要的信息。
比如
ssh myserver journalctl | grep sshd
不过把远程的整个日志弄下来在去grep仍然需要比较长的时间，可以进一步优化
ssh myserver 'journalctl | grep sshd | grep "Disconnected from"' | less
注意这里单引号的使用，在远端就可以过滤，而本地的less只是更便于滚动查看，也可以把过滤后的数据先存一个本地文件。
但是这样仍然有很多噪声–>可以使用工具sed

sed

sed是个流的编辑器stream editor，可以方便的操作数据，最常用的功能之一就是借助正则表达式进行替换

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed 's/.*Disconnected from //'

其中s的用法是s/REGEX/SUBSTITUTION/即正则表达式以及替换后的内容。

正则表达式

在数据处理种用得很多，常见的有：

. means “any single character” except newline
* zero or more of the preceding match
+ one or more of the preceding match
[abc] any one character of a, b, and c
(RX1|RX2) either something that matches RX1 or RX2
^ the start of the line
$ the end of the line

不过sed无法处理?代表的非贪心的匹配，需要替换成perl命令
perl -pe 's/.*?Disconnected from //'

sed还可以进行输出（-p）、多个替换、查找，插入文本(-i)等,具体可以看man sed
其实正则表达式确实比较复杂难写，可以借助在线的工具来初步检查正则表达式写的是否正确。

继续刚才的数据处理,可以对上面得到的登录用户名进行排序查看

sort

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c

sort对输入进行排序，unique -c将多个连续一样的行压缩为一行，前面是重复的行数。
可以把重复的行数进一步进行排序只保留出现次数较多的。

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10

-n表示按数字排序而非字典序，-k1,1表示按照空格分隔的第一个字符进行排序，是从小到大排序的，-r可以实现逆序。

如果进一步想要得到一行逗号分隔的用户名而不是每个一行的话，可以进一步处理：

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10
 | awk '{print $2}' | paste -sd,

不过paste在MacOS上不能用，作用是将各个行-s以逗号分隔开-d,

awk

非常适合处理文本流，{}里的内容来说明对匹配行应该做什么处理，默认匹配所有行。参数$0作用欲整个行，$1-$n指的是被分割的各个field，默认是空格分隔。
对上面那个命令而言，就是对每行输出空格分隔的第二个字段，在这里指的就是用户名。
可以做一些更复杂的处理，比如只出现一次的、c开头、e结尾的用户名打印出来，| awk '$1 == 1 && $2 ~ /^c[^ ]*e$/ { print $2 }' | wc -l
awk其实算是一门编程语言，基本可以替代grep和sed的用法。

其他数据分析工具

bc可以像计算器一样做运算
| paste -sd+ | bc -l
还可以用到R语言的内容，可以方便的进行复杂的数据分析和画图

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | awk '{print $1}' | R --no-echo -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'

gnuplot也可以进行简单的画图

ssh myserver journalctl
 | grep sshd
 | grep "Disconnected from"
 | sed -E 's/.*Disconnected from (invalid |authenticating )?user (.*) [^ ]+ port [0-9]+( \[preauth\])?$/\2/'
 | sort | uniq -c
 | sort -nk1,1 | tail -n10
 | gnuplot -p -e 'set boxwidth 0.5; plot "-" using 1:xtic(2) with boxes'

另外，xargs可以组合多个命令，或者作为给命令传递参数的过滤器。主要是很多命令不支持用管道来传递参数，此时就要用到xargs.
另外，shelle也是可以处理二进制数据的，比如图片。

练习

1、完成正则教程
2、Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words? sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?
3、To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.

4、Find your average, median, and max system boot time over the last ten boots. Use journalctl on Linux and log show on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:
5、Look for boot messages that are not shared between your past three reboots (see journalctl’s -b flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use sed ‘0,/STRING/d’ to remove all lines previous to one that matches STRING. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).
6、Find an online data set like this one, this one, or maybe one from here. Fetch it using curl and extract out just two columns of numerical data. If you’re fetching HTML data, pup might be helpful. For JSON data, try jq. Find the min and max of one column in a single command, and the difference of the sum of each column in another.

系统运维最新文章

配置小型公司网络WLAN基本业务（AC通过三层

如何用DWDM射频光纤技术实现200公里外的站点

国内顺畅下载k8s.gcr.io的镜像

自动化测试appium

ctfshow ssrf

Linux操作系统学习之实用指令（Centos7/8均

加:2022-03-13 22:13:16 更:2022-03-13 22:15:47

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/13 5:50:00-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码