[嵌入式] 利用AI+大数据的方式分析恶意样本（三十四）

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 嵌入式 -> 利用AI+大数据的方式分析恶意样本（三十四） -> 正文阅读

[嵌入式]利用AI+大数据的方式分析恶意样本（三十四）

DisCo: Combining Disassemblers for Improved Performance

RAID 2021

开源：https://github.com/gsrishaila/DisCo-Combining-Disassemblers-for-Improved-Performance/tree/main/SourceCode

abstract

Malware infects thousands of systems globally each day causing millions of dollars in damages.

Which disassembler should a maliware analyst choose in order to get the most accurate disassembly and be able to detect, analyze and defuse malware quickly?

There is no clear answer to this question: (a) the performance of disassemblers varies across configurations, and (b) most prior work on disassemblers focuses on benign software and the x86 CPU architecture.

In this work, we take a different approach and ask:why not use all the disassemblers instead of picking one?

We present xxx, a novel and effective approach to harness the collective capability of a group of disassemblers combining their output into an ensemble consensus.

We develop and evaluate our approach using 1760 IoT malware binaries compiled with different compiled with different compilers and compiler options for the ARM and MIPR architectures.

First, we show that xxx can combine the collective wisdom of disassemblers effectively.

For example, our approach outperforms the best contributing disassembler by as much as 17.8 in the F1 score for function start identification for MIPS binaries compiled using GCC with O3 option.

Second, the collective wisdom of the disassemblers can be brought back to improve each disassembler.

As a proof of concept, we show that byte-level signatures identified by xxx can improve the performance of Ghidra by as much as 13.6 in terms of the F1 score.

Third, we quantify the effect of the architecture, the compiler, and the compiler options on the performance of disassemblers.

Finally, the systematic evaluation within our approach led to a bug discovery in Ghidra v9.1, which was acknowledged by the Ghidra team.

introduction

主要讲述的是一篇联合多反汇编器增加反编译准确率

二进制反汇编是恶意软件防御中必不可少的工具，2017年wannacry和petya勒索席卷全球时，恶意软件分析师需要快速的了解它们的传播机制和操作模式以便控制他们。

Which disassembler should a malware analyst choose for a rapidly-spreading malware binary to get the most accurate results? This is the question that motivates our work.

作者在文章中重点讨论MIPS和ARM架构的恶意软件，反汇编程序的性能会因为二进制文件的类型而异，而二进制文件可以通过以下各种方式创建：

编译器
编译器优化标志
目标CPU架构
这些变化都会导致二进制文件中的汇编代码出现显著差异。

作者提出一种组合反汇编器的有效方法：

评估每个反汇编器创建训练数据的有效性，使用各种配置编译恶意软件的源代码，并将每个反汇编的输出与真实情况对比
创建和训练机器学习方法来将各个输出转化为一个组合输出，使用神经网络创建一个堆叠集成，采用以下输入：每个反汇编的输出，从实际二进制文件中选择的数据

作者考虑MIPS和ARM两种配置，两种不同的编译器GCC和Clang，以及五个编译器优化级别，且作者关注函数启动标识度量，这是一个关键的反汇编度量

correctly identified function starts(CFS)正确识别函数起始

指令和函数开始识别被认为是评估反汇编程序的两个基本指标，因为它们产出其他指标的输出，如控制流和调用图。

使用1760个Iot二进制文件来训练和评估，这些二进制文件是从88个具有各种配置选项的iot程序编译而来，

作者考虑了五个基线反汇编程序

Angr
IDA
Ghidra
BAP
Radare2

五个优化级别：O0、O1、O2、O3 和 Os
两个架构：ARM 5 和MIPS R3000
编译器：GCC 5.5.0 Clang 9.0

剥离二进制和非剥离二进制的区别：编译后的二进制文件会包含程序执行所不需要的调试信息，而是用于调试和查找程序中的问题或错误，剥离的二进制文件是一个没有这些调试符号的二进制，更小，很难反汇编。

如图是一个激励性的例子，IDA可以识别241个额外且真实的函数，Ghidra可以识别352个额外且真实的函数，将两者有效组合可以提升性能。

背后的直觉是：不同的反汇编器应该具有互补的功能，因为它们使用不同的算法来识别二进制文件结构。

反汇编器可以看到不同的东西，结合基线反汇编是有益的如果每个恢复二进制文件不同的部分。

合并结果时应谨慎，如果进行简单的联合处理则并不一定会保证是最佳性能。多数投票法也会导致召回率低下，因为某些功能的启动仅由少数汇编程序识别。

创建ground truth：使用源代码开始，-g编译，以便将更丰富的调试信息附加到生成的二进制文件中，使用DWARF库识别函数起始地址，以此创建ground truth。

输入：

ARM架构：函数开始位置前后四个指令，即16字节
MIPS架构：函数开始位置后两个指令，即8字节

加上每个反汇编器的投票，占位5

模型配置：

    model = Sequential()
	model.add(Dense(2053, input_dim=2053, activation='relu')) # 2053 to 2024  #change back from 2054 to 2053
	#8*256 +5 = 2053
	model.add(Dense(1000, activation='relu')) #added in extra layer
	model.add(Dense(250, activation='relu')) #added in extra layer
	#model.add(Dense(60, activation='relu')) #added in extra layer
	model.add(Dense(1, activation='sigmoid'))
	# Compile model
	model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

效果展示：

启发

很小的一个点，就是利用神经网络结合5个不同的反汇编器，构建ground truth进行模型的训练预测。效果比单个反汇编器要好。但这一个点明显不够，因此只占了5页篇幅，后面作者用详细的篇幅介绍了由此项工作引申出来的一些观察和发现，这是非常重要的。 甚至可以说，如果没有后续的分析，这篇文章是不可能发一篇B类会议的。DisCo: Combining Disassemblers for Improved Performance

嵌入式最新文章

基于高精度单片机开发红外测温仪方案

89C51单片机与DAC0832

基于51单片机宠物自动投料喂食器控制系统仿

《痞子衡嵌入式半月刊》第 68 期

多思计组实验实验七简单模型机实验

CSC7720

启明智显分享| ESP32学习笔记参考--PWM(脉冲

STM32初探

STM32 总结

【STM32】CubeMX例程四---定时器中断（附工

加:2021-08-13 12:17:17 更:2021-08-13 12:26:19

360图书馆购物三丰科技阅读网日历万年历 2025年10日历

-2025/10/19 13:09:11-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码