[人工智能] CellPAD（机器学习的异常检测）代码详解---自己回顾用

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> CellPAD（机器学习的异常检测）代码详解---自己回顾用 -> 正文阅读

[人工智能]CellPAD（机器学习的异常检测）代码详解---自己回顾用

????????先放论文代码GitHub链接：https://github.com/XuJiaxin000/CellPAD#cellpad-detecting-performance-anomalies-in-cellular-networks-via-regression-analysis

????????最近在学习基于统计建模和机器学习的回归异常检测，原来已经作出了一个基本框架，最后被评价过于简陋，因此回顾一下CellPAD的代码。这里只讲关于Sudden_Drop的异常检测。

? ? ? ? 先打开example_drop.py文件。test()函数很清晰：读取KPI、注入异常、检测Sudden_Drop异常、性能评价。其中读取KPI不作介绍。

注入异常：

? ? ? ? 注入异常中使用DropSynthesiser Class的syn_drop()函数。其中，point_fraction参数为数据集中点异常比例，lowest_drop_ratio参数为异常数据突然下降的最小比例，segment_cnt参数异常中连续异常的数量，shortest_period和longest_period参数分别为连续异常最短和最长的持续数量。

? ? ? ? add_point_anlomalies()函数注入点异常，add_segment_anomalies()函数注入连续异常，filter_by_rule()函数标注数据集中明显存在的（突然下降）异常。

? ? ? ? 如此，注入异常步骤完成。

训练模型：

? ? ? ? 这是整篇论文的关键，也是代码的难点。

    controller = DropController(timestamps=timestamps,
                                series=syn_series,
                                period_len=168,
                                feature_types=["Indexical", "Numerical"],
                                feature_time_grain=["Weekly"],
                                feature_operations=["Wma", "Ewma", "Mean", "Median"],
                                bootstrap_period_cnt=2,
                                to_remove_trend=True,
                                trend_remove_method="center_mean",
                                anomaly_filter_method="gauss",
                                anomaly_filter_coefficient=3.0)

? ? ? ? timestamps和series不需要解释，分别是时间戳和原始的KPI序列。period_len为时间序列的周期。其他的将在下面详细分析：

? ? ? ? 在controller.py文件中，首先预处理原始数据：

# the dict of the attributes of the time series.
        dict_series = {}
        if not to_remove_trend:
            dict_series["detected_series"] = np.array(series)
        else:
            preprocessor = Preprocessor()
            dict_series["detected_series"] = preprocessor.remove_trend(series, period_len, method=trend_remove_method)
        dict_series["timestamps"] = timestamps
        dict_series["series_len"] = len(dict_series["detected_series"])
        dict_series["period_len"] = period_len
        self.dict_series = dict_series

????????to_remove_trend参数为是否去除趋势性，trend_remove_trend参数为去趋势方法，其中去趋势方法有center_mean与past_mean、除趋势与减趋势。center_mean取前后共一周期数据的平均值，past_mean取过去一周期数据的平均值。

? ? ? ? 其次，特征工程提取特征，创建特征的字典dict_feature：

 # the dict of features related variables.
        dict_feature = {}
        dict_feature["operations"] = feature_operations
        dict_feature["time_grain"] = feature_time_grain
        dict_feature["feature_types"] = feature_types
        dict_feature["feature_tool"] = FeatureTools()
        dict_feature["feature_list"] = dict_feature["feature_tool"].set_feature_names(dict_feature["feature_types"],                                                                                             
                                                  dict_feature["time_grain"],                                                                                      
                                                  dict_feature["operations"])

? ? ? ? feature_operations参数为特征工程提取的特征，也就是对数据的处理，如Mean为取平均，Wma为加权移动平均，Media为中位数。

? ? ? ? feature_time_grain参数为特征的时间粒度。时间粒度指的是对时间管理的最小值，形象点，时间粒度就是提取数据的步长?，即数据的采样频率。

? ? ? ? feature_type参数为特征的种类，有Indexical、Numerical两类，indexical类根据索引即时间戳提取特征，Numerical类根据数值即KPI提取特征。

? ? ? ? FeatureTools()类在feature.py文件中，我们进入分析：

? ? ? ? 我们直接找到set_feature_names()函数：

? ? ? ? ?如果feature_type=Indexical，处理时间戳。如果feature_time_grain=Weekly，一周提取一次的数据，那么一周数据就有日和小时的特征，如果feature_time_grain=Day，一天提取一次的数据，那么一天数据只有小时的特征。用feature_list存储特征种类。

? ? ? ? 如果feature_type=Numerical，处理KPI。如果operation=Raw，不处理KPI，直接作为特征；否则根据win即窗口长度，feature_time_grain与operation建立特征? ? ? ? win_(feature_time_grain)_operation，如2_Weekly_Mean即取以周为时间粒度的2个数据做平均值。

????????文中feature_types=["Indexical", "Numerical"]、feature_time_grain=["Weekly"]、feature_operations=["Wma", "Ewma", "Mean", "Median"]。提取的特征有Hour、Day、[3,5,7,10,13]_Weekly_["Wma","Ewma","Mean","Median"]。

????????以此类推，建立feature_list。

????????再次，从数据集中提取引导用训练集，作dict_bootstrap字典。（我也不清楚具体有什么用）

 # the dict of the bootstrap parameters.
        dict_bootstrap = {}
        dict_bootstrap["period_cnt"] = bootstrap_period_cnt
        dict_bootstrap["bootstrap_series_len"] = bootstrap_period_cnt * period_len
        dict_bootstrap["bootstrap_series"] = self.dict_series["detected_series"][:dict_bootstrap["bootstrap_series_len"]]
        self.dict_bootstrap = dict_bootstrap

? ? ? ? 其中，boostrap_period_cnt参数为引导训练所用数据的周期数，总数据量为周期数boostrap_period_cnt*数据集周期period_len。从数据集的开头开始取。

? ? ? ? 然后，作筛选异常的字典。

# the dict of anomaly filter parameters.
        dict_filter = {}
        dict_filter["method"] = anomaly_filter_method
        dict_filter["coefficient"] = anomaly_filter_coefficient
        self.dict_filter = dict_filter

? ? ? ? anomaly_filter_method参数为筛选方法，本文默认为高斯方法，具体实现见后文。

? ? ? ? anomaly_filter_coefficient参数为置信区间的边界值。

? ? ? ? 最后，作存储训练集数据的字典。

# the dict of the storage for training data.
        dict_storage = {}
        dict_storage["normal_features_matrix"] = pd.DataFrame()
        dict_storage["normal_response_series"] = []
        self.dict_storage = dict_storage

? ? ? ? 这里，normal_features_matrix为输入的正常数据特征矩阵，normal_response_series为正常数据的输出，基于回归的异常检测中输出即对应的KPI。

? ? ? ? 到这里为止，controller的DropController()初始化已经完成。

? ? ? ? 接下来开始检测异常。

controller.detect(predictor="RF")

? ? ? ? 重新回到controller.py文件定位detect()函数。

? ? ? ? 本文以随机森林RF为例，detect()函数继续进入__detect_by_regression()函数。

        if predictor == "RT" or predictor == "RF" or predictor == "SLR" or predictor == "HR":
            self.__detect_by_regression(predictor=predictor)

? ? ? ? ?__detect_by_regression()函数默认参数n_esimators即迭代次数为100。

? ? ? ? 首先调用self.__init_bootstrap()初始化引导训练集。该初始化函数建立dict_result字典存储drop_ratios、drop_scores、drop_labels、predicted_series。

? ? ? ? 其次，调用RegressionPredictor(predictor)函数建立model回归模型。该函数在algorithm.py文件中。self.reg = RandomForestRegressor(n_estimators=100, criterion="mse")建立回归模型，迭代次数n_estimators为100，损失函数标准criterion为mse均方差。

? ? ? ? 再次，提取特征，下面是对引导训练集的特征提取：

first_train_features = self.dict_feature["feature_tool"].compute_feature_matrix(
                                     timestamps=self.dict_series["timestamps"],
                                     series=self.dict_bootstrap["bootstrap_series"],
                                     labels=[False] * self.dict_bootstrap["bootstrap_series_len"],
                                     ts_period_len=self.dict_series["period_len"],
                                     feature_list=self.dict_feature["feature_list"],
                                     start_pos=0,
                                     end_pos=self.dict_bootstrap["bootstrap_series_len"])

? ? ? ? start_pos参数为起始位置即0，end_pos参数为结束位置即训练序列的长度在最后一个数据的后一位。这里，引入了 self.dict_feature["feature_tool"].compute_feature_matrix()函数，在feature.py文件中，用于提取特征矩阵。

? ? ? ? 这里一次性投入的是一个bootstrap即两周期168*2的数据量。转入FeatureExtractor类compute_features()函数。遍历feature_list，调用self.compute_one_feature(feature_name, start_pos, end_pos)函数提取各个特征向量。进入该函数：

? ? ? ? feature_name=Hour or Day，将对应时间戳加入feature_values返回作为一个特征。

? ? ? ? feature_name=Raw，将原始KPI返回作为一个特征。

? ? ? ? 若feature_name为3_Weekly_Mean形式，遍历所有数据，调用下述函数：

feature_period_len = self.compute_feature_period_len(period_grain) 
vs = self.get_sametime_instances(current_index=idx,
                                 feature_period_len=feature_period_len,
                                 ts_period_len=self.ts_period_len,
                                 instance_count=win)

? ? ? ? 进入self.compute_feature_period_len(period_grain)函数，time_delta变量指相邻数据的时间差，weekly_time_delta/time_delta为7*24，即返回一个时间粒度的数据时的数据量feature_period_len。

? ? ? ? 进入self.get_sametime_instances()函数，参数current_index为当前数据下标，feature_period_len参数为一时间粒度的数据量168，ts_period_len为时间序列的周期168，instance_count为滑动窗口长度[3,5,7,10,13]。

? ? ? ?根据我的理解，这里有一个假设：t+period时采集所得的反映的是t时的数据。

????????已知时间序列的周期为168，那么数据集的前168条相当于无效，则使pos为当前下标减去168，当pos<0时返回0，前168个数据返回皆为0。接下来的数据减去一个周期后，每隔一个时间粒度采集一次数据，直至达到滑动窗口数或数据不足时停止。

? ? ? ? 回到self.compute_one_feature(feature_name, start_pos, end_pos)函数，调用函数求得特征，返回特征值feature_values。

? ? ? ? 返回的特征值最后赋给first_train_features作为模型输入，将dict_bootstrap[“bootstrap_series”]赋给first_train_response作为模型输出。最后再将特征矩阵与输出分别存储到self.dict_storage["normal_features_matrix"]? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?????????????????????self.dict_storage["normal_response_series"]。

? ? ? ? 接着，进行模型的训练model.train。

? ? ? ? 取round_cnt为时间序列的周期数（取整），遍历各个周期，重复上述特征工程操作得到模型训练的输入和输出。

? ? ? ? 对于输入，调用rf的回归模型model预测输出为this_predicted_series，而实际的输出this_practical_series则为原始设置值。

? ? ? ? 异常检测

????????

this_drop_ratios, this_drop_labels, this_drop_scores = \
                                    self.__filter_anomaly(predicted_series=this_predicted_series,
                                                        practical_series=this_practical_series)

? ? ? ? 这里调用函数self.__filter_anomaly()测试模型。

? ? ? ? predicted_series参数是模型的预测即期望输出，practical_series参数是模型的实际输出。__filter_anomaly()函数如下：

anomaly_filter = DropAnomalyFilter(rule=self.dict_filter["method"],
                                           coef=self.dict_filter["coefficient"])
drop_ratios, drop_labels, drop_scores = \
            anomaly_filter.detect_anomaly(predicted_series=predicted_series,
                                          practical_series=practical_series)
return drop_ratios, drop_labels, drop_scores

? ? ? ? 其中，初始化anomaly_filter为DropAnomalyFilter类中该类位于filter.py文件中。

? ? ? ? 根据之前代码，rule=gauss，coef=3.0。

????????继续进入detect_anomaly()函数，遍历practical_series，计算practical_series相较predicted_series的下降率为dp，再遍历下降率dp，若下降率大于等于0，则异常得分为0，反之，取其异常得分为其相反数。再调用self.filter_anomaly(drop_ratios)函数生成异常标签。

? ? ? ? 该函数中，取下降率的平均值与方差标准化，再调用self.filter_by_threshold(drop_ratios, threshold)函数生成异常标签。

? ? ? ? 该函数中，若标准化的下降率超出阈值，则标记为异常，反之为正常。阈值即coef=3.0。

????????

 self.__store_this_results(this_predicted_series, this_drop_ratios,
                                    this_drop_labels, this_drop_scores)
 self.__store_features_response(this_features_matrix=this_predicted_features,
                                this_response_series=this_practical_series,
                                this_labels=this_drop_labels)

? ? ? ? 存储测试结果并更新字典dict_result，将最新的训练和测试结果加入字典后方，使得字典的越后方，训练效果越好。同时更新dict_storage字典。

? ? ? ? 对更新好的dict_storage字典继续训练，如此迭代已取得更好的效果。

性能评价

????????

    results = controller.get_results()

    auc, prauc = evaluate(results["drop_scores"][2*168:], syn_labels[2*168:])

    print("front_mean", "auc", auc, "prauc", prauc)

? ? ? ? 调用上述函数对实验结果的auc与prauc进行测评。

人工智能最新文章

2022吴恩达机器学习课程——第二课（神经网

第十五章规则学习

FixMatch: Simplifying Semi-Supervised Le

数据挖掘Java——Kmeans算法的实现

大脑皮层的分割方法

【翻译】GPT-3是如何工作的

论文笔记:TEACHTEXT: CrossModal Generaliz

python从零学（六）

详解Python 3.x 导入(import)

【答读者问27】backtrader不支持最新版本的