实验5:完全基于数据挖掘的方法建模预测
2021年10月11日 在前几个实验过程中,Dr.Li(耀哥)说有大约45个数据特征需要直接删除(基于人工经验的特征选择),作为一个没有任何领域专家所具备的经验知识的菜鸡,甚``是不解,如何判断传感器坏了(需要去现场),为什么说这些特征需要直接删除?这里通过特征工程的方法进行验证 。
实验思路:
附加说明:原始数据建模任务为多标签分类,含38个标签列。创新性的想法:将每个样本所对应的横向标签值视为二进制编码、问题转化为回归问题
训练回归模型,待模型预测输出之后,将回归输出结果进行二进制映射,得到二进制编码串,即样本的标签值0-1。本实验进行可行性验证 。
实验结果:
-
模型在测试集上的决定系数
R
2
R^2
R2达到了0.92; -
模型在训练集上的决定系数
R
2
R^2
R2达到了0.97; -
基于数据驱动的方法,特征选择的结果与人工经验直接剔除的结果大致吻合,即前面实验中直接删除的特征在这里基本都被识别出来; -
然而,对回归输出结果进行二进制编码输出,并采用多标签分类指标进行评估之后,与上述性能指标相去甚远,可行性有待进一步考量;
思考:
-
相较于实验3、4,并综合实验5的结果分析发现,全部特征的使用,对于模型性能的干扰不大,即当前数据特征的加入,似乎并未引入额外噪音; -
基于实验3,4,5的实验结果发现,采用43个特征、85个特征、92个特征、RF Regression性能差异不大,则多余特征是否可以直接删除? -
在计算机领域等大数据场景,考虑模型性能相当的情况下,可毫不犹豫地选用更少的特征,然而在土木等高危行业,完全基于数据驱动的方法是否可行? -
在机器学习或数据挖掘领域,如何实现将人工经验与模型建立,预测全过程进行有机融合?
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
data = pd.read_csv("./Data_all_V2.csv")
data.head()
| TIME | XC-1 | XC-2 | XC-3 | XC-4 | XC-5 | RF-1 | RF-2 | RF-3 | RF-4 | ... | K4-6 | K4-7 | K4-8 | K5-1 | K5-2 | K5-3 | K5-4 | K5-5 | K5-6 | Y_dec |
---|
0 | 2020-07-19 00:00:00 | 5.734000 | 5.426833 | 5.575333 | 5.534667 | 5.606333 | 0.1 | 0.1 | 1.503 | 1.2695 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
---|
1 | 2020-07-19 01:00:00 | 5.730667 | 5.418667 | 5.571167 | 5.531667 | 5.603667 | 0.1 | 0.1 | 1.484 | 1.2685 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
---|
2 | 2020-07-19 02:00:00 | 5.737833 | 5.424500 | 5.577833 | 5.537167 | 5.611667 | 0.1 | 0.1 | 1.443 | 1.2610 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1048576 |
---|
3 | 2020-07-19 03:00:00 | 5.730833 | 5.418333 | 5.571000 | 5.525500 | 5.609333 | 0.1 | 0.1 | 1.474 | 1.2570 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1048576 |
---|
4 | 2020-07-19 04:00:00 | 5.740000 | 5.429167 | 5.581000 | 5.549500 | 5.606333 | 0.1 | 0.1 | 1.489 | 1.2525 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
---|
5 rows × 170 columns
data.describe()
| XC-1 | XC-2 | XC-3 | XC-4 | XC-5 | RF-1 | RF-2 | RF-3 | RF-4 | RF-5 | ... | K4-6 | K4-7 | K4-8 | K5-1 | K5-2 | K5-3 | K5-4 | K5-5 | K5-6 | Y_dec |
---|
count | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 601.000000 | 601.000000 | 601.000000 | 601.000000 | 601.000000 | ... | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 6.230000e+02 |
---|
mean | 8.821620 | 8.584109 | 8.705866 | 8.692483 | 8.725204 | 0.637256 | 0.118895 | 1.245230 | 1.111804 | 1.206263 | ... | 0.154093 | 0.044944 | 0.065811 | 0.086677 | 0.115570 | 0.104334 | 0.081862 | 0.109149 | 0.048154 | 2.928224e+10 |
---|
std | 3.064538 | 3.213162 | 3.134184 | 3.175944 | 3.098137 | 0.663537 | 0.187299 | 0.339832 | 0.354685 | 0.349192 | ... | 0.361328 | 0.207347 | 0.248150 | 0.281588 | 0.319965 | 0.305939 | 0.274375 | 0.312077 | 0.214264 | 5.544480e+10 |
---|
min | 5.608333 | 5.415000 | 5.568833 | 5.520167 | 5.588667 | 0.100000 | 0.100000 | 0.456000 | 0.100000 | 0.430500 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 |
---|
25% | 6.083667 | 5.524750 | 5.810583 | 5.778250 | 5.833250 | 0.100000 | 0.100000 | 0.993000 | 0.880500 | 0.903000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.099200e+06 |
---|
50% | 7.419000 | 7.354167 | 7.417333 | 7.409500 | 7.466167 | 0.379000 | 0.100000 | 1.367500 | 1.142000 | 1.276500 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.368709e+08 |
---|
75% | 11.877583 | 11.671000 | 11.764083 | 11.918000 | 11.606667 | 0.961500 | 0.100000 | 1.458000 | 1.310500 | 1.523000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.439343e+10 |
---|
max | 14.251500 | 14.176000 | 14.207167 | 14.253500 | 14.155833 | 2.817000 | 3.077500 | 2.036500 | 4.603500 | 2.730500 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.190783e+11 |
---|
8 rows × 169 columns
import missingno as msno
import seaborn as sns
msno.bar(data)
<matplotlib.axes._subplots.AxesSubplot at 0x1f6bab1e308>
def missing_values_table(df):
mis_val = df.isnull().sum()
mis_val_percent = mis_val / len(df) * 100
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : '缺失值', 1 : '缺失值占比'})
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'缺失值占比', ascending=False).round(1)
print ("一共" + str(df.shape[1]) + "列,\n"
"其中" + str(mis_val_table_ren_columns.shape[0]) +
"列有缺失值。")
return mis_val_table_ren_columns
missing_values_table(data)
一共170列,
其中100列有缺失值。
| 缺失值 | 缺失值占比 |
---|
RF-1 | 22 | 3.5 |
---|
RF-81 | 22 | 3.5 |
---|
RF-95 | 22 | 3.5 |
---|
RF-94 | 22 | 3.5 |
---|
RF-93 | 22 | 3.5 |
---|
... | ... | ... |
---|
RF-50 | 9 | 1.4 |
---|
RF-51 | 9 | 1.4 |
---|
RF-58 | 9 | 1.4 |
---|
RF-59 | 9 | 1.4 |
---|
RF-103 | 9 | 1.4 |
---|
100 rows × 2 columns
data.duplicated().value_counts()
False 623
dtype: int64
data[['XC-1']].boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x1f6bd3055c8>
特征选择1,异常传感器数据删除
耀哥说,这里的传感器坏了,直接将这些特征删除 。
暂时未弄清楚,如何确定传感器故障的原因?
data_new = data.dropna()
data_new.shape
(601, 170)
X = data_new.iloc[:,1:131]
X.head()
| XC-1 | XC-2 | XC-3 | XC-4 | XC-5 | RF-1 | RF-2 | RF-3 | RF-4 | RF-5 | ... | RF-116 | RF-117 | RF-118 | RF-119 | RF-120 | RF-121 | RF-122 | RF-123 | RF-124 | RF-125 |
---|
0 | 5.734000 | 5.426833 | 5.575333 | 5.534667 | 5.606333 | 0.1 | 0.1 | 1.503 | 1.2695 | 1.5110 | ... | 0.6505 | 0.3530 | 0.6080 | 0.5690 | 2.5085 | 2.5195 | 1.9170 | 3.4460 | 2.9660 | 2.2025 |
---|
1 | 5.730667 | 5.418667 | 5.571167 | 5.531667 | 5.603667 | 0.1 | 0.1 | 1.484 | 1.2685 | 1.5110 | ... | 0.6545 | 0.3550 | 0.6110 | 0.5700 | 2.5035 | 2.5055 | 2.8310 | 3.4210 | 2.9560 | 2.1920 |
---|
2 | 5.737833 | 5.424500 | 5.577833 | 5.537167 | 5.611667 | 0.1 | 0.1 | 1.443 | 1.2610 | 1.5055 | ... | 0.6570 | 0.3595 | 0.6145 | 0.5725 | 2.4950 | 2.4910 | 2.6855 | 3.3850 | 2.9385 | 2.1760 |
---|
3 | 5.730833 | 5.418333 | 5.571000 | 5.525500 | 5.609333 | 0.1 | 0.1 | 1.474 | 1.2570 | 1.4990 | ... | 0.6590 | 0.3620 | 0.6195 | 0.5750 | 2.4885 | 2.4870 | 4.0795 | 3.3595 | 2.9250 | 2.1605 |
---|
4 | 5.740000 | 5.429167 | 5.581000 | 5.549500 | 5.606333 | 0.1 | 0.1 | 1.489 | 1.2525 | 1.4960 | ... | 0.6630 | 0.3630 | 0.6240 | 0.5785 | 2.4845 | 2.4905 | 3.4460 | 3.3470 | 2.9160 | 2.1595 |
---|
5 rows × 130 columns
Y = (data_new['Y_dec'])
Y
0 0
1 0
2 1048576
3 1048576
4 0
...
596 43017177090
597 43017177106
598 43017177106
599 43017177106
600 43017177106
Name: Y_dec, Length: 601, dtype: int64
特征选择2,热力图heatmap
print(X.columns)
Index(['XC-1', 'XC-2', 'XC-3', 'XC-4', 'XC-5', 'RF-1', 'RF-2', 'RF-3', 'RF-4',
'RF-5',
...
'RF-116', 'RF-117', 'RF-118', 'RF-119', 'RF-120', 'RF-121', 'RF-122',
'RF-123', 'RF-124', 'RF-125'],
dtype='object', length=130)
data_new = pd.concat([X, Y], axis=1)
data_new.shape
(601, 131)
data_new.head()
| XC-1 | XC-2 | XC-3 | XC-4 | XC-5 | RF-1 | RF-2 | RF-3 | RF-4 | RF-5 | ... | RF-117 | RF-118 | RF-119 | RF-120 | RF-121 | RF-122 | RF-123 | RF-124 | RF-125 | Y_dec |
---|
0 | 5.734000 | 5.426833 | 5.575333 | 5.534667 | 5.606333 | 0.1 | 0.1 | 1.503 | 1.2695 | 1.5110 | ... | 0.3530 | 0.6080 | 0.5690 | 2.5085 | 2.5195 | 1.9170 | 3.4460 | 2.9660 | 2.2025 | 0 |
---|
1 | 5.730667 | 5.418667 | 5.571167 | 5.531667 | 5.603667 | 0.1 | 0.1 | 1.484 | 1.2685 | 1.5110 | ... | 0.3550 | 0.6110 | 0.5700 | 2.5035 | 2.5055 | 2.8310 | 3.4210 | 2.9560 | 2.1920 | 0 |
---|
2 | 5.737833 | 5.424500 | 5.577833 | 5.537167 | 5.611667 | 0.1 | 0.1 | 1.443 | 1.2610 | 1.5055 | ... | 0.3595 | 0.6145 | 0.5725 | 2.4950 | 2.4910 | 2.6855 | 3.3850 | 2.9385 | 2.1760 | 1048576 |
---|
3 | 5.730833 | 5.418333 | 5.571000 | 5.525500 | 5.609333 | 0.1 | 0.1 | 1.474 | 1.2570 | 1.4990 | ... | 0.3620 | 0.6195 | 0.5750 | 2.4885 | 2.4870 | 4.0795 | 3.3595 | 2.9250 | 2.1605 | 1048576 |
---|
4 | 5.740000 | 5.429167 | 5.581000 | 5.549500 | 5.606333 | 0.1 | 0.1 | 1.489 | 1.2525 | 1.4960 | ... | 0.3630 | 0.6240 | 0.5785 | 2.4845 | 2.4905 | 3.4460 | 3.3470 | 2.9160 | 2.1595 | 0 |
---|
5 rows × 131 columns
len(data_new.columns)
131
data_new.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 601 entries, 0 to 600
Columns: 131 entries, XC-1 to Y_dec
dtypes: float64(130), int64(1)
memory usage: 619.8 KB
import numpy as np
from mlxtend.plotting import heatmap
import matplotlib.pyplot as plt
cols = data_new.columns[:20]
cm = np.corrcoef(data_new[cols].values.T)
hm = heatmap(cm, row_names=cols, column_names=cols, column_name_rotation=45, figsize=(15, 15))
plt.savefig('./heatmap-1.png', dpi=300)
plt.show()
cm = np.corrcoef(data_new[data_new.columns].values.T)
print(cm)
[[1. 0.99801785 0.99838858 ... 0.8491032 0.49766594 0.28886902]
[0.99801785 1. 0.9986607 ... 0.84473304 0.48884636 0.30590216]
[0.99838858 0.9986607 1. ... 0.84974265 0.49435048 0.30096088]
...
[0.8491032 0.84473304 0.84974265 ... 1. 0.59760806 0.19152249]
[0.49766594 0.48884636 0.49435048 ... 0.59760806 1. 0.10425281]
[0.28886902 0.30590216 0.30096088 ... 0.19152249 0.10425281 1. ]]
特征选择3 特征递归消除/L1特征选择/序列反向选择/特征重要性度量
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
X_train.shape
(480, 130)
from sklearn.ensemble import RandomForestRegressor
feat_labels = data_new.columns[:130]
forest = RandomForestRegressor(random_state=1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[indices[f]],
importances[indices[f]]))
plt.rcParams['font.sans-serif']=['FangSong']
plt.figure(figsize=(18, 5))
plt.title('特征重要性评估')
plt.bar(range(X_train.shape[1]),
importances[indices],
align='center')
plt.xticks(range(X_train.shape[1]),
feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()
1) RF-92 0.200760
2) RF-10 0.180728
3) RF-87 0.051799
4) RF-95 0.028264
5) RF-24 0.028129
6) RF-73 0.022808
7) RF-100 0.020915
8) RF-82 0.018629
9) RF-9 0.018139
10) RF-86 0.017640
11) RF-23 0.015947
12) RF-88 0.014931
13) RF-84 0.014429
14) RF-30 0.014212
15) RF-26 0.013331
16) RF-125 0.012515
17) RF-52 0.012292
18) RF-83 0.012036
19) RF-7 0.011987
20) RF-34 0.011529
21) RF-21 0.011237
22) RF-33 0.011077
23) RF-31 0.010535
24) RF-67 0.010051
25) RF-123 0.010015
26) RF-32 0.009059
27) RF-79 0.009013
28) RF-39 0.008055
29) RF-96 0.007899
30) XC-2 0.007668
31) RF-109 0.007457
32) RF-18 0.006936
33) RF-47 0.006934
34) RF-49 0.006903
35) RF-25 0.006135
36) RF-38 0.005785
37) XC-1 0.004617
38) RF-20 0.004595
39) RF-107 0.004383
40) XC-5 0.004339
41) RF-5 0.004297
42) RF-11 0.004213
43) RF-8 0.004203
44) RF-120 0.004015
45) RF-27 0.003573
46) RF-64 0.003397
47) RF-74 0.003380
48) RF-93 0.003163
49) RF-97 0.002960
50) RF-118 0.002928
51) RF-54 0.002877
52) RF-112 0.002807
53) RF-44 0.002696
54) RF-45 0.002664
55) RF-40 0.002663
56) RF-48 0.002653
57) RF-69 0.002637
58) RF-43 0.002630
59) RF-57 0.002541
60) RF-22 0.002534
61) RF-55 0.002529
62) RF-65 0.002446
63) RF-58 0.002433
64) XC-4 0.002097
65) RF-104 0.002049
66) RF-15 0.002011
67) RF-101 0.001984
68) RF-80 0.001779
69) RF-4 0.001730
70) RF-78 0.001726
71) RF-99 0.001721
72) RF-124 0.001704
73) RF-50 0.001680
74) RF-56 0.001673
75) RF-1 0.001652
76) RF-121 0.001644
77) RF-102 0.001643
78) RF-60 0.001586
79) RF-53 0.001452
80) RF-42 0.001424
81) RF-111 0.001396
82) RF-91 0.001379
83) RF-35 0.001346
84) RF-17 0.001324
85) RF-122 0.001320
86) RF-103 0.001314
87) RF-41 0.001214
88) RF-117 0.001205
89) RF-16 0.001203
90) RF-46 0.001163
91) RF-119 0.001126
92) RF-113 0.001126
93) RF-63 0.001057
94) RF-71 0.001048
95) RF-61 0.000935
96) RF-19 0.000858
97) RF-114 0.000819
98) RF-85 0.000786
99) RF-70 0.000708
100) RF-116 0.000691
101) RF-105 0.000685
102) RF-98 0.000570
103) RF-3 0.000515
104) RF-106 0.000429
105) RF-110 0.000421
106) RF-81 0.000414
107) RF-115 0.000270
108) RF-6 0.000260
109) XC-3 0.000243
110) RF-36 0.000191
111) RF-68 0.000176
112) RF-75 0.000128
113) RF-77 0.000108
114) RF-76 0.000046
115) RF-13 0.000010
116) RF-72 0.000005
117) RF-29 0.000002
118) RF-2 0.000001
119) RF-12 0.000000
120) RF-94 0.000000
121) RF-51 0.000000
122) RF-90 0.000000
123) RF-59 0.000000
124) RF-37 0.000000
125) RF-14 0.000000
126) RF-108 0.000000
127) RF-62 0.000000
128) RF-28 0.000000
129) RF-66 0.000000
130) RF-89 0.000000
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(forest, threshold=0.001057, prefit=True)
X_selected = sfm.transform(X_train)
print('最终确定选择的特征个数:',
X_selected.shape[1])
最终确定选择的特征个数: 92
list1 = []
for f in range(X_selected.shape[1]):
list1.append(feat_labels[indices[f]])
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[indices[f]],
importances[indices[f]]))
1) RF-92 0.200760
2) RF-10 0.180728
3) RF-87 0.051799
4) RF-95 0.028264
5) RF-24 0.028129
6) RF-73 0.022808
7) RF-100 0.020915
8) RF-82 0.018629
9) RF-9 0.018139
10) RF-86 0.017640
11) RF-23 0.015947
12) RF-88 0.014931
13) RF-84 0.014429
14) RF-30 0.014212
15) RF-26 0.013331
16) RF-125 0.012515
17) RF-52 0.012292
18) RF-83 0.012036
19) RF-7 0.011987
20) RF-34 0.011529
21) RF-21 0.011237
22) RF-33 0.011077
23) RF-31 0.010535
24) RF-67 0.010051
25) RF-123 0.010015
26) RF-32 0.009059
27) RF-79 0.009013
28) RF-39 0.008055
29) RF-96 0.007899
30) XC-2 0.007668
31) RF-109 0.007457
32) RF-18 0.006936
33) RF-47 0.006934
34) RF-49 0.006903
35) RF-25 0.006135
36) RF-38 0.005785
37) XC-1 0.004617
38) RF-20 0.004595
39) RF-107 0.004383
40) XC-5 0.004339
41) RF-5 0.004297
42) RF-11 0.004213
43) RF-8 0.004203
44) RF-120 0.004015
45) RF-27 0.003573
46) RF-64 0.003397
47) RF-74 0.003380
48) RF-93 0.003163
49) RF-97 0.002960
50) RF-118 0.002928
51) RF-54 0.002877
52) RF-112 0.002807
53) RF-44 0.002696
54) RF-45 0.002664
55) RF-40 0.002663
56) RF-48 0.002653
57) RF-69 0.002637
58) RF-43 0.002630
59) RF-57 0.002541
60) RF-22 0.002534
61) RF-55 0.002529
62) RF-65 0.002446
63) RF-58 0.002433
64) XC-4 0.002097
65) RF-104 0.002049
66) RF-15 0.002011
67) RF-101 0.001984
68) RF-80 0.001779
69) RF-4 0.001730
70) RF-78 0.001726
71) RF-99 0.001721
72) RF-124 0.001704
73) RF-50 0.001680
74) RF-56 0.001673
75) RF-1 0.001652
76) RF-121 0.001644
77) RF-102 0.001643
78) RF-60 0.001586
79) RF-53 0.001452
80) RF-42 0.001424
81) RF-111 0.001396
82) RF-91 0.001379
83) RF-35 0.001346
84) RF-17 0.001324
85) RF-122 0.001320
86) RF-103 0.001314
87) RF-41 0.001214
88) RF-117 0.001205
89) RF-16 0.001203
90) RF-46 0.001163
91) RF-119 0.001126
92) RF-113 0.001126
print(list1)
['RF-92', 'RF-10', 'RF-87', 'RF-95', 'RF-24', 'RF-73', 'RF-100', 'RF-82', 'RF-9', 'RF-86', 'RF-23', 'RF-88', 'RF-84', 'RF-30', 'RF-26', 'RF-125', 'RF-52', 'RF-83', 'RF-7', 'RF-34', 'RF-21', 'RF-33', 'RF-31', 'RF-67', 'RF-123', 'RF-32', 'RF-79', 'RF-39', 'RF-96', 'XC-2', 'RF-109', 'RF-18', 'RF-47', 'RF-49', 'RF-25', 'RF-38', 'XC-1', 'RF-20', 'RF-107', 'XC-5', 'RF-5', 'RF-11', 'RF-8', 'RF-120', 'RF-27', 'RF-64', 'RF-74', 'RF-93', 'RF-97', 'RF-118', 'RF-54', 'RF-112', 'RF-44', 'RF-45', 'RF-40', 'RF-48', 'RF-69', 'RF-43', 'RF-57', 'RF-22', 'RF-55', 'RF-65', 'RF-58', 'XC-4', 'RF-104', 'RF-15', 'RF-101', 'RF-80', 'RF-4', 'RF-78', 'RF-99', 'RF-124', 'RF-50', 'RF-56', 'RF-1', 'RF-121', 'RF-102', 'RF-60', 'RF-53', 'RF-42', 'RF-111', 'RF-91', 'RF-35', 'RF-17', 'RF-122', 'RF-103', 'RF-41', 'RF-117', 'RF-16', 'RF-46', 'RF-119', 'RF-113']
重新选择特征,确定X
X = data_new[list1]
X.columns
Index(['RF-92', 'RF-10', 'RF-87', 'RF-95', 'RF-24', 'RF-73', 'RF-100', 'RF-82',
'RF-9', 'RF-86', 'RF-23', 'RF-88', 'RF-84', 'RF-30', 'RF-26', 'RF-125',
'RF-52', 'RF-83', 'RF-7', 'RF-34', 'RF-21', 'RF-33', 'RF-31', 'RF-67',
'RF-123', 'RF-32', 'RF-79', 'RF-39', 'RF-96', 'XC-2', 'RF-109', 'RF-18',
'RF-47', 'RF-49', 'RF-25', 'RF-38', 'XC-1', 'RF-20', 'RF-107', 'XC-5',
'RF-5', 'RF-11', 'RF-8', 'RF-120', 'RF-27', 'RF-64', 'RF-74', 'RF-93',
'RF-97', 'RF-118', 'RF-54', 'RF-112', 'RF-44', 'RF-45', 'RF-40',
'RF-48', 'RF-69', 'RF-43', 'RF-57', 'RF-22', 'RF-55', 'RF-65', 'RF-58',
'XC-4', 'RF-104', 'RF-15', 'RF-101', 'RF-80', 'RF-4', 'RF-78', 'RF-99',
'RF-124', 'RF-50', 'RF-56', 'RF-1', 'RF-121', 'RF-102', 'RF-60',
'RF-53', 'RF-42', 'RF-111', 'RF-91', 'RF-35', 'RF-17', 'RF-122',
'RF-103', 'RF-41', 'RF-117', 'RF-16', 'RF-46', 'RF-119', 'RF-113'],
dtype='object')
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
X_train.shape
(480, 92)
套索回归
from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.00001)
reg.fit(X_train,y_train)
D:\installation\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:532: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.6435372468214974e+23, tolerance: 1.4973870832681085e+20
positive)
Lasso(alpha=1e-05)
plt.plot(np.abs(reg.coef_),'--o')
plt.axhline(y=0.02,c='r')
<matplotlib.lines.Line2D at 0x1f6c19fe748>
np.mean(np.abs(reg.coef_))
18311770661.999977
reg.score(X_test,y_test)
0.732052935705332
reg.score(X_train,y_train)
0.7804793075636222
pred = reg.predict(X_test)
pred
array([-5.91279193e+09, 1.52970644e+11, 1.21356713e+11, 1.61144231e+09,
-2.76269753e+09, 1.66805503e+11, -2.55209259e+09, 6.57431795e+10,
2.09392306e+09, 3.00407735e+10, 5.10659695e+07, 1.44706078e+10,
2.75700557e+09, 8.36149121e+09, 1.09271166e+10, 1.53840924e+10,
4.89804223e+09, -3.27441451e+09, 1.01305009e+10, 5.01193482e+09,
1.13311659e+10, 7.10822012e+09, -1.91037528e+10, 1.12133120e+10,
2.94543418e+10, 3.37137129e+10, 6.84364644e+09, 7.92962844e+09,
1.39378767e+11, 6.76242943e+10, 6.81095813e+09, 2.35075869e+08,
-9.01481269e+09, -3.12739062e+09, 3.70687644e+10, 8.92374996e+10,
-9.79097338e+09, -1.36527924e+10, 2.95734998e+10, 1.21358170e+11,
3.69019313e+10, -7.57923541e+08, -8.41263434e+09, 7.08175212e+10,
-2.16577354e+10, 2.28445865e+09, 4.30734803e+10, -1.39146317e+10,
-8.80558734e+09, 1.21641591e+11, -1.02231574e+10, -2.09872847e+09,
1.00209332e+09, 1.35053067e+11, 2.42631162e+09, -1.21676250e+10,
-1.04680970e+10, 2.41489349e+10, 2.87759396e+09, -9.82787928e+09,
7.25166935e+08, 7.30268846e+09, 8.89993429e+10, 5.27150258e+10,
6.40587784e+10, 6.49945475e+08, 9.59509662e+10, 7.83788370e+09,
-6.21577580e+09, 7.91980402e+09, -2.44524065e+10, -2.00705094e+09,
1.53761308e+11, 1.37174202e+10, 3.12556649e+10, 1.22462383e+10,
-1.37108599e+10, 1.19086132e+10, 1.17209150e+11, 1.58853291e+10,
1.89075897e+11, 1.14837540e+11, 1.77801946e+10, 4.35574632e+10,
1.71481394e+10, -1.93642884e+10, 1.16004011e+10, -1.43264415e+10,
-4.38351832e+10, -9.74000150e+09, 1.16270513e+10, 5.39487296e+10,
1.88127992e+09, -1.13001965e+10, -1.69820432e+09, 1.52138838e+11,
-2.58193379e+10, -3.93855871e+09, 1.30366979e+11, -1.90334073e+10,
9.72897998e+09, -1.11751827e+10, 1.58294676e+10, 7.61421260e+10,
5.90676211e+07, 2.31508892e+10, -4.34778130e+09, 1.70370590e+11,
1.93214298e+10, 1.33169111e+11, 1.54414149e+10, 5.60941514e+10,
3.17653949e+10, 6.26557415e+09, -4.92238959e+10, 9.81001782e+10,
3.71165210e+10, 1.70241027e+10, 7.99393150e+10, 1.77629873e+09,
-1.55065335e+10])
plt.rcParams['axes.unicode_minus']=False
plt.plot(pred,'-o')
plt.plot(y_test.values,'-x')
[<matplotlib.lines.Line2D at 0x1f6c1a78bc8>]
随机森林回归
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=500,random_state=1)
forest.fit(X_train, y_train)
RandomForestRegressor(n_estimators=500, random_state=1)
forest.score(X_test, y_test)
0.9261195509450422
forest.score(X_train, y_train)
0.9755027191067267
pred = forest.predict(X_test)
plt.figure(figsize=(9, 5))
plt.plot(pred, '-o')
plt.plot(y_test.values, '-x')
plt.legend(['pred','y_test'])
plt.savefig('./拟合结果3.png', dpi=300)
plt.show()
y_test.shape
(121,)
pred
array([1.24675352e+08, 2.05173475e+11, 1.12789236e+11, 8.00262968e+07,
7.54421520e+08, 1.85587425e+11, 1.55189248e+05, 9.81478164e+10,
5.74017449e+07, 4.24724722e+09, 4.66480466e+07, 5.08733195e+08,
1.00001649e+08, 6.44360553e+09, 1.42626958e+10, 3.13156145e+07,
4.31114181e+07, 1.26695267e+08, 7.76068297e+08, 7.59096300e+07,
8.48918886e+06, 7.27663689e+09, 1.17102611e+10, 3.46650597e+10,
1.10393074e+10, 7.40686389e+09, 4.81951760e+07, 4.17041954e+08,
1.18842789e+11, 1.97305119e+10, 3.22288935e+10, 3.89535361e+08,
6.65526165e+07, 5.75985656e+08, 6.27017229e+08, 6.49339062e+10,
3.85719023e+08, 1.89915162e+09, 1.55328249e+10, 9.74938349e+10,
8.37426616e+09, 8.95340757e+06, 2.69645769e+09, 2.85048298e+10,
1.87720648e+07, 7.47554743e+08, 6.02530517e+10, 2.15799840e+07,
4.55129420e+08, 8.61797050e+10, 8.86174570e+08, 5.78620744e+08,
2.09165917e+08, 1.16666346e+11, 1.08871650e+10, 2.74455762e+09,
3.28018528e+08, 6.68731939e+08, 5.17918113e+08, 4.66320301e+08,
7.33137992e+07, 6.73922735e+08, 3.76987720e+10, 1.56428215e+10,
5.41022623e+10, 2.90909762e+07, 1.28268144e+11, 8.92707179e+09,
4.63233331e+07, 1.17945429e+09, 9.14957227e+08, 1.12137744e+07,
1.83293585e+11, 4.09141865e+10, 1.33210105e+10, 1.31548227e+10,
6.86595383e+08, 1.04837817e+08, 1.25573677e+11, 8.05585407e+09,
1.80610749e+11, 4.69923831e+10, 4.03452763e+10, 1.16284253e+10,
8.46682973e+07, 5.45160184e+06, 1.06452752e+09, 4.64929293e+08,
2.83771817e+10, 3.62310379e+08, 4.31160930e+06, 3.97632956e+10,
1.96497034e+08, 6.50559268e+09, 1.02850940e+09, 1.63454052e+11,
1.93554624e+06, 1.53093184e+05, 1.34768527e+11, 8.28636442e+07,
2.03984846e+08, 3.90851214e+08, 4.71234506e+09, 8.92876697e+10,
4.98159381e+07, 1.76728362e+10, 3.07549350e+08, 1.28416273e+11,
3.70288798e+10, 1.36616852e+11, 5.65949200e+07, 5.09815668e+10,
1.16628258e+10, 8.99349129e+09, 3.09913240e+10, 1.34674082e+11,
3.16818562e+10, 3.90879421e+08, 1.28392593e+11, 6.20190259e+09,
4.31842726e+08])
list_forest_pred = list(pred)
list_forest_pred
[124675352.248,
205173474701.0255,
112789236039.8526,
80026296.82,
754421520.08,
185587425370.55,
155189.248,
98147816379.65141,
57401744.884,
4247247221.6748667,
46648046.592,
508733194.752,
100001649.256,
6443605531.820465,
14262695805.496351,
31315614.504,
43111418.068,
126695266.728,
776068297.328,
75909630.036,
8489188.864,
7276636891.948751,
11710261065.919668,
34665059729.64825,
11039307422.680023,
7406863889.814001,
48195175.96,
417041953.82,
118842788770.70879,
19730511943.936913,
32228893465.0654,
389535361.076,
66552616.464,
575985656.2351364,
627017229.124,
64933906221.420006,
385719023.424,
1899151623.344267,
15532824945.758196,
97493834936.33267,
8374266160.352725,
8953407.568,
2696457689.141733,
28504829788.50441,
18772064.84,
747554743.296,
60253051736.09195,
21579984.008,
455129419.776,
86179705011.769,
886174569.6234286,
578620744.4596444,
209165916.504,
116666345834.87895,
10887164990.000875,
2744557618.8492255,
328018527.82,
668731939.1013333,
517918112.6706667,
466320300.524,
73313799.18,
673922734.728,
37698771993.227325,
15642821490.815332,
54102262309.18782,
29090976.244,
128268143897.54721,
8927071788.246752,
46323333.06,
1179454292.77,
914957226.78,
11213774.4,
183293585492.91522,
40914186531.93481,
13321010536.4859,
13154822719.0016,
686595383.3788,
104837817.052,
125573677330.19968,
8055854065.595169,
180610749089.55646,
46992383090.550606,
40345276294.254875,
11628425323.005379,
84668297.288,
5451601.836,
1064527519.5970285,
464929293.44,
28377181747.692825,
362310378.664,
4311609.304,
39763295589.8338,
196497034.256,
6505592679.9942665,
1028509400.3388445,
163454051992.4498,
1935546.236,
153093.184,
134768526920.15146,
82863644.164,
203984845.772,
390851214.396,
4712345059.564934,
89287669687.39032,
49815938.1,
17672836223.857914,
307549349.88159996,
128416272544.25165,
37028879840.38972,
136616852034.36374,
56594920.04,
50981566750.73101,
11662825816.32518,
8993491285.912485,
30991324005.078377,
134674082283.40555,
31681856183.926003,
390879421.376,
128392592893.31665,
6201902588.575354,
431842726.056]
"""
Python向下取整:直接使用int
向上取整,直接使用ceil()
"""
import math
list_bin = []
for i in list_forest_pred:
list_bin.append(bin(int(math.ceil(i))))
list_bin
['0b111011011100110010100011001',
'0b10111111000101010010101100010110001110',
'0b1101001000010110000110010100101001000',
'0b100110001010001101010111001',
'0b101100111101111000111100010001',
'0b10101100110101110111110110000001011011',
'0b100101111000110110',
'0b1011011011010000100001101001110111100',
'0b11011010111110000110010001',
'0b11111101001001111101100101110110',
'0b10110001111100101011101111',
'0b11110010100101010011100001011',
'0b101111101011110011101110010',
'0b110000000000100011001111000011100',
'0b1101010010000111110111011101111110',
'0b1110111011101011010011111',
'0b10100100011101001111111011',
'0b111100011010011011101100011',
'0b101110010000011101110011001010',
'0b100100001100100100111111111',
'0b100000011000100011100101',
'0b110110001101110001010101011011100',
'0b1010111001111111000110011101001010',
'0b100000010010001100101101010110010010',
'0b1010010001111111100111011010011111',
'0b110111001011110111100011000010010',
'0b10110111110110011001101000',
'0b11000110110111000111000100010',
'0b1101110101011100101010000101110100011',
'0b10010011000000001111011100001001000',
'0b11110000000111111011110001100011010',
'0b10111001101111101011010000010',
'0b11111101111000001100101001',
'0b100010010101001101011111111001',
'0b100101010111111000011000001110',
'0b111100011110010111001100011100101110',
'0b10110111111011001101011110000',
'0b1110001001100101100000100001000',
'0b1110011101110101000001100101110010',
'0b1011010110011000101011101110010111001',
'0b111110011001001010010100100110001',
'0b100010001001111001000000',
'0b10100000101110001010110111011010',
'0b11010100011000001010010111101011101',
'0b1000111100111000001100001',
'0b101100100011101100011110111000',
'0b111000000111010111001001101101011001',
'0b1010010010100100011010001',
'0b11011001000001011100101001100',
'0b1010000010000101101011111000010110100',
'0b110100110100011111001101101010',
'0b100010011111010000110101001001',
'0b1100011101111001111001011101',
'0b1101100101001110110110010100101101011',
'0b1010001000111011001111010000111111',
'0b10100011100101101010000000110011',
'0b10011100011010010101001100000',
'0b100111110111000000101000100100',
'0b11110110111101100110110100001',
'0b11011110010110111101110101101',
'0b100010111101010111000001000',
'0b101000001010110011111010101111',
'0b100011000111000001011001110000011010',
'0b1110100100011000101000001101110011',
'0b110010011000101111110000001000100110',
'0b1101110111110010010100001',
'0b1110111011101011000001000110100011010',
'0b1000010100000110000100111000101101',
'0b10110000101101011010000110',
'0b1000110010011010000101101010101',
'0b110110100010010010001110101011',
'0b101010110001101111001111',
'0b10101010101101001001100010100001010101',
'0b100110000110101011001111000100100100',
'0b1100011001111111100111110101101001',
'0b1100010000000101101010101001000000',
'0b101000111011001001110100111000',
'0b110001111111011001010111010',
'0b1110100111100110001100100000100010011',
'0b111100000001010101001001111110010',
'0b10101000001101001111010101001010100010',
'0b101011110000111101101101110001110011',
'0b100101100100110001000000111110000111',
'0b1010110101000110111011000001101100',
'0b101000010111110111110001010',
'0b10100110010111101010010',
'0b111111011100110110011010100000',
'0b11011101101100100001000001110',
'0b11010011011011010010110111000110100',
'0b10101100110000110101011101011',
'0b10000011100101000111010',
'0b100101000010000100111011110101100110',
'0b1011101101100100111010001011',
'0b110000011110000110111011101101000',
'0b111101010011011100111011011001',
'0b10011000001110100111101110011010011001',
'0b111011000100010111011',
'0b100101011000000110',
'0b1111101100000110101001000011001001001',
'0b100111100000110011000011101',
'0b1100001010001000111111001110',
'0b10111010010111110101010001111',
'0b100011000111000001010110111100100',
'0b1010011001001111101011011011110111000',
'0b10111110000010000110000011',
'0b10000011101011000100001010010000000',
'0b10010010101001101010010100110',
'0b1110111100110001101001101000010100001',
'0b100010011111000101111101110111100001',
'0b1111111001110111111111011101001000011',
'0b11010111111001000111101001',
'0b101111011110101111001111100100011111',
'0b1010110111001010001001100101011001',
'0b1000011000000011011100100101010110',
'0b11100110111001110100001001101100110',
'0b1111101011011001100110110100111101100',
'0b11101100000011000101100001010111000',
'0b10111010011000101100010111110',
'0b1110111100100110010110111110111111110',
'0b101110001101010011000010111111101',
'0b11001101111010110010110100111']
len(list_bin)
121
(pd.DataFrame(y_test)).to_csv('./y_test.csv')
(pd.DataFrame(list_bin)).to_csv('./list_bin.csv')
print(int("0b111011011100110010100011001",2))
124675353
将二进制串补全到38位
str1 = '0b1010101'
print(str1.zfill(10))
00b1010101
list_bin_Q0b = []
for str_bin in list_bin:
list_bin_Q0b.append(str_bin[2:])
list_bin_Q0b
['111011011100110010100011001',
'10111111000101010010101100010110001110',
'1101001000010110000110010100101001000',
'100110001010001101010111001',
'101100111101111000111100010001',
'10101100110101110111110110000001011011',
'100101111000110110',
'1011011011010000100001101001110111100',
'11011010111110000110010001',
'11111101001001111101100101110110',
'10110001111100101011101111',
'11110010100101010011100001011',
'101111101011110011101110010',
'110000000000100011001111000011100',
'1101010010000111110111011101111110',
'1110111011101011010011111',
'10100100011101001111111011',
'111100011010011011101100011',
'101110010000011101110011001010',
'100100001100100100111111111',
'100000011000100011100101',
'110110001101110001010101011011100',
'1010111001111111000110011101001010',
'100000010010001100101101010110010010',
'1010010001111111100111011010011111',
'110111001011110111100011000010010',
'10110111110110011001101000',
'11000110110111000111000100010',
'1101110101011100101010000101110100011',
'10010011000000001111011100001001000',
'11110000000111111011110001100011010',
'10111001101111101011010000010',
'11111101111000001100101001',
'100010010101001101011111111001',
'100101010111111000011000001110',
'111100011110010111001100011100101110',
'10110111111011001101011110000',
'1110001001100101100000100001000',
'1110011101110101000001100101110010',
'1011010110011000101011101110010111001',
'111110011001001010010100100110001',
'100010001001111001000000',
'10100000101110001010110111011010',
'11010100011000001010010111101011101',
'1000111100111000001100001',
'101100100011101100011110111000',
'111000000111010111001001101101011001',
'1010010010100100011010001',
'11011001000001011100101001100',
'1010000010000101101011111000010110100',
'110100110100011111001101101010',
'100010011111010000110101001001',
'1100011101111001111001011101',
'1101100101001110110110010100101101011',
'1010001000111011001111010000111111',
'10100011100101101010000000110011',
'10011100011010010101001100000',
'100111110111000000101000100100',
'11110110111101100110110100001',
'11011110010110111101110101101',
'100010111101010111000001000',
'101000001010110011111010101111',
'100011000111000001011001110000011010',
'1110100100011000101000001101110011',
'110010011000101111110000001000100110',
'1101110111110010010100001',
'1110111011101011000001000110100011010',
'1000010100000110000100111000101101',
'10110000101101011010000110',
'1000110010011010000101101010101',
'110110100010010010001110101011',
'101010110001101111001111',
'10101010101101001001100010100001010101',
'100110000110101011001111000100100100',
'1100011001111111100111110101101001',
'1100010000000101101010101001000000',
'101000111011001001110100111000',
'110001111111011001010111010',
'1110100111100110001100100000100010011',
'111100000001010101001001111110010',
'10101000001101001111010101001010100010',
'101011110000111101101101110001110011',
'100101100100110001000000111110000111',
'1010110101000110111011000001101100',
'101000010111110111110001010',
'10100110010111101010010',
'111111011100110110011010100000',
'11011101101100100001000001110',
'11010011011011010010110111000110100',
'10101100110000110101011101011',
'10000011100101000111010',
'100101000010000100111011110101100110',
'1011101101100100111010001011',
'110000011110000110111011101101000',
'111101010011011100111011011001',
'10011000001110100111101110011010011001',
'111011000100010111011',
'100101011000000110',
'1111101100000110101001000011001001001',
'100111100000110011000011101',
'1100001010001000111111001110',
'10111010010111110101010001111',
'100011000111000001010110111100100',
'1010011001001111101011011011110111000',
'10111110000010000110000011',
'10000011101011000100001010010000000',
'10010010101001101010010100110',
'1110111100110001101001101000010100001',
'100010011111000101111101110111100001',
'1111111001110111111111011101001000011',
'11010111111001000111101001',
'101111011110101111001111100100011111',
'1010110111001010001001100101011001',
'1000011000000011011100100101010110',
'11100110111001110100001001101100110',
'1111101011011001100110110100111101100',
'11101100000011000101100001010111000',
'10111010011000101100010111110',
'1110111100100110010110111110111111110',
'101110001101010011000010111111101',
'11001101111010110010110100111']
list_bin_BQ = []
for str_bin in list_bin_Q0b:
list_bin_BQ.append(str_bin.zfill(38))
list_bin_BQ
['00000000000111011011100110010100011001',
'10111111000101010010101100010110001110',
'01101001000010110000110010100101001000',
'00000000000100110001010001101010111001',
'00000000101100111101111000111100010001',
'10101100110101110111110110000001011011',
'00000000000000000000100101111000110110',
'01011011011010000100001101001110111100',
'00000000000011011010111110000110010001',
'00000011111101001001111101100101110110',
'00000000000010110001111100101011101111',
'00000000011110010100101010011100001011',
'00000000000101111101011110011101110010',
'00000110000000000100011001111000011100',
'00001101010010000111110111011101111110',
'00000000000001110111011101011010011111',
'00000000000010100100011101001111111011',
'00000000000111100011010011011101100011',
'00000000101110010000011101110011001010',
'00000000000100100001100100100111111111',
'00000000000000100000011000100011100101',
'00000110110001101110001010101011011100',
'00001010111001111111000110011101001010',
'00100000010010001100101101010110010010',
'00001010010001111111100111011010011111',
'00000110111001011110111100011000010010',
'00000000000010110111110110011001101000',
'00000000011000110110111000111000100010',
'01101110101011100101010000101110100011',
'00010010011000000001111011100001001000',
'00011110000000111111011110001100011010',
'00000000010111001101111101011010000010',
'00000000000011111101111000001100101001',
'00000000100010010101001101011111111001',
'00000000100101010111111000011000001110',
'00111100011110010111001100011100101110',
'00000000010110111111011001101011110000',
'00000001110001001100101100000100001000',
'00001110011101110101000001100101110010',
'01011010110011000101011101110010111001',
'00000111110011001001010010100100110001',
'00000000000000100010001001111001000000',
'00000010100000101110001010110111011010',
'00011010100011000001010010111101011101',
'00000000000001000111100111000001100001',
'00000000101100100011101100011110111000',
'00111000000111010111001001101101011001',
'00000000000001010010010100100011010001',
'00000000011011001000001011100101001100',
'01010000010000101101011111000010110100',
'00000000110100110100011111001101101010',
'00000000100010011111010000110101001001',
'00000000001100011101111001111001011101',
'01101100101001110110110010100101101011',
'00001010001000111011001111010000111111',
'00000010100011100101101010000000110011',
'00000000010011100011010010101001100000',
'00000000100111110111000000101000100100',
'00000000011110110111101100110110100001',
'00000000011011110010110111101110101101',
'00000000000100010111101010111000001000',
'00000000101000001010110011111010101111',
'00100011000111000001011001110000011010',
'00001110100100011000101000001101110011',
'00110010011000101111110000001000100110',
'00000000000001101110111110010010100001',
'01110111011101011000001000110100011010',
'00001000010100000110000100111000101101',
'00000000000010110000101101011010000110',
'00000001000110010011010000101101010101',
'00000000110110100010010010001110101011',
'00000000000000101010110001101111001111',
'10101010101101001001100010100001010101',
'00100110000110101011001111000100100100',
'00001100011001111111100111110101101001',
'00001100010000000101101010101001000000',
'00000000101000111011001001110100111000',
'00000000000110001111111011001010111010',
'01110100111100110001100100000100010011',
'00000111100000001010101001001111110010',
'10101000001101001111010101001010100010',
'00101011110000111101101101110001110011',
'00100101100100110001000000111110000111',
'00001010110101000110111011000001101100',
'00000000000101000010111110111110001010',
'00000000000000010100110010111101010010',
'00000000111111011100110110011010100000',
'00000000011011101101100100001000001110',
'00011010011011011010010110111000110100',
'00000000010101100110000110101011101011',
'00000000000000010000011100101000111010',
'00100101000010000100111011110101100110',
'00000000001011101101100100111010001011',
'00000110000011110000110111011101101000',
'00000000111101010011011100111011011001',
'10011000001110100111101110011010011001',
'00000000000000000111011000100010111011',
'00000000000000000000100101011000000110',
'01111101100000110101001000011001001001',
'00000000000100111100000110011000011101',
'00000000001100001010001000111111001110',
'00000000010111010010111110101010001111',
'00000100011000111000001010110111100100',
'01010011001001111101011011011110111000',
'00000000000010111110000010000110000011',
'00010000011101011000100001010010000000',
'00000000010010010101001101010010100110',
'01110111100110001101001101000010100001',
'00100010011111000101111101110111100001',
'01111111001110111111111011101001000011',
'00000000000011010111111001000111101001',
'00101111011110101111001111100100011111',
'00001010110111001010001001100101011001',
'00001000011000000011011100100101010110',
'00011100110111001110100001001101100110',
'01111101011011001100110110100111101100',
'00011101100000011000101100001010111000',
'00000000010111010011000101100010111110',
'01110111100100110010110111110111111110',
'00000101110001101010011000010111111101',
'00000000011001101111010110010110100111']
n = 0
for i in list_bin_BQ:
if n<len(list_bin_BQ)-1:
with open("./output.csv", 'a',encoding='utf8') as name:
name.write(i + '\n')
n += 1
else:
with open("./output.csv", 'a',encoding='utf8') as name:
name.write(i)
n += 1
list_y_test = list(y_test)
list_y_test_bin = []
for i in list_y_test:
list_y_test_bin.append(bin(i))
list_y_test_bin
['0b10000000000000000',
'0b11001100000010000101010000100100000001',
'0b10000000000100000000000000000000000000',
'0b0',
'0b110000000000000000000000000000',
'0b10110000100000000000100010000000000001',
'0b0',
'0b1101000100000000001000000100000000000',
'0b0',
'0b1101001000100001000010100000000',
'0b10000000000000000',
'0b10010000000010000100100000000',
'0b10000000000000000',
'0b10000000000100000000000000000',
'0b101000010000000100000000100000010101',
'0b1000000000000000000000',
'0b100000000000000000000000000',
'0b100000000000000000000',
'0b110000000000000000000000000000',
'0b100000100000000000000000000',
'0b100000000000000000000',
'0b1000010000100000100100000000000000',
'0b0',
'0b101000000100000001100000100000010010',
'0b10101000100000000100001000000',
'0b1000000000000100000000100000000',
'0b1000000000000000000000',
'0b11000000000000000100000000000',
'0b10000000000100010000110000000000001000',
'0b10000100001000000000000000000000000',
'0b101000000100000001100000100000000010',
'0b110000000010000000001000000000',
'0b100000100000000000000000000',
'0b1000000000',
'0b100000000000000000000000000000',
'0b1000000001000000001000010000000110000',
'0b11010000000000000100000000000',
'0b10110000101000100001000000001000',
'0b1100000010000000000100001000',
'0b110010000000000000000000000000001000',
'0b100010100000000000000100000000000',
'0b100000000000000000000000000',
'0b100000010000001000001100010',
'0b10000000000000000000000000100000000',
'0b0',
'0b110000000000000000000000000000',
'0b100010100000100001010000101000000000',
'0b0',
'0b100000001010000000000100000000',
'0b10000000001010000000101000000001000000',
'0b1000100000',
'0b1000100100',
'0b10000000000000001000100000000',
'0b10000000000100010000110000000000001000',
'0b100000100000000000',
'0b1000000001000100000000100101000',
'0b100000000100000000000000000000',
'0b110000000010000000001000000000',
'0b1000000000000000000000000000',
'0b100000000000000000100000000000',
'0b10000000000000000',
'0b110000000010000000010000000000',
'0b10000000010010001100000010100',
'0b0',
'0b100000000011000000100010000000001001',
'0b1000000000000000000000',
'0b10000000000100000000000001000000001000',
'0b10000000000100000000000000000',
'0b100000000000000000000000000',
'0b11000011000000000100000011100',
'0b1000100010010100110000',
'0b0',
'0b11001100000010000101010000100100000010',
'0b101000000000000100000100100000000100',
'0b101010000100000000010',
'0b1000000000000100000000100000000',
'0b1100011001010000100010011100',
'0b10000000000000',
'0b10000000000100010000000001000000001000',
'0b100000000001010000100000000000',
'0b10110000100000000000000010000001000011',
'0b10000101010000100100000010',
'0b101000000100000001100000100000010010',
'0b101000000000000100000000000100000',
'0b1000000000000000000000',
'0b0',
'0b100000111010000011000000000',
'0b110100000000000000000000',
'0b101000010000000100001000100000010101',
'0b100000000100000000000000000000',
'0b0',
'0b100000000010000000100010000100001001',
'0b1010000000000100000000',
'0b10000000001000000010000100010000',
'0b1000',
'0b10110000000000000000100000000000000000',
'0b10000000000000000000',
'0b0',
'0b10000000000000000000000000000000000010',
'0b100000100000000000000000000',
'0b10000000000000000000000000000',
'0b100000000100000000000000000000',
'0b1000000000000100000000100000000',
'0b10100000010000100000000001001000010000',
'0b0',
'0b10000100000000001010000100000000000',
'0b10010000100000000000000000000',
'0b10110000100000000000100000000000000001',
'0b10001000',
'0b10000000000000000000000000000000000000',
'0b100000000000',
'0b1001000010001001000000000100000000100',
'0b100000000000000001000000',
'0b1000010000100000000100000000000000',
'0b101000010000000100001000100000010101',
'0b10010000100000000000100001010000100010',
'0b100000000000100000000001001000000001',
'0b11100000101000000010000000000',
'0b10000001001010000000001000001000001100',
'0b1000010000000000000000000',
'0b11110000101000000000000000000']
list_y_test_bin_Q0b = []
for str_bin in list_y_test_bin:
list_y_test_bin_Q0b.append(str_bin[2:])
list_y_test_bin_Q0b
['10000000000000000',
'11001100000010000101010000100100000001',
'10000000000100000000000000000000000000',
'0',
'110000000000000000000000000000',
'10110000100000000000100010000000000001',
'0',
'1101000100000000001000000100000000000',
'0',
'1101001000100001000010100000000',
'10000000000000000',
'10010000000010000100100000000',
'10000000000000000',
'10000000000100000000000000000',
'101000010000000100000000100000010101',
'1000000000000000000000',
'100000000000000000000000000',
'100000000000000000000',
'110000000000000000000000000000',
'100000100000000000000000000',
'100000000000000000000',
'1000010000100000100100000000000000',
'0',
'101000000100000001100000100000010010',
'10101000100000000100001000000',
'1000000000000100000000100000000',
'1000000000000000000000',
'11000000000000000100000000000',
'10000000000100010000110000000000001000',
'10000100001000000000000000000000000',
'101000000100000001100000100000000010',
'110000000010000000001000000000',
'100000100000000000000000000',
'1000000000',
'100000000000000000000000000000',
'1000000001000000001000010000000110000',
'11010000000000000100000000000',
'10110000101000100001000000001000',
'1100000010000000000100001000',
'110010000000000000000000000000001000',
'100010100000000000000100000000000',
'100000000000000000000000000',
'100000010000001000001100010',
'10000000000000000000000000100000000',
'0',
'110000000000000000000000000000',
'100010100000100001010000101000000000',
'0',
'100000001010000000000100000000',
'10000000001010000000101000000001000000',
'1000100000',
'1000100100',
'10000000000000001000100000000',
'10000000000100010000110000000000001000',
'100000100000000000',
'1000000001000100000000100101000',
'100000000100000000000000000000',
'110000000010000000001000000000',
'1000000000000000000000000000',
'100000000000000000100000000000',
'10000000000000000',
'110000000010000000010000000000',
'10000000010010001100000010100',
'0',
'100000000011000000100010000000001001',
'1000000000000000000000',
'10000000000100000000000001000000001000',
'10000000000100000000000000000',
'100000000000000000000000000',
'11000011000000000100000011100',
'1000100010010100110000',
'0',
'11001100000010000101010000100100000010',
'101000000000000100000100100000000100',
'101010000100000000010',
'1000000000000100000000100000000',
'1100011001010000100010011100',
'10000000000000',
'10000000000100010000000001000000001000',
'100000000001010000100000000000',
'10110000100000000000000010000001000011',
'10000101010000100100000010',
'101000000100000001100000100000010010',
'101000000000000100000000000100000',
'1000000000000000000000',
'0',
'100000111010000011000000000',
'110100000000000000000000',
'101000010000000100001000100000010101',
'100000000100000000000000000000',
'0',
'100000000010000000100010000100001001',
'1010000000000100000000',
'10000000001000000010000100010000',
'1000',
'10110000000000000000100000000000000000',
'10000000000000000000',
'0',
'10000000000000000000000000000000000010',
'100000100000000000000000000',
'10000000000000000000000000000',
'100000000100000000000000000000',
'1000000000000100000000100000000',
'10100000010000100000000001001000010000',
'0',
'10000100000000001010000100000000000',
'10010000100000000000000000000',
'10110000100000000000100000000000000001',
'10001000',
'10000000000000000000000000000000000000',
'100000000000',
'1001000010001001000000000100000000100',
'100000000000000001000000',
'1000010000100000000100000000000000',
'101000010000000100001000100000010101',
'10010000100000000000100001010000100010',
'100000000000100000000001001000000001',
'11100000101000000010000000000',
'10000001001010000000001000001000001100',
'1000010000000000000000000',
'11110000101000000000000000000']
list_bin_y_test_BQ = []
for str_bin in list_y_test_bin_Q0b:
list_bin_y_test_BQ.append(str_bin.zfill(38))
list_bin_y_test_BQ
['00000000000000000000010000000000000000',
'11001100000010000101010000100100000001',
'10000000000100000000000000000000000000',
'00000000000000000000000000000000000000',
'00000000110000000000000000000000000000',
'10110000100000000000100010000000000001',
'00000000000000000000000000000000000000',
'01101000100000000001000000100000000000',
'00000000000000000000000000000000000000',
'00000001101001000100001000010100000000',
'00000000000000000000010000000000000000',
'00000000010010000000010000100100000000',
'00000000000000000000010000000000000000',
'00000000010000000000100000000000000000',
'00101000010000000100000000100000010101',
'00000000000000001000000000000000000000',
'00000000000100000000000000000000000000',
'00000000000000000100000000000000000000',
'00000000110000000000000000000000000000',
'00000000000100000100000000000000000000',
'00000000000000000100000000000000000000',
'00001000010000100000100100000000000000',
'00000000000000000000000000000000000000',
'00101000000100000001100000100000010010',
'00000000010101000100000000100001000000',
'00000001000000000000100000000100000000',
'00000000000000001000000000000000000000',
'00000000011000000000000000100000000000',
'10000000000100010000110000000000001000',
'00010000100001000000000000000000000000',
'00101000000100000001100000100000000010',
'00000000110000000010000000001000000000',
'00000000000100000100000000000000000000',
'00000000000000000000000000001000000000',
'00000000100000000000000000000000000000',
'01000000001000000001000010000000110000',
'00000000011010000000000000100000000000',
'00000010110000101000100001000000001000',
'00000000001100000010000000000100001000',
'00110010000000000000000000000000001000',
'00000100010100000000000000100000000000',
'00000000000100000000000000000000000000',
'00000000000100000010000001000001100010',
'00010000000000000000000000000100000000',
'00000000000000000000000000000000000000',
'00000000110000000000000000000000000000',
'00100010100000100001010000101000000000',
'00000000000000000000000000000000000000',
'00000000100000001010000000000100000000',
'10000000001010000000101000000001000000',
'00000000000000000000000000001000100000',
'00000000000000000000000000001000100100',
'00000000010000000000000001000100000000',
'10000000000100010000110000000000001000',
'00000000000000000000100000100000000000',
'00000001000000001000100000000100101000',
'00000000100000000100000000000000000000',
'00000000110000000010000000001000000000',
'00000000001000000000000000000000000000',
'00000000100000000000000000100000000000',
'00000000000000000000010000000000000000',
'00000000110000000010000000010000000000',
'00000000010000000010010001100000010100',
'00000000000000000000000000000000000000',
'00100000000011000000100010000000001001',
'00000000000000001000000000000000000000',
'10000000000100000000000001000000001000',
'00000000010000000000100000000000000000',
'00000000000100000000000000000000000000',
'00000000011000011000000000100000011100',
'00000000000000001000100010010100110000',
'00000000000000000000000000000000000000',
'11001100000010000101010000100100000010',
'00101000000000000100000100100000000100',
'00000000000000000101010000100000000010',
'00000001000000000000100000000100000000',
'00000000001100011001010000100010011100',
'00000000000000000000000010000000000000',
'10000000000100010000000001000000001000',
'00000000100000000001010000100000000000',
'10110000100000000000000010000001000011',
'00000000000010000101010000100100000010',
'00101000000100000001100000100000010010',
'00000101000000000000100000000000100000',
'00000000000000001000000000000000000000',
'00000000000000000000000000000000000000',
'00000000000100000111010000011000000000',
'00000000000000110100000000000000000000',
'00101000010000000100001000100000010101',
'00000000100000000100000000000000000000',
'00000000000000000000000000000000000000',
'00100000000010000000100010000100001001',
'00000000000000001010000000000100000000',
'00000010000000001000000010000100010000',
'00000000000000000000000000000000001000',
'10110000000000000000100000000000000000',
'00000000000000000010000000000000000000',
'00000000000000000000000000000000000000',
'10000000000000000000000000000000000010',
'00000000000100000100000000000000000000',
'00000000010000000000000000000000000000',
'00000000100000000100000000000000000000',
'00000001000000000000100000000100000000',
'10100000010000100000000001001000010000',
'00000000000000000000000000000000000000',
'00010000100000000001010000100000000000',
'00000000010010000100000000000000000000',
'10110000100000000000100000000000000001',
'00000000000000000000000000000010001000',
'10000000000000000000000000000000000000',
'00000000000000000000000000100000000000',
'01001000010001001000000000100000000100',
'00000000000000100000000000000001000000',
'00001000010000100000000100000000000000',
'00101000010000000100001000100000010101',
'10010000100000000000100001010000100010',
'00100000000000100000000001001000000001',
'00000000011100000101000000010000000000',
'10000001001010000000001000001000001100',
'00000000000001000010000000000000000000',
'00000000011110000101000000000000000000']
n = 0
for i in list_bin_y_test_BQ:
if n<len(list_bin_y_test_BQ)-1:
with open("./output_y_test.csv", 'a',encoding='utf8') as name:
name.write(i + '\n')
n += 1
else:
with open("./output_y_test.csv", 'a',encoding='utf8') as name:
name.write(i)
n += 1
计算预测串与真实串之间的匹配程度
设想,如果可以直接比较预测得到的二进制串与测试集二进制串,是否可以进行评估
import difflib
def string_similar(s1, s2):
return difflib.SequenceMatcher(None, s1, s2).quick_ratio()
xiangsidu_list = []
for i in range(0,len(list_bin_BQ)):
for j in range(0,len(list_bin_y_test_BQ)):
xiangsidu_list.append((string_similar(list_bin_BQ[i], list_bin_y_test_BQ[j])))
print((string_similar(list_bin_BQ[i], list_bin_y_test_BQ[j])))
# 上述输出结果过程,此处予以删除;
np.mean(xiangsidu_list,axis)
0.6696695293318331
计算汉明损失(Hamming-Loss)
print(type(list_bin_BQ[0]))
print(type(list_bin_y_test_BQ[0]))
<class 'str'>
<class 'str'>
from sklearn.metrics import hamming_loss
hamming_loss(y_true=[1,2,3,4], y_pred=[2,2,3,4])
0.25
str = 'xiaoyao'
print(list(str))
['x', 'i', 'a', 'o', 'y', 'a', 'o']
y_pred_list = [(list(str)) for str in list_bin_BQ]
y_test_list = [(list(str)) for str in list_bin_y_test_BQ]
print(type(y_pred_list), type(y_test_list))
print(y_pred_list[0], y_test_list[0])
<class 'list'> <class 'list'>
['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '1', '1', '0', '1', '1', '0', '1', '1', '1', '0', '0', '1', '1', '0', '0', '1', '0', '1', '0', '0', '0', '1', '1', '0', '0', '1'] ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']
pred_array = np.array(y_pred_list)
test_array = np.array(y_test_list)
print((pred_array).shape,(test_array).shape)
(121, 38) (121, 38)
type(pred_array[0][0])
numpy.str_
int_pred_array = pred_array.astype(np.int32)
int_test_array = test_array.astype(np.int32)
hamming_loss(y_true=int_test_array, y_pred=int_pred_array)
0.4275772074815137
计算杰卡德相似系数
from sklearn.metrics import jaccard_score
print(jaccard_score(int_test_array, int_pred_array, average='macro'))
print(jaccard_score(int_test_array, int_pred_array, average='samples'))
print(jaccard_score(int_test_array, int_pred_array, average=None))
0.11170386238918098
0.09724124663042615
[0.29411765 0.05555556 0.26470588 0.28 0.23684211 0.05882353
0.05263158 0.07142857 0.21153846 0.26984127 0.14583333 0.11940299
0.06557377 0.05084746 0.046875 0.06153846 0.09677419 0.15
0.07142857 0.13513514 0.16438356 0.07894737 0.02666667 0.03076923
0.0877193 0.05 0.18181818 0.03030303 0.11267606 0.14492754
0.01754386 0.03389831 0.08823529 0.11111111 0.08974359 0.1372549
0.08955224 0.03030303]
计算F1_score
from sklearn.metrics import f1_score
print(f1_score(int_test_array, int_pred_array, average='macro'))
print(f1_score(int_test_array, int_pred_array, average='samples'))
print(f1_score(int_test_array, int_pred_array, average=None))
0.19289818577352147
0.16674516791123403
[0.45454545 0.10526316 0.41860465 0.4375 0.38297872 0.11111111
0.1 0.13333333 0.34920635 0.425 0.25454545 0.21333333
0.12307692 0.09677419 0.08955224 0.11594203 0.17647059 0.26086957
0.13333333 0.23809524 0.28235294 0.14634146 0.05194805 0.05970149
0.16129032 0.0952381 0.30769231 0.05882353 0.20253165 0.25316456
0.03448276 0.06557377 0.16216216 0.2 0.16470588 0.24137931
0.16438356 0.05882353]
|