实验日期
2021.10.02 — 2021.10.06
实验环境
KNN
KNN实验的核心就在于distance矩阵的计算。其中,计算方法分为两层循环,一层循环和无循环计算
-
两层循环 for i in range(num_test):
for j in range(num_train):
dists[i][j] = np.sqrt(np.sum(np.square(X[i , :] - self.X_train[j , :])))
- distance矩阵得出的结果是每一个测试数据距离所有训练数据的距离。因为我们可以在遍历测试集的每一个数据时,将测试数据各个feature的数据与训练数据对应feature相减,然后得到L2-norm形式的距离。最终dists[ i ] [ j ]代表第 i 的测试数据和第 j 个训练数据的L2-norm距离
-
一层循环 for i in range(num_test):
dists[i] = np.sqrt(np.sum(np.square(X[i , :] - self.X_train) , axis=1))
- 在这里,只使用一层循环,是遍历测试数据的。np.array的一个特性是,当一个向量与一个矩阵做加减运算时,该向量会自动复制,扩充到和矩阵一样大的维度,然后对应相减。所以这里的测试数据是一个行向量,在减去X_train时,会和X_train的每一行相减,这样我们就得到了所有的差值,然后平方,求和,开方,就得到了L2-norm距离
-
无循环 a = np.sum(np.square(X) , axis=1 , keepdims=True)
b = np.sum(np.square(self.X_train) , axis=1)
c = np.multiply(np.dot(X , self.X_train.T) , -2)
dists = np.add(a , b)
dists = np.add(dists , c)
dists = np.sqrt(dists)
- 这里需要数学上的小技巧,即利用
(
x
?
y
)
2
=
x
2
+
y
2
?
2
x
y
(x-y)^2 = x^2 + y^2 - 2xy
(x?y)2=x2+y2?2xy的公式,我们想要得到式子左边的,可以通过求式子右边的
x
2
x^2
x2 ,
y
2
y^2
y2 ,
?
2
x
y
-2xy
?2xy 来得到,即a , b , c , 注意:a , b 分别为列向量和行向量,相加得到一个矩阵
-
三种循环的运行速度
- Two loop version took 29.890179 seconds
- One loop version took 57.145778 seconds
- No loop version took 0.229963 seconds
补充predict
closest_y = self.y_train[np.argsort(dists[i,:])[:k]]
y_pred[i] = np.argmax(np.bincount(closest_y))
- 利用np.argsort得到前k小dists的索引,然后再利用np.bincount计算这些索引出现的次数,最后利用大多数投票的方式得到最终的label
交叉验证
X_train_folds = np.array_split(X_train , num_folds , axis = 0)
y_train_folds = np.array_split(y_train , num_folds , axis = 0)
for k in k_choices:
accuracies = []
for i in range(num_folds):
X_train_cv = np.vstack(X_train_folds[0:i] + X_train_folds[i+1:])
y_train_cv = np.hstack(y_train_folds[0:i] + y_train_folds[i+1:])
X_valid_cv = X_train_folds[i]
y_valid_cv = y_train_folds[i]
classifier.train(X_train_cv , y_train_cv)
dists = classifier.compute_distances_no_loops(X_valid_cv)
y_test_pred = classifier.predict_labels(dists , k)
num_correct = np.sum(y_test_pred == y_valid_cv)
accuracy = float(num_correct) / y_valid_cv.shape[0]
accuracies.append(accuracy)
k_to_accuracies[k] = accuracies
SVM
首先,推导SVM loss function对w的梯度
对于每一个样本,正确分类的得分应该比错误分类的得分至少高
Δ
\Delta
Δ ,
Δ
\Delta
Δ取值在实际中一般为1
-
若正确分类的得分比错误分类的得分大1 , 则认为判断正确 -
loss function:
L
i
=
∑
j
≠
y
i
[
m
a
x
(
0
,
w
j
T
x
i
?
w
y
i
T
x
i
+
Δ
)
]
L_i = \sum_{j \ne y_i} [max(0 , w_j^Tx_i - w_{y_i}^Tx_i + \Delta)]
Li?=∑j?=yi??[max(0,wjT?xi??wyi?T?xi?+Δ)]
-
w
y
i
T
x
i
w_{y_i}^Tx_i
wyi?T?xi? 求的是
x
i
x_i
xi?在第
y
i
y_i
yi? 个类别,也就是对正确分类的得分
-
w
j
T
x
i
w_j^Tx_i
wjT?xi? 求的是
x
i
x_i
xi? 在第 j 个类别的得分 , 也就是对错误分类的得分
-
Δ
\Delta
Δ 是正确分类的得分要比错误分类的得分高
Δ
\Delta
Δ 才算分类正确 , 分类正确 loss 算 0
-
Δ
\Delta
Δ 通常取1 , 因为
Δ
\Delta
Δ 的变换可以转换成
w
w
w 的变化
-
梯度
- 对loss function 直接求梯度,实际是分成两部分求的,一部分是对 0 求,一部分是对
∑
j
≠
y
i
(
w
j
T
x
i
?
w
y
i
T
x
i
+
Δ
)
\sum_{j \ne y_i}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta)
∑j?=yi??(wjT?xi??wyi?T?xi?+Δ) 求 , 所以梯度中会出现判别式
- 对正确分类的梯度:
-
▽
w
y
i
L
i
=
?
(
∑
j
≠
y
i
1
(
w
j
T
x
i
?
w
y
i
T
x
i
+
Δ
>
0
)
)
x
i
\bigtriangledown_{w_{y_i}}L_i = -\left(\sum_{j \ne y_i } 1(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i
▽wyi???Li?=?(∑j?=yi??1(wjT?xi??wyi?T?xi?+Δ>0))xi?
- 注意:需要求和(
y
i
y_i
yi? 是同一个)
- 若正确分类,则判别式中 > 0 不满足 , 因此该项梯度为 0
- 若分类错误,则判别式中 > 0 满足 ,因此该项梯度为
?
x
i
-x_i
?xi?
- 对错误分类的梯度:
-
▽
w
j
L
i
=
1
(
w
j
T
x
i
?
w
y
i
T
x
i
+
Δ
>
0
)
x
i
\bigtriangledown_{w_j}L_i = 1(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i
▽wj??Li?=1(wjT?xi??wyi?T?xi?+Δ>0)xi?
- 注意:不用求和(每一个
j
j
j 都不是同一个)
- 若正确分类,则判别式中 > 0 不满足 , 因此该项梯度为 0
- 若分类错误,则判别式中 > 0 满足 ,因此该项梯度为
x
i
x_i
xi?
-
对应求梯度和loss的代码部分 ,在计算梯度和loss的时候只需要考虑分类错误所带来的影响,因为分类正确时,对loss和梯度的贡献都是0。 dW = np.zeros(W.shape)
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
for i in range(num_train):
scores = X[i].dot(W)
correct_class_score = scores[y[i]]
for j in range(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1
if margin > 0:
loss += margin
dW[:,y[i]] += -X[i]
dW[:,j] += X[i]
代码补充svm_loss_naive
dW /= num_train
dW += reg * W
代码补充svm_loss_vectorized部分
-
首先计算loss scores = np.dot(X,W)
num_train = X.shape[0]
correct_class_score = scores[np.arange(num_train),y]
correct_class_score = np.reshape(correct_class_score,(num_train,-1))
margins = np.maximum(0, scores - correct_class_score + 1)
margins[np.arange(num_train),y] = 0
loss += np.sum(margins) / num_train
loss += reg * np.sum(W * W)
- 需要注意的是
- 找到所有正确分类的得分,是个行向量,我们要将其转化为列向量,这样在与矩阵相减时才能保证对应列相减,因为scores中的每一行是一个数据的所有类别的得分
- 在margins的计算过程中,要注意,正确分类对loss的贡献为0
-
计算梯度 margins[margins > 0] = 1
row_sum = np.sum(margins,axis=1)
margins[np.arange(num_train),y] = -row_sum
dW = np.dot(X.T,margins)
dW /= num_train
dW += reg * W
-
首先将margins中大于0的项置1,因为所有未正确分类的数据都会对dW[:,y[i]]有多次影响
-
从每个margin > 0 ,正确分类的梯度都要减去X[i] , 因此有 margins[np.arange(num_train),y] = -row_sum
-
对未正确分类的梯度来说,只需要加上X[i] 即可 -
关键的部分在于 dW = np.dot(X.T,margins)
- 当X.T与margins相乘时,可以将margins中的数看成贡献量
- 若margins[i] [j] = k , 则dW[:,j] += k * X[i] , 这个式子是向量化计算的核心
调参(交叉验证)
- 就是简单的遍历 + 训练 + 预测 , 得到最好的参数
- 结果:best validation accuracy achieved during cross-validation: 0.398000
- linear SVM on raw pixels final test set accuracy: 0.377000
for learning_rate in learning_rates:
for regularization_strength in regularization_strengths:
svm = LinearSVM()
loss_hist = svm.train(X_train, y_train, learning_rate=learning_rate, reg=regularization_strength,
num_iters=1500, verbose=True)
y_train_pred = svm.predict(X_train)
y_val_pred = svm.predict(X_val)
y_train_acc = np.mean(y_train_pred==y_train)
y_val_acc = np.mean(y_val_pred==y_val)
results[(learning_rate,regularization_strength)] = [y_train_acc, y_val_acc]
if y_val_acc > best_val:
best_val = y_val_acc
best_svm = svm
softmax
首先,推导softmax loss function对w的梯度
-
Softmax 公式:
S
i
=
e
f
y
i
∑
j
e
f
j
S_i = \frac{e^ {f_{y_i}}}{\sum_je^ {f_j} }
Si?=∑j?efj?efyi??? -
Softmax Loss 公式:
L
i
=
?
log
?
e
f
y
i
∑
j
e
f
j
L_i = -\log \frac{e^ {f_{y_i}}}{\sum_je^ {f_{j}} }
Li?=?log∑j?efj?efyi??? ,
f
j
=
X
(
i
)
W
(
j
)
f_j = X(i)W(j)
fj?=X(i)W(j) , 若
f
y
i
f_{y_i}
fyi??的得分越高,则其所占总得分比例越大,
L
i
L_i
Li?越小 -
梯度
-
对每一个 X[i] 来说,可能对分类器判断所有类别的参数都有贡献,需要对所有的
W
(
j
)
求
导
W(j)求导
W(j)求导 -
?
L
i
?
W
j
=
?
L
i
?
f
j
?
f
j
?
W
j
\frac{\partial L_i}{\partial W_j} = \frac{\partial L_i}{\partial f_j}\frac{\partial f_j}{\partial W_j}
?Wj??Li??=?fj??Li???Wj??fj?? , 可见,当求导链式法则第一项时,分子下标为 i , 分母下标为 j , 因此分为
i
=
j
i=j
i=j 和
i
≠
j
i \ne j
i?=j 来求导第一项
-
i = j
-
?
L
i
?
f
j
=
?
∑
j
e
f
y
i
e
f
y
i
∑
j
?
e
f
y
i
e
f
j
∑
j
2
=
e
f
j
?
∑
j
∑
j
\frac{\partial L_i}{\partial f_j} = -\frac{\sum_{j}^{} }{e^{f_{y_i}}} \frac{e^{f_{y_i}}\sum_{j}-e^{f_{y_i}}e^{f_j}}{{\sum_{j}}^2} =\frac{e^{f_j} - \sum_j }{\sum_{j} }
?fj??Li??=?efyi??∑j??∑j?2efyi??∑j??efyi??efj??=∑j?efj??∑j??
-
i
≠
j
i \ne j
i?=j
-
?
L
i
?
f
j
=
?
∑
j
e
f
y
i
?
e
f
y
i
e
f
j
∑
j
2
=
e
f
j
∑
j
\frac{\partial L_i}{\partial f_j} = -\frac{\sum_{j}^{} }{e^{f_{y_i}}} \frac{-e^{f_{y_i}}e^{f_j}}{{\sum_{j}}^2} =\frac{e^{f_j}}{\sum_{j} }
?fj??Li??=?efyi??∑j??∑j?2?efyi??efj??=∑j?efj??
-
易知:
?
f
j
?
W
j
=
?
X
i
W
j
?
W
j
=
X
i
\frac{\partial f_j}{\partial W_j} = \frac{\partial X_iW_j}{\partial W_j} = X_i
?Wj??fj??=?Wj??Xi?Wj??=Xi? -
综上:
- i = j
-
?
L
i
?
W
j
=
e
f
j
?
∑
j
∑
j
X
i
=
(
S
j
?
1
)
X
i
\frac{\partial L_i}{\partial W_j} = \frac{e^{f_j} - \sum_j }{\sum_{j} }X_i = (S_j - 1)X_i
?Wj??Li??=∑j?efj??∑j??Xi?=(Sj??1)Xi?
-
i
≠
j
i \ne j
i?=j
-
?
L
i
?
W
j
=
e
f
j
∑
j
X
i
=
S
j
X
i
\frac{\partial L_i}{\partial W_j} = \frac{e^{f_j}}{\sum_{j} }X_i = S_jX_i
?Wj??Li??=∑j?efj??Xi?=Sj?Xi?
代码补充softmax_loss_naive
num_classes = W.shape[1]
num_train = X.shape[0]
for i in xrange(num_train):
scores = X[i].dot(W)
shift_scores = scores - max(scores)
loss += (-shift_scores[y[i]] + np.log(sum(np.exp(shift_scores))) )
for j in xrange(num_classes):
softmax_output = np.exp(shift_scores[j])/sum(np.exp(shift_scores))
if j == y[i]:
dW[:,j] += (-1 + softmax_output) *X[i]
else:
dW[:,j] += softmax_output *X[i]
loss /= num_train
loss += 0.5* reg * np.sum(W * W)
dW = dW/num_train + reg* W
代码补充softmax_loss_vectorized
num_classes = W.shape[1]
num_train = X.shape[0]
scores = X.dot(W)
shift_scores = scores - np.max(scores, axis = 1, keepdims = True)
softmax_output = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis = 1, keepdims = True)
loss = -np.sum(np.log(softmax_output[range(num_train), list(y)]))
loss = loss / num_train + 0.5* reg * np.sum(W * W)
dscores = softmax_output.copy()
dscores[range(num_train), y] += -1
dW = (X.T).dot(dscores)
dW = dW/num_train + reg* W
调参(交叉验证)
- 这里的交叉验证也和SVM类似,不再多说
- 结果:best validation accuracy achieved during cross-validation: 0.358000
- softmax on raw pixels final test set accuracy: 0.361000
for reg in regularization_strengths:
for lr in learning_rates:
svm = Softmax()
loss_hist = svm.train(X_train, y_train, lr, reg, num_iters=1500)
y_train_pred = svm.predict(X_train)
train_accuracy = np.mean(y_train == y_train_pred)
y_val_pred = svm.predict(X_val)
val_accuracy = np.mean(y_val == y_val_pred)
if val_accuracy > best_val:
best_val = val_accuracy
best_softmax = svm
results[(lr,reg)] = train_accuracy, val_accuracy
three_layer_net
代码补充 loss
h1 = np.dot(X , W1) + b1
first_layer = np.maximum(0 , h1)
h2 = np.dot(first_layer , W2) + b2
second_layer = np.maximum(0 , h2)
scores = np.dot(second_layer , W3) + b3
shift_scores = scores - np.max(scores, axis = 1, keepdims = True)
softmax_output = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis = 1, keepdims = True)
loss = -np.sum(np.log(softmax_output[range(N), y]))
loss = loss / N + 0.5 * reg * np.sum(W1 * W1) + 0.5 * reg * np.sum(W2 * W2) + 0.5 * reg * np.sum(W3 * W3)
- 这里的 loss 就是一个softmax的过程 , 和前面softmax一模一样 , 只不过多加了几个正则项
dscores = softmax_output
dscores[range(N),y] -= 1
grads['W3'] = np.dot(second_layer.T,dscores) / N + reg * W3
grads['b3'] = np.sum(dscores,axis=0,keepdims=False) / N
mask = np.zeros_like(h2)
mask[h2>0] = 1
delta2 = np.dot(dscores , W3.T) * mask
grads['W2'] = first_layer.T.dot(delta2) / N + reg * W2
grads['b2'] = np.ones(N).dot(delta2) / N
mask2 = np.zeros_like(h1)
mask2[h1>0] = 1
delta1 = np.dot(delta2 , W2.T) * mask2
grads['W1'] = X.T.dot(delta1) / N + reg * W1
grads['b1'] = np.ones(N).dot(delta1) / N
-
这里就需要注意了
-
第三层
- dscores 的本质:是tot_loss 对 score 的梯度
- 在求解W3的梯度时,其实就是softmax中对W3的梯度 , 求的是 tot_loss 对 W3 的导数
- 在求解b3的梯度时,其实和W3一样,只是没有了X[i]作为系数,因为b的前面没有参数
-
第二层
-
根据链式法则,tot_loss 对 第二层 + ReLu 之后的数据second_layer的导数,应该是tot_loss对scores的梯度 * scores对second_layer的梯度 -
second_layer就是 W3*X + b3 中的 X , 也就是第三层的输入 , 因此scores对second_layer的导数就是
W
3
T
W_3^T
W3T? -
ReLu将小于0的值,归成了0 , 大于 0 的为 x
- 因此second_layer 对 ReLu之前的数据 h2 的导数为 , 0 或 1
- h2 对 W2 的梯度为
f
i
r
s
t
_
l
a
y
e
r
T
first\_layer^T
first_layerT
-
所以就有了这部分,这部分就是链式法则的运用 mask = np.zeros_like(h2)
mask[h2>0] = 1
delta2 = np.dot(dscores , W3.T) * mask
-
第一层
- 从第二层到第一层和从第三层到第二层一样,直接重复上述操作即可
代码补充train
batch_inx = np.random.choice(num_train, batch_size)
X_batch = X[batch_inx,:]
y_batch = y[batch_inx]
self.params['W1'] -= learning_rate * grads['W1']
self.params['b1'] -= learning_rate * grads['b1']
self.params['W2'] -= learning_rate * grads['W2']
self.params['b2'] -= learning_rate * grads['b2']
self.params['W3'] -= learning_rate * grads['W3']
self.params['b3'] -= learning_rate * grads['b3']
代码补充predict
scores = self.loss(X)
y_pred = np.argmax(scores, axis=1)
调参(交叉验证)
best_acc = 0
learning_rate = [1e-4, 5e-3, 1e-3,1e-2]
regulations = [1e-4, 5e-4, 1e-3, 4e-3,1e-2]
for lr in learning_rate:
for reg in regulations:
stats = net.train(X_train, y_train, X_val, y_val,
num_iters=2500, batch_size=150,
learning_rate=lr, learning_rate_decay=0.95,
reg=reg, verbose=True)
val_acc = (net.predict(X_val) == y_val).mean()
if val_acc > best_acc:
best_acc = val_acc
best_net = net
print('lr = ',lr ,' reg = ',reg, ' acc = ', best_acc)
features
-
不在图像的原始像素上运用分类器,而是先从原始数据上计算好一些特征,再对这些特征运用分类器 -
主要有两个特征提取器
- hog_feature 注重texture
- color_histogram_hsv注重颜色
-
SVM
- 结果:best validation accuracy achieved during cross-validation: 0.420000
for lr in learning_rates:
for reg_str in regularization_strengths:
svm = LinearSVM()
loss_hist = svm.train(X_train_feats, y_train, learning_rate=lr, reg=reg_str,
num_iters=1500, verbose=False)
y_train_pred = svm.predict(X_train_feats)
accuracy_train = np.mean(y_train == y_train_pred)
y_val_pred = svm.predict(X_val_feats)
accuracy_val = np.mean(y_val == y_val_pred)
results[(lr, reg_str)] = (accuracy_train, accuracy_val)
if accuracy_val > best_val:
print ("lr:",lr)
print ("reg:", reg_str)
best_val = accuracy_val
best_svm = svm
-
ThreeLayerNet
-
结果:lr = 1 reg = 1e-05 acc = 0.558 -
test_acc:0.551 -
调参得到较优的参数为 learning_rat = 1 , reg_choice = 1e-5 , num_iters = 10000 best_acc = -1
input_size = 32 * 32 * 3
learning_rate_choice = [1e-2 , 1e-1 , 1]
reg_choice = [1e-5 , 1e-4 , 1e-3 , 1e-2 , 1e-1]
for learning_rate_curr in learning_rate_choice:
for reg_cur in reg_choice:
net = ThreeLayerNet(input_dim, hidden_dim, num_classes)
stats = net.train(X_train_feats, y_train, X_val_feats, y_val,
num_iters=10000, batch_size=150,
learning_rate=learning_rate_curr, learning_rate_decay=0.95,
reg=reg_cur, verbose=True)
val_acc = (net.predict(X_val_feats) == y_val).mean()
if val_acc>best_acc:
best_acc = val_acc
best_net = net
best_stats = stats
print('lr = ',learning_rate_curr ,' reg = ',reg_cur, ' acc = ', best_acc)
结论分析与体会
- 通过本次实验,自己推导了SVM loss function 对 W 和 b 的梯度公式,对SVM loss function 有了深入的了解
- 通过本次实验,自己推导了SoftMax loss function 对 W 和 b 的梯度公式 , 对SoftMax loss function 有了深入的了解
- 通过本次实验 , 手动构建了三层神经网络 , 直观体会了每一层的 Input 和 W ,b , Output 的关系 , 特别是维度之间的联系
- 手动推导了多层神经网络的链式法则
- 手动进行了bp过程
- 手动进行了调参(有点玄学)
- 这种通过在有基本框架的基础上补全代码的实验形式,能让我们学到很多东西,就是比较费时间
实验过程中遇到的问题和解决方法
-
实验过程中遇到的主要问题就是不会求梯度,一开始无从下手
- 因为先做的
S
V
M
SVM
SVM , 在
S
V
M
SVM
SVM中遇到求梯度的问题 , 去搜博客,然后手推并总结
- 在做完
S
V
M
SVM
SVM去做
S
o
f
t
M
a
x
SoftMax
SoftMax时,仿照
S
V
M
SVM
SVM的思想,直接手推就好了
-
在做三层神经网络的时候,遇到了这个问题 , 不理解这个内积是干什么用的,在求对第二层的导数的时候,为什么用的都是第三层的量 delta2 = np.dot(dscores , W3.T) * mask
- 后面通过与同学讨论,了解到这一步就是链式法则
- 然后再了解了这一步是链式法则的基础上,想明白了第二层 + ReLu的输出就是第三层的输入
X
3
X_3
X3? 那
d
s
c
o
r
e
s
dscores
dscores 是
t
o
t
l
o
s
s
tot_loss
totl?oss对scores的梯度,
s
c
o
r
e
s
=
X
3
W
3
+
b
3
scores = X_3W_3 + b_3
scores=X3?W3?+b3? , 这里出现
W
3
W_3
W3? 刚好是scores对
X
3
X_3
X3? 的导数 , 这就搞明白了每一层的梯度和链式法则
|