[人工智能] 牛顿法logistic回归

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 牛顿法logistic回归 -> 正文阅读

[人工智能]牛顿法logistic回归

牛顿法求解Logistic Regression

1、Logistic回归模型、牛顿法推导公式

Logistic模型为（ $\mathbf{x_i}$ 是第 $i$ 个样本，为行向量， $\mathbf{w}$ 是特征向量，为一个列向量）
$KaTeX parse error: Unknown column alignment: * at position 60: … \begin{array}{*?*lr**} p(y_{i}…$
其损失函数为
$l=Loss(\mathbf{w})=-\frac{1}{m}\ln{p(\mathbf{y}|\mathbf{x},\mathbf{w})}=-[\mathbf{y}^{T}\ln(\mathbf{\sigma})+\mathbf{(1-y)}^{T}\ln{(1-\sigma)}].$
则损失函数的一阶导数为
$\frac{\partial{l}}{\partial{w_{j}}}=\frac{1}{m} \sum_{i=1}^m{(\sigma_{i}-y_{i})x_{ij}}.$
即可得向量形式
$\frac{\partial{l}}{\partial{\mathbf{w}}}=-\frac{1}{m}\frac{\partial{[\mathbf{y}^{T}\ln(sigmoid(X\mathbf{w}))+\mathbf{(1-y)}^{T}\ln{(1-sigmoid(X\mathbf{w}))}]}}{\partial{\mathbf{w}}},$

$\mathrm{d}l=-\frac{1}{m}tr((\mathbf{y}\odot \frac{\sigma(1-\sigma)}{\sigma})^T(\mathrm{d}X\mathbf{w})-((1-\mathbf{y})\odot\frac{\sigma(1-\sigma)}{1-\sigma})^T(\mathrm{d}X\mathbf{w}))$

$=-\frac{1}{m}tr((\mathbf{y}\odot \frac{\sigma(1-\sigma)}{\sigma})^TX(\mathrm{d}\mathbf{w})-((1-\mathbf{y})\odot\frac{\sigma(1-\sigma)}{1-\sigma})^TX(\mathrm{d}\mathbf{w}))$

$=-\frac{1}{m}tr((X^T(\mathbf{y}\odot \frac{\sigma(1-\sigma)}{\sigma}))^T(\mathrm{d}\mathbf{w})-(X^T((1-\mathbf{y})\odot\frac{\sigma(1-\sigma)}{1-\sigma}))^T(\mathrm{d}\mathbf{w}))=-\frac{1}{m}tr((X^T(\mathbf{y}-\sigma))^T(\mathrm{d}\mathbf{w})),$

$\mathrm{d}l=tr(\frac{\partial l}{\partial \mathbf{w}}^T\mathrm{d}\mathbf{w}),$

$\frac{\partial l}{\partial \mathbf{w}}=\frac{1}{m}X^T(\sigma-\mathbf{y}),$
接着求解损失函数的二阶导数（Heissian矩阵）
$H_{ij}=\frac{\partial^{2}{l}}{\partial{w_{i}}\partial{w_{j}}}=\frac{1}{m}\frac{\partial}{\partial{w_{j}}}\sum_{t=1}^{m}{(y_{t}-\sigma_{t})x_{ti}}=\frac{1}{m}\sum_{t=1}^{m}\sigma_{t}(\sigma_{t}-1)x_{ti}x_{tj}.$
即可得矩阵形式
$\mathrm{d}\nabla_w l=\frac{1}{m}X^T(sigmoid'(Xw)\otimes(\mathrm{d}Xw))=\frac{1}{m}X^T(sigmoid'(Xw)\otimes X(\mathrm{d}w)),$

$\mathrm{vec}(\mathrm{d}\nabla_w l)=\frac{1}{m}X^Tdiag(\sigma(1-\sigma))X\mathrm{vec}(\mathrm{d}w),$

$\mathrm{vec}(\mathrm{d}\nabla_w l) = \nabla^2_w l^T \mathrm{vec}(\mathrm{d}w),$

$H=\nabla^2_w l=H(\mathbf{w})=\frac{1}{m}XAX^{T},$

$A=diag(\sigma_{i}(1-\sigma_{i})).$
由于 $H$ 是一个二次型矩阵，故 $H$ 为正定矩阵，因此损失函数 $l$ 为凸函数，有最小值，可通过求解 $\frac{\partial{l}}{\partial{\mathbf{w}}}=0$ 得到最优解 $\mathbf{w}$ ，且牛顿法求解二次收敛，故具有较好的求解效果。
根据牛顿法的求解步骤，有下列迭代过程
$\mathbf{w}:=\mathbf{w}-\alpha(H^{-1}\frac{\partial{l}}{\partial{\mathbf{w}}}).$

其中，方向 $H^{-1}\frac{\partial{l}}{\partial{\mathbf{w}}}$ 被称为牛顿方向，由于牛顿法只是二阶近似，因此需要在每次的迭代方向前乘以一个学习率 $\alpha$ 作为对高阶无穷小（高阶导数项）的近似，即 $\alpha(H^{-1}\frac{\partial{l}}{\partial{\mathbf{w}}})$ 是每次牛顿法的迭代步长。

2、利用代码实现该Logistic Regression模型（Matlab实现）

sigmoid.m函数（用于计算sigmoid函数值）

function g = sigmoid(z)
%SIGMOID Compute sigmoid function
%   g = SIGMOID(z) computes the sigmoid of z.
g=1./(1+exp(-z));

end

Loss_function.m函数（求解模型在所有样本上的误差）

function J = loss_function(w, X, y)
%%LOSS_FUNCTION computes the loss of the logistic regression model
%       J = LOSS_FUNCTION(w, X, y) computes the loss of the logistic 
%regression model of all samples with weight w.
m = size(X, 1);
J = 0;
for i = 1:m
    if y(i) == 1
        J = J - log(sigmoid(X(i,:) * w));
    else
        J = J - log(1 - sigmoid(X(i,:) * w));
    end
end
J = J / m;

end

Loss_d1.m函数（用于计算损失函数的一阶导数，为一个向量）

function dl = Loss_d1(w, X, y)
%LOSS_D1 Compute the derivative
%   dl = LOSS_D1(w, X, y) computes the derivative of the loss function 
%of the logistic regression.
m = size(y, 1);
dl = X' * (sigmoid(X * w) - y) / m;

end

Loss_d2.m函数（用于计算损失函数的二阶导数，为一个Hessian矩阵）

function H = Loss_d2(w, X, y)
%LOSS_D2 Compute the Hessian Matrix
%   H = LOSS_D2(w, X, y) computes the Hessian Matrix of the loss function 
%of the logistic regression.
m = size(y, 1);
n = zeros(1, m);
for i = 1:m
    n(i) = sigmoid(X(i,:) * w) * (1 - sigmoid(X(i,:) * w));
end
A = diag(n);
H = X' * A * X / m;

end

newton_method.m函数（牛顿法求解）

function [out, iteration, weight] = newton_method(alpha, w, X, y)
%%NEWTON_METHOD Compute the solution of a the function
%   [out, iteration] = NEWTON_METHOD(w, X, y) gives the predicted results, 
%the weight vector of the logistic regression and the iteration.
%limit is the maximum iteration
%alpha is the learning rate

limit = 1000000;

%Iterative Solution (Newton Method)
for i = 1:limit
    dl = Loss_d1(w, X, y);
    H = Loss_d2(w, X, y);
    w_next = w - alpha * (pinv(H) * dl); %Use pinv in case the matrix is approaching a singular matrix
    if loss_function(w_next, X, y) < 1e-6 %The condition to judge the convergence
        w = w_next;
        break
    end
    w = w_next;
end

%Compute the result with the weight got above and predict all the samples
out = sigmoid(X * w);
iteration = i;
weight = w;
end

evaluation.m函数（用于绘制ROC曲线并计算AUC）

function auc = evaluation(out, y, alpha)
%EVALUATION Evaluate the performance of the model
%   auc = EVALUATION(out, y) evaluate the performance by ploting the ROC 
%curve and computing the AUC.

%(xx, yy) is the initial point
%x_step is the step when we meet FP
%y_step is the step when we meet TP
xx = 1.0;
yy = 1.0;
pos_num = sum(y == 1);
neg_num = sum(y == 0);
x_step = 1.0 / neg_num;
y_step = 1.0 / pos_num;

%Sort the predicted data and use the index to sort the real
%value y to judge whether the value is TP or FP
[out, index] = sort(out);
y = y(index);
X(1) = 1;
Y(1) = 1;
for i = 1:length(y)
    if y(i) == 1 % TP
        yy = yy - y_step;
    else %FP
        xx =xx - x_step;
    end
    X(i + 1) = xx;
    Y(i + 1) = yy;
end

%Plot the ROC curve
name = "alpha" + num2str(alpha) + '.jpg';
figure('name', name);
plot(X, Y, '-ro', 'LineWidth', 2, 'MarkerSize', 3);  
xlabel('FPR');  
ylabel('TPR');  
title('ROC');
f = gca;
f.XAxisLocation = 'origin';
f.YAxisLocation = 'origin';
f.XLim = [0 1.1];
f.YLim = [0 1.1];
saveas(gcf, name);

%Calculate the area below the ROC curve
auc = -trapz(X,Y);            
end

Logistic_Reg.m（模型求解）

%'data.xlsx' is the dataset
%X is the feature matrix
%y is the classification vector
%m is the amount of the features
%n is the amount of samples
%w_initial is the initial weight vector, whose value is a one vector
%w is the weight vector obtained by using newton method
%alpha is the learning rate
%out is the predicted results
%auc is the evalution standard
X = xlsread('data.xlsx','A:B');
y = xlsread('data.xlsx','C:C');
X = [ones(size(X,1), 1) X];%Add one colomn of 1s as the constant terms
y = (y == 1);%Change the label '-1' to '0' to make the computation easier
y = double(y);
m = size(X, 2);
n = size(X, 1);
w_initial = ones(m, 1);
alpha = 0.05;
w = zeros(80, m);
out = zeros(n, 80);
iteration = zeros(80, 1);
auc = zeros(80, 1);


%Solve the problem by using newton method and evaluate the performance at
%the same time
for i = 1:80
    [out(:,i), iteration(i), w(i,:)] = newton_method(alpha, w_initial, X, y);
    auc(i) = evaluation(out(:,i), y, alpha);
    alpha = alpha + 0.05;
end

save('out.mat');
save('iteration.mat');
save('auc.mat');
save('w.mat');

3、根据样本的标签和预测结果可分为真正例TP、假正例FP、假负例TN、真负例FN，试设置不同阈值p，编写代码绘制横坐标FPR，纵坐标TPR的ROC曲线，并计算AUC面积

利用2中的evalution.m函数即可根据分类结果out绘制相应的ROC曲线并计算AUC面积。

其中，ROC曲线的纵坐标为 $TPR=\frac{TP}{TP+FN}$ ，横坐标为 $FPR=\frac{FP}{FP+TN}$ ，曲线的趋势变化反映了当取不同阈值p的时候，TPR和FPR的变化。

绘制过程可按以下流程：

统计标签中正例和反例的个数，并将 $\frac{1.0}{正例个数}$ 作为y方向上的步长，将 $\frac{1.0}{反例个数}$ 作为x方向上的步长。
将预测结果进行升序排序，并按照该顺序对标签中的样本进行重排。
初始点设置为（1.0,1.0），然后对y中的每个值进行迭代：
- 当迭代时遇到标签值为1时，说明以该点的预测值作为阈值时，会将一个正例预测为反例，即TP的个数减1
- 当迭代时遇到标签值为0时，说明以该点的预测值作为阈值时，会将一个反例预测为反例，即FP的个数减1
最后求ROC与x轴所围的面积AUC作为评估模型的标准。

4、比较不同的学习率对性能的影响

由于牛顿法只是对于问题求解的二阶近似，故可以仿照梯度下降法引入一个学习率 $\alpha$ 来作为对高阶无穷小的近似，然后通过性能对比来选择一个收敛速度较快的学习率。

通常可以用一定的步长来对学习率 $\alpha$ 在一定区间内进行迭代，然后分别进行评估来对比不同的学习率。

5、讨论与改进

由于目前的模型只考虑了对当前数据集的拟合性能，且没有引入约束条件，故最后得到的权重向量可能会使模型对数据过拟合，泛化性能较差，所以可以考虑引入正则化项作为约束条件来对权重大小进行限制以消除一些由过拟合引起的问题，如引入二次正则化项，即将损失函数改为

$l=Loss(\mathbf{w})=-\frac{1}{m}\ln{p(\mathbf{y}|\mathbf{x},\mathbf{w})}+\lambda \mathbf{w}^T\mathbf{w}=-[\mathbf{y}^{T}\ln(\mathbf{\sigma})+\mathbf{(1-y)}^{T}\ln{(1-\sigma)}]+\lambda \mathbf{w}^T\mathbf{w}.$

目前的评估标准仅仅使用了ROC曲线以及其面积AUC，还可以考虑通过引入准确率 $precision=\frac{TP}{TP+FP}$ ， $recall=\frac{TP}{TP+FN}$ 来对模型进行综合评价以期得到更加准确的评估结果。
当前模型的选择仅建立在训练集的基础上，后续若要进一步提升性能可将按一定比例将数据集划分为训练集、验证集和测试集三部分来分别进行不同参数的选择，可以有效提高泛化性能。