L-BFGS 原理
L-BFGS即限制内存的BFGS拟牛顿法,在计算搜索方向时不出现矩阵,大大节省了内存。 注意到BFGS中Hessian矩阵的逆的更新公式为:
D
k
+
1
=
(
I
?
s
k
y
k
T
y
k
T
s
k
)
D
k
(
I
?
y
k
s
k
T
y
k
T
s
k
)
+
s
k
s
k
T
y
k
T
s
k
{D_{k + 1}} = \left( {I - \frac{{{s_k}y_k^T}}{{y_k^T{s_k}}}} \right){D_k}\left( {I - \frac{{{y_k}s_k^T}}{{y_k^T{s_k}}}} \right) + \frac{{{s_k}s_k^T}}{{y_k^T{s_k}}}
Dk+1?=(I?ykT?sk?sk?ykT??)Dk?(I?ykT?sk?yk?skT??)+ykT?sk?sk?skT?? 其中
y
k
=
g
k
+
1
?
g
k
,
s
k
=
x
k
+
1
?
x
k
y_{k} = g_{k+1}-g_{k} ,s_{k}=x_{k+1}-x_{k}
yk?=gk+1??gk?,sk?=xk+1??xk? 如果记
ρ
k
=
1
y
k
T
s
k
,
V
k
=
I
?
ρ
k
y
k
s
k
T
{\rho _k} = \frac{1}{{y_k^T{s_k}}},{V_k} = I - {\rho _k}{y_k}s_k^T
ρk?=ykT?sk?1?,Vk?=I?ρk?yk?skT? 则该更新公式可以写成:
D
k
+
1
=
V
k
T
D
k
V
k
+
ρ
k
s
k
s
k
T
{D_{k + 1}} = V_k^T{D_k}{V_k} + {\rho _k}{s_k}s_k^T
Dk+1?=VkT?Dk?Vk?+ρk?sk?skT? 初始hessian阵的逆取为对称正定阵,LBFGS中只能取为单位阵(为什么?) 给定最大存储搜索方向记录条数m:
f
o
r
k
+
1
≤
m
:
D
0
=
I
D
1
=
V
0
T
V
0
+
ρ
0
s
0
s
0
T
D
2
=
V
1
T
V
0
T
V
0
V
1
+
ρ
0
V
1
T
s
0
s
0
T
V
1
+
ρ
1
s
1
s
1
T
?
D
k
+
1
=
V
k
T
V
k
?
1
T
?
V
1
T
V
0
T
V
0
V
1
?
V
k
?
1
V
k
+
V
k
T
V
k
?
1
T
?
V
1
T
ρ
0
s
0
T
s
0
V
1
?
V
k
?
1
V
k
+
V
k
T
V
k
?
1
T
?
V
2
T
ρ
1
s
1
T
s
1
V
2
?
V
k
?
1
V
k
+
+
?
+
V
k
T
ρ
k
?
1
s
k
?
1
s
k
?
1
T
V
k
+
ρ
k
s
k
s
k
T
for \quad k + 1 \le m:\newline {D_0} = I \newline {D_1} = V_0^T{V_0} + {\rho _0}{s_0}s_0^T \newline {D_2} = V_1^TV_0^T{V_0}{V_1} + {\rho _0}V_1^T{s_0}s_0^T{V_1} + {\rho _1}{s_1}s_1^T\newline \cdots \newline {D_{k + 1}} = V_k^TV_{k - 1}^T \cdots V_1^TV_0^T{V_0}{V_1} \cdots {V_{k - 1}}{V_k} + \newline V_k^TV_{k - 1}^T \cdots V_1^T{\rho _0}s_0^T{s_0}{V_1} \cdots {V_{k - 1}}{V_k} + \newline V_k^TV_{k - 1}^T \cdots V_2^T{\rho _1}s_1^T{s_1}{V_2} \cdots {V_{k - 1}}{V_k} + \newline+ \cdots +\newline \begin{array}{l} V_k^T{\rho _{k - 1}}{s_{k - 1}}s_{k - 1}^T{V_k} + \newline {\rho _k}{s_k}s_k^T \end{array}
fork+1≤m:D0?=ID1?=V0T?V0?+ρ0?s0?s0T?D2?=V1T?V0T?V0?V1?+ρ0?V1T?s0?s0T?V1?+ρ1?s1?s1T??Dk+1?=VkT?Vk?1T??V1T?V0T?V0?V1??Vk?1?Vk?+VkT?Vk?1T??V1T?ρ0?s0T?s0?V1??Vk?1?Vk?+VkT?Vk?1T??V2T?ρ1?s1T?s1?V2??Vk?1?Vk?++?+VkT?ρk?1?sk?1?sk?1T?Vk?+ρk?sk?skT?? 当记录多于m-1条时,需要将最前面的记录覆盖掉,因此D_{k}更新公式变成:
D
k
+
1
=
V
k
T
V
k
?
1
T
?
V
k
?
m
+
1
T
V
k
?
m
+
1
?
V
k
?
1
V
k
+
V
k
T
V
k
?
1
T
?
V
k
?
m
+
2
T
ρ
k
?
m
+
1
s
k
?
m
+
1
s
k
?
m
+
1
T
V
k
?
m
+
2
?
V
k
?
1
V
k
+
V
k
T
ρ
k
?
1
s
k
?
1
s
k
?
1
T
V
k
+
ρ
k
s
k
s
k
T
{D_{k + 1}} = V_k^TV_{k - 1}^T \cdots V_{k - m + 1}^T{V_{k - m + 1}} \cdots {V_{k - 1}}{V_k} \newline +V_k^TV_{k - 1}^T \cdots V_{k - m + 2}^T{\rho _{k - m + 1}}{s_{k - m + 1}}s_{k - m + 1}^T{V_{k - m + 2}} \cdots {V_{k - 1}}{V_k} \newline+ V_k^T{\rho _{k - 1}}{s_{k - 1}}s_{k - 1}^T{V_k}\newline+ {\rho _k}{s_k}s_k^T
Dk+1?=VkT?Vk?1T??Vk?m+1T?Vk?m+1??Vk?1?Vk?+VkT?Vk?1T??Vk?m+2T?ρk?m+1?sk?m+1?sk?m+1T?Vk?m+2??Vk?1?Vk?+VkT?ρk?1?sk?1?sk?1T?Vk?+ρk?sk?skT?
搜索方向的计算
p
k
+
1
=
?
D
k
+
1
g
k
+
1
{p_{k + 1}} = - {D_{k + 1}}{g_{k + 1}}
pk+1?=?Dk+1?gk+1? 在程序中需要定义三类作用在向量空间上的线性变换:
(
1
)
s
i
s
i
T
p
=
(
s
i
T
p
)
s
i
(
2
)
V
i
p
=
(
I
?
ρ
i
y
i
s
i
T
)
p
=
p
?
ρ
i
(
s
i
T
p
)
y
i
(
3
)
V
i
T
p
=
(
I
?
ρ
i
s
i
y
i
T
)
p
=
p
?
ρ
i
(
y
i
T
p
)
s
i
(1){s_i}{s_i}^Tp = \left( {s_i^Tp} \right){s_i}\newline(2){V_i}p = \left( {I - {\rho _i}{y_i}s_i^T} \right)p = p - {\rho _i}\left( {s_i^Tp} \right){y_i}\newline (3)V_i^Tp = \left( {I - {\rho _i}{s_i}y_i^T} \right)p = p - {\rho _i}\left( {y_i^Tp} \right){s_i}
(1)si?si?Tp=(siT?p)si?(2)Vi?p=(I?ρi?yi?siT?)p=p?ρi?(siT?p)yi?(3)ViT?p=(I?ρi?si?yiT?)p=p?ρi?(yiT?p)si?
tensorflow中手写
tensorflow中有L-BFGS现成框架(2.0 tensorflow probability)和scipy.optimize中都有,为了体现BFGS算法的二次终止性,我对于形如:
f
(
x
)
=
1
2
x
T
A
x
+
b
T
x
f\left( {\bf{x}} \right) = \frac{1}{2}{{\bf{x}}^T}A{\bf{x}} + {b^T}{\bf{x}}
f(x)=21?xTAx+bTx的正定二次函数进行了优化,该函数在A为对称正定时有精确解
?
1
2
b
T
A
?
1
b
- \frac{1}{2}{b^T}{A^{ - 1}}b
?21?bTA?1b 每一次算出搜索方向以后都按照一维精确搜索选择步长
t
k
=
?
p
k
T
?
f
(
x
k
)
p
k
T
A
p
k
,
x
k
+
1
=
x
k
+
t
k
p
k
{t_k} = \frac{{ - p_k^T\nabla f\left( {{x_k}} \right)}}{{p_k^TA{p_k}}},{x_{k + 1}} = {x_k} + {t_k}{p_k}
tk?=pkT?Apk??pkT??f(xk?)?,xk+1?=xk?+tk?pk?
代码部分
- 导入框架第三方库
import tensorflow as tf
import numpy as np
import copy
from scipy.stats import ortho_group
- 正定二次函数的生成
(运用了正交相似对角化,先生成正特征值)
m = 10
maxepoch = 50
epsilon = 1.0e-5
dimension = 10
s_list = np.zeros((dimension,m))
y_list = np.zeros((dimension,m))
rho = np.zeros((1,m))
D = np.diag(np.random.rand(dimension))*5+1
Q = ortho_group.rvs(size=1, dim=dimension, random_state=2)
D = Q @ D @ np.linalg.inv(Q)
b = np.random.randn(dimension,1)
- 函数,梯度,三类线性变换等
def f_fun(X):
return 0.5*tf.transpose(X) @ D @ X - tf.transpose(X) @ b
def f_grad(X):
X = tf.Variable(X)
with tf.GradientTape() as tape:
f = f_fun(X)
return tape.gradient(f,X)
def linear_search(Xk,p):
Xk = tf.Variable(Xk)
with tf.GradientTape() as tape:
f = f_fun(Xk)
dxk = tape.gradient(f,Xk)
p = tf.cast(p,tf.float64)
dxk = tf.cast(dxk,tf.float64)
Xk_1 = Xk - tf.cast(tf.transpose(p) @ dxk/(tf.transpose(p) @ tf.cast(D,tf.float64) @ p)*p,tf.float32)
return Xk_1
def linear_transform_1(index,p):
return np.sum(s_list[:,index]*p)*s_list[:,index]
def linear_transform_2(index,p):
return p - np.sum(s_list[:,index]*p)*y_list[:,index]/rho[0,index]
def linear_transform_3(index,p):
return p - np.sum(y_list[:,index]*p)*s_list[:,index]/rho[0,index]
- L-BFGS
print('解析解:{:.6f}'.format(-0.5*np.dot(b.T,np.dot(np.linalg.inv(D),b))[0][0]))
D = tf.cast(tf.convert_to_tensor(D),tf.float32)
b = tf.cast(tf.convert_to_tensor(b),tf.float32)
X = tf.Variable(tf.random.normal((dimension,1))*100.0,name='w');
print('初始函数值:{:6f}'.format(float(f_fun(X))))
with tf.GradientTape() as tape:
f = f_fun(X)
G0 = tape.gradient(f,X)
p0 = -G0
X1 = linear_search(X,p0)
print('迭代次数:{:d}'.format(1),end='|')
print('当前值:{:6f}'.format(float(f_fun(X1))))
s_list[:,0] = (X1 - X).numpy().flatten()
G1 = f_grad(X1)
y_list[:,0] = (G1-G0).numpy().flatten()
rho[0,0] = np.sum(y_list[:,0]*s_list[:,0])
p = G1 - np.sum(G1*s_list[:,0].reshape(-1,1))*y_list[:,0].reshape(-1,1)/rho[0,0]
p = p - np.sum(p*y_list[:,0].reshape(-1,1))*s_list[:,0].reshape(-1,1)/rho[0,0]
p = p + np.sum(s_list[:,0].reshape(-1,1)*G1)*s_list[:,0].reshape(-1,1)/rho[0,0]
G0 = G1
X2 = linear_search(X1,-p)
print('迭代次数:{:d}'.format(2),end='|')
print('当前值:{:6f}'.format(float(f_fun(X2))))
s_list[:,1] = tf.reshape(X2 - X1,X1.shape[0])
for j in range(1,m):
G1 = f_grad(X2)
X1 = X2
y_list[:,j] = tf.reshape(G1 - G0,G0.shape[0])
rho[0,j] = np.sum(y_list[:,j]*s_list[:,j])
odd = G1.numpy().flatten()
q1 = copy.deepcopy(odd)
for i in range(j+1):
q1 = linear_transform_2(j-i,q1)
for i in range(j+1):
q1 = linear_transform_3(i,q1)
for i in range(0,j):
q2 = copy.deepcopy(odd)
for k in range(j-i):
q2 = linear_transform_2(j-k,q2)
q2 = linear_transform_1(i,q2)
for k in range(j-i):
q2 = linear_transform_2(k+i+1,q2)
q1 = q1 + q2/rho[0,i]
q1 = q1 + linear_transform_1(j,odd)/rho[0,j]
q1 = q1.reshape(-1,1)
X2 = linear_search(X1,-q1)
if np.sqrt(np.sum(f_grad(X2)**2))<epsilon:
print('迭代次数:{:d}'.format(j+2),end='|')
print('找到了极小值点:',end='|')
print('极小值:{:6f}'.format(float(f_fun(X2))))
break
else:
print('迭代次数:{:d}'.format(j+2),end='|')
print('当前值:{:6f}'.format(float(f_fun(X2))))
G0 = G1
if j < m-1:
s_list[:,j+1] = tf.reshape(X2 - X1,X1.shape[0])
for ll in range(maxepoch-m):
s_list[:,0:-2] = s_list[:,1:-1]
y_list[:,0:-2] = y_list[:,1:-1]
rho[0,0:-2] = rho[0,1:-1]
j = m-1
G1 = f_grad(X2)
X1 = X2
y_list[:,j] = tf.reshape(G1 - G0,G0.shape[0])
rho[0,j] = np.sum(y_list[:,j]*s_list[:,j])
odd = G1.numpy().flatten()
q1 = copy.deepcopy(odd)
for i in range(j+1):
q1 = linear_transform_2(j-i,q1)
for i in range(j+1):
q1 = linear_transform_3(i,q1)
for i in range(0,j):
q2 = copy.deepcopy(odd)
for k in range(j-i):
q2 = linear_transform_2(j-k,q2)
q2 = linear_transform_1(i,q2)
for k in range(j-i):
q2 = linear_transform_2(k+i+1,q2)
q1 = q1 + q2/rho[0,i]
q1 = q1 + linear_transform_1(j,odd)/rho[0,j]
q1 = q1.reshape(-1,1)
X2 = linear_search(X1,-q1)
if np.sqrt(np.sum(f_grad(X2)**2))<epsilon:
print('迭代次数:{:d}'.format(ll+m),end='|')
print('找到了极小值点:',end='|')
print('极小值:{:6f}'.format(float(f_fun(X2))))
break
else:
print('迭代次数:{:d}'.format(ll+m),end='|')
print('当前值:{:6f}'.format(float(f_fun(X2))))
G0 = G1
s_list[:,j] = tf.reshape(X2 - X1,X1.shape[0])
- 运行结果
解析解:-32.108312 初始函数值:149942.296875 迭代次数:1|当前值:66679.601562 迭代次数:2|当前值:5903.409180 迭代次数:3|当前值:676.357117 迭代次数:4|当前值:65.207367 迭代次数:5|当前值:3.300758 迭代次数:6|当前值:-15.957199 迭代次数:7|当前值:-29.664780 迭代次数:8|当前值:-30.898552 迭代次数:9|当前值:-31.169037 迭代次数:10|当前值:-32.108303 迭代次数:11|找到了极小值点:|极小值:-32.108315 迭代次数:10|找到了极小值点:|极小值:-32.108292
细节问题
可以看出L-BFGS算法的二次收敛性,也即对于正定二次函数在有限次内可收敛到精确解,tensorflow在浮点数的强制转换上有很多需要注意的地方: "@"又叫做Matmul运算符,需要两边的元素都是tf.float64类型的,而加减要求tf.float32
|