切片逆回归
简介
切片逆回归(Slice Inverse Regression, SIR)是Li(1991)提出的一种经典的充分降维方法。SIR探究了多维变量
x
x
x对单变量
y
y
y的回归而不是一维变量
y
y
y对多维变量
x
x
x的回归。在对多维解释变量
x
x
x的降维过程中充分利用了
y
y
y的信息。
一般模型
-
一般模型:
y
=
f
(
β
1
T
x
,
…
,
β
K
T
x
,
ε
)
y=f(\beta^{T}_{1}x,\dots,\beta_{K}^{T}x,\varepsilon)
y=f(β1T?x,…,βKT?x,ε)
-
ε
\varepsilon
ε是一个独立于
x
x
x的随机变量,
f
f
f为一个未知的连接函数;
-
y
y
y关于
x
x
x的条件分布可以由
x
x
x的
K
K
K个线性组合
β
1
T
x
,
…
,
β
K
T
x
\beta^{T}_{1}x,\dots,\beta_{K}^{T}x
β1T?x,…,βKT?x得到,并不会损失
x
x
x中包含的原始信息;
- 等价于在给定
x
x
x的
K
K
K个线性组合时,响应变量
y
y
y与解释变量
x
x
x是独立的。当
K
K
K远小于
x
x
x的维度
p
p
p时,便达到降维目的;
-
基本定理:
- 逆回归曲线
E
(
X
∣
Y
)
?
E
(
X
)
E(X|Y)-E(X)
E(X∣Y)?E(X)被包含在
Σ
x
β
i
,
i
=
1
?
…
,
k
\Sigma_{x}\beta_{i},i=1\,\dots,k
Σx?βi?,i=1…,k张成的线性空间中;
- 在线性条件下,满足
Σ
η
b
i
=
λ
i
Σ
x
b
i
\Sigma_{\eta}b_{i}=\lambda_{i}\Sigma_{x}b_{i}
Ση?bi?=λi?Σx?bi?的非零相对特征
λ
i
\lambda_{i}
λi?不超过
k
k
k个,其对应的相对特征向量
b
i
b_{i}
bi?被包含在
β
i
,
i
=
1
,
…
,
k
\beta_{i},i=1,\dots,k
βi?,i=1,…,k张成的线性空间中;
具体步骤
-
切片逆回归对数据
(
Y
i
,
X
i
)
,
(
i
=
1
,
…
,
n
)
(Y_{i},X_{i}),(i=1,\dots,n)
(Yi?,Xi?),(i=1,…,n)进行操作; -
通过仿射变换标准化
X
i
X_{i}
Xi?得到
X
i
~
=
Σ
^
x
x
?
1
/
2
(
X
i
?
X
 ̄
)
,
i
=
1
,
…
,
n
\widetilde{X_{i}}=\widehat{\Sigma}_{xx}^{-1/2}(X_{i}-\overline{X}),i=1,\dots,n
Xi?
?=Σ
xx?1/2?(Xi??X),i=1,…,n ,其中
Σ
^
x
x
\widehat{\Sigma}_{xx}
Σ
xx?和
X
 ̄
\overline{X}
X分别是
X
X
X 的样本协方差矩阵和样本均值; -
将
Y
Y
Y的取值
Y
i
Y_{i}
Yi?从小到大进行排序,并切为
H
H
H片,记为
I
1
,
I
2
,
…
,
I
h
I_{1},I_{2},\dots,I_{h}
I1?,I2?,…,Ih?。其中
I
h
I_{h}
Ih?包含
Y
i
Y_{i}
Yi?的概率为
P
h
P_{h}
Ph?:
P
h
=
1
n
∑
i
=
1
n
δ
h
(
Y
i
)
P_{h}=\frac{1}{n}\sum_{i=1}^{n}\delta_{h}(Y_{i})
Ph?=n1?i=1∑n?δh?(Yi?)
δ
h
(
Y
i
)
=
{
0
,
?if?
Y
i
?inside?the?
h
1
,
?if?
Y
i
?outside?the?
h
\delta_{h}\left(Y_{i}\right)=\left\{\begin{array}{l} 0, \text { if } Y_{i} \text { inside the } \mathrm{h} \\ 1, \text { if } Y_{i} \text { outside the } \mathrm{h} \end{array}\right.
δh?(Yi?)={0,?if?Yi??inside?the?h1,?if?Yi??outside?the?h? -
对于每一个切片,计算其
x
i
x_{i}
xi?的样本均值
m
h
m_{h}
mh?,即
m
h
=
1
n
P
h
∑
y
∈
I
h
X
i
~
m_{h}=\frac{1}{nP_{h}}\sum_{y\in{I{_h}}}\widetilde{X_{i}}
mh?=nPh?1?∑y∈Ih??Xi?
?; -
对数据
m
h
m_{h}
mh?进行(加权)主成分分析,加权的协方差矩阵为
V
=
∑
h
=
1
H
p
h
m
h
m
h
′
V=\sum_{h=1}^{H}p_{h}m_{h}m_{h}^{'}
V=∑h=1H?ph?mh?mh′?,并找出V的特征值和特征向量。 -
令
k
k
k个最大的特征向量为
η
k
,
(
k
=
1
,
2
,
…
,
k
)
\eta_{k},(k=1,2,\dots,k)
ηk?,(k=1,2,…,k),可得
β
k
=
η
k
Σ
x
x
?
1
2
,
k
=
1
,
2
,
…
,
k
\beta_{k}=\eta_{k}\Sigma_{xx}^{-\frac{1}{2}},k=1,2,\dots,k
βk?=ηk?Σxx?21??,k=1,2,…,k
切片逆回归的Python实现
class SIR:
def __init__(self, H, K, estdim=0, Cn=1):
self.H = H
self.K = K
self.estdim = estdim
self.Cn = Cn
def fit(self, X, Y):
self.X = X
self.Y = Y
self.mean_x = np.mean(X, axis=0)
self.sigma_x = np.cov(X, rowvar=False)
self.Z = np.matmul(X - np.tile(self.mean_x, (X.shape[0], 1)),
fractional_matrix_power(self.sigma_x, -0.5))
n, p = self.Z.shape
if self.Y.ndim == 1:
self.Y = self.Y.reshape(-1, 1)
ny, py = self.Y.shape
W = np.ones((n, 1)) / n
nw, pw = W.shape
if n != ny:
raise ValueError('X and Y must have the same number of samples')
elif p == 1:
raise ValueError('X must have at least 2 dimensions')
elif py != 1:
raise ValueError('Y must have only 1 dimension')
c = np.ones((1, self.H)) * (n // self.H) + np.hstack(
[np.ones(
(1, n % self.H)), np.zeros((1, self.H - n % self.H))])
cumc = np.cumsum(c)
temp = np.hstack((self.Z, self.Y, W))
temp = temp[np.argsort(temp[:, p])]
z, y, w = temp[:, :p], temp[:, p:p + 1], temp[:, p + 1]
muh = np.zeros((self.H, p))
wh = np.zeros((self.H, 1))
k = 0
for i in range(n):
if i >= cumc[k]:
k += 1
muh[k, :] = muh[k, :] + z[i, :]
wh[k] = wh[k] + w[i]
muh = muh / (np.tile(wh, (1, p)) * n)
self.M = np.zeros((p, p))
for i in range(self.H):
self.M = self.M + wh[i] * muh[i, :].reshape(-1, 1) * muh[i, :]
if self.estdim == 0:
self.D, self.V = eigs(A=self.M, k=self.K, which='LM')
else:
""" # 稀疏矩阵情况,待修正
[V D] = np.linalg,eig(full(M))
lambda = np.sort(abs(diag(D)),'descend')
L = np.log(lambda+1) - lambda
G = np.zeros((p,1))
if Cn == 1
Cn = n^(1/2)
elif Cn == 2
Cn = n^(1/3)
elif Cn == 3:
Cn = 0.5 * n^0.25
for k in range(p):
G(k) = n / 2 * sum(L(1:k)) / sum(L) - Cn * k * (k+1) / p
maxG, K = np. max(G)
V, D = eigs(M,K,'lm')
"""
pass
return self.V, self.K, self.M, self.D
def transform(self):
hatbeta = np.matmul(fractional_matrix_power(self.sigma_x, -0.5),self.V)
return hatbeta
X = pd.read_csv('X.csv',header=None).values
Y = pd.read_csv('Y.csv',header=None).values
model = SIR(H=6, K=4, estdim=0, Cn=1)
V, K, M, D = model.fit(X,Y)
hatbeta = model.transform()
|