题目
基于信息增益,对下述数据集进行决策树构建,描述过程
一个关于配眼镜的一个决策分类所需要的数据,数据集包含4属性:age, astigmatism, trear-prod-rate为输入特征,contact-lenses为决策属性。
属性集
A
=
{
A
G
E
,
A
S
T
,
T
E
A
}
A=\{AGE,AST,TEA\}
A={AGE,AST,TEA},类别为
C
O
N
CON
CON。计算根节点的信息熵
E
n
t
(
D
)
=
?
(
2
12
log
?
2
2
12
+
3
12
log
?
2
3
12
+
7
12
log
?
2
7
12
)
=
1.384
Ent(D)=-(\frac{2}{12}\log_2{\frac{2}{12}}+\frac{3}{12}\log_2{\frac{3}{12}}+\frac{7}{12}\log_2{\frac{7}{12}})=1.384
Ent(D)=?(122?log2?122?+123?log2?123?+127?log2?127?)=1.384 计算每个属性的信息熵和信息增益,记
p
(
s
o
f
t
)
=
p
1
,
p
(
h
a
r
d
)
=
p
2
,
p
(
n
o
n
e
)
=
p
3
p(soft)=p_1,p(hard)=p_2,p(none)=p_3
p(soft)=p1?,p(hard)=p2?,p(none)=p3?,
-
A
G
E
AGE
AGE 有三个可能取值
{
y
o
u
n
g
,
p
r
e
?
p
r
e
,
p
r
e
}
\{young,pre-pre,pre\}
{young,pre?pre,pre},对应下标
1
,
2
,
3
1,2,3
1,2,3
D
1
=
{
1
,
2
,
3
}
p
1
=
1
3
,
?
p
2
=
1
3
,
?
p
3
=
1
3
E
n
t
(
D
1
)
=
?
(
1
3
log
?
2
1
3
+
1
3
log
?
2
1
3
+
1
3
log
?
2
1
3
)
=
1.585
D
2
=
{
4
,
5
,
6
,
7
,
8
}
p
1
=
1
5
,
?
p
2
=
1
5
,
?
p
3
=
3
5
E
n
t
(
D
2
)
=
?
(
1
5
log
?
2
1
5
+
1
5
log
?
2
1
5
+
3
5
log
?
2
3
5
)
=
1.371
D
3
=
{
9
,
10
,
11
,
12
}
p
1
=
0
,
?
p
2
=
1
4
,
?
p
3
=
3
4
E
n
t
(
D
3
)
=
?
(
1
4
log
?
2
1
4
+
3
4
log
?
2
3
4
)
=
0.811
G
a
i
n
(
D
,
A
G
E
)
=
E
n
t
(
D
)
?
∑
v
=
1
3
∣
D
v
∣
∣
D
∣
E
n
t
(
D
v
)
=
1.384
?
3
12
×
1.585
?
5
12
×
1.371
?
4
12
×
0.811
=
0.146
\begin{aligned} &D^1=\{1,2,3\} \quad p_1=\frac{1}{3},\ p_2=\frac{1}{3},\ p_3=\frac{1}{3} \quad Ent(D^1)=-(\frac{1}{3}\log_2{\frac{1}{3}}+\frac{1}{3}\log_2{\frac{1}{3}}+\frac{1}{3}\log_2{\frac{1}{3}})=1.585 \\ &D^2=\{4,5,6,7,8\} \quad p_1=\frac{1}{5},\ p_2=\frac{1}{5},\ p_3=\frac{3}{5} \quad Ent(D^2)=-(\frac{1}{5}\log_2{\frac{1}{5}}+\frac{1}{5}\log_2{\frac{1}{5}}+\frac{3}{5}\log_2{\frac{3}{5}})=1.371 \\ &D^3=\{9,10,11,12\} \quad p_1=0,\ p_2=\frac{1}{4},\ p_3=\frac{3}{4} \quad Ent(D^3)=-(\frac{1}{4}\log_2{\frac{1}{4}}+\frac{3}{4}\log_2{\frac{3}{4}})=0.811 \\ &Gain(D,AGE)=Ent(D)-\sum_{v=1}^3 \frac{|D^v|}{|D|}Ent(D^v)=1.384-\frac{3}{12}\times 1.585-\frac{5}{12}\times 1.371-\frac{4}{12}\times 0.811=0.146 \end{aligned}
?D1={1,2,3}p1?=31?,?p2?=31?,?p3?=31?Ent(D1)=?(31?log2?31?+31?log2?31?+31?log2?31?)=1.585D2={4,5,6,7,8}p1?=51?,?p2?=51?,?p3?=53?Ent(D2)=?(51?log2?51?+51?log2?51?+53?log2?53?)=1.371D3={9,10,11,12}p1?=0,?p2?=41?,?p3?=43?Ent(D3)=?(41?log2?41?+43?log2?43?)=0.811Gain(D,AGE)=Ent(D)?v=1∑3?∣D∣∣Dv∣?Ent(Dv)=1.384?123?×1.585?125?×1.371?124?×0.811=0.146? -
A
S
T
AST
AST 有两个可能取值
{
y
e
s
,
n
o
}
\{yes,no\}
{yes,no},对应下标
1
,
2
1,2
1,2
D
1
=
{
2
,
3
,
6
,
7
,
8
,
11
,
12
}
p
1
=
0
,
?
p
2
=
3
7
,
?
p
3
=
4
7
E
n
t
(
D
1
)
=
?
(
3
7
log
?
2
3
7
+
4
7
log
?
2
4
7
)
=
0.985
D
2
=
{
1
,
4
,
5
,
9
,
10
}
p
1
=
2
5
,
?
p
2
=
0
,
?
p
3
=
3
5
E
n
t
(
D
2
)
=
?
(
2
5
log
?
2
2
5
+
3
5
log
?
2
3
5
)
=
0.971
G
a
i
n
(
D
,
A
S
T
)
=
E
n
t
(
D
)
?
∑
v
=
1
2
∣
D
v
∣
∣
D
∣
E
n
t
(
D
v
)
=
1.384
?
7
12
×
0.985
?
5
12
×
0.971
=
0.405
\begin{aligned} &D^1=\{2,3,6,7,8,11,12\} \quad p_1=0,\ p_2=\frac{3}{7},\ p_3=\frac{4}{7} \quad Ent(D^1)=-(\frac{3}{7}\log_2{\frac{3}{7}}+\frac{4}{7}\log_2{\frac{4}{7}})=0.985 \\ &D^2=\{1,4,5,9,10\} \quad p_1=\frac{2}{5},\ p_2=0,\ p_3=\frac{3}{5} \quad Ent(D^2)=-(\frac{2}{5}\log_2{\frac{2}{5}}+\frac{3}{5}\log_2{\frac{3}{5}})=0.971 \\ &Gain(D,AST)=Ent(D)-\sum_{v=1}^2 \frac{|D^v|}{|D|}Ent(D^v)=1.384-\frac{7}{12}\times 0.985-\frac{5}{12}\times 0.971=0.405 \end{aligned}
?D1={2,3,6,7,8,11,12}p1?=0,?p2?=73?,?p3?=74?Ent(D1)=?(73?log2?73?+74?log2?74?)=0.985D2={1,4,5,9,10}p1?=52?,?p2?=0,?p3?=53?Ent(D2)=?(52?log2?52?+53?log2?53?)=0.971Gain(D,AST)=Ent(D)?v=1∑2?∣D∣∣Dv∣?Ent(Dv)=1.384?127?×0.985?125?×0.971=0.405? -
T
E
A
TEA
TEA 有两个可能的取值
{
n
o
r
m
a
l
,
r
e
d
u
c
e
d
}
\{normal,reduced\}
{normal,reduced},对应下标
1
,
2
1,2
1,2
D
1
=
{
1
,
3
,
5
,
6
,
7
,
8
,
10
,
12
}
p
1
=
2
8
,
?
p
2
=
3
8
,
?
p
3
=
3
8
E
n
t
(
D
1
)
=
?
(
2
8
log
?
2
2
8
+
3
8
log
?
2
3
8
+
3
8
log
?
2
3
8
)
=
1.561
D
2
=
{
2
,
4
,
9
,
11
}
p
1
=
0
,
?
p
2
=
0
,
?
p
3
=
1
E
n
t
(
D
2
)
=
?
(
1
log
?
2
1
)
=
0
G
a
i
n
(
D
,
T
E
A
)
=
E
n
t
(
D
)
?
∑
v
=
1
2
∣
D
v
∣
∣
D
∣
E
n
t
(
D
v
)
=
1.384
?
8
12
×
1.561
?
5
12
×
0
=
0.343
\begin{aligned} &D^1=\{1,3,5,6,7,8,10,12\} \quad p_1=\frac{2}{8},\ p_2=\frac{3}{8},\ p_3=\frac{3}{8} \quad Ent(D^1)=-(\frac{2}{8}\log_2{\frac{2}{8}}+\frac{3}{8}\log_2{\frac{3}{8}}+\frac{3}{8}\log_2{\frac{3}{8}})=1.561 \\ &D^2=\{2,4,9,11\} \quad p_1=0,\ p_2=0,\ p_3=1 \quad Ent(D^2)=-(1\log_2{1})=0 \\ &Gain(D,TEA)=Ent(D)-\sum_{v=1}^2 \frac{|D^v|}{|D|}Ent(D^v)=1.384-\frac{8}{12}\times 1.561-\frac{5}{12}\times 0=0.343 \end{aligned}
?D1={1,3,5,6,7,8,10,12}p1?=82?,?p2?=83?,?p3?=83?Ent(D1)=?(82?log2?82?+83?log2?83?+83?log2?83?)=1.561D2={2,4,9,11}p1?=0,?p2?=0,?p3?=1Ent(D2)=?(1log2?1)=0Gain(D,TEA)=Ent(D)?v=1∑2?∣D∣∣Dv∣?Ent(Dv)=1.384?128?×1.561?125?×0=0.343?
于是,
G
a
i
n
(
D
,
A
S
T
)
Gain(D,AST)
Gain(D,AST)最大,选它为划分属性,
对左分支节点划分,可用属性集
A
=
{
A
G
E
,
T
E
A
}
A=\{AGE,TEA\}
A={AGE,TEA},类别为
C
O
N
CON
CON。该节点的信息熵
E
n
t
(
D
1
)
=
0.985
Ent(D^1)=0.985
Ent(D1)=0.985。计算每个属性的信息熵和信息增益,记
p
(
s
o
f
t
)
=
p
1
,
p
(
h
a
r
d
)
=
p
2
,
p
(
n
o
n
e
)
=
p
3
p(soft)=p_1,p(hard)=p_2,p(none)=p_3
p(soft)=p1?,p(hard)=p2?,p(none)=p3?,
-
A
G
E
AGE
AGE 有三个可能取值
{
y
o
u
n
g
,
p
r
e
?
p
r
e
,
p
r
e
}
\{young,pre-pre,pre\}
{young,pre?pre,pre},对应下标
1
,
2
,
3
1,2,3
1,2,3
D
11
=
{
2
,
3
}
p
1
=
0
,
?
p
2
=
1
2
,
?
p
3
=
1
2
E
n
t
(
D
11
)
=
?
(
1
2
log
?
2
1
2
+
1
2
log
?
2
1
2
)
=
1.000
D
12
=
{
6
,
7
,
8
}
p
1
=
0
,
?
p
2
=
1
3
,
?
p
3
=
2
3
E
n
t
(
D
12
)
=
?
(
1
3
log
?
2
1
3
+
2
3
log
?
2
2
3
)
=
0.918
D
13
=
{
11
,
12
}
p
1
=
0
,
?
p
2
=
1
2
,
?
p
3
=
1
2
E
n
t
(
D
13
)
=
?
(
1
2
log
?
2
1
2
+
1
2
log
?
2
1
2
)
=
1.000
G
a
i
n
(
D
1
,
A
G
E
)
=
E
n
t
(
D
)
?
∑
v
=
1
3
∣
D
1
v
∣
∣
D
1
∣
E
n
t
(
D
1
v
)
=
0.985
?
2
7
×
1.000
?
3
7
×
0.918
?
2
7
×
1.000
=
0.020
\begin{aligned} &D^{11}=\{2,3\} \quad p_1=0,\ p_2=\frac{1}{2},\ p_3=\frac{1}{2} \quad Ent(D^{11})=-(\frac{1}{2}\log_2{\frac{1}{2}}+\frac{1}{2}\log_2{\frac{1}{2}})=1.000 \\ &D^{12}=\{6,7,8\} \quad p_1=0,\ p_2=\frac{1}{3},\ p_3=\frac{2}{3} \quad Ent(D^{12})=-(\frac{1}{3}\log_2{\frac{1}{3}}+\frac{2}{3}\log_2{\frac{2}{3}})=0.918 \\ &D^{13}=\{11,12\} \quad p_1=0,\ p_2=\frac{1}{2},\ p_3=\frac{1}{2} \quad Ent(D^{13})=-(\frac{1}{2}\log_2{\frac{1}{2}}+\frac{1}{2}\log_2{\frac{1}{2}})=1.000 \\ &Gain(D^1,AGE)=Ent(D)-\sum_{v=1}^3 \frac{|D^{1v}|}{|D^1|}Ent(D^{1v})=0.985-\frac{2}{7}\times 1.000-\frac{3}{7}\times 0.918-\frac{2}{7}\times 1.000=0.020 \end{aligned}
?D11={2,3}p1?=0,?p2?=21?,?p3?=21?Ent(D11)=?(21?log2?21?+21?log2?21?)=1.000D12={6,7,8}p1?=0,?p2?=31?,?p3?=32?Ent(D12)=?(31?log2?31?+32?log2?32?)=0.918D13={11,12}p1?=0,?p2?=21?,?p3?=21?Ent(D13)=?(21?log2?21?+21?log2?21?)=1.000Gain(D1,AGE)=Ent(D)?v=1∑3?∣D1∣∣D1v∣?Ent(D1v)=0.985?72?×1.000?73?×0.918?72?×1.000=0.020? -
T
E
A
TEA
TEA 有两个可能的取值
{
n
o
r
m
a
l
,
r
e
d
u
c
e
d
}
\{normal,reduced\}
{normal,reduced},对应下标
1
,
2
1,2
1,2
D
11
=
{
3
,
6
,
7
,
8
,
12
}
p
1
=
0
,
?
p
2
=
3
5
,
?
p
3
=
2
5
E
n
t
(
D
1
)
=
?
(
3
5
log
?
2
3
5
+
2
5
log
?
2
2
5
)
=
0.971
D
12
=
{
2
,
11
}
p
1
=
0
,
?
p
2
=
0
,
?
p
3
=
1
E
n
t
(
D
2
)
=
?
(
1
log
?
2
1
)
=
0
G
a
i
n
(
D
1
,
T
E
A
)
=
E
n
t
(
D
1
)
?
∑
v
=
1
2
∣
D
1
v
∣
∣
D
1
∣
E
n
t
(
D
1
v
)
=
0.985
?
5
7
×
0.971
?
2
7
×
0
=
0.291
\begin{aligned} &D^{11}=\{3,6,7,8,12\} \quad p_1=0,\ p_2=\frac{3}{5},\ p_3=\frac{2}{5} \quad Ent(D^1)=-(\frac{3}{5}\log_2{\frac{3}{5}}+\frac{2}{5}\log_2{\frac{2}{5}})=0.971 \\ &D^{12}=\{2,11\} \quad p_1=0,\ p_2=0,\ p_3=1 \quad Ent(D^2)=-(1\log_2{1})=0 \\ &Gain(D^1,TEA)=Ent(D^1)-\sum_{v=1}^2 \frac{|D^{1v}|}{|D^1|}Ent(D^{1v})=0.985-\frac{5}{7}\times 0.971-\frac{2}{7}\times 0=0.291 \end{aligned}
?D11={3,6,7,8,12}p1?=0,?p2?=53?,?p3?=52?Ent(D1)=?(53?log2?53?+52?log2?52?)=0.971D12={2,11}p1?=0,?p2?=0,?p3?=1Ent(D2)=?(1log2?1)=0Gain(D1,TEA)=Ent(D1)?v=1∑2?∣D1∣∣D1v∣?Ent(D1v)=0.985?75?×0.971?72?×0=0.291?
于是,
G
a
i
n
(
D
1
,
T
E
A
)
Gain(D^1,TEA)
Gain(D1,TEA)最大,选它为划分属性,
继续对左分支节点划分,可用属性集
A
=
{
A
G
E
}
A=\{AGE\}
A={AGE},类别为
C
O
N
CON
CON。选它为划分属性,得到
D
111
,
D
112
,
D
113
D^{111},D^{112},D^{113}
D111,D112,D113,此时属性集合为空,将这三个节点设为叶子节点,其中
D
112
D^{112}
D112中
p
2
=
1
3
,
p
3
=
2
3
p_2=\frac{1}{3},p_3=\frac{2}{3}
p2?=31?,p3?=32?,因此将
D
112
D^{112}
D112对应叶子节点标注为
n
o
n
e
none
none类别,其余两个节点中只有一个样本,将叶子节点标记为对应样本类别,返回。考察
D
12
D^{12}
D12,包含样本均属同一类别
n
o
n
e
none
none,则将
D
12
D^{12}
D12标记为
n
o
n
e
none
none。
回到一层,对一层右分支节点划分,可用属性集
A
=
{
A
G
E
,
T
E
A
}
A=\{AGE,TEA\}
A={AGE,TEA},类别为
C
O
N
CON
CON。该节点的信息熵
E
n
t
(
D
1
)
=
0.971
Ent(D^1)=0.971
Ent(D1)=0.971。计算每个属性的信息熵和信息增益,记
p
(
s
o
f
t
)
=
p
1
,
p
(
h
a
r
d
)
=
p
2
,
p
(
n
o
n
e
)
=
p
3
p(soft)=p_1,p(hard)=p_2,p(none)=p_3
p(soft)=p1?,p(hard)=p2?,p(none)=p3?,
-
A
G
E
AGE
AGE 有三个可能取值
{
y
o
u
n
g
,
p
r
e
?
p
r
e
,
p
r
e
}
\{young,pre-pre,pre\}
{young,pre?pre,pre},对应下标
1
,
2
,
3
1,2,3
1,2,3
D
21
=
{
1
}
p
1
=
1
,
?
p
2
=
0
,
?
p
3
=
0
E
n
t
(
D
11
)
=
?
(
1
log
?
2
1
)
=
0.000
D
22
=
{
4
,
5
}
p
1
=
0
,
?
p
2
=
1
2
,
?
p
3
=
1
2
E
n
t
(
D
12
)
=
?
(
1
2
log
?
2
1
2
+
1
2
log
?
2
1
2
)
=
1.000
D
23
=
{
9
,
10
}
p
1
=
0
,
?
p
2
=
0
,
?
p
3
=
1
E
n
t
(
D
13
)
=
?
(
1
log
?
2
1
)
=
0.000
G
a
i
n
(
D
2
,
A
G
E
)
=
E
n
t
(
D
2
)
?
∑
v
=
1
3
∣
D
2
v
∣
∣
D
2
∣
E
n
t
(
D
2
v
)
=
0.971
?
1
5
×
0.000
?
2
5
×
1.000
?
2
5
×
0.000
=
0.571
\begin{aligned} &D^{21}=\{1\} \quad p_1=1,\ p_2=0,\ p_3=0 \quad Ent(D^{11})=-(1\log_2 1)=0.000 \\ &D^{22}=\{4,5\} \quad p_1=0,\ p_2=\frac{1}{2},\ p_3=\frac{1}{2} \quad Ent(D^{12})=-(\frac{1}{2}\log_2{\frac{1}{2}}+\frac{1}{2}\log_2{\frac{1}{2}})=1.000 \\ &D^{23}=\{9,10\} \quad p_1=0,\ p_2=0,\ p_3=1 \quad Ent(D^{13})=-(1\log_2{1})=0.000 \\ &Gain(D^2,AGE)=Ent(D^2)-\sum_{v=1}^3 \frac{|D^{2v}|}{|D^2|}Ent(D^{2v})=0.971-\frac{1}{5}\times 0.000-\frac{2}{5}\times 1.000-\frac{2}{5}\times 0.000=0.571 \end{aligned}
?D21={1}p1?=1,?p2?=0,?p3?=0Ent(D11)=?(1log2?1)=0.000D22={4,5}p1?=0,?p2?=21?,?p3?=21?Ent(D12)=?(21?log2?21?+21?log2?21?)=1.000D23={9,10}p1?=0,?p2?=0,?p3?=1Ent(D13)=?(1log2?1)=0.000Gain(D2,AGE)=Ent(D2)?v=1∑3?∣D2∣∣D2v∣?Ent(D2v)=0.971?51?×0.000?52?×1.000?52?×0.000=0.571? -
T
E
A
TEA
TEA 有两个可能的取值
{
n
o
r
m
a
l
,
r
e
d
u
c
e
d
}
\{normal,reduced\}
{normal,reduced},对应下标
1
,
2
1,2
1,2
D
21
=
{
1
,
5
,
10
}
p
1
=
2
3
,
?
p
2
=
0
,
?
p
3
=
1
3
E
n
t
(
D
1
)
=
?
(
2
3
log
?
2
2
3
+
1
3
log
?
2
1
3
)
=
0.918
D
22
=
{
4
,
9
}
p
1
=
0
,
?
p
2
=
0
,
?
p
3
=
1
E
n
t
(
D
2
)
=
?
(
1
log
?
2
1
)
=
0
G
a
i
n
(
D
2
,
T
E
A
)
=
E
n
t
(
D
2
)
?
∑
v
=
1
2
∣
D
2
v
∣
∣
D
2
∣
E
n
t
(
D
2
v
)
=
0.971
?
3
5
×
0.918
=
0.420
\begin{aligned} &D^{21}=\{1,5,10\} \quad p_1=\frac{2}{3},\ p_2=0,\ p_3=\frac{1}{3} \quad Ent(D^1)=-(\frac{2}{3}\log_2{\frac{2}{3}}+\frac{1}{3}\log_2{\frac{1}{3}})=0.918 \\ &D^{22}=\{4,9\} \quad p_1=0,\ p_2=0,\ p_3=1 \quad Ent(D^2)=-(1\log_2{1})=0 \\ &Gain(D^2,TEA)=Ent(D^2)-\sum_{v=1}^2 \frac{|D^{2v}|}{|D^2|}Ent(D^{2v})=0.971-\frac{3}{5}\times 0.918=0.420 \end{aligned}
?D21={1,5,10}p1?=32?,?p2?=0,?p3?=31?Ent(D1)=?(32?log2?32?+31?log2?31?)=0.918D22={4,9}p1?=0,?p2?=0,?p3?=1Ent(D2)=?(1log2?1)=0Gain(D2,TEA)=Ent(D2)?v=1∑2?∣D2∣∣D2v∣?Ent(D2v)=0.971?53?×0.918=0.420?
于是,
G
a
i
n
(
D
2
,
A
G
E
)
Gain(D^2,AGE)
Gain(D2,AGE)最大,选它为划分属性,
考察
D
21
D^{21}
D21,由于只有一个样本,所以直接将
D
21
D^{21}
D21设置为叶子节点,标记为
s
o
f
t
soft
soft,返回到
D
22
D^{22}
D22.此时可用的属性集
A
=
{
T
E
A
}
A=\{TEA\}
A={TEA},选它为划分属性,此时属性集为空,将这两个节点设置为叶子节点,这两个叶子节点中都只有一个样本,于是标记为对应样本类别,返回。考察
D
23
D^{23}
D23,其中样本全部属于类别
n
o
n
e
none
none,所以直接将
D
23
D^{23}
D23设置为叶子节点,标记为
n
o
n
e
none
none。最终得到决策树
|