[人工智能] 吴恩达·Machine Learning || chap12 Support Vector Machines简记

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 吴恩达·Machine Learning || chap12 Support Vector Machines简记 -> 正文阅读

[人工智能]吴恩达·Machine Learning || chap12 Support Vector Machines简记

12-1 Optimization objective

Alternative view of logistic regression

$\theta } ( x ) = \frac { 1 } { 1 + e ^ { - \theta ^ { T } x } }$

If $y = 1$ ,we want $h_{\theta}(x)\approx1, \theta^Tx\gg0$

If $y = 0$ ,we want $h_{\theta}(x)\approx0, \theta^Tx\ll0$

? Cost of example: $\log h _ { \theta } ( x ) + ( 1 - y ) \log ( 1 - h _ { \theta } ( x ) ) )$ $\log \frac { 1 } { 1 + e ^ { - \theta ^ { r } x } } - ( 1 - y ) \log ( 1 - \frac { 1 } { 1 + e ^ { - \theta ^ { r } x } } )$

Support vector machine

Logistic regression:
$min_\theta\frac { 1 } { m } [ \sum _ { i = 1 } ^ { m } y ^ { ( i ) }( -\log h _ { \theta } ( x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) (-\log ( 1 - h _ { \theta } ( x ^ { ( i ) } ) ) ] + \frac { \lambda } { 2 m } \sum _ { j = 1 } ^ { n } \theta^2_j$
support vector machine:
在这里插入图片描述
$\theta} C \sum _ { i = 1 } ^ { m }[ y ^ { ( i ) } {cost} _ { 1 } ( \theta ^ { T } x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) {cost} _ { 0 } ( \theta ^ { T } x ^ { ( i ) } ) ] + \frac { 1 } { 2 } \sum _ { i = 1 } ^ { n}\theta_j^2$

SVM hypothesis
$\theta} C \sum _ { i = 1 } ^ { m }[ y ^ { ( i ) } {cost} _ { 1 } ( \theta ^ { T } x ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) {cost} _ { 0 } ( \theta ^ { T } x ^ { ( i ) } ) ] + \frac { 1 } { 2 } \sum _ { i = 1 } ^ { n}\theta_j^2$
Hypothesis:

$h_\theta(x)=\begin{cases}1\quad if\;\theta^Tx\ge0\\0\quad otherwise\end{cases}$

12-2 Large Margin Intuition

Support Vector Machine

if $y = 1$ ,we want $\theta^Tx\ge1\quad(\text{not just $\ge$ 0})$

在这里插入图片描述

if $y = 0$ ,we want $\theta^Tx\le-1\quad(\text{not just $<$ 0})$
在这里插入图片描述

SVM Decision Boundary

SVM Decision Boundary: Linearly separable case

Large margin classifier in presence of outliers

12-3 The mathematics behind large margin classification (optional)

Vector Inner Product 向量内积
在这里插入图片描述

SVM Decision Boundary

在这里插入图片描述

12-4 Kernels I

核函数

Non-linear Decision Boundary

predict y=1 if $\theta _ { 0 } + \theta _ { 1 } x _ { 1 } + \theta _ { 2 } x _ { 2 } + \theta _ { 3 } x _ { 1 } x _ { 2 } } { + \theta _ { 4 } x _ { 1 } ^ { 2 } + \theta _ { 5 } x _ { 2 } ^ { 2 } + \cdots \geq 0 }$

$h_\theta(x)=\begin{cases}1\quad if\;{ \theta _ { 0 } + \theta _ { 1 } x _ { 1 } + \theta _ { 2 } x _ { 2 } + \theta _ { 3 } x _ { 1 } x _ { 2 } } { + \cdots \geq 0 }\\0\quad otherwise\end{cases}$

Given x,compute new feature depending on proximity to landmarks $l^{(1)},l^{(2)},l^{(3)}$

Kernels and Similarity

$f_1=similarity(x,l^{(1)})=exp(-\frac{||x-l^{(1)}||^2}{2\sigma^2})=\operatorname { exp } ( - \frac { \sum _ { j = 1 } ^ { n } ( x _ { j } - l _j^{ ( 1 ) } ) ^ { 2 } } { 2 \sigma ^ { 2 } } )$

$\; x\approx l^{(1)}:$ $f_1\approx 1$

$If \;x\; if \; far \; from\;l^{(1)}:$ $f_1\approx 0$

在这里插入图片描述

12-5 Kernels II

Choosing the landmarks

SVM with Kernels

Given $\cdots , ( x ^ { ( m ) } , y ^ { ( m ) } )$

choose $\cdots , l ^ { ( m ) } = x ^ { ( m ) }$

Given example x:

? $\begin{array}{l}f_1=similarity(x,l^{(1)})\\f_2=similarity(x,l^{(2)})\\\cdots\end{array}$

For training example $(x^{(i)},y^{(i)})\longrightarrow f^{(i)}$

Hypothesis: Given x,compute features $f\in \mathbb{R}^{m+1}$

Predict “y=1” if $\theta^T f\ge 0$

Training: $\theta} C \sum _ { i = 1 } ^ { m }[ y ^ { ( i ) } {cost} _ { 1 } ( \theta ^ { T } f ^ { ( i ) } ) + ( 1 - y ^ { ( i ) } ) {cost} _ { 0 } ( \theta ^ { T } f ^ { ( i ) } ) ] + \frac { 1 } { 2 } \sum _ { i = 1 } ^ { n}\theta_j^2$

SVM parameters:

$C(=\frac{1}{\lambda})$ .

Large C: Lower bias, high variance.(small $\lambda$ )
Small C: Higher bias, low variance.( $large\;\lambda$ )

$\sigma^2$

Large $\sigma^2$ : Features $f_i$ ; vary more smoothly. Higher bias, lower variance.

Small $\sigma^2$ : Features $f_i$ ; vary less smoothly. Lower bias, Higher variance.

12-6 Using an SVM

Use SVM software package (e.g. liblinear, libsvm, …) to solve for parameters $\theta$

Need to specify:

Choice of parameter C

Choice of kernel (similarity function):

? E.g. No kernel (“linear kernel”)

$\theta _ { 0 } + \theta _ { 1 } x _ { 1 } + \theta _ { 2 } x _ { 2 } + \theta _ { 3 } + \cdots \ge 0 \longrightarrow n\;large,m\;small$

? predict:“y=1” if $\theta^Tx\ge0$

? Gaussian kernel:

$f_i=exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2}),where\;l^{(i)}=x^{(i)}$

? Need to choose $\sigma^2$

Kernel (similarity) functions:

function f= kernel(x1,x2)
	f=exp((-abs(x1-x2)^2)/(2*(sigma^2)))
return

Note: Do perform feature scaling before using the Gaussian kernel

=[](#4-3 Gradient descent in practice I: Feature Scaling)

Other choices of kernel

Note: Not all similarity functions similarity(x, l) make valid kernels (Need to satisfy technical condition called"mercer’s Theorem"to make sure SVM packages’ optimizations run correctly, and do not diverge)

Many of-the-shelf kernels avaliable:

Polynomial kernel: $k(x,l)=(x^Tl)^2,(x^Tl+1)^3,(x^Tl+5)^4$
More esoteric: String kernel, chi-square kernel, histogram intersection kernel

Multi-class classification
在这里插入图片描述

Logistic regression vs. SVMS

m=number of features( $x\in \mathbb{R}^{n+1}$ ), m =number of training examples
If n is large(relative to m)
Use logistic regression, or SVM without a kernel (“linear kernel”)