KNN分类模型
1.概念
-
k-邻近算法采用测量不同特征值之间的距离方法进行分类(k-Nearest Neighbor,KNN) -
k值的作用【选择样本数据集中前K个的数据,出现次数最多的分类,作为新(预测)数据的分类】 -
-
欧几里得距离(Euclidean Distance) -
-
注意:
- 在knn中的k的取值不同会直接导致分类的结果不同。
n_neighbors 参数表示k值 - 模型的超参数:模型的参数有不同的取值且不同的取值会导致模型的分类或预测产生直接的影响
- 在knn算法中,目标数据可以不是数值型。在knn算法原理中,仅仅计算特征数据的距离,不会计算目标数据的距离。
-
示例说明【如何进行电影分类(动作片和爱情片)】
-
因为动作片中会存在接吻镜头,爱情片中也会存在打斗场景,我们不能单纯依靠是否存在打斗或者亲吻来判断影片的类型。但是爱情片中的亲吻镜头更多,动作片中 的打斗场景也更频繁,基于此类场景在某部电影中出现的次数可以用来进行电影分类。 -
工作原理
- 存在一个样本数据集合【训练样本集】,并且样本集中每个数据都存在标签,即我们知道样本集中每一数据 与所属分类的对应关系。输人没有标签的新数据后,将新数据的每个特征与样本集中数据对应的 特征进行比较,然后算法提取样本集中特征最相似数据(最近邻)的分类标签。一般来说,我们 只选择样本数据集中前K个最相似的数据,这就是K-近邻算法中K的出处,通常K是不大于20的整数。 最后 ,选择K个最相似数据中出现次数最多的分类,作为新数据的分类。
-
下图显示了6部电影的打斗和接吻次数。假如有一部未看过的电影,如何确定它是爱情片还是动作片呢? -
首先我们需要知道这个未知电影存在多少个打斗镜头和接吻镜头 -
-
计算未知电影与样本集中其他电影的距离 -
-
得到了样本集中所有电影与未知电影的距离,按照距离递增排序,可以找到K个距 离最近的电影。假定k=3,则三个最靠近的电影依次是California Man 、He's Not Really into Dudes 、Beautiful Woman 。K-近邻算法按照距离最近的三部电影的类型,决定未知电影的类型,而这三部电影全是爱情片,因此我们判定未知电影是爱情片。
2.寻找最优k值
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
df=pd.read_csv('./data/adults.txt')
target=df['salary']
feature=df[['age','education_num','occupation','hours_per_week']]
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.1,random_state=2020)
one_hot_train=pd.get_dummies(x_train['occupation'])
one_hot_test=pd.get_dummies(x_test['occupation'])
x_train=pd.concat([x_train,one_hot_train],axis=1).drop(labels='occupation',axis=1)
x_test=pd.concat([x_test,one_hot_test],axis=1).drop(labels='occupation',axis=1)
scores = []
ks = []
for i in range(5,100):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,y_train)
score = knn.score(x_test,y_test)
scores.append(score)
ks.append(i)
scores_arr=np.array(scores)
ks_arr=np.array(ks)
plt.plot(ks_arr,scores_arr)
plt.xlabel('k_value')
plt.ylabel('score_value')
max=scores_arr.argmax()
k_value=ks_arr[max]
3.knn案例
3.1鸢尾花分类的实现
import sklearn.datasets as datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
iris=datasets.load_iris()
feature=iris['data']
target=iris['target']
x_train,x_test,y_train,y_test=train_test_split(feature,target,test_size=0.2,random_state=2020)
knn=KNeighborsClassifier(n_neighbors=6)
knn.fit(x_train,y_train)
y_predict=knn.predict(x_test)
print('模型分类结果:',y_predict)
print('真实分类结果:',y_test)
knn.predict([[6.1,3.1,4.7,2.1]])
2.3 预测年收入是否大于50k美元
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
df=pd.read_csv('./data/adults.txt')
target=df['salary']
feature=df[['age','education_num','occupation','hours_per_week']]
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.1,random_state=2020)
one_hot_train=pd.get_dummies(x_train['occupation'])
one_hot_test=pd.get_dummies(x_test['occupation'])
x_train=pd.concat([x_train,one_hot_train],axis=1).drop(labels='occupation',axis=1)
x_test=pd.concat([x_test,one_hot_test],axis=1).drop(labels='occupation',axis=1)
knn.score(x_test,y_test)
-
寻找最优k值预测 import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
df=pd.read_csv('./data/adults.txt')
target=df['salary']
feature=df[['age','education_num','occupation','hours_per_week']]
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.1,random_state=2020)
one_hot_train=pd.get_dummies(x_train['occupation'])
one_hot_test=pd.get_dummies(x_test['occupation'])
x_train=pd.concat([x_train,one_hot_train],axis=1).drop(labels='occupation',axis=1)
x_test=pd.concat([x_test,one_hot_test],axis=1).drop(labels='occupation',axis=1)
scores = []
ks = []
for i in range(5,100):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train,y_train)
score = knn.score(x_test,y_test)
scores.append(score)
ks.append(i)
scores_arr=np.array(scores)
ks_arr=np.array(ks)
plt.plot(ks_arr,scores_arr)
plt.xlabel('k_value')
plt.ylabel('score_value')
max=scores_arr.argmax()
k_value=ks_arr[max]
knn=KNeighborsClassifier(n_neighbors=k_value)
knn.fit(x_train,y_train)
score=knn.score(x_test,y_test)
knn.predict()
2.4 约会网站配对效果判定
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline
df=pd.read_csv('./data/datingTestSet.txt',header=None,sep='\s+')
feature=df[[0,1,2]]
target=df[3]
x_train,x_test,y_train,y_test=train_test_split(feature,target,test_size=0.2,random_state=2020)
mm=MinMaxScaler()
m_x_train=mm.fit_transform(x_train)
m_x_test=mm.transform(x_test)
scores=[]
ks=[]
for k in range(3,50):
knn=KNeighborsClassifier(n_neighbors=k)
knn.fit(m_x_train,y_train)
score=knn.score(m_x_test,y_test)
scores.append(score)
ks.append(k)
scores_arr=np.array(scores)
ks_arr=np.array(ks)
plt.plot(ks_arr,scores_arr)
plt.xlabel('k_value')
plt.ylabel('score_value')
max=scores_arr.argmax()
k=ks_arr[max]
knn=KNeighborsClassifier(n_neighbors=k).fit(m_x_train,y_train)
4.knn取值问题
4.1学习曲线&交叉验证选取k值
- k值较小,则模型复杂度较高,容易发生过度拟合,学习的估计误差会增大,预测结果对近邻的实例点非常敏感
- k值较大,可以减少学习估计误差,但是学习的近似误差会增大,与输入实例较远的训练实例也会预测其作用,使预测发生错误,k值增大模型的复杂度会下降
- 在应用中,k值一般取一个比较小的值,通常采用交叉验证法来选取最优的k值
- 适用场景为小数据场景,样本为几千,几万
4.2 K折交叉验证
-
目的:
- 将样本的训练数据交叉折分出不同的训练集和验证集,使用交叉折分出不同的训练集和验证集分别测试模型的精准度,精准度的均值就是交叉验证的结果。将结果作用到不同的超参数中,选取出精准度最高的超参数作为模型创建的超参数即可
-
API from sklearn.model_selection import cross_val_score
cross_val_score(estimator,X,y,cv)
-
实现思路
-
将数据集平均分割成k个等份 -
使用1份数据作为测试数据,其余为训练数据 -
计算测试准确率 -
使用不同的测试集,重复2,3步骤 -
对精准率求均值,作为对未知数据预测准确率的估计 -
-
交叉验证在knn算法的基本使用 from sklearn.model_selection import cross_val_score,train_test_split
import sklearn.datasets as datasets
from sklearn.neighbors import KNeighborsClassifier
iris=datasets.load_iris()
feature=iris['data']
target=iris['target']
x_train,x_test,y_train,y_test=train_test_split(feature,target,test_size=0.2,random_state=2020)
knn=KNeighborsClassifier(n_neighbors=5)
cross_val_score(knn,x_train,y_train,cv=5).mean()
-
使用交叉验证&学习曲线寻找最优的超参数 from sklearn.model_selection import train_test_split
import sklearn.datasets as datasets
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
iris=datasets.load_iris()
feature = iris['data']
target = iris['target']
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2020)
scores = []
ks = []
iris = datasets.load_iris()
feature = iris['data']
target = iris['target']
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2020)
for k in range(3,20):
knn = KNeighborsClassifier(n_neighbors=k)
score = cross_val_score(knn,x_train,y_train,cv=6).mean()
scores.append(score)
ks.append(k)
scores_arr=np.array(scores)
ks_arr=np.array(ks)
plt.plot(ks_arr,scores_arr)
plt.xlabel('k_value')
plt.ylabel('score_value')
max=scores_arr.argmax()
k=ks_arr[max]
4.3 模型选择
-
交叉验证也可以帮助我们进行模型选择,使用iris数据,比较和选择KNN和Logistic回归模型 from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import sklearn.datasets as datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
feature = iris['data']
target = iris['target']
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2020)
knn=KNeighborsClassifier(n_neighbors=5)
print(cross_val_score(knn,x_train,y_train,cv=10).mean())
lr=LogisticRegression()
print(cross_val_score(lr,x_train,y_train,cv=10).mean())
4.4 K-Fold&交叉验证
-
Scikit提供的API from sklearn.model_selection import KFold
KFold(n_solits,shuffle,random_state)
-
示例 from numpy import array
from sklearn.model_selection import KFold
data=array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
kfold=KFold(n_splits=3,shuffle=True,random_state=1)
for train,test in kfold.split(data):
print('train:%s,test:%s'%(data[train],data[test]))
-
Scikit中提取带K-Fold接口的交叉验证接口sklearn.model_selection.cross_validate,但是该接口没有数据shuffle功能,所以一般结合Kfold一起使用。如果Train数据在分组前已经经过了shuffle处理,比如使用train_test_split分组,那就可以直接使用cross_val_score接口 from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import sklearn.datasets as datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
feature = iris['data']
target = iris['target']
n_folds=5
kf=KFold(n_folds,shuffle=True,random_state=42).get_n_splits(feature)
scores=cross_val_score(knn,feature,target,cv=kf)
scores.mean()
|