前言:
? 鄙人搞开发和安全的,对这个数据建模有点兴趣。也是第一次参加,然后的话这里面涉及模型的训练。但是预测模型都是组合调参,不会太涉及改网络啥的。着重点是在分析参数以及找到合适的回归模型。
? 有点人工智能基础,python也算个精通,然后比赛的话发了教程啥的,看了下,然后写了如下笔记。相较于专门打这种的,肯定没有别人专业。但是通过笔记能学到东西,那么目的都达到了。
? 如果想迅速了解什么是数据建模。建议直接看“泰坦尼克号”大数据分析我尽我努力写完善了,但是还有些代码(尤其是人工智能模型那块)半吊子吧,不能说是熟练。大佬看了笑笑就好~
? “泰坦尼克号”大数据分析的资料我放到云盘了。
? 下载链接 提取码:eand |文章首发于https://sleepymonster.cn/
班级成绩案例之大数据分析
相关矩阵热图
相关系数: https://baike.baidu.com/item/%E7%9B%B8%E5%85%B3%E7%B3%BB%E6%95%B0/3109424
相关系数是用以反映变量之间相关关系密切程度的统计指标。
相关系数是按积差方法计算,同样以两变量与各自平均值的离差为基础,通过两个离差相乘来反映两变量之间相关程度.
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(df.corr(), ax=ax, annot=True, linewidths=0.05, fmt='.2f', cmap='magma')
plt.show()
柱状图
例如: 在这里为是否做过科学研究可视化
这个就很基础,直接放代码。但是效果好丑,excel可以考虑。
y = np.array([len(df[df.Research == 0]),len(df[df.Research == 1])])
x = np.arange(2)
plt.bar(x,y)
plt.title("Research Experience")
plt.xlabel("Canditates")
plt.ylabel("Frequency")
plt.xticks(x,('Not having research','Having research'))
plt.show()
例如: 在这里为是否托福的最高分 最低分 平均分
y = np.array([df['TOEFL Score'].min(),df['TOEFL Score'].mean(),df['TOEFL Score'].max()])
x = np.arange(3)
plt.bar(x,y)
plt.title('TOEFL Score')
plt.xlabel('Level')
plt.ylabel('TOEFL Score')
plt.xticks(x,('Worst','Average','Best'))
plt.show()
直方图
例如:绘制 GRE 成绩直方图
df['GRE Score'].plot(kind='hist',bins=200,figsize=(6,6))
plt.title('GRE Score')
plt.xlabel('GRE Score')
plt.ylabel('Frequency')
plt.show()
散点图
参考链接: https://baike.baidu.com/item/%E6%95%A3%E7%82%B9%E5%9B%BE
判断两变量之间是否存在某种关联或总结坐标点的分布模式
例如:学校排名与CGPA的散点图
plt.scatter(df['University Rating'],df['CGPA'])
plt.title('CGPA Scores for University ratings')
plt.xlabel('University Rating')
plt.ylabel('CGPA')
plt.show()
UCI 肿瘤数据集之大数据分析
准备阶段
数据库来源:https://archive.ics.uci.edu/ml/datasets.php
- ID确定 2 .结果确定 3. 指标确定
官方地址: https://scikit-learn.org/stable/
数据集
- 导入数据集并且处理
import pandas as pd
column_names = ['Sample code number','Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
data = pd.read_csv('./breast-cancer-wisconsin.data',names=column_names)
print(data)
- 处理数据集
data = data.replace(to_replace='?',value = np.nan)
print(data)
data = data.dropna(how='any')
print(data.shape)
- 分割数据集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(data[column_names[1:10]],data[column_names[10]],test_size = 0.25,random_state = 33)
print(y_train.value_counts())
print(y_test.value_counts())
模型训练
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
lr = LogisticRegression()
lr.fit(X_train,y_train)
lr_y_predict = lr.predict(X_test)
sgdc = SGDClassifier()
sgdc.fit(X_train,y_train)
sgdc_y_predict = sgdc.predict(X_test)
print(sgdc_y_predict)
预测结果
from sklearn.metrics import classification_report
print('Accuracy of LR Classifier:', lr.score(X_test, y_test))
print(classification_report(y_test, lr_y_predict, target_names=['Benign', 'Malignant']))
print('Accuarcy of SGD Classifier:',sgdc.score(X_test,y_test))
print(classification_report(y_test,sgdc_y_predict,target_names=['Benign','Malignant']))
UCI观影数据集之大数据分析
读取CSV
import warnings
warnings.filterwarnings('ignore')
movie = pd.read_csv('movies.csv')
credit = pd.read_csv('credits.csv')
print(movie.head(1))
print(movie.tail(3))
print(movie.info())
数据清理
movie['genres']=movie['genres'].apply(json.loads)
print(movie['genres'].head())
print(list(zip(movie.index,movie['genres']))[:2])
for index, i in zip(movie.index, movie['genres']):
genresList = [each["name"] for each in i]
movie.loc[index, 'genres'] = str(genresList)
def director(x):
for i in x:
if i['job'] == 'Director':
return i['name']
credit['crew']=credit['crew'].apply(director)
credit.rename(columns={'crew':'director'},inplace=True)
fulldf = pd.merge(movie,credit,left_on='id',right_on='movie_id',how='left')
print(fulldf.head(1))
print(fulldf.shape)
fulldf.rename(columns={'title_x':'title'},inplace=True)
fulldf.drop('title_y',axis=1,inplace=True)
NAs = pd.DataFrame(fulldf.isnull().sum())
NAs[NAs.sum(axis=1)>0].sort_values(by=[0],ascending=False)
fulldf.loc[fulldf['release_date'].isnull(),'title']
fulldf['release_date']=fulldf['release_date'].fillna('2014-06-01')
fulldf['runtime'] = fulldf['runtime'].fillna(fulldf['runtime'].mean())
fulldf['release_year'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.year
fulldf['release_month'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.month
统计数据与作图
fulldf['genres']=fulldf['genres'].str.strip('[]').str.replace(" ","").str.replace("'","")
fulldf['genres']=fulldf['genres'].str.split(',')
allList=[]
for i in fulldf['genres']:
allList.extend(i)
gen_list=pd.Series(allList).value_counts()[:10].sort_values(ascending=False)
gen_df = pd.DataFrame(gen_list)
gen_df.rename(columns={0:'Total'},inplace=True)
plt.subplots(figsize=(10,8))
sns.barplot(y=gen_df.index,x='Total',data=gen_df,palette='GnBu_d')
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.xlabel('Total',fontsize=15)
plt.ylabel('Genres',fontsize=15)
plt.title('Top 10 Genres',fontsize=20)
plt.show()
l=[]
for i in list1:
if i not in l:
l.append(i)
year_min = fulldf['release_year'].min()
year_max = fulldf['release_year'].max()
year_genr =pd.DataFrame(index=l,columns=range(year_min,year_max+1))
year_genr.fillna(value=0,inplace=True)
intil_y = np.array(fulldf['release_year'])
z = 0
for i in fulldf['genres']:
splt_gen = list(i)
for j in splt_gen:
year_genr.loc[j,intil_y[z]] = year_genr.loc[j,intil_y[z]]+1
z+=1
year_genr = year_genr.sort_values(by=2006, ascending=False)
year_genr = year_genr.iloc[0:10, -49:-1]
plt.subplots(figsize=(10,8))
plt.plot(year_genr.T)
plt.title('Genres vs Time',fontsize=20)
plt.xticks(range(1969,2020,5))
plt.legend(year_genr.T)
plt.show()
Kaggle竞赛之“泰坦尼克号”大数据分析
准备工作
train=pd.read_csv('./Data/train.csv')
test=pd.read_csv('./Data/test.csv')
full=pd.concat([train,test],ignore_index=True,sort=False)
full.head()
数据清洗
首先来看缺失数值:Age,Cabin,Embarked,Fare这些变量存在缺失值(Survived是预测值)。
对于缺失值较少的,可以选择众数和中位数进行插补。
对于缺失值较多的,可以考虑有的与没有的跟最后是否生存情况。
可以发现有Cabin数据的乘客的存活率远高于无Cabin数据的乘客,所以我们可以将Cabin的有无数据作为一个特征。
对于Age的缺失值有263个。通过相关系数图发现相关性并不高,则无法预测。
根据名字中的称呼来预测年龄。
full.isnull().sum()
full.Embarked.mode()
full['Embarked'].fillna('S',inplace=True)
full[full.Fare.isnull()]
full.Fare.fillna(full[full.Pclass==3]['Fare'].median(),inplace=True)
full.loc[full.Cabin.notnull(),'Cabin']=1
full.loc[full.Cabin.isnull(),'Cabin']=0
pd.pivot_table(full,index=['Cabin'],values=['Survived']).plot.bar(figsize=(8,5))
plt.title('Survival Rate')
cabin=pd.crosstab(full.Cabin,full.Survived)
cabin.rename(index={0:'no cabin',1:'cabin'},columns={0.0:'Dead',1.0:'Survived'},inplace=True)
cabin
cabin.plot.bar(figsize=(8,5))
plt.xticks(rotation=0,size='xx-large')
plt.title('Survived Count')
plt.xlabel('')
plt.legend()
full['Title']=full['Name'].apply(lambda x: x.split(',')[1].split('.')[0].strip())
full.Title.value_counts()
full[(full.Title=='Dr')&(full.Sex=='female')]
nn={'Capt':'Rareman', 'Col':'Rareman','Don':'Rareman','Dona':'Rarewoman',
'Dr':'Rareman','Jonkheer':'Rareman','Lady':'Rarewoman','Major':'Rareman',
'Master':'Master','Miss':'Miss','Mlle':'Rarewoman','Mme':'Rarewoman',
'Mr':'Mr','Mrs':'Mrs','Ms':'Rarewoman','Rev':'Mr','Sir':'Rareman',
'the Countess':'Rarewoman'}
full.Title=full.Title.map(nn)
full.loc[full.PassengerId==797,'Title']='Rarewoman'
full.Title.value_counts()
full[full.Title=='Master']['Sex'].value_counts()
full[full.Title=='Master']['Age'].describe()
full.Age.fillna(999,inplace=True)
def girl(aa):
if (aa.Age!=999)&(aa.Title=='Miss')&(aa.Age<=14):
return 'Girl'
elif (aa.Age==999)&(aa.Title=='Miss')&(aa.Parch!=0):
return 'Girl'
else:
return aa.Title
full['Title']=full.apply(girl,axis=1)
full.Title.value_counts()
full[full.Age==999]['Age'].value_counts()
Tit=['Mr','Miss','Mrs','Master','Girl','Rareman','Rarewoman']
for i in Tit:
full.loc[(full.Age==999)&(full.Title==i),'Age']=full.loc[full.Title==i,'Age'].median()
full.info()
探索可视化
- 普遍认为泰坦尼克号中女人的存活率远高于男人
- 显示年龄与存活人数的关系,可以看出小于5岁的小孩存活率很高
- 客舱等级(Pclass)自然也与存活率有很大关系
full.groupby(['Title'])[['Age','Title']].mean().plot(kind='bar',figsize=(8,5))
plt.xticks(rotation=0)
plt.show()
pd.crosstab(full.Sex,full.Survived).plot.bar(stacked=True,figsize=(8,5),color=['#4169E1','#FF00FF'])
plt.xticks(rotation=0,size='large')
plt.legend(bbox_to_anchor=(0.55,0.9))
agehist=pd.concat([full[full.Survived==1]['Age'],full[full.Survived==0]['Age']],axis=1)
agehist.columns=['Survived','Dead']
agehist.plot(kind='hist',bins=30,figsize=(15,8),alpha=0.3)
farehist=pd.concat([full[full.Survived==1]['Fare'],full[full.Survived==0]['Fare']],axis=1)
farehist.columns=['Survived','Dead']
farehist.head()
full.groupby(['Title'])[['Title','Survived']].mean().plot(kind='bar',figsize=(10,7))
plt.xticks(rotation=0)
fig,axes=plt.subplots(2,3,figsize=(15,8))
Sex1=['male','female']
for i,ax in zip(Sex1,axes):
for j,pp in zip(range(1,4),ax):
PclassSex=full[(full.Sex==i)&(full.Pclass==j)]['Survived'].value_counts().sort_index(ascending=False)
pp.bar(range(len(PclassSex)),PclassSex,label=(i,'Class'+str(j)))
pp.set_xticks((0,1))
pp.set_xticklabels(('Survived','Dead'))
pp.legend(bbox_to_anchor=(0.6,1.1))
特征工程
由于训练集和测试集的数量过少,容易产生过拟合。
根据可视化探索出来的特征变量来处理特征。
full.AgeCut=pd.cut(full.Age,5)
full.FareCut=pd.qcut(full.Fare,5)
full.AgeCut.value_counts().sort_index()
full.FareCut.value_counts().sort_index()
full.loc[full.Age<=16.136,'AgeCut']=1
full.loc[(full.Age>16.136)&(full.Age<=32.102),'AgeCut']=2
full.loc[(full.Age>32.102)&(full.Age<=48.068),'AgeCut']=3
full.loc[(full.Age>48.068)&(full.Age<=64.034),'AgeCut']=4
full.loc[full.Age>64.034,'AgeCut']=5
full.loc[full.Fare<=7.854,'FareCut']=1
full.loc[(full.Fare>7.854)&(full.Fare<=10.5),'FareCut']=2
full.loc[(full.Fare>10.5)&(full.Fare<=21.558),'FareCut']=3
full.loc[(full.Fare>21.558)&(full.Fare<=41.579),'FareCut']=4
full.loc[full.Fare>41.579,'FareCut']=5
full[['FareCut','Survived']].groupby(['FareCut']).mean().plot.bar(figsize=(8,5))
full.corr()
full[full.Survived.notnull()].pivot_table(index=['Title','Pclass'],values=['Survived']).sort_values('Survived',ascending=False)
full[full.Survived.notnull()].pivot_table(index=['Title','Parch'],values=['Survived']).sort_values('Survived',ascending=False)
TPP=full[full.Survived.notnull()].pivot_table(index=['Title','Pclass','Parch'],values=['Survived']).sort_values('Survived',ascending=False)
TPP
TPP.plot(kind='bar',figsize=(16,10))
plt.xticks(rotation=40)
plt.axhline(0.8,color='#BA55D3')
plt.axhline(0.5,color='#BA55D3')
plt.annotate('80% survival rate',xy=(30,0.81),xytext=(32,0.85),arrowprops=dict(facecolor='#BA55D3',shrink=0.05))
plt.annotate('50% survival rate',xy=(32,0.51),xytext=(34,0.54),arrowprops=dict(facecolor='#BA55D3',shrink=0.05))
Tit=['Girl','Master','Mr','Miss','Mrs','Rareman','Rarewoman']
for i in Tit:
for j in range(1,4):
for g in range(0,10):
if full.loc[(full.Title==i)&(full.Pclass==j)&(full.Parch==g)&(full.Survived.notnull()),'Survived'].mean()>=0.8:
full.loc[(full.Title==i)&(full.Pclass==j)&(full.Parch==g),'TPP']=1
elif full.loc[(full.Title==i)&(full.Pclass==j)&(full.Parch==g)&(full.Survived.notnull()),'Survived'].mean()>=0.5:
full.loc[(full.Title==i)&(full.Pclass==j)&(full.Parch==g),'TPP']=2
elif full.loc[(full.Title==i)&(full.Pclass==j)&(full.Parch==g)&(full.Survived.notnull()),'Survived'].mean()>=0:
full.loc[(full.Title==i)&(full.Pclass==j)&(full.Parch==g),'TPP']=3
else:
full.loc[(full.Title==i)&(full.Pclass==j)&(full.Parch==g),'TPP']=4
full[full.TPP==4]
full.loc[(full.TPP==4)&(full.Sex=='female')&(full.Pclass!=3),'TPP']=1
full.loc[(full.TPP==4)&(full.Sex=='female')&(full.Pclass==3),'TPP']=2
full.loc[(full.TPP==4)&(full.Sex=='male')&(full.Pclass!=3),'TPP']=2
full.loc[(full.TPP==4)&(full.Sex=='male')&(full.Pclass==3),'TPP']=3
full.TPP.value_counts()
基本建模和评估
开始选择算法,做交叉验证,最后取最好的结果。
-
K近邻(k-Nearest Neighbors) -
逻辑回归(Logistic Regression) -
朴素贝叶斯分类器(Naive Bayes classifier) -
决策树(Decision Tree) -
随机森林(Random Forest) -
梯度提升树(Gradient Boosting Decision Tree) -
支持向量机(Support Vector Machine)
由于K近邻和支持向量机对数据的scale敏感,所以先进行标准化(standard-scaling)
选择比较不错的模型之后,接下来可以挑选一个模型进行错误分析
提取该模型中错分类的观测值,寻找其中规律进而提取新的特征,以图提高整体准确率
predictors=['Cabin','Embarked','Parch','Pclass','Sex','SibSp','Title','AgeCut','TPP','FareCut','Age','Fare']
full_dummies=pd.get_dummies(full[predictors]) 将分类变量转换为数值变量
full_dummies.head()
炼丹
调整参数,时间与运气的集合。
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
models=[KNeighborsClassifier(),LogisticRegression(),GaussianNB(),DecisionTreeClassifier(),RandomForestClassifier(),
GradientBoostingClassifier(),SVC()]
full.shape,full_dummies.shape
X=full_dummies[:891]
y=full.Survived[:891]
test_X=full_dummies[891:]
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_scaled=scaler.fit(X).transform(X)
test_X_scaled=scaler.fit(X).transform(test_X)
names=['KNN','LR','NB','Tree','RF','GDBT','SVM']
for name, model in zip(names,models):
score=cross_val_score(model,X,y,cv=5)
print("{}:{},{}".format(name,score.mean(),score))
names=['KNN','LR','NB','Tree','RF','GDBT','SVM']
for name, model in zip(names,models):
score=cross_val_score(model,X_scaled,y,cv=5)
print("{}:{},{}".format(name,score.mean(),score))
model=GradientBoostingClassifier()
model.fit(X,y)
model.feature_importances_
fi=pd.DataFrame({'importance':model.feature_importances_},index=X.columns)
fi.sort_values('importance',ascending=False)
fi.sort_values('importance',ascending=False).plot.bar(figsize=(11,7))
plt.xticks(rotation=30)
plt.title('Feature Importance',size='x-large')
from sklearn.model_selection import KFold
kf=KFold(n_splits=10,random_state=1)
kf.get_n_splits(X)
print(kf)
rr=[]
for train_index, val_index in kf.split(X):
pred=model.fit(X.loc[train_index],y[train_index]).predict(X.loc[val_index])
rr.append(y[val_index][pred!=y[val_index]].index.values)
whole_index=np.concatenate(rr)
len(whole_index)
full.loc[whole_index].head()
diff=full.loc[whole_index]
diff.describe()
diff.describe(include=['O'])
diff.groupby(['Title'])['Survived'].agg([('average','mean'),('number','count')])
diff.groupby(['Title','Pclass'])['Survived'].agg([('average','mean'),('number','count')])
diff.groupby(['Title','Pclass','Parch','SibSp'])['Survived'].agg([('average','mean'),('number','count')])
full.loc[(full.Title=='Mr')&(full.Pclass==1)&(full.Parch==0)&((full.SibSp==0)|(full.SibSp==1)),'MPPS']=1
full.loc[(full.Title=='Mr')&(full.Pclass!=1)&(full.Parch==0)&(full.SibSp==0),'MPPS']=2
full.loc[(full.Title=='Miss')&(full.Pclass==3)&(full.Parch==0)&(full.SibSp==0),'MPPS']=3
full.MPPS.fillna(4,inplace=True)
full.MPPS.value_counts()
diff[(diff.Title=='Mr')|(diff.Title=='Miss')].groupby(['Title','Survived','Pclass'])[['Fare']].describe().unstack()
full[(full.Title=='Mr')|(full.Title=='Miss')].groupby(['Title','Survived','Pclass'])[['Fare']].describe().unstack()
colormap = plt.cm.viridis
plt.figure(figsize=(12,12))
plt.title('Pearson Correlation of Features', y=1.05, size=20)
sns.heatmap(full[['Cabin','Parch','Pclass','SibSp','AgeCut','TPP','FareCut','Age','Fare','MPPS','Survived']].astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
predictors=['Cabin','Embarked','Parch','Pclass','Sex','SibSp','Title','AgeCut','TPP','FareCut','Age','Fare','MPPS']
full_dummies=pd.get_dummies(full[predictors])
X=full_dummies[:891]
y=full.Survived[:891]
test_X=full_dummies[891:]
scaler=StandardScaler()
X_scaled=scaler.fit(X).transform(X)
test_X_scaled=scaler.fit(X).transform(test_X)
from sklearn.model_selection import GridSearchCV
param_grid={'n_neighbors':[1,2,3,4,5,6,7,8,9]}
grid_search=GridSearchCV(KNeighborsClassifier(),param_grid,cv=5)
grid_search.fit(X_scaled,y)
grid_search.best_params_,grid_search.best_score_
param_grid={'C':[0.01,0.1,1,10]}
grid_search=GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid_search.fit(X_scaled,y)
grid_search.best_params_,grid_search.best_score_
param_grid={'C':[0.04,0.06,0.08,0.1,0.12,0.14]}
grid_search=GridSearchCV(LogisticRegression(),param_grid,cv=5)
grid_search.fit(X_scaled,y)
grid_search.best_params_,grid_search.best_score_
param_grid={'C':[0.01,0.1,1,10],'gamma':[0.01,0.1,1,10]}
grid_search=GridSearchCV(SVC(),param_grid,cv=5)
grid_search.fit(X_scaled,y)
grid_search.best_params_,grid_search.best_score_
param_grid={'C':[2,4,6,8,10,12,14],'gamma':[0.008,0.01,0.012,0.015,0.02]}
grid_search=GridSearchCV(SVC(),param_grid,cv=5)
grid_search.fit(X_scaled,y)
grid_search.best_params_,grid_search.best_score_
param_grid={'n_estimators':[30,50,80,120,200],'learning_rate':[0.05,0.1,0.5,1],'max_depth':[1,2,3,4,5]}
grid_search=GridSearchCV(GradientBoostingClassifier(),param_grid,cv=5)
grid_search.fit(X_scaled,y)
grid_search.best_params_,grid_search.best_score_
param_grid={'n_estimators':[100,120,140,160],'learning_rate':[0.05,0.08,0.1,0.12],'max_depth':[3,4]}
grid_search=GridSearchCV(GradientBoostingClassifier(),param_grid,cv=5)
grid_search.fit(X_scaled,y)
grid_search.best_params_,grid_search.best_score_
集成方法
用了逻辑回归、K近邻、支持向量机、梯度提升树作为第一层模型,随机森林作为第二层模型。
总的来说根据交叉验证的结果,集成算法并没有比单个算法提升太多,原因可能是:
-
个数据集太小,模型没有得到充分的训练 -
集成方法中子模型的相关性太强 -
集成方法可能本身也需要调参
from sklearn.ensemble import BaggingClassifier
bagging=BaggingClassifier(LogisticRegression(C=0.06),n_estimators=100)
from sklearn.ensemble import VotingClassifier
clf1=LogisticRegression(C=0.06)
clf2=RandomForestClassifier(n_estimators=500)
clf3=GradientBoostingClassifier(n_estimators=120,learning_rate=0.12,max_depth=4)
clf4=SVC(C=4,gamma=0.015,probability=True)
clf5=KNeighborsClassifier(n_neighbors=8)
eclf_hard=VotingClassifier(estimators=[('LR',clf1),('RF',clf2),('GDBT',clf3),('SVM',clf4),('KNN',clf5)])
eclfW_hard=VotingClassifier(estimators=[('LR',clf1),('RF',clf2),('GDBT',clf3),('SVM',clf4), ('KNN',clf5)],weights=[1,1,2,2,1])
eclf_soft=VotingClassifier(estimators=[('LR',clf1),('RF',clf2),('GDBT',clf3),('SVM',clf4),('KNN',clf5)],voting='soft')
eclfW_soft=VotingClassifier(estimators=[('LR',clf1),('RF',clf2),('GDBT',clf3),('SVM',clf4),('KNN',clf5)],voting='soft',weights=[1,1,2,2,1])
models=[KNeighborsClassifier(n_neighbors=8),LogisticRegression(C=0.06),GaussianNB(),DecisionTreeClassifier(),RandomForestClassifier(n_estimators=500),
GradientBoostingClassifier(n_estimators=120,learning_rate=0.12,max_depth=4),SVC(C=4,gamma=0.015),
eclf_hard,eclf_soft,eclfW_hard,eclfW_soft,bagging]
names=['KNN','LR','NB','CART','RF','GBT','SVM','VC_hard','VC_soft','VCW_hard','VCW_soft','Bagging']
for name,model in zip(names,models):
score=cross_val_score(model,X_scaled,y,cv=5)
print("{}: {},{}".format(name,score.mean(),score))
from sklearn.model_selection import StratifiedKFold
n_train=train.shape[0]
n_test=test.shape[0]
kf=StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
def get_oof(clf,X,y,test_X):
oof_train=np.zeros((n_train,))
oof_test_mean=np.zeros((n_test,))
oof_test_single=np.empty((5,n_test))
for i, (train_index,val_index) in enumerate(kf.split(X,y)):
kf_X_train=X[train_index]
kf_y_train=y[train_index]
kf_X_val=X[val_index]
clf.fit(kf_X_train,kf_y_train)
oof_train[val_index]=clf.predict(kf_X_val)
oof_test_single[i,:]=clf.predict(test_X)
oof_test_mean=oof_test_single.mean(axis=0)
return oof_train.reshape(-1,1), oof_test_mean.reshape(-1,1)
LR_train,LR_test=get_oof(LogisticRegression(C=0.06),X_scaled,y,test_X_scaled)
KNN_train,KNN_test=get_oof(KNeighborsClassifier(n_neighbors=8),X_scaled,y,test_X_scaled)
SVM_train,SVM_test=get_oof(SVC(C=4,gamma=0.015),X_scaled,y,test_X_scaled)
GBDT_train,GBDT_test=get_oof(GradientBoostingClassifier(n_estimators=120,learning_rate=0.12,max_depth=4),X_scaled,y,test_X_scaled)
X_stack=np.concatenate((LR_train,KNN_train,SVM_train,GBDT_train),axis=1)
y_stack=y
X_test_stack=np.concatenate((LR_test,KNN_test,SVM_test,GBDT_test),axis=1)
stack_score=cross_val_score(RandomForestClassifier(n_estimators=1000),X_stack,y_stack,cv=5)
stack_score.mean(),stack_score
预测
pred=RandomForestClassifier(n_estimators=500).fit(X_stack,y_stack).predict(X_test_stack)
tt=pd.DataFrame({'PassengerId':test.PassengerId,'Survived':pred})
tt.to_csv('G.csv',index=False)
|