1. 导言
在前几个章节中,我们学习了关于回归和分类的算法,同时也讨论了如何将这些方法集成为强大的算法的集成学习方式,分别是Bagging和Boosting。本章我们继续讨论集成学习方法的最后一个成员–Stacking,这个集成方法在比赛中被称为“懒人”算法,因为它不需要花费过多时间的调参就可以得到一个效果不错的算法,同时,这种算法也比前两种算法容易理解的多,因为这种集成学习的方式不需要理解太多的理论,只需要在实际中加以运用即可。 stacking严格来说并不是一种算法,而是精美而又复杂的,对模型集成的一种策略。Stacking集成算法可以理解为一个两层的集成,第一层含有多个基础分类器,把预测的结果(元特征)提供给第二层, 而第二层的分类器通常是逻辑回归,他把一层分类器的结果当做特征做拟合输出预测结果。在介绍Stacking之前,我们先来对简化版的Stacking进行讨论,也叫做Blending,接着我们对Stacking进行更深入的讨论。
2. Blending集成学习算法
- (1) 将数据划分为训练集和测试集(test_set),其中训练集需要再次划分为训练集(train_set)和验证集(val_set);
- (2) 创建第一层的多个模型,这些模型可以使同质的也可以是异质的;
- (3) 使用train_set训练步骤2中的多个模型,然后用训练好的模型预测val_set和test_set得到val_predict, test_predict1;
- (4) 创建第二层的模型,使用val_predict作为训练集训练第二层的模型;
- (5) 使用第二层训练好的模型对第二层测试集test_predict1进行预测,该结果为整个测试集的结果。

3. Stacking集成学习算法

- 首先将所有数据集生成测试集和训练集(假如训练集为10000,测试集为2500行),那么上层会进行5折交叉检验,使用训练集中的8000条作为训练集,剩余2000行作为验证集(橙色)。
- 每次验证相当于使用了蓝色的8000条数据训练出一个模型,使用模型对验证集进行验证得到2000条数据,并对测试集进行预测,得到2500条数据,这样经过5次交叉检验,可以得到中间的橙色的5* 2000条验证集的结果(相当于每条数据的预测结果),5* 2500条测试集的预测结果。
- 接下来会将验证集的5* 2000条预测结果拼接成10000行长的矩阵,标记为
A1?,而对于5* 2500行的测试集的预测结果进行加权平均,得到一个2500一列的矩阵,标记为
- 上面得到一个基模型在数据集上的预测结果
- 之后我们会将
A3?并列在一起成10000行3列的矩阵作为training data,
B3?合并在一起成2500行3列的矩阵作为testing data,让下层学习器基于这样的数据进行再训练。
- 再训练是基于每个基础模型的预测结果作为特征(三个特征),次学习器会学习训练如果往这样的基学习的预测结果上赋予权重w,来使得最后的预测最为准确。
Blending与Stacking对比: Blending的优点在于:
- 比stacking简单(因为不用进行k次的交叉验证来获得stacker feature)
- 使用了很少的数据(是划分hold-out作为测试集,并非cv)
- blender可能会过拟合(其实大概率是第一点导致的)
- stacking使用多次的CV会比较稳健
4. 结语
在本章中,我们讨论了如何使用Blending和Stacking的方式去集成多个模型,相比于Bagging与Boosting的集成方式,Blending和Stacking的方式更加简单和直观,且效果还很好,因此在比赛中有这么一句话:它(Stacking)可以帮你打败当前学术界性能最好的算法 。那么截至目前为止,我们已经把所有的集成学习方式都讨论完了,接下来的第六章,我们将以几个大型的案例来展示集成学习的威力。
泰坦尼克号特征工程- 待完善
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
import xlwings as xw
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
([<matplotlib.patches.Wedge at 0x1a2b4317dc8>,
<matplotlib.patches.Wedge at 0x1a2b4320a48>],
[Text(-1.027562611392443, -0.392574935099458, ''),
Text(1.961710369761393, 0.749461423403913, '')],
[Text(-0.5604886971231506, -0.21413178278152253, '61.62%'),
Text(1.4946364721991565, 0.5710182273553622, '38.38%')])

array(['S'], dtype=object)
train_data['Embarked'].value_counts().plot.pie(autopct = '%1.2f%%')
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b4328b48>

train_data.Embarked[train_data.Embarked.isnull()] = train_data.Embarked.dropna().mode().values
train_data['Cabin'] = train_data.Cabin.fillna('U0')
(4)使用回归 随机森林等模型来预测缺失属性的值。因为Age在该数据集里是一个相当重要的特征(先对Age进行分析即可得知),所以保证一定的缺失值填充准确率是非常重要的,对结果也会产生较大影响。一般情况下,会使用数据完整的条目作为模型的训练集,以此来预测缺失值。对于当前的这个数据,可以使用随机森林来预测也可以使用线性回归预测。这里使用随机森林预测模型,选取数据集中的数值属性作为特征(因为sklearn的模型只能处理数值属性,所以这里先仅选取数值特征,但在实际的应用中需要将非数值特征转换为数值特征) ———————————————— 版权声明:本文为CSDN博主「大树先生的博客」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 原文链接:https://blog.csdn.net/koala_tree/article/details/78725881
from sklearn.ensemble import RandomForestRegressor
age_df = train_data[['Age','Survived','Fare', 'Parch', 'SibSp', 'Pclass']]
age_df_notnull = age_df.loc[(train_data['Age'].notnull())]
age_df_isnull = age_df.loc[(train_data['Age'].isnull())]
| Age | Survived | Fare | Parch | SibSp | Pclass |
0 | 22.0 | 0 | 7.2500 | 0 | 1 | 3 | 1 | 38.0 | 1 | 71.2833 | 0 | 1 | 1 | 2 | 26.0 | 1 | 7.9250 | 0 | 0 | 3 | 3 | 35.0 | 1 | 53.1000 | 0 | 1 | 1 | 4 | 35.0 | 0 | 8.0500 | 0 | 0 | 3 | ... | ... | ... | ... | ... | ... | ... | 885 | 39.0 | 0 | 29.1250 | 5 | 0 | 3 | 886 | 27.0 | 0 | 13.0000 | 0 | 0 | 2 | 887 | 19.0 | 1 | 30.0000 | 0 | 0 | 1 | 889 | 26.0 | 1 | 30.0000 | 0 | 0 | 1 | 890 | 32.0 | 0 | 7.7500 | 0 | 0 | 3 |
714 rows × 6 columns
array([[ 0. , 7.25 , 0. , 1. , 3. ],
[ 1. , 71.2833, 0. , 1. , 1. ],
[ 1. , 7.925 , 0. , 0. , 3. ],
[ 1. , 30. , 0. , 0. , 1. ],
[ 1. , 30. , 0. , 0. , 1. ],
[ 0. , 7.75 , 0. , 0. , 3. ]])
array([22. , 38. , 26. , 35. , 35. , 54. , 2. , 27. , 14. ,
4. , 58. , 20. , 39. , 14. , 55. , 2. , 31. , 35. ,
34. , 15. , 28. , 8. , 38. , 19. , 40. , 66. , 28. ,
42. , 21. , 18. , 14. , 40. , 27. , 3. , 19. , 18. ,
7. , 21. , 49. , 29. , 65. , 21. , 28.5 , 5. , 11. ,
22. , 38. , 45. , 4. , 29. , 19. , 17. , 26. , 32. ,
16. , 21. , 26. , 32. , 25. , 0.83, 30. , 22. , 29. ,
28. , 17. , 33. , 16. , 23. , 24. , 29. , 20. , 46. ,
26. , 59. , 71. , 23. , 34. , 34. , 28. , 21. , 33. ,
37. , 28. , 21. , 38. , 47. , 14.5 , 22. , 20. , 17. ,
21. , 70.5 , 29. , 24. , 2. , 21. , 32.5 , 32.5 , 54. ,
12. , 24. , 45. , 33. , 20. , 47. , 29. , 25. , 23. ,
19. , 37. , 16. , 24. , 22. , 24. , 19. , 18. , 19. ,
27. , 9. , 36.5 , 42. , 51. , 22. , 55.5 , 40.5 , 51. ,
16. , 30. , 44. , 40. , 26. , 17. , 1. , 9. , 45. ,
28. , 61. , 4. , 1. , 21. , 56. , 18. , 50. , 30. ,
36. , 9. , 1. , 4. , 45. , 40. , 36. , 32. , 19. ,
19. , 3. , 44. , 58. , 42. , 24. , 28. , 34. , 45.5 ,
18. , 2. , 32. , 26. , 16. , 40. , 24. , 35. , 22. ,
30. , 31. , 27. , 42. , 32. , 30. , 16. , 27. , 51. ,
38. , 22. , 19. , 20.5 , 18. , 35. , 29. , 59. , 5. ,
24. , 44. , 8. , 19. , 33. , 29. , 22. , 30. , 44. ,
25. , 24. , 37. , 54. , 29. , 62. , 30. , 41. , 29. ,
30. , 35. , 50. , 3. , 52. , 40. , 36. , 16. , 25. ,
58. , 35. , 25. , 41. , 37. , 63. , 45. , 7. , 35. ,
65. , 28. , 16. , 19. , 33. , 30. , 22. , 42. , 22. ,
26. , 19. , 36. , 24. , 24. , 23.5 , 2. , 50. , 19. ,
0.92, 17. , 30. , 30. , 24. , 18. , 26. , 28. , 43. ,
26. , 24. , 54. , 31. , 40. , 22. , 27. , 30. , 22. ,
36. , 61. , 36. , 31. , 16. , 45.5 , 38. , 16. , 29. ,
41. , 45. , 45. , 2. , 24. , 28. , 25. , 36. , 24. ,
40. , 3. , 42. , 23. , 15. , 25. , 28. , 22. , 38. ,
40. , 29. , 45. , 35. , 30. , 60. , 24. , 25. , 18. ,
19. , 22. , 3. , 22. , 27. , 20. , 19. , 42. , 1. ,
32. , 35. , 18. , 1. , 36. , 17. , 36. , 21. , 28. ,
23. , 24. , 22. , 31. , 46. , 23. , 28. , 39. , 26. ,
21. , 28. , 20. , 34. , 51. , 3. , 21. , 33. , 44. ,
34. , 18. , 30. , 10. , 21. , 29. , 28. , 18. , 28. ,
19. , 32. , 28. , 42. , 17. , 50. , 14. , 21. , 24. ,
64. , 31. , 45. , 20. , 25. , 28. , 4. , 13. , 34. ,
5. , 52. , 36. , 30. , 49. , 29. , 65. , 50. , 48. ,
34. , 47. , 48. , 38. , 56. , 0.75, 38. , 33. , 23. ,
22. , 34. , 29. , 22. , 2. , 9. , 50. , 63. , 25. ,
35. , 58. , 30. , 9. , 21. , 55. , 71. , 21. , 54. ,
25. , 24. , 17. , 21. , 37. , 16. , 18. , 33. , 28. ,
26. , 29. , 36. , 54. , 24. , 47. , 34. , 36. , 32. ,
30. , 22. , 44. , 40.5 , 50. , 39. , 23. , 2. , 17. ,
30. , 7. , 45. , 30. , 22. , 36. , 9. , 11. , 32. ,
50. , 64. , 19. , 33. , 8. , 17. , 27. , 22. , 22. ,
62. , 48. , 39. , 36. , 40. , 28. , 24. , 19. , 29. ,
32. , 62. , 53. , 36. , 16. , 19. , 34. , 39. , 32. ,
25. , 39. , 54. , 36. , 18. , 47. , 60. , 22. , 35. ,
52. , 47. , 37. , 36. , 49. , 49. , 24. , 44. , 35. ,
36. , 30. , 27. , 22. , 40. , 39. , 35. , 24. , 34. ,
26. , 4. , 26. , 27. , 42. , 20. , 21. , 21. , 61. ,
57. , 21. , 26. , 80. , 51. , 32. , 9. , 28. , 32. ,
31. , 41. , 20. , 24. , 2. , 0.75, 48. , 19. , 56. ,
23. , 18. , 21. , 18. , 24. , 32. , 23. , 58. , 50. ,
40. , 47. , 36. , 20. , 32. , 25. , 43. , 40. , 31. ,
70. , 31. , 18. , 24.5 , 18. , 43. , 36. , 27. , 20. ,
14. , 60. , 25. , 14. , 19. , 18. , 15. , 31. , 4. ,
25. , 60. , 52. , 44. , 49. , 42. , 18. , 35. , 18. ,
25. , 26. , 39. , 45. , 42. , 22. , 24. , 48. , 29. ,
52. , 19. , 38. , 27. , 33. , 6. , 17. , 34. , 50. ,
27. , 20. , 30. , 25. , 25. , 29. , 11. , 23. , 23. ,
28.5 , 48. , 35. , 36. , 21. , 24. , 31. , 70. , 16. ,
30. , 19. , 31. , 4. , 6. , 33. , 23. , 48. , 0.67,
28. , 18. , 34. , 33. , 41. , 20. , 36. , 16. , 51. ,
30.5 , 32. , 24. , 48. , 57. , 54. , 18. , 5. , 43. ,
13. , 17. , 29. , 25. , 25. , 18. , 8. , 1. , 46. ,
16. , 25. , 39. , 49. , 31. , 30. , 30. , 34. , 31. ,
11. , 0.42, 27. , 31. , 39. , 18. , 39. , 33. , 26. ,
39. , 35. , 6. , 30.5 , 23. , 31. , 43. , 10. , 52. ,
27. , 38. , 27. , 2. , 1. , 62. , 15. , 0.83, 23. ,
18. , 39. , 21. , 32. , 20. , 16. , 30. , 34.5 , 17. ,
42. , 35. , 28. , 4. , 74. , 9. , 16. , 44. , 18. ,
45. , 51. , 24. , 41. , 21. , 48. , 24. , 42. , 27. ,
31. , 4. , 26. , 47. , 33. , 47. , 28. , 15. , 20. ,
19. , 56. , 25. , 33. , 22. , 28. , 25. , 39. , 27. ,
19. , 26. , 32. ])
X = age_df_notnull.values[:,1:]
Y = age_df_notnull.values[:,0]
RFR = RandomForestRegressor(n_estimators=1000,n_jobs=-1)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
n_jobs=-1, oob_score=False, random_state=None, verbose=0,
PredictAgeData = RFR.predict(age_df_isnull.values[:,1:])
train_data.loc[train_data["Age"].isnull(),"Age"] = PredictAgeData
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 891 non-null object
Embarked 891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
| | Survived |
Sex | Survived | |
female | 0 | 81 | 1 | 233 | male | 0 | 468 | 1 | 109 |
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b4d2c548>

train_data.groupby(['Sex'])['Sex'].count().plot.pie(autopct = '%1.2f%%')
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b7f9a048>

<matplotlib.axes._subplots.AxesSubplot at 0x1a2b800a788>

<matplotlib.axes._subplots.AxesSubplot at 0x1a2b8071988>

<matplotlib.axes._subplots.AxesSubplot at 0x1a2b80e0ac8>

<matplotlib.axes._subplots.AxesSubplot at 0x1a2b8134608>

Pclass Sex Survived
1 female 0 3
1 91
male 0 77
1 45
2 female 0 6
1 70
male 0 91
1 17
3 female 0 72
1 72
male 0 300
1 47
Name: Pclass, dtype: int64
sns.violinplot("Pclass", "Age", hue="Survived", data=train_data, split=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b81c2888>

fig, ax = plt.subplots(1, 2, figsize = (18, 8))
sns.violinplot("Pclass", "Age", hue="Survived", data=train_data, split=True, ax=ax[1])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0, 110, 10))
sns.violinplot("Sex", "Age", hue="Survived", data=train_data, split=True, ax=ax[0])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0, 110, 10))

| | Age |
Survived | Sex | |
0 | female | 81 | male | 468 | 1 | female | 233 | male | 109 |
fig = plt.subplots(1, figsize = (18, 8))
pic = sns.violinplot("Survived","Age", hue="Sex", data=train_data, split=True)
pic.set_title("Sex and age VS Survived")
[<matplotlib.axis.YTick at 0x1a2b869eac8>,
<matplotlib.axis.YTick at 0x1a2b8693fc8>,
<matplotlib.axis.YTick at 0x1a2b8692948>,
<matplotlib.axis.YTick at 0x1a2b83cb588>,
<matplotlib.axis.YTick at 0x1a2b83cbe88>,
<matplotlib.axis.YTick at 0x1a2b83cf748>,
<matplotlib.axis.YTick at 0x1a2b83d4188>,
<matplotlib.axis.YTick at 0x1a2b83d4a48>,
<matplotlib.axis.YTick at 0x1a2b83d83c8>,
<matplotlib.axis.YTick at 0x1a2b83cfcc8>]

train_data.boxplot(column='Age', showfliers=False)

facet = sns.FacetGrid(train_data, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train_data['Age'].max()))
<seaborn.axisgrid.FacetGrid at 0x1a2b8b53508>

train_data["Age_int"] = train_data["Age"].astype(int)
train_data[["Age_int", "Survived"]].groupby(['Age_int'],as_index=True).mean()
| Survived |
Age_int | |
0 | 1.000000 | 1 | 0.714286 | 2 | 0.300000 | 3 | 0.833333 | 4 | 0.700000 | ... | ... | 66 | 0.000000 | 70 | 0.000000 | 71 | 0.000000 | 74 | 0.000000 | 80 | 1.000000 |
71 rows × 1 columns
fig, axis1 = plt.subplots(1,1,figsize=(18,4))
train_data["Age_int"] = train_data["Age"].astype(int)
average_age = train_data[["Age_int", "Survived"]].groupby(['Age_int'],as_index=False).mean()
sns.barplot(x='Age_int', y='Survived', data=average_age)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b8a5df48>

count 891.000000
mean 29.653886
std 13.738179
min 0.420000
25% 21.000000
50% 28.000000
75% 37.000000
max 80.000000
Name: Age, dtype: float64
bins = [0, 12, 18, 65, 100]
train_data['Age_group'] = pd.cut(train_data['Age'], bins)
by_age = train_data.groupby('Age_group')['Survived'].mean()
(0, 12] 0.506173
(12, 18] 0.466667
(18, 65] 0.364512
(65, 100] 0.125000
Name: Survived, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b9e97648>

by_age.plot(kind = 'bar')
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b9cf0d48>

(4) 称呼与存活与否的关系 Name
train_data['Title'] = train_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(train_data['Title'], train_data['Sex'])
Sex | female | male |
Title | | |
Capt | 0 | 1 | Col | 0 | 2 | Countess | 1 | 0 | Don | 0 | 1 | Dr | 1 | 6 | Jonkheer | 0 | 1 | Lady | 1 | 0 | Major | 0 | 2 | Master | 0 | 40 | Miss | 182 | 0 | Mlle | 2 | 0 | Mme | 1 | 0 | Mr | 0 | 517 | Mrs | 125 | 0 | Ms | 1 | 0 | Rev | 0 | 6 | Sir | 0 | 1 |
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b9d7e888>

fig, axis1 = plt.subplots(1,1,figsize=(18,4))
train_data['Name_length'] = train_data['Name'].apply(len)
name_length = train_data[['Name_length','Survived']].groupby(['Name_length'],as_index=False).mean()
sns.barplot(x='Name_length', y='Survived', data=name_length)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b9e3aa08>

(5) 有无兄弟姐妹和存活与否的关系 SibSp
sibsp_df = train_data[train_data['SibSp'] != 0]
no_sibsp_df = train_data[train_data['SibSp'] == 0]
sibsp_df['Survived'].value_counts().plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
no_sibsp_df['Survived'].value_counts().plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')

(6) 有无父母子女和存活与否的关系 Parch
parch_df = train_data[train_data['Parch'] != 0]
no_parch_df = train_data[train_data['Parch'] == 0]
parch_df['Survived'].value_counts().plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')
no_parch_df['Survived'].value_counts().plot.pie(labels=['No Survived', 'Survived'], autopct = '%1.1f%%')

(7) 亲友的人数和存活与否的关系 SibSp & Parch
ax[0].set_title('Parch and Survived')
ax[1].set_title('SibSp and Survived')
ax[1].set_title('SibSp and Survived')
Text(0.5, 1.0, 'SibSp and Survived')

train_data['Family_Size'] = train_data['Parch'] + train_data['SibSp'] + 1
<matplotlib.axes._subplots.AxesSubplot at 0x1a2b8620fc8>

(8) 票价分布和存活与否的关系 Fare
train_data['Fare'].hist(bins = 70)
train_data.boxplot(column='Fare', by='Pclass', showfliers=False)

count 891.000000
mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: Fare, dtype: float64
fare_not_survived = train_data['Fare'][train_data['Survived'] == 0]
fare_survived = train_data['Fare'][train_data['Survived'] == 1]
average_fare = pd.DataFrame([fare_not_survived.mean(), fare_survived.mean()])
std_fare = pd.DataFrame([fare_not_survived.std(), fare_survived.std()])
average_fare.plot(yerr=std_fare, kind='bar', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2ba5696c8>

(9) 船舱类型和存活与否的关系 Cabin
train_data.loc[train_data.Cabin.isnull(), 'Cabin'] = 'U0'
train_data['Has_Cabin'] = train_data['Cabin'].apply(lambda x: 0 if x == 'U0' else 1)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2ba5ff948>

x = "U0"
0 U0
1 C85
2 U0
3 C123
4 U0
886 U0
887 B42
888 U0
889 C148
890 U0
Name: Cabin, Length: 891, dtype: object
train_data['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
0 U
1 C
2 U
3 C
4 U
886 U
887 B
888 U
889 C
890 U
Name: Cabin, Length: 891, dtype: object
train_data['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
0 U
1 C
2 U
3 C
4 U
886 U
887 B
888 U
889 C
890 U
Name: Cabin, Length: 891, dtype: object
train_data['CabinLetter'] = train_data['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
0 U
1 C
2 U
3 C
4 U
886 U
887 B
888 U
889 C
890 U
Name: CabinLetter, Length: 891, dtype: object
array([0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4,
0, 5, 0, 0, 0, 1, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 6, 1, 0, 0, 0, 0, 0, 6, 1, 0, 0, 0,
7, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 2, 0, 0, 0, 5, 4, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 2, 4, 0, 0, 0, 7, 0, 0, 0,
0, 0, 0, 0, 4, 1, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 6, 0, 0, 0, 5, 0,
0, 1, 0, 0, 0, 0, 0, 7, 0, 5, 0, 0, 0, 0, 0, 0, 0, 7, 6, 6, 0, 0,
0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 0, 0, 0, 4, 0, 0, 4, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 4, 0, 0, 3, 1, 0, 0, 0, 0, 6, 0, 0, 0, 0, 2, 6,
0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0,
0, 0, 0, 0, 0, 6, 4, 0, 0, 0, 0, 1, 1, 6, 0, 0, 0, 2, 0, 1, 0, 1,
0, 2, 1, 6, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 1, 0, 4, 0, 6,
0, 1, 1, 0, 0, 0, 1, 2, 0, 8, 7, 1, 0, 0, 0, 7, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 6, 2, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 4, 3, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 2, 6, 0, 0, 1, 0,
0, 0, 0, 0, 0, 5, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 4, 0, 0, 2, 0,
2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0,
6, 0, 1, 6, 0, 0, 0, 0, 1, 0, 0, 0, 4, 0, 1, 0, 0, 0, 0, 0, 6, 1,
0, 0, 0, 0, 0, 0, 2, 0, 0, 4, 7, 0, 0, 0, 6, 0, 0, 6, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 6, 6, 0, 0, 0, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 5, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
2, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 5, 0, 2, 0, 6, 0, 0, 0, 4, 0, 0,
0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 7, 0, 0, 4, 0, 0, 0, 4, 0, 4, 0, 0, 5, 0, 6, 0, 0, 0, 0, 0,
0, 0, 0, 6, 0, 0, 0, 4, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4,
0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 6, 0, 0, 0, 0, 0, 0, 0, 6, 0, 4,
0, 0, 0, 0, 0, 0, 0, 6, 6, 0, 0, 0, 0, 0, 0, 0, 1, 7, 1, 2, 0, 0,
0, 0, 0, 2, 0, 0, 1, 1, 1, 0, 0, 7, 1, 2, 0, 0, 0, 0, 0, 0, 2, 0,
0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 6, 0, 0, 4, 1, 6, 0, 0, 6, 0, 0,
4, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0, 0, 6, 0, 4, 0, 0, 0, 0,
0, 0, 2, 0, 0, 0, 7, 0, 0, 6, 0, 6, 4, 0, 0, 0, 0, 0, 0, 6, 0, 0,
0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 6, 0, 0, 0, 5, 0, 0, 2, 0, 0, 0, 0,
0, 6, 0, 0, 0, 0, 6, 0, 0, 2, 0, 0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 2,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 4, 0, 0, 0, 2,
0, 0, 0, 0, 4, 0, 0, 0, 0, 5, 0, 0, 0, 4, 6, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 6, 0, 1, 0], dtype=int64)
train_data['CabinLetter'] = train_data['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
train_data['CabinLetter'] = pd.factorize(train_data['CabinLetter'])[0]
<matplotlib.axes._subplots.AxesSubplot at 0x1a2ba691048>

(10) 港口和存活与否的关系 Embarked
sns.countplot('Embarked', hue='Survived', data=train_data)
plt.title('Embarked and Survived')
Text(0.5, 1.0, 'Embarked and Survived')

sns.factorplot('Embarked', 'Survived', data=train_data, size=3, aspect=2)
plt.title('Embarked and Survived rate')

据了解,泰坦尼克号上共有2224名乘客。本训练数据只给出了891名乘客的信息,如果该数据集是从总共的2224人中随机选出的,根据中心极限定理,该样本的数据也足够大,那么我们的分析结果就具有代表性;但如果不是随机选取,那么我们的分析结果就可能不太靠谱了。 ———————————————— 版权声明:本文为CSDN博主「大树先生的博客」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 原文链接:https://blog.csdn.net/Koala_Tree/article/details/78725881
(11) 其他可能和存活与否有关系的特征
另外还有数据集中没有分析的几个特征:Ticket(船票号)、Cabin(船舱号),这些因素的不同可能会影响乘客在船中的位置从而影响逃生的顺序。但是船舱号数据缺失,船票号类别大,难以分析规律,所以在后期模型融合的时候,将这些因素交由模型来决定其重要性。 ———————————————— 版权声明:本文为CSDN博主「大树先生的博客」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。 原文链接:https://blog.csdn.net/Koala_Tree/article/details/78725881
4. 变量转换
1. Dummy Variables
就是类别变量或者二元变量,当qualitative variable是一些频繁出现的几个独立变量时,Dummy Variables比较适合使用。我们以Embarked为例,Embarked只包含三个值’S’,‘C’,‘Q’,我们可以使用下面的代码将其转换为dummies:
0 S
1 C
2 S
3 S
4 S
886 S
887 S
888 S
889 C
890 Q
Name: Embarked, Length: 891, dtype: object
| C | Q | S |
0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | 1 | 3 | 0 | 0 | 1 | 4 | 0 | 0 | 1 | ... | ... | ... | ... | 886 | 0 | 0 | 1 | 887 | 0 | 0 | 1 | 888 | 0 | 0 | 1 | 889 | 1 | 0 | 0 | 890 | 0 | 1 | 0 |
891 rows × 3 columns
embark_dummies = pd.get_dummies(train_data['Embarked'])
train_data = train_data.join(embark_dummies)
train_data.drop(['Embarked'], axis=1,inplace=True)
embark_dummies = train_data[['S', 'C', 'Q']]
2. Factorizing
0 U0
1 C85
2 U0
3 C123
4 U0
886 U0
887 B42
888 U0
889 C148
890 U0
Name: Cabin, Length: 891, dtype: object
train_data['Cabin'].map( lambda x : re.compile("([a-zA-Z]+)").search(x).group())
0 U
1 C
2 U
3 C
4 U
886 U
887 B
888 U
889 C
890 U
Name: Cabin, Length: 891, dtype: object
train_data['Cabin'][train_data.Cabin.isnull()] = 'U0'
train_data['CabinLetter'] = train_data['Cabin'].map( lambda x : re.compile("([a-zA-Z]+)").search(x).group())
train_data['CabinLetter'] = pd.factorize(train_data['CabinLetter'])[0]
0 U0
1 C85
2 U0
3 C123
4 U0
5 U0
6 E46
7 U0
8 U0
9 U0
Name: Cabin, dtype: object
array([0, 1, 0, 1, 0, 0, 2, 0, 0, 0], dtype=int64)
0 0
1 1
2 0
3 1
4 0
5 0
6 2
7 0
8 0
9 0
Name: CabinLetter, dtype: int64
1. Scaling
Scaling可以将一个很大范围的数值映射到一个很小的范围(通常是-1 - 1,或则是0 - 1),很多情况下我们需要将数值做Scaling使其范围大小一样,否则大范围数值特征将会由更高的权重。比如:Age的范围可能只是0-100,而income的范围可能是0-10000000,在某些对数组大小敏感的模型中会影响其结果。
from sklearn import preprocessing
assert np.size(train_data['Age']) == 891
scaler = preprocessing.StandardScaler()
train_data['Age_scaled'] = scaler.fit_transform(train_data['Age'].values.reshape(-1, 1))
0 -0.557438
1 0.607854
2 -0.266115
3 0.389361
4 0.389361
886 -0.193284
887 -0.775930
888 -0.320543
889 -0.266115
890 0.170869
Name: Age_scaled, Length: 891, dtype: float64
2. Binning
0 7.2500
1 71.2833
2 7.9250
3 53.1000
4 8.0500
886 13.0000
887 30.0000
888 23.4500
889 30.0000
890 7.7500
Name: Fare, Length: 891, dtype: float64
train_data['Fare_bin'] = pd.qcut(train_data['Fare'], 5)
0 (-0.001, 7.854]
1 (39.688, 512.329]
2 (7.854, 10.5]
3 (39.688, 512.329]
4 (7.854, 10.5]
Name: Fare_bin, dtype: category
Categories (5, interval[float64]): [(-0.001, 7.854] < (7.854, 10.5] < (10.5, 21.679] < (21.679, 39.688] < (39.688, 512.329]]
train_data['Fare_bin_id'] = pd.factorize(train_data['Fare_bin'])[0]
fare_bin_dummies_df = pd.get_dummies(train_data['Fare_bin']).rename(columns=lambda x: 'Fare_' + str(x))
train_data = pd.concat([train_data, fare_bin_dummies_df], axis=1)
0 0
1 1
2 2
3 1
4 2
886 3
887 4
888 4
889 4
890 0
Name: Fare_bin_id, Length: 891, dtype: int64
Fare_bin | Fare_(-0.001, 7.854] | Fare_(7.854, 10.5] | Fare_(10.5, 21.679] | Fare_(21.679, 39.688] | Fare_(39.688, 512.329] |
0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | 1 | 0 | 0 | 0 | ... | ... | ... | ... | ... | ... | 886 | 0 | 0 | 1 | 0 | 0 | 887 | 0 | 0 | 0 | 1 | 0 | 888 | 0 | 0 | 0 | 1 | 0 | 889 | 0 | 0 | 0 | 1 | 0 | 890 | 1 | 0 | 0 | 0 | 0 |
891 rows × 5 columns
5. 特征工程
train_df_org = pd.read_csv('train.csv')
test_df_org = pd.read_csv('test.csv')
test_df_org['Survived'] = 0
combined_train_test = train_df_org.append(test_df_org)
PassengerId = test_df_org['PassengerId']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
Survived 418 non-null int64
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
0 892
1 893
2 894
3 895
4 896
413 1305
414 1306
415 1307
416 1308
417 1309
Name: PassengerId, Length: 418, dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 1309 non-null int64
Ticket 1309 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 132.9+ KB
(1) Embarked
combined_train_test['Embarked'].fillna(combined_train_test['Embarked'].mode().iloc[0], inplace=True)
combined_train_test['Embarked'] = pd.factorize(combined_train_test['Embarked'])[0]
0 0
1 1
2 0
3 0
4 0
413 0
414 1
415 0
416 0
417 1
Name: Embarked, Length: 1309, dtype: int64
emb_dummies_df = pd.get_dummies(combined_train_test['Embarked'], prefix=combined_train_test[['Embarked']].columns[0])
combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)
| Embarked_0 | Embarked_1 | Embarked_2 |
0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 3 | 1 | 0 | 0 | 4 | 1 | 0 | 0 | ... | ... | ... | ... | 413 | 1 | 0 | 0 | 414 | 0 | 1 | 0 | 415 | 1 | 0 | 0 | 416 | 1 | 0 | 0 | 417 | 0 | 1 | 0 |
1309 rows × 3 columns
(2) Sex
one-hot 编码的定义: 独热编码即 One-Hot 编码,又称一位有效编码。其方法是使用 N位 状态寄存器来对 N个状态 进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。 具体参考:https://blog.csdn.net/qq_15192373/article/details/89552498
combined_train_test['Sex'] = pd.factorize(combined_train_test['Sex'])[0]
sex_dummies_df = pd.get_dummies(combined_train_test['Sex'], prefix=combined_train_test[['Sex']].columns[0])
combined_train_test = pd.concat([combined_train_test, sex_dummies_df], axis=1)
(3) Name
combined_train_test['Title'] = combined_train_test['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
combined_train_test['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
0 Mr
1 Mrs
2 Miss
3 Mrs
4 Mr
413 Mr
414 Dona
415 Mr
416 Mr
417 Master
Name: Name, Length: 1309, dtype: object
title_Dict = {}
title_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
title_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
title_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
title_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
title_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
title_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))
combined_train_test['Title'] = combined_train_test['Title'].map(title_Dict)
{'Capt': 'Officer',
'Col': 'Officer',
'Major': 'Officer',
'Dr': 'Officer',
'Rev': 'Officer',
'Don': 'Royalty',
'Sir': 'Royalty',
'the Countess': 'Royalty',
'Dona': 'Royalty',
'Lady': 'Royalty',
'Mme': 'Mrs',
'Ms': 'Mrs',
'Mrs': 'Mrs',
'Mlle': 'Miss',
'Miss': 'Miss',
'Mr': 'Mr',
'Master': 'Master',
'Jonkheer': 'Master'}
(array([0, 1, 2, ..., 0, 0, 3], dtype=int64),
Index(['Mr', 'Mrs', 'Miss', 'Master', 'Royalty', 'Officer'], dtype='object'))
combined_train_test['Title'] = pd.factorize(combined_train_test['Title'])[0]
title_dummies_df = pd.get_dummies(combined_train_test['Title'], prefix=combined_train_test[['Title']].columns[0])
combined_train_test = pd.concat([combined_train_test, title_dummies_df], axis=1)
combined_train_test['Name_length'] = combined_train_test['Name'].apply(len)
(1309, 25)
(4) Fare
| Fare |
0 | 7.2500 | 1 | 71.2833 | 2 | 7.9250 | 3 | 53.1000 | 4 | 8.0500 | ... | ... | 413 | 8.0500 | 414 | 108.9000 | 415 | 7.2500 | 416 | 8.0500 | 417 | 22.3583 |
1309 rows × 1 columns
| Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Sex | SibSp | Survived | ... | Sex_0 | Sex_1 | Title | Title_0 | Title_1 | Title_2 | Title_3 | Title_4 | Title_5 | Name_length |
0 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | ... | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 1 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | ... | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 2 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | ... | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 3 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | ... | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 4 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | ... | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 413 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | ... | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 414 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | ... | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 323 | 415 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | ... | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 416 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | ... | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 417 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | ... | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 | 709 |
1309 rows × 24 columns
| Fare |
0 | 7.2500 | 1 | 71.2833 | 2 | 7.9250 | 3 | 53.1000 | 4 | 8.0500 | ... | ... | 413 | 8.0500 | 414 | 108.9000 | 415 | 7.2500 | 416 | 8.0500 | 417 | 22.3583 |
1309 rows × 1 columns
combined_train_test['Fare'] = combined_train_test[['Fare']].fillna(combined_train_test.groupby('Pclass').transform(np.mean))
combined_train_test['Group_Ticket'] = combined_train_test['Fare'].groupby(by=combined_train_test['Ticket']).transform('count')
combined_train_test['Fare'] = combined_train_test['Fare'] / combined_train_test['Group_Ticket']
0 1
1 2
2 1
3 2
4 1
413 1
414 3
415 1
416 1
417 3
Name: Group_Ticket, Length: 1309, dtype: int64
combined_train_test.drop(['Group_Ticket'], axis=1, inplace=True)
combined_train_test['Fare_bin'] = pd.qcut(combined_train_test['Fare'], 5)
combined_train_test['Fare_bin_id'] = pd.factorize(combined_train_test['Fare_bin'])[0]
fare_bin_dummies_df = pd.get_dummies(combined_train_test['Fare_bin_id']).rename(columns=lambda x: 'Fare_' + str(x))
combined_train_test = pd.concat([combined_train_test, fare_bin_dummies_df], axis=1)
combined_train_test.drop(['Fare_bin'], axis=1, inplace=True)
| Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | ... | Title_3 | Title_4 | Title_5 | Name_length | Fare_bin_id | Fare_0 | Fare_1 | Fare_2 | Fare_3 | Fare_4 |
0 | 22.0 | NaN | 0 | 7.250000 | Braund, Mr. Owen Harris | 0 | 1 | 3 | 0 | 1 | ... | 0 | 0 | 0 | 23 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 38.0 | C85 | 1 | 35.641650 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 2 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 51 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 26.0 | NaN | 0 | 7.925000 | Heikkinen, Miss. Laina | 0 | 3 | 3 | 1 | 0 | ... | 0 | 0 | 0 | 22 | 2 | 0 | 0 | 1 | 0 | 0 | 3 | 35.0 | C123 | 0 | 26.550000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 44 | 1 | 0 | 1 | 0 | 0 | 0 | 4 | 35.0 | NaN | 0 | 8.050000 | Allen, Mr. William Henry | 0 | 5 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 24 | 2 | 0 | 0 | 1 | 0 | 0 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 413 | NaN | NaN | 0 | 8.050000 | Spector, Mr. Woolf | 0 | 1305 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 18 | 2 | 0 | 0 | 1 | 0 | 0 | 414 | 39.0 | C105 | 1 | 36.300000 | Oliva y Ocana, Dona. Fermina | 0 | 1306 | 1 | 1 | 0 | ... | 0 | 1 | 0 | 28 | 1 | 0 | 1 | 0 | 0 | 0 | 415 | 38.5 | NaN | 0 | 7.250000 | Saether, Mr. Simon Sivertsen | 0 | 1307 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 28 | 0 | 1 | 0 | 0 | 0 | 0 | 416 | NaN | NaN | 0 | 8.050000 | Ware, Mr. Frederick | 0 | 1308 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 19 | 2 | 0 | 0 | 1 | 0 | 0 | 417 | NaN | NaN | 1 | 7.452767 | Peter, Master. Michael J | 1 | 1309 | 3 | 0 | 1 | ... | 1 | 0 | 0 | 24 | 0 | 1 | 0 | 0 | 0 | 0 |
1309 rows × 31 columns
(5) Pclass
combined_train_test['Pclass'].value_counts().plot.pie(autopct = '%1.2f%%')
<matplotlib.axes._subplots.AxesSubplot at 0x1a2bba76748>

from sklearn.preprocessing import LabelEncoder
def pclass_fare_category(df, pclass1_mean_fare, pclass2_mean_fare, pclass3_mean_fare):
if df['Pclass'] == 1:
if df['Fare'] <= pclass1_mean_fare:
return 'Pclass1_Low'
return 'Pclass1_High'
elif df['Pclass'] == 2:
if df['Fare'] <= pclass2_mean_fare:
return 'Pclass2_Low'
return 'Pclass2_High'
elif df['Pclass'] == 3:
if df['Fare'] <= pclass3_mean_fare:
return 'Pclass3_Low'
return 'Pclass3_High'
Pclass1_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([1]).values[0]
Pclass2_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([2]).values[0]
Pclass3_mean_fare = combined_train_test['Fare'].groupby(by=combined_train_test['Pclass']).mean().get([3]).values[0]
combined_train_test['Pclass_Fare_Category'] = combined_train_test.apply(pclass_fare_category, args=(Pclass1_mean_fare, Pclass2_mean_fare, Pclass3_mean_fare), axis=1)
pclass_level = LabelEncoder()
['Pclass1_Low', 'Pclass1_High', 'Pclass2_Low', 'Pclass2_High', 'Pclass3_Low', 'Pclass3_High']))
combined_train_test['Pclass_Fare_Category'] = pclass_level.transform(combined_train_test['Pclass_Fare_Category'])
pclass_dummies_df = pd.get_dummies(combined_train_test['Pclass_Fare_Category']).rename(columns=lambda x: 'Pclass_' + str(x))
combined_train_test = pd.concat([combined_train_test, pclass_dummies_df], axis=1)
自然数编码 : a) 使用sklearn中的LabelEncoder()方法,转换为数值型特征 b) 使用pd.factorize()函数
独热编码(one-hot encoding):生成一个(n_examples * n_classes)大小的0~1矩阵,每个样本仅对应一个label a) 使用pandas中的get_dummies实现
b) 使用OneHotEncoder() , LabelEncoder() , LabelBinarizer() 这些方法
同时,我们将 Pclass 特征factorize化:
array([3, 1, 3, ..., 3, 3, 3], dtype=int64)
combined_train_test['Pclass'] = pd.factorize(combined_train_test['Pclass'])[0]
array([0, 1, 0, ..., 0, 0, 0], dtype=int64)
(6) Parch and SibSp
def family_size_category(family_size):
if family_size <= 1:
return 'Single'
elif family_size <= 4:
return 'Small_Family'
return 'Large_Family'
combined_train_test['Family_Size'] = combined_train_test['Parch'] + combined_train_test['SibSp'] + 1
combined_train_test['Family_Size_Category'] = combined_train_test['Family_Size'].map(family_size_category)
le_family = LabelEncoder()
le_family.fit(np.array(['Single', 'Small_Family', 'Large_Family']))
combined_train_test['Family_Size_Category'] = le_family.transform(combined_train_test['Family_Size_Category'])
family_size_dummies_df = pd.get_dummies(combined_train_test['Family_Size_Category'],
combined_train_test = pd.concat([combined_train_test, family_size_dummies_df], axis=1)
(7) Age
missing_age_df = pd.DataFrame(combined_train_test[
['Age', 'Embarked', 'Sex', 'Title', 'Name_length', 'Family_Size', 'Family_Size_Category','Fare', 'Fare_bin_id', 'Pclass']])
missing_age_train = missing_age_df[missing_age_df['Age'].notnull()]
missing_age_test = missing_age_df[missing_age_df['Age'].isnull()]
| Age | Embarked | Sex | Title | Name_length | Family_Size | Family_Size_Category | Fare | Fare_bin_id | Pclass |
5 | NaN | 2 | 0 | 0 | 16 | 1 | 1 | 8.4583 | 2 | 0 | 17 | NaN | 0 | 0 | 0 | 28 | 1 | 1 | 13.0000 | 3 | 2 | 19 | NaN | 1 | 1 | 1 | 23 | 1 | 1 | 7.2250 | 4 | 0 | 26 | NaN | 1 | 0 | 0 | 23 | 1 | 1 | 7.2250 | 4 | 0 | 28 | NaN | 2 | 1 | 2 | 29 | 1 | 1 | 7.8792 | 0 | 0 |
建立Age的预测模型,我们可以多模型预测,然后再做模型的融合,提高预测的精度。 1)Bagging + 决策树 = 随机森林
2)AdaBoost + 决策树 = 提升树
3)Gradient Boosting + 决策树 = GBDT 整合:
from sklearn import ensemble
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
def fill_missing_age(missing_age_train, missing_age_test):
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
gbm_reg = GradientBoostingRegressor(random_state=42)
gbm_reg_param_grid = {'n_estimators': [2000], 'max_depth': [4], 'learning_rate': [0.01], 'max_features': [3]}
gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best GB Params:' + str(gbm_reg_grid.best_params_))
print('Age feature Best GB Score:' + str(gbm_reg_grid.best_score_))
print('GB Train Error for "Age" Feature Regressor:' + str(gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_GB'] = gbm_reg_grid.predict(missing_age_X_test)
rf_reg = RandomForestRegressor()
rf_reg_param_grid = {'n_estimators': [200], 'max_depth': [5], 'random_state': [0]}
rf_reg_grid = model_selection.GridSearchCV(rf_reg, rf_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
rf_reg_grid.fit(missing_age_X_train, missing_age_Y_train)
print('Age feature Best RF Params:' + str(rf_reg_grid.best_params_))
print('Age feature Best RF Score:' + str(rf_reg_grid.best_score_))
print('RF Train Error for "Age" Feature Regressor' + str(rf_reg_grid.score(missing_age_X_train, missing_age_Y_train)))
missing_age_test.loc[:, 'Age_RF'] = rf_reg_grid.predict(missing_age_X_test)
print('shape1', missing_age_test['Age'].shape, missing_age_test[['Age_GB', 'Age_RF']].mode(axis=1).shape)
missing_age_test.loc[:, 'Age'] = np.mean([missing_age_test['Age_GB'], missing_age_test['Age_RF']])
missing_age_test.drop(['Age_GB', 'Age_RF'], axis=1, inplace=True)
return missing_age_test
combined_train_test.loc[(combined_train_test.Age.isnull()), 'Age'] = fill_missing_age(missing_age_train, missing_age_test)
Fitting 10 folds for each of 1 candidates, totalling 10 fits
[Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
[Parallel(n_jobs=25)]: Done 5 out of 10 | elapsed: 14.3s remaining: 14.3s
[Parallel(n_jobs=25)]: Done 10 out of 10 | elapsed: 17.1s finished
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
Age feature Best GB Params:{'learning_rate': 0.01, 'max_depth': 4, 'max_features': 3, 'n_estimators': 2000}
Age feature Best GB Score:-130.2956775989383
GB Train Error for "Age" Feature Regressor:-64.65669617233556
5 35.773942
17 31.489153
19 34.113840
26 28.621281
Name: Age_GB, dtype: float64
Fitting 10 folds for each of 1 candidates, totalling 10 fits
[Parallel(n_jobs=25)]: Using backend LokyBackend with 25 concurrent workers.
[Parallel(n_jobs=25)]: Done 5 out of 10 | elapsed: 7.8s remaining: 7.8s
[Parallel(n_jobs=25)]: Done 10 out of 10 | elapsed: 10.1s finished
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
Age feature Best RF Params:{'max_depth': 5, 'n_estimators': 200, 'random_state': 0}
Age feature Best RF Score:-119.09495605170706
RF Train Error for "Age" Feature Regressor-96.06031484477619
5 33.459421
17 33.076798
19 34.855942
26 28.146718
Name: Age_RF, dtype: float64
shape1 (263,) (263, 2)
5 30.000675
17 30.000675
19 30.000675
26 30.000675
Name: Age, dtype: float64
missing_age_X_train = missing_age_train.drop(['Age'], axis=1)
missing_age_Y_train = missing_age_train['Age']
missing_age_X_test = missing_age_test.drop(['Age'], axis=1)
gbm_reg = GradientBoostingRegressor(random_state=42)
gbm_reg_param_grid = {'n_estimators': [2000], 'max_depth': [4], 'learning_rate': [0.01], 'max_features': [3]}
gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring='neg_mean_squared_error')
GridSearchCV(cv=10, error_score='raise-deprecating',
init=None, learning_rate=0.1,
loss='ls', max_depth=3,
random_state=42, subsample=1.0,
verbose=0, warm_start=False),
iid='warn', n_jobs=25,
param_grid={'learning_rate': [0.01], 'max_depth': [4],
'max_features': [3], 'n_estimators': [2000]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='neg_mean_squared_error', verbose=1)
(8) Ticket
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket'].str.split().str[0]
combined_train_test['Ticket_Letter'].apply(lambda x: 'U0' if x.isnumeric() else x)
0 A/5
1 PC
2 STON/O2.
3 U0
4 U0
413 A.5.
414 PC
415 SOTON/O.Q.
416 U0
417 U0
Name: Ticket_Letter, Length: 1309, dtype: object
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket'].str.split().str[0]
combined_train_test['Ticket_Letter'] = combined_train_test['Ticket_Letter'].apply(lambda x: 'U0' if x.isnumeric() else x)
combined_train_test['Ticket_Letter'] = pd.factorize(combined_train_test['Ticket_Letter'])[0]
(9) Cabin
combined_train_test.loc[combined_train_test.Cabin.isnull(), 'Cabin'] = 'U0'
combined_train_test['Cabin'] = combined_train_test['Cabin'].apply(lambda x: 0 if x == 'U0' else 1)
Correlation = pd.DataFrame(combined_train_test[
['Embarked', 'Sex', 'Title', 'Name_length', 'Family_Size', 'Family_Size_Category','Fare', 'Fare_bin_id', 'Pclass',
'Pclass_Fare_Category', 'Age', 'Ticket_Letter', 'Cabin']])
| Embarked | Sex | Title | Name_length | Family_Size | Family_Size_Category | Fare | Fare_bin_id | Pclass | Pclass_Fare_Category | Age | Ticket_Letter | Cabin |
0 | 0 | 0 | 0 | 23 | 2 | 2 | 7.250000 | 0 | 0 | 5 | 22.000000 | 0 | 0 | 1 | 1 | 1 | 1 | 51 | 2 | 2 | 35.641650 | 1 | 1 | 0 | 38.000000 | 1 | 1 | 2 | 0 | 1 | 2 | 22 | 1 | 1 | 7.925000 | 2 | 0 | 4 | 26.000000 | 2 | 0 | 3 | 0 | 1 | 1 | 44 | 2 | 2 | 26.550000 | 1 | 1 | 1 | 35.000000 | 3 | 1 | 4 | 0 | 0 | 0 | 24 | 1 | 1 | 8.050000 | 2 | 0 | 4 | 35.000000 | 3 | 0 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 413 | 0 | 0 | 0 | 18 | 1 | 1 | 8.050000 | 2 | 0 | 4 | 30.000675 | 24 | 0 | 414 | 1 | 1 | 4 | 28 | 1 | 1 | 36.300000 | 1 | 1 | 0 | 39.000000 | 1 | 1 | 415 | 0 | 0 | 0 | 28 | 1 | 1 | 7.250000 | 0 | 0 | 5 | 38.500000 | 21 | 0 | 416 | 0 | 0 | 0 | 19 | 1 | 1 | 8.050000 | 2 | 0 | 4 | 30.000675 | 3 | 0 | 417 | 1 | 0 | 3 | 24 | 3 | 2 | 7.452767 | 0 | 0 | 4 | 30.000675 | 3 | 0 |
1309 rows × 13 columns
colormap = plt.cm.viridis
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(Correlation.astype(float).corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2ba56a888>

g = sns.pairplot(combined_train_test[[u'Survived', u'Pclass', u'Sex', u'Age', u'Fare', u'Embarked',
u'Family_Size', u'Title', u'Ticket_Letter']], hue='Survived', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
<seaborn.axisgrid.PairGrid at 0x1a2ba2c6508>

1. 一些数据的正则化
scale_age_fare = preprocessing.StandardScaler().fit(combined_train_test[['Age','Fare', 'Name_length']])
StandardScaler(copy=True, with_mean=True, with_std=True)
combined_train_test[['Age','Fare', 'Name_length']] = scale_age_fare.transform(combined_train_test[['Age','Fare', 'Name_length']])
combined_train_test[['Age','Fare', 'Name_length']]
| Age | Fare | Name_length |
0 | -0.613832 | -0.554177 | -0.434672 | 1 | 0.628562 | 1.541869 | 2.511806 | 2 | -0.303234 | -0.504344 | -0.539904 | 3 | 0.395613 | 0.870667 | 1.775186 | 4 | 0.395613 | -0.495116 | -0.329441 | ... | ... | ... | ... | 413 | 0.007417 | -0.495116 | -0.960829 | 414 | 0.706211 | 1.590472 | 0.091485 | 415 | 0.667387 | -0.554177 | 0.091485 | 416 | 0.007417 | -0.495116 | -0.855598 | 417 | 0.007417 | -0.539208 | -0.329441 |
1309 rows × 3 columns
2. 弃掉无用特征
combined_data_backup = combined_train_test
| Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | Embarked_0 | Embarked_1 | Embarked_2 | Sex_0 | Sex_1 | Title | Title_0 | Title_1 | Title_2 | Title_3 | Title_4 | Title_5 | Name_length | Fare_bin_id | Fare_0 | Fare_1 | Fare_2 | Fare_3 | Fare_4 | Pclass_Fare_Category | Pclass_0 | Pclass_1 | Pclass_2 | Pclass_3 | Pclass_4 | Pclass_5 | Family_Size | Family_Size_Category | Family_Size_Category_0 | Family_Size_Category_1 | Family_Size_Category_2 | Ticket_Letter |
0 | -0.613832 | 0 | 0 | -0.554177 | Braund, Mr. Owen Harris | 0 | 1 | 0 | 0 | 1 | 0 | A/5 21171 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | -0.434672 | 0 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 2 | 0 | 0 | 1 | 0 | 1 | 0.628562 | 1 | 1 | 1.541869 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 2 | 1 | 1 | 1 | 1 | PC 17599 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 2.511806 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 0 | 0 | 1 | 1 | 2 | -0.303234 | 0 | 0 | -0.504344 | Heikkinen, Miss. Laina | 0 | 3 | 0 | 1 | 0 | 1 | STON/O2. 3101282 | 1 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | -0.539904 | 2 | 0 | 0 | 1 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 2 | 3 | 0.395613 | 1 | 0 | 0.870667 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | 1 | 1 | 1 | 113803 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1.775186 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 2 | 0 | 0 | 1 | 3 | 4 | 0.395613 | 0 | 0 | -0.495116 | Allen, Mr. William Henry | 0 | 5 | 0 | 0 | 0 | 0 | 373450 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | -0.329441 | 2 | 0 | 0 | 1 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 3 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 413 | 0.007417 | 0 | 0 | -0.495116 | Spector, Mr. Woolf | 0 | 1305 | 0 | 0 | 0 | 0 | A.5. 3236 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | -0.960829 | 2 | 0 | 0 | 1 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 24 | 414 | 0.706211 | 1 | 1 | 1.590472 | Oliva y Ocana, Dona. Fermina | 0 | 1306 | 1 | 1 | 0 | 0 | PC 17758 | 0 | 1 | 0 | 0 | 1 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0.091485 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 415 | 0.667387 | 0 | 0 | -0.554177 | Saether, Mr. Simon Sivertsen | 0 | 1307 | 0 | 0 | 0 | 0 | SOTON/O.Q. 3101262 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.091485 | 0 | 1 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 21 | 416 | 0.007417 | 0 | 0 | -0.495116 | Ware, Mr. Frederick | 0 | 1308 | 0 | 0 | 0 | 0 | 359309 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | -0.855598 | 2 | 0 | 0 | 1 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 3 | 417 | 0.007417 | 0 | 1 | -0.539208 | Peter, Master. Michael J | 1 | 1309 | 0 | 0 | 1 | 0 | 2668 | 0 | 1 | 0 | 1 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | -0.329441 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 2 | 0 | 0 | 1 | 3 |
1309 rows × 44 columns
combined_train_test[['PassengerId', 'Embarked', 'Sex', 'Name', 'Title', 'Fare_bin_id', 'Pclass_Fare_Category',
'Parch', 'SibSp', 'Family_Size_Category', 'Ticket']]
| PassengerId | Embarked | Sex | Name | Title | Fare_bin_id | Pclass_Fare_Category | Parch | SibSp | Family_Size_Category | Ticket |
0 | 1 | 0 | 0 | Braund, Mr. Owen Harris | 0 | 0 | 5 | 0 | 1 | 2 | A/5 21171 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 1 | 0 | 0 | 1 | 2 | PC 17599 | 2 | 3 | 0 | 1 | Heikkinen, Miss. Laina | 2 | 2 | 4 | 0 | 0 | 1 | STON/O2. 3101282 | 3 | 4 | 0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 1 | 1 | 0 | 1 | 2 | 113803 | 4 | 5 | 0 | 0 | Allen, Mr. William Henry | 0 | 2 | 4 | 0 | 0 | 1 | 373450 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 413 | 1305 | 0 | 0 | Spector, Mr. Woolf | 0 | 2 | 4 | 0 | 0 | 1 | A.5. 3236 | 414 | 1306 | 1 | 1 | Oliva y Ocana, Dona. Fermina | 4 | 1 | 0 | 0 | 0 | 1 | PC 17758 | 415 | 1307 | 0 | 0 | Saether, Mr. Simon Sivertsen | 0 | 0 | 5 | 0 | 0 | 1 | SOTON/O.Q. 3101262 | 416 | 1308 | 0 | 0 | Ware, Mr. Frederick | 0 | 2 | 4 | 0 | 0 | 1 | 359309 | 417 | 1309 | 1 | 0 | Peter, Master. Michael J | 3 | 0 | 4 | 1 | 1 | 2 | 2668 |
1309 rows × 11 columns
combined_train_test.drop(['PassengerId', 'Embarked', 'Sex', 'Name', 'Title', 'Fare_bin_id', 'Pclass_Fare_Category',
'Parch', 'SibSp', 'Family_Size_Category', 'Ticket'],axis=1,inplace=True)
3. 将训练数据和测试数据分开:
| Age | Cabin | Fare | Pclass | Survived | Embarked_0 | Embarked_1 | Embarked_2 | Sex_0 | Sex_1 | Title_0 | Title_1 | Title_2 | Title_3 | Title_4 | Title_5 | Name_length | Fare_0 | Fare_1 | Fare_2 | Fare_3 | Fare_4 | Pclass_0 | Pclass_1 | Pclass_2 | Pclass_3 | Pclass_4 | Pclass_5 | Family_Size | Family_Size_Category_0 | Family_Size_Category_1 | Family_Size_Category_2 | Ticket_Letter |
0 | -0.613832 | 0 | -0.554177 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | -0.434672 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 0 | 1 | 0.628562 | 1 | 1.541869 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 2.511806 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 1 | 2 | -0.303234 | 0 | -0.504344 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | -0.539904 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 3 | 0.395613 | 1 | 0.870667 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1.775186 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 3 | 4 | 0.395613 | 0 | -0.495116 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | -0.329441 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 3 | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 886 | -0.225584 | 0 | -0.129677 | 2 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | -0.645135 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 3 | 887 | -0.846781 | 1 | 1.125368 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0.091485 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 3 | 888 | 0.007417 | 0 | -0.656611 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1.354261 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | 0 | 1 | 15 | 889 | -0.303234 | 1 | 1.125368 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | -0.645135 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 3 | 890 | 0.162664 | 0 | -0.517264 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | -0.855598 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 3 |
891 rows × 33 columns
(1309, 33)
train_data = combined_train_test[:891]
test_data = combined_train_test[891:]
titanic_train_data_X = train_data.drop(['Survived'],axis=1)
titanic_train_data_Y = train_data['Survived']
titanic_test_data_X = test_data.drop(['Survived'],axis=1)
(891, 32)
6. 模型融合及测试
(1) 利用不同的模型来对特征进行筛选,选出较为重要的特征: