Kaggle exercise 1: Titanic Disaster [data]
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
数据结构探测与分析
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimSun']
plt.rcParams['axes.unicode_minus'] = False
data_train = pd.read_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/train.csv')
data_train.head(10)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
---|
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
---|
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
---|
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
---|
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
---|
属性解释: passengerID–乘客id、survived–是否获救、Pclass–票的等级、sibsp–兄弟姐妹是否在船上,人数、parch–父母小孩是否在船上,人数、fare–票价、 cabin–船舱编号、embarked–出发的港口位置
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
从数据统计结果来看,有些属性的值存在缺失值,比如Age、Cabin以及Embarked,其中,Embarked缺少两个,相对整体来说较好,可考虑整体分布来填补,Cabin和Age存在较多的缺失值,age可以考虑通过其他的值预测得到,再观察一下数值数据的统计描述
data_train.describe()
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
---|
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
---|
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
---|
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
---|
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
---|
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
---|
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
---|
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
---|
平均年龄29.7,平均获救率0.38,票等级上二等三等的个数远大于一等
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
ax1, ax2, ax3, ax4, ax5, ax6 = axes.flatten()
ax1.bar(data_train['Survived'].unique(),
data_train['Survived'].value_counts())
ax1.set_title('获救与未获救的人数情况')
a = pd.DataFrame()
a['获救'] = data_train.Sex[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Sex[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax2)
ax3.hist(data_train.Age, color='darkorange')
ax3.hist(data_train.Age[data_train.Survived == 1])
ax3.set_title('不同年龄的获救人员情况')
a = pd.DataFrame()
a['获救'] = data_train.Pclass[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Pclass[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax4)
ax4.set_title('不同船票等级中的获救人员情况')
a = pd.DataFrame()
a['获救'] = data_train.Embarked[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Embarked[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax5)
ax5.set_title('不同港口的上船与人员获救情况')
import seaborn as sns
sns.heatmap(data_train.drop(columns = 'Survived').corr(),annot=True,ax = ax6, vmax=0.5, square=True, cmap="Blues")
ax6.set_title('不同特征之间的相关系数热力图')
从以上的简单统计分析可知: 1、获救人员占小部分,大部分乘客没有获救 2、男性乘客人数多于女性乘客人数,但是女性中有更多的被救;年龄上大多数乘客属于[20,40]中青年,而获救中占比较多的是小孩和老年人 3、船票等级越高,获救的概率越大,是否获救与港口位置关系不大。 4、乘客的不同特征之间是相互独立的,从相关系数上来看没有多重共线性 先处理了显而易见的简单属性,对于父母兄弟姐妹以及存在较多缺失值的船舱还没有研究
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 4))
a = pd.DataFrame()
a['获救'] = data_train.Parch[data_train.Survived == 1].value_counts()
a['未获救'] = data_train.Parch[data_train.Survived == 0].value_counts()
a.plot(kind='bar', stacked=True, ax=ax1)
ax1.set_title('父母人数与获救的关系')
ax1.set_xlabel('父母人数')
ax1.set_ylabel('获救人数')
b = pd.DataFrame()
b['获救'] = data_train.SibSp[data_train.Survived == 1].value_counts()
b['未获救'] = data_train.SibSp[data_train.Survived == 0].value_counts()
b.plot(kind='bar', ax=ax2, stacked=True)
ax2.set_title('兄弟姐妹人数与获救的关系')
ax2.set_xlabel('兄弟姐妹人数')
ax2.set_ylabel('获救人数')
c = pd.DataFrame()
c['空值'] = data_train.Survived[pd.isna(data_train.Cabin)].value_counts()
c['非空值'] = data_train.Survived[pd.notna(data_train.Cabin)].value_counts()
c.plot(kind = 'bar',ax=ax3,stacked=True)
ax3.set_title('是否为空值与是否获救的关系')
从以上图可以看出,父母数为0的人获救人数最多,而有1-2位父母在场的会是获救比例最大的,兄弟姐妹上来看和父母的结果保持一致,这和上述的相关系数分析的结果也保持一致,父母人数和兄弟姐妹人数有一定的相关性,但是也没有很高。好像不是空值的船舱乘客获救的概率更高,是不是因为登记了船舱更有利于被找到,所以能够获救。
数据预处理阶段
数据中存在缺失值的属性包括年龄和Cabin,缺失值会很大影响模型预测效果。 对缺失值的处理包括舍弃含有缺失值的行,或者预测均值填充。 因为样本数本来也不够多,因而不能舍弃,只能填充,年龄可以考虑用均值或者预测值填充,Cabin可以将其区分为空值和非空值
data_train.loc[data_train.Cabin.notnull(),'Cabin'] = 1
data_train.loc[data_train.Cabin.isnull(),'Cabin'] = 0
from sklearn.preprocessing import LabelEncoder
data_train.Sex.value_counts()
data_train['Sex'].replace({'male': 0, 'female': 1}, inplace=True)
data_train.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 0 | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 0 | S |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | 1 | S |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.0 | 0 | 0 | 373450 | 8.0500 | 0 | S |
---|
print(data_train.Embarked.value_counts())
print(data_train.Embarked.count())
data_train.loc[data_train.Embarked.isna(),'Embarked'] = 'S'
print(data_train.Embarked.value_counts())
data_train.Embarked.replace({'S':0,'C':1,'Q':2},inplace=True)
data_train.head(5)
S 644
C 168
Q 77
Name: Embarked, dtype: int64
889
S 646
C 168
Q 77
Name: Embarked, dtype: int64
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 0 | 0 |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1 | 1 |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 0 | 0 |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | 1 | 0 |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.0 | 0 | 0 | 373450 | 8.0500 | 0 | 0 |
---|
from sklearn.ensemble import RandomForestRegressor
df = data_train[['Age','Sex','Fare', 'Parch', 'SibSp', 'Pclass','Cabin']]
samples_with_age = df[pd.notna(data_train.Age)]
print(samples_with_age.head(3))
samples_without_age = df[pd.isna(data_train.Age)]
print(samples_without_age.head(3))
X_train = samples_with_age.drop(columns = 'Age')
y_train = samples_with_age['Age']
X_test = samples_without_age.drop(columns = 'Age')
y_test = samples_without_age['Age']
model = RandomForestRegressor(random_state=0, n_estimators=200)
model.fit(X_train,y_train)
y_pre = model.predict(X_test)
data_train.loc[ (data_train.Age.isnull()), 'Age' ] = y_pre
Age Sex Fare Parch SibSp Pclass Cabin
0 22.0 0 7.2500 0 1 3 0
1 38.0 1 71.2833 0 1 1 1
2 26.0 1 7.9250 0 0 3 0
Age Sex Fare Parch SibSp Pclass Cabin
5 NaN 0 8.4583 0 0 3 0
17 NaN 0 13.0000 0 0 2 0
19 NaN 1 7.2250 0 0 3 0
data_train.head(10)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.000000 | 1 | 0 | A/5 21171 | 7.2500 | 0 | 0 |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.000000 | 1 | 0 | PC 17599 | 71.2833 | 1 | 1 |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 0 | 0 |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.000000 | 1 | 0 | 113803 | 53.1000 | 1 | 0 |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.000000 | 0 | 0 | 373450 | 8.0500 | 0 | 0 |
---|
5 | 6 | 0 | 3 | Moran, Mr. James | 0 | 23.127944 | 0 | 0 | 330877 | 8.4583 | 0 | 2 |
---|
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | 0 | 54.000000 | 0 | 0 | 17463 | 51.8625 | 1 | 0 |
---|
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | 0 | 2.000000 | 3 | 1 | 349909 | 21.0750 | 0 | 0 |
---|
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | 1 | 27.000000 | 0 | 2 | 347742 | 11.1333 | 0 | 0 |
---|
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | 1 | 14.000000 | 1 | 0 | 237736 | 30.0708 | 0 | 1 |
---|
目前为止,基本的数据处理基本完成,需要数值化处理的也已经处理完毕,再对数值数据进行异常值的探测,判断是否有异常值。利用最简单的箱体图观察。
data_train[['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']].boxplot()
整体上来看,数据正常,虽然票价存在500这样的远远超过正常值,但也可能是土豪任性或者特殊情况之类的,也可以接受的。
数据处理完毕,下面开始建模预测,先用简单的逻辑回归来预测一下当作baseline。
数据建模
data = data_train.drop(columns=['PassengerId','Name','Ticket'])
数据集的划分:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data.drop(columns='Survived'), data['Survived'], random_state=1, test_size=0.2)
采用逻辑回归模型
from sklearn.linear_model import LogisticRegression
model5 = LogisticRegression()
model5.fit(X_train,y_train)
score = model5.score(X_test,y_test)
print(score)
print(model5.coef_)
0.7988826815642458
[[-9.02910417e-01 2.45051490e+00 -4.13729185e-02 -3.87050276e-01
5.63331782e-02 -1.36826728e-03 1.13677853e+00 3.03292337e-01]]
E:\anaconda\install\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
如果用基于决策树的模型结果会不会不一样,采用Adaboost模型
from sklearn.ensemble import AdaBoostClassifier
model1 = AdaBoostClassifier(random_state=1)
model1.fit(X_train, y_train)
y_pred = model1.predict(X_test)
score = model1.score(X_test, y_test)
print(score)
0.770949720670391
因为模型中采用的是默认参数,在利用几种预测模型都预测出结果之后可采用网格搜索得到合适的参数。
采用GBDT模型
from sklearn.ensemble import GradientBoostingClassifier
model2 = GradientBoostingClassifier(random_state=1)
model2.fit(X_train,y_train)
y_pred = model2.predict(X_test)
score = model2.score(X_test,y_test)
print(score)
0.7821229050279329
采用XGBoost模型
X_train.Cabin.replace({1:True,0:False},inplace=True)
X_test.Cabin.replace({1:True,0:False},inplace=True)
from xgboost import XGBClassifier
model3 = XGBClassifier()
model3.fit(X_train,y_train)
score = model3.score(X_test,y_test)
print(score)
[18:47:51] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
0.7821229050279329
E:\anaconda\install\lib\site-packages\xgboost\sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
采用LightGBM模型
from lightgbm import LGBMClassifier
model4 = LGBMClassifier()
model4.fit(X_train,y_train)
score = model4.score(X_test,y_test)
print(score)
0.776536312849162
对GBDT网格搜索最佳参数
from sklearn.model_selection import GridSearchCV
parameters = {'max_depth':[1,3,5,7],'n_estimators':[50,100,200,500],'learning_rate':[0.01,0.02,0.05,0.1,0.2]}
model = GradientBoostingClassifier()
grid_search = GridSearchCV(model,parameters,scoring = 'r2',cv=5)
grid_search.fit(X_train,y_train)
grid_search.best_params_
{'learning_rate': 0.02, 'max_depth': 5, 'n_estimators': 100}
model2 = GradientBoostingClassifier(learning_rate= 0.02, max_depth= 5, n_estimators=100,random_state=1)
model2.fit(X_train,y_train)
y_pred = model2.predict(X_test)
score = model2.score(X_test,y_test)
print(score)
0.7821229050279329
难道处理这么多,预测效果最好的是逻辑回归嘛,笑cry。anyway,还是来预测一下测试集吧。
data_test = pd.read_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/test.csv')
data_test.head(10)
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
---|
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
---|
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
---|
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
---|
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
---|
5 | 897 | 3 | Svensson, Mr. Johan Cervin | male | 14.0 | 0 | 0 | 7538 | 9.2250 | NaN | S |
---|
6 | 898 | 3 | Connolly, Miss. Kate | female | 30.0 | 0 | 0 | 330972 | 7.6292 | NaN | Q |
---|
7 | 899 | 2 | Caldwell, Mr. Albert Francis | male | 26.0 | 1 | 1 | 248738 | 29.0000 | NaN | S |
---|
8 | 900 | 3 | Abrahim, Mrs. Joseph (Sophie Halaut Easu) | female | 18.0 | 0 | 0 | 2657 | 7.2292 | NaN | C |
---|
9 | 901 | 3 | Davies, Mr. John Samuel | male | 21.0 | 2 | 0 | A/4 48871 | 24.1500 | NaN | S |
---|
对测试集做相同的数据处理,Sex替换为0,1,缺失的Age用预测值填充,Cabin用有和没有来填充,Embarked用序号。
data_test.Sex.replace({'male':0,'female':1},inplace = True)
print(data_test.Embarked.value_counts())
data_test.Embarked.replace({'S':0,'C':1,'Q':2},inplace=True)
S 270
C 102
Q 46
Name: Embarked, dtype: int64
print(data_test.Cabin.isna().value_counts())
data_test.loc[data_test.Cabin.notna(), 'Cabin'] = 1
data_test.loc[data_test.Cabin.isna(), 'Cabin'] = 0
data_test
True 327
False 91
Name: Cabin, dtype: int64
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 892 | 3 | Kelly, Mr. James | 0 | 34.5 | 0 | 0 | 330911 | 7.8292 | 0 | 2 |
---|
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | 1 | 47.0 | 1 | 0 | 363272 | 7.0000 | 0 | 0 |
---|
2 | 894 | 2 | Myles, Mr. Thomas Francis | 0 | 62.0 | 0 | 0 | 240276 | 9.6875 | 0 | 2 |
---|
3 | 895 | 3 | Wirz, Mr. Albert | 0 | 27.0 | 0 | 0 | 315154 | 8.6625 | 0 | 0 |
---|
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | 1 | 22.0 | 1 | 1 | 3101298 | 12.2875 | 0 | 0 |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
413 | 1305 | 3 | Spector, Mr. Woolf | 0 | NaN | 0 | 0 | A.5. 3236 | 8.0500 | 0 | 0 |
---|
414 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | 1 | 39.0 | 0 | 0 | PC 17758 | 108.9000 | 1 | 1 |
---|
415 | 1307 | 3 | Saether, Mr. Simon Sivertsen | 0 | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | 0 | 0 |
---|
416 | 1308 | 3 | Ware, Mr. Frederick | 0 | NaN | 0 | 0 | 359309 | 8.0500 | 0 | 0 |
---|
417 | 1309 | 3 | Peter, Master. Michael J | 0 | NaN | 1 | 1 | 2668 | 22.3583 | 0 | 1 |
---|
418 rows × 11 columns
data_test.Fare[data_test.Fare.isna()] = data_test.Fare[data_test.Fare.notna()].mean()
<ipython-input-163-aae3951b0f12>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data_test.Fare[data_test.Fare.isna()] = data_test.Fare[data_test.Fare.notna()].mean()
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null int64
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 418 non-null float64
9 Cabin 418 non-null object
10 Embarked 418 non-null int64
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB
print(data_test.Age.isna().value_counts())
False 332
True 86
Name: Age, dtype: int64
age_with = data_test[data_test.Age.notna()]
age_without = data_test[data_test.Age.isna()]
model = RandomForestRegressor(random_state=1)
X_train = age_with.drop(columns=['Age', 'Name', 'PassengerId', 'Ticket'])
y_train = age_with['Age']
model.fit(X_train, y_train)
data_test.Age[data_test.Age.isna()] = model.predict(
age_without.drop(columns=['Age', 'Name', 'PassengerId', 'Ticket']))
<ipython-input-167-24ba260ac57c>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data_test.Age[data_test.Age.isna()] = model.predict(
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null int64
4 Age 418 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 418 non-null float64
9 Cabin 418 non-null object
10 Embarked 418 non-null int64
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB
pred_test = model2.predict(data_test.drop(columns = ['PassengerId','Name','Ticket']))
result = pd.DataFrame()
result['PassengerId'] = data_test['PassengerId']
result['Survived'] = list(pred_test)
result.to_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/test_result.csv',index = False)
最终提交到Kaggle的预测结果0.78467。 到此第一次kaggle实战结束,跟很多大佬比起来还差很远,存在的问题是适合的模型不熟悉。另外在特征工程上其实需要很多操作,也要考虑处理的顺序,应该优先处理异常值和文本数据,再对缺失值采用放弃或者预测填补或者其他操作。 还有特征属性里应该还有可以挖掘的信息,比如名字根据是否是一个家族的来判断是否更可能获救,对于模型的优化也可以继续,考虑其他的模型或者组合模型等等…
|