[人工智能] Kaggle exercise 1: Titanic Disaster

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> Kaggle exercise 1: Titanic Disaster -> 正文阅读

[人工智能]Kaggle exercise 1: Titanic Disaster

Kaggle exercise 1: Titanic Disaster [data]

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

数据结构探测与分析

# 导入需要的包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimSun']#用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False#用来解决负号显示为方块的问题

#读取数据
data_train = pd.read_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/train.csv')
data_train.head(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

属性解释： passengerID–乘客id、survived–是否获救、Pclass–票的等级、sibsp–兄弟姐妹是否在船上，人数、parch–父母小孩是否在船上，人数、fare–票价、 cabin–船舱编号、embarked–出发的港口位置

#先观察数据信息
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

从数据统计结果来看，有些属性的值存在缺失值，比如Age、Cabin以及Embarked，其中，Embarked缺少两个，相对整体来说较好，可考虑整体分布来填补，Cabin和Age存在较多的缺失值，age可以考虑通过其他的值预测得到，再观察一下数值数据的统计描述

data_train.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

平均年龄29.7，平均获救率0.38，票等级上二等三等的个数远大于一等

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
ax1, ax2, ax3, ax4, ax5, ax6 = axes.flatten()
# 获救与未获救的人数对比情况
ax1.bar(data_train['Survived'].unique(),
        data_train['Survived'].value_counts())
ax1.set_title('获救与未获救的人数情况')
# 获救人员性别对比
a = pd.DataFrame()
a['获救'] = data_train.Sex[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Sex[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax2)
# ax2.hist(data_train.Sex, color='lightseagreen')
# ax2.hist(data_train.Sex[data_train.Survived == 1], color='pink')
# ax2.set_title('不同性别的获救人员情况')
# 获救人员年龄对比
ax3.hist(data_train.Age, color='darkorange')
ax3.hist(data_train.Age[data_train.Survived == 1])
ax3.set_title('不同年龄的获救人员情况')
# 每种票中获救人员所占比例
a = pd.DataFrame()
a['获救'] = data_train.Pclass[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Pclass[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax4)
# ax4.bar(data_train.Pclass.unique(),
#         data_train.Pclass.value_counts(), color='lightseagreen')
# ax4.bar(data_train.Pclass[data_train.Survived == 1].unique(
# ), data_train.Pclass[data_train.Survived == 1].value_counts(), color='pink')
ax4.set_title('不同船票等级中的获救人员情况')
#不同港口上船与获救情况
a = pd.DataFrame()
a['获救'] = data_train.Embarked[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Embarked[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax5)
# ax5.bar(['S', 'C', 'Q' ],data_train.Embarked.value_counts(),color = 'lightseagreen')
# ax5.bar(['S', 'C', 'Q'],data_train.Embarked[data_train.Survived==1].value_counts(),color = 'pink')
ax5.set_title('不同港口的上船与人员获救情况')
import seaborn as sns
sns.heatmap(data_train.drop(columns = 'Survived').corr(),annot=True,ax = ax6, vmax=0.5, square=True, cmap="Blues")
ax6.set_title('不同特征之间的相关系数热力图')

从以上的简单统计分析可知：
1、获救人员占小部分，大部分乘客没有获救
2、男性乘客人数多于女性乘客人数，但是女性中有更多的被救；年龄上大多数乘客属于[20，40]中青年，而获救中占比较多的是小孩和老年人
3、船票等级越高，获救的概率越大，是否获救与港口位置关系不大。
4、乘客的不同特征之间是相互独立的，从相关系数上来看没有多重共线性
先处理了显而易见的简单属性，对于父母兄弟姐妹以及存在较多缺失值的船舱还没有研究

# 父母人数是否有利于获救
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 4))
a = pd.DataFrame()
a['获救'] = data_train.Parch[data_train.Survived == 1].value_counts()
a['未获救'] = data_train.Parch[data_train.Survived == 0].value_counts()
a.plot(kind='bar', stacked=True, ax=ax1)
ax1.set_title('父母人数与获救的关系')
ax1.set_xlabel('父母人数')
ax1.set_ylabel('获救人数')
# 兄弟姐妹人数是否有利于获救
b = pd.DataFrame()
b['获救'] = data_train.SibSp[data_train.Survived == 1].value_counts()
b['未获救'] = data_train.SibSp[data_train.Survived == 0].value_counts()
b.plot(kind='bar', ax=ax2, stacked=True)
ax2.set_title('兄弟姐妹人数与获救的关系')
ax2.set_xlabel('兄弟姐妹人数')
ax2.set_ylabel('获救人数')
# 船舱与是否获救的关系
c = pd.DataFrame()
c['空值'] = data_train.Survived[pd.isna(data_train.Cabin)].value_counts()
c['非空值'] = data_train.Survived[pd.notna(data_train.Cabin)].value_counts()
c.plot(kind = 'bar',ax=ax3,stacked=True)
ax3.set_title('是否为空值与是否获救的关系')

从以上图可以看出，父母数为0的人获救人数最多，而有1-2位父母在场的会是获救比例最大的，兄弟姐妹上来看和父母的结果保持一致，这和上述的相关系数分析的结果也保持一致，父母人数和兄弟姐妹人数有一定的相关性，但是也没有很高。好像不是空值的船舱乘客获救的概率更高，是不是因为登记了船舱更有利于被找到，所以能够获救。

数据预处理阶段

数据中存在缺失值的属性包括年龄和Cabin，缺失值会很大影响模型预测效果。
对缺失值的处理包括舍弃含有缺失值的行，或者预测均值填充。
因为样本数本来也不够多，因而不能舍弃，只能填充，年龄可以考虑用均值或者预测值填充，Cabin可以将其区分为空值和非空值

#对cabin进行处理，不为空值的替换为1，空值替换为0
# from sklearn.preprocessing import LabelEncoder
data_train.loc[data_train.Cabin.notnull(),'Cabin'] = 1
data_train.loc[data_train.Cabin.isnull(),'Cabin'] = 0

#先处理非数值属性再对年龄进行预测
from sklearn.preprocessing import LabelEncoder
#处理性别
data_train.Sex.value_counts()
data_train['Sex'].replace({'male': 0, 'female': 1}, inplace=True)
data_train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	A/5 21171	7.2500	0	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	PC 17599	71.2833	1	C
2	3	1	3	Heikkinen, Miss. Laina	1	26.0	0	STON/O2. 3101282	7.9250	0	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	113803	53.1000	1	S
4	5	0	3	Allen, Mr. William Henry	0	35.0	0	373450	8.0500	0	S

#处理港口,有两个缺失值，由于缺失值相对整体样本来说数量不大，考虑用众数填充
print(data_train.Embarked.value_counts())
print(data_train.Embarked.count())
data_train.loc[data_train.Embarked.isna(),'Embarked'] = 'S'
print(data_train.Embarked.value_counts())
data_train.Embarked.replace({'S':0,'C':1,'Q':2},inplace=True)
data_train.head(5)

S    644
C    168
Q     77
Name: Embarked, dtype: int64
889
S    646
C    168
Q     77
Name: Embarked, dtype: int64

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	0	22.0	1	A/5 21171	7.2500	0	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	PC 17599	71.2833	1	1
2	3	1	3	Heikkinen, Miss. Laina	1	26.0	0	STON/O2. 3101282	7.9250	0	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	113803	53.1000	1	0
4	5	0	3	Allen, Mr. William Henry	0	35.0	0	373450	8.0500	0	0

#对缺失的年龄进行处理，利用随机森林进行预测
from sklearn.ensemble import RandomForestRegressor
df = data_train[['Age','Sex','Fare', 'Parch', 'SibSp', 'Pclass','Cabin']]
samples_with_age = df[pd.notna(data_train.Age)]
print(samples_with_age.head(3))
samples_without_age  = df[pd.isna(data_train.Age)]
print(samples_without_age.head(3))
X_train = samples_with_age.drop(columns = 'Age')
y_train = samples_with_age['Age']
X_test = samples_without_age.drop(columns = 'Age')
y_test = samples_without_age['Age']
model = RandomForestRegressor(random_state=0, n_estimators=200)
model.fit(X_train,y_train)
y_pre = model.predict(X_test)
data_train.loc[ (data_train.Age.isnull()), 'Age' ] = y_pre

    Age  Sex     Fare  Parch  SibSp  Pclass Cabin
0  22.0    0   7.2500      0      1       3     0
1  38.0    1  71.2833      0      1       1     1
2  26.0    1   7.9250      0      0       3     0
    Age  Sex     Fare  Parch  SibSp  Pclass Cabin
5   NaN    0   8.4583      0      0       3     0
17  NaN    0  13.0000      0      0       2     0
19  NaN    1   7.2250      0      0       3     0

data_train.head(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	0	22.000000	1	0	A/5 21171	7.2500	0	0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.000000	1	0	PC 17599	71.2833	1	1
2	3	1	3	Heikkinen, Miss. Laina	1	26.000000	0	0	STON/O2. 3101282	7.9250	0	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.000000	1	0	113803	53.1000	1	0
4	5	0	3	Allen, Mr. William Henry	0	35.000000	0	0	373450	8.0500	0	0
5	6	0	3	Moran, Mr. James	0	23.127944	0	0	330877	8.4583	0	2
6	7	0	1	McCarthy, Mr. Timothy J	0	54.000000	0	0	17463	51.8625	1	0
7	8	0	3	Palsson, Master. Gosta Leonard	0	2.000000	3	1	349909	21.0750	0	0
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	1	27.000000	0	2	347742	11.1333	0	0
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	1	14.000000	1	0	237736	30.0708	0	1

目前为止，基本的数据处理基本完成，需要数值化处理的也已经处理完毕，再对数值数据进行异常值的探测，判断是否有异常值。利用最简单的箱体图观察。

data_train[['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']].boxplot()

整体上来看，数据正常，虽然票价存在500这样的远远超过正常值，但也可能是土豪任性或者特殊情况之类的，也可以接受的。

数据处理完毕，下面开始建模预测，先用简单的逻辑回归来预测一下当作baseline。

数据建模

data = data_train.drop(columns=['PassengerId','Name','Ticket'])

数据集的划分：

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns='Survived'), data['Survived'], random_state=1, test_size=0.2)

采用逻辑回归模型

from sklearn.linear_model import LogisticRegression
model5 = LogisticRegression()
model5.fit(X_train,y_train)
score = model5.score(X_test,y_test)
print(score)
print(model5.coef_)

0.7988826815642458
[[-9.02910417e-01  2.45051490e+00 -4.13729185e-02 -3.87050276e-01
   5.63331782e-02 -1.36826728e-03  1.13677853e+00  3.03292337e-01]]


E:\anaconda\install\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

如果用基于决策树的模型结果会不会不一样，采用Adaboost模型

from sklearn.ensemble import AdaBoostClassifier
model1 = AdaBoostClassifier(random_state=1)
model1.fit(X_train, y_train)
y_pred = model1.predict(X_test)
score = model1.score(X_test, y_test)
print(score)

0.770949720670391

因为模型中采用的是默认参数，在利用几种预测模型都预测出结果之后可采用网格搜索得到合适的参数。

采用GBDT模型

from sklearn.ensemble import GradientBoostingClassifier
model2 = GradientBoostingClassifier(random_state=1)
model2.fit(X_train,y_train)
y_pred = model2.predict(X_test)
score = model2.score(X_test,y_test)
print(score)

0.7821229050279329

采用XGBoost模型

X_train.Cabin.replace({1:True,0:False},inplace=True)
X_test.Cabin.replace({1:True,0:False},inplace=True)

from xgboost import XGBClassifier
model3 = XGBClassifier()
model3.fit(X_train,y_train)
score = model3.score(X_test,y_test)
print(score)

[18:47:51] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
0.7821229050279329


E:\anaconda\install\lib\site-packages\xgboost\sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)

采用LightGBM模型

from lightgbm import LGBMClassifier
model4 = LGBMClassifier()
model4.fit(X_train,y_train)
score = model4.score(X_test,y_test)
print(score)

0.776536312849162

对GBDT网格搜索最佳参数

from sklearn.model_selection import GridSearchCV
parameters = {'max_depth':[1,3,5,7],'n_estimators':[50,100,200,500],'learning_rate':[0.01,0.02,0.05,0.1,0.2]}
model = GradientBoostingClassifier()
grid_search = GridSearchCV(model,parameters,scoring = 'r2',cv=5)
grid_search.fit(X_train,y_train)
grid_search.best_params_

{'learning_rate': 0.02, 'max_depth': 5, 'n_estimators': 100}

model2 = GradientBoostingClassifier(learning_rate= 0.02, max_depth= 5, n_estimators=100,random_state=1)
model2.fit(X_train,y_train)
y_pred = model2.predict(X_test)
score = model2.score(X_test,y_test)
print(score)

0.7821229050279329

难道处理这么多，预测效果最好的是逻辑回归嘛，笑cry。anyway，还是来预测一下测试集吧。

data_test = pd.read_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/test.csv')
data_test.head(10)

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S
5	897	3	Svensson, Mr. Johan Cervin	male	14.0	0	0	7538	9.2250	NaN	S
6	898	3	Connolly, Miss. Kate	female	30.0	0	0	330972	7.6292	NaN	Q
7	899	2	Caldwell, Mr. Albert Francis	male	26.0	1	1	248738	29.0000	NaN	S
8	900	3	Abrahim, Mrs. Joseph (Sophie Halaut Easu)	female	18.0	0	0	2657	7.2292	NaN	C
9	901	3	Davies, Mr. John Samuel	male	21.0	2	0	A/4 48871	24.1500	NaN	S

对测试集做相同的数据处理，Sex替换为0，1，缺失的Age用预测值填充，Cabin用有和没有来填充，Embarked用序号。

data_test.Sex.replace({'male':0,'female':1},inplace = True)

print(data_test.Embarked.value_counts())
data_test.Embarked.replace({'S':0,'C':1,'Q':2},inplace=True)

S    270
C    102
Q     46
Name: Embarked, dtype: int64

print(data_test.Cabin.isna().value_counts())
#一定要先填充非空的，否则填充完都是非空，都变成了1
data_test.loc[data_test.Cabin.notna(), 'Cabin'] = 1
data_test.loc[data_test.Cabin.isna(), 'Cabin'] = 0
data_test

True     327
False     91
Name: Cabin, dtype: int64

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	0	34.5	0	0	330911	7.8292	0	2
1	893	3	Wilkes, Mrs. James (Ellen Needs)	1	47.0	1	0	363272	7.0000	0	0
2	894	2	Myles, Mr. Thomas Francis	0	62.0	0	0	240276	9.6875	0	2
3	895	3	Wirz, Mr. Albert	0	27.0	0	0	315154	8.6625	0	0
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	1	22.0	1	1	3101298	12.2875	0	0
...	...	...	...	...	...	...	...	...	...	...	...
413	1305	3	Spector, Mr. Woolf	0	NaN	0	0	A.5. 3236	8.0500	0	0
414	1306	1	Oliva y Ocana, Dona. Fermina	1	39.0	0	0	PC 17758	108.9000	1	1
415	1307	3	Saether, Mr. Simon Sivertsen	0	38.5	0	0	SOTON/O.Q. 3101262	7.2500	0	0
416	1308	3	Ware, Mr. Frederick	0	NaN	0	0	359309	8.0500	0	0
417	1309	3	Peter, Master. Michael J	0	NaN	1	1	2668	22.3583	0	1

418 rows × 11 columns

#处理fare为空值的情况
data_test.Fare[data_test.Fare.isna()] = data_test.Fare[data_test.Fare.notna()].mean()

<ipython-input-163-aae3951b0f12>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_test.Fare[data_test.Fare.isna()] = data_test.Fare[data_test.Fare.notna()].mean()

data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    int64  
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        418 non-null    object 
 10  Embarked     418 non-null    int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB

print(data_test.Age.isna().value_counts())

False    332
True      86
Name: Age, dtype: int64

age_with = data_test[data_test.Age.notna()]
age_without = data_test[data_test.Age.isna()]

model = RandomForestRegressor(random_state=1)
X_train = age_with.drop(columns=['Age', 'Name', 'PassengerId', 'Ticket'])
y_train = age_with['Age']
model.fit(X_train, y_train)
data_test.Age[data_test.Age.isna()] = model.predict(
    age_without.drop(columns=['Age', 'Name', 'PassengerId', 'Ticket']))

<ipython-input-167-24ba260ac57c>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_test.Age[data_test.Age.isna()] = model.predict(

data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    int64  
 4   Age          418 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        418 non-null    object 
 10  Embarked     418 non-null    int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB

#下面开始预测
pred_test = model2.predict(data_test.drop(columns = ['PassengerId','Name','Ticket']))
# pred_test = pd.DataFrame(pred_test)
result = pd.DataFrame()
result['PassengerId'] = data_test['PassengerId']
result['Survived'] = list(pred_test)
result.to_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/test_result.csv',index = False)

最终提交到Kaggle的预测结果0.78467。
到此第一次kaggle实战结束，跟很多大佬比起来还差很远，存在的问题是适合的模型不熟悉。另外在特征工程上其实需要很多操作，也要考虑处理的顺序，应该优先处理异常值和文本数据，再对缺失值采用放弃或者预测填补或者其他操作。
还有特征属性里应该还有可以挖掘的信息，比如名字根据是否是一个家族的来判断是否更可能获救，对于模型的优化也可以继续，考虑其他的模型或者组合模型等等…

人工智能最新文章

2022吴恩达机器学习课程——第二课（神经网

第十五章规则学习

FixMatch: Simplifying Semi-Supervised Le

数据挖掘Java——Kmeans算法的实现

大脑皮层的分割方法

【翻译】GPT-3是如何工作的

论文笔记:TEACHTEXT: CrossModal Generaliz

python从零学（六）

详解Python 3.x 导入(import)

【答读者问27】backtrader不支持最新版本的

加:2021-07-17 11:54:48 更:2021-07-17 11:56:41

360图书馆购物三丰科技阅读网日历万年历 2025年10日历

-2025/10/16 8:34:45-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码