IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> 人工智能 -> Kaggle exercise 1: Titanic Disaster -> 正文阅读

[人工智能]Kaggle exercise 1: Titanic Disaster

Kaggle exercise 1: Titanic Disaster [data]

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

数据结构探测与分析

# 导入需要的包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimSun']#用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False#用来解决负号显示为方块的问题
#读取数据
data_train = pd.read_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/train.csv')
data_train.head(10)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC

属性解释: passengerID–乘客id、survived–是否获救、Pclass–票的等级、sibsp–兄弟姐妹是否在船上,人数、parch–父母小孩是否在船上,人数、fare–票价、 cabin–船舱编号、embarked–出发的港口位置

#先观察数据信息
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

从数据统计结果来看,有些属性的值存在缺失值,比如Age、Cabin以及Embarked,其中,Embarked缺少两个,相对整体来说较好,可考虑整体分布来填补,Cabin和Age存在较多的缺失值,age可以考虑通过其他的值预测得到,再观察一下数值数据的统计描述

data_train.describe()
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200

平均年龄29.7,平均获救率0.38,票等级上二等三等的个数远大于一等

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
ax1, ax2, ax3, ax4, ax5, ax6 = axes.flatten()
# 获救与未获救的人数对比情况
ax1.bar(data_train['Survived'].unique(),
        data_train['Survived'].value_counts())
ax1.set_title('获救与未获救的人数情况')
# 获救人员性别对比
a = pd.DataFrame()
a['获救'] = data_train.Sex[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Sex[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax2)
# ax2.hist(data_train.Sex, color='lightseagreen')
# ax2.hist(data_train.Sex[data_train.Survived == 1], color='pink')
# ax2.set_title('不同性别的获救人员情况')
# 获救人员年龄对比
ax3.hist(data_train.Age, color='darkorange')
ax3.hist(data_train.Age[data_train.Survived == 1])
ax3.set_title('不同年龄的获救人员情况')
# 每种票中获救人员所占比例
a = pd.DataFrame()
a['获救'] = data_train.Pclass[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Pclass[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax4)
# ax4.bar(data_train.Pclass.unique(),
#         data_train.Pclass.value_counts(), color='lightseagreen')
# ax4.bar(data_train.Pclass[data_train.Survived == 1].unique(
# ), data_train.Pclass[data_train.Survived == 1].value_counts(), color='pink')
ax4.set_title('不同船票等级中的获救人员情况')
#不同港口上船与获救情况
a = pd.DataFrame()
a['获救'] = data_train.Embarked[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Embarked[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax5)
# ax5.bar(['S', 'C', 'Q' ],data_train.Embarked.value_counts(),color = 'lightseagreen')
# ax5.bar(['S', 'C', 'Q'],data_train.Embarked[data_train.Survived==1].value_counts(),color = 'pink')
ax5.set_title('不同港口的上船与人员获救情况')
import seaborn as sns
sns.heatmap(data_train.drop(columns = 'Survived').corr(),annot=True,ax = ax6, vmax=0.5, square=True, cmap="Blues")
ax6.set_title('不同特征之间的相关系数热力图')

从以上的简单统计分析可知:
1、获救人员占小部分,大部分乘客没有获救
2、男性乘客人数多于女性乘客人数,但是女性中有更多的被救;年龄上大多数乘客属于[20,40]中青年,而获救中占比较多的是小孩和老年人
3、船票等级越高,获救的概率越大,是否获救与港口位置关系不大。
4、乘客的不同特征之间是相互独立的,从相关系数上来看没有多重共线性
先处理了显而易见的简单属性,对于父母兄弟姐妹以及存在较多缺失值的船舱还没有研究

# 父母人数是否有利于获救
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 4))
a = pd.DataFrame()
a['获救'] = data_train.Parch[data_train.Survived == 1].value_counts()
a['未获救'] = data_train.Parch[data_train.Survived == 0].value_counts()
a.plot(kind='bar', stacked=True, ax=ax1)
ax1.set_title('父母人数与获救的关系')
ax1.set_xlabel('父母人数')
ax1.set_ylabel('获救人数')
# 兄弟姐妹人数是否有利于获救
b = pd.DataFrame()
b['获救'] = data_train.SibSp[data_train.Survived == 1].value_counts()
b['未获救'] = data_train.SibSp[data_train.Survived == 0].value_counts()
b.plot(kind='bar', ax=ax2, stacked=True)
ax2.set_title('兄弟姐妹人数与获救的关系')
ax2.set_xlabel('兄弟姐妹人数')
ax2.set_ylabel('获救人数')
# 船舱与是否获救的关系
c = pd.DataFrame()
c['空值'] = data_train.Survived[pd.isna(data_train.Cabin)].value_counts()
c['非空值'] = data_train.Survived[pd.notna(data_train.Cabin)].value_counts()
c.plot(kind = 'bar',ax=ax3,stacked=True)
ax3.set_title('是否为空值与是否获救的关系')

从以上图可以看出,父母数为0的人获救人数最多,而有1-2位父母在场的会是获救比例最大的,兄弟姐妹上来看和父母的结果保持一致,这和上述的相关系数分析的结果也保持一致,父母人数和兄弟姐妹人数有一定的相关性,但是也没有很高。好像不是空值的船舱乘客获救的概率更高,是不是因为登记了船舱更有利于被找到,所以能够获救。

数据预处理阶段

数据中存在缺失值的属性包括年龄和Cabin,缺失值会很大影响模型预测效果。
对缺失值的处理包括舍弃含有缺失值的行,或者预测均值填充。
因为样本数本来也不够多,因而不能舍弃,只能填充,年龄可以考虑用均值或者预测值填充,Cabin可以将其区分为空值和非空值

#对cabin进行处理,不为空值的替换为1,空值替换为0
# from sklearn.preprocessing import LabelEncoder
data_train.loc[data_train.Cabin.notnull(),'Cabin'] = 1
data_train.loc[data_train.Cabin.isnull(),'Cabin'] = 0
#先处理非数值属性再对年龄进行预测
from sklearn.preprocessing import LabelEncoder
#处理性别
data_train.Sex.value_counts()
data_train['Sex'].replace({'male': 0, 'female': 1}, inplace=True)
data_train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harris022.010A/5 211717.25000S
1211Cumings, Mrs. John Bradley (Florence Briggs Th...138.010PC 1759971.28331C
2313Heikkinen, Miss. Laina126.000STON/O2. 31012827.92500S
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)135.01011380353.10001S
4503Allen, Mr. William Henry035.0003734508.05000S
#处理港口,有两个缺失值,由于缺失值相对整体样本来说数量不大,考虑用众数填充
print(data_train.Embarked.value_counts())
print(data_train.Embarked.count())
data_train.loc[data_train.Embarked.isna(),'Embarked'] = 'S'
print(data_train.Embarked.value_counts())
data_train.Embarked.replace({'S':0,'C':1,'Q':2},inplace=True)
data_train.head(5)
S    644
C    168
Q     77
Name: Embarked, dtype: int64
889
S    646
C    168
Q     77
Name: Embarked, dtype: int64
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harris022.010A/5 211717.250000
1211Cumings, Mrs. John Bradley (Florence Briggs Th...138.010PC 1759971.283311
2313Heikkinen, Miss. Laina126.000STON/O2. 31012827.925000
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)135.01011380353.100010
4503Allen, Mr. William Henry035.0003734508.050000
#对缺失的年龄进行处理,利用随机森林进行预测
from sklearn.ensemble import RandomForestRegressor
df = data_train[['Age','Sex','Fare', 'Parch', 'SibSp', 'Pclass','Cabin']]
samples_with_age = df[pd.notna(data_train.Age)]
print(samples_with_age.head(3))
samples_without_age  = df[pd.isna(data_train.Age)]
print(samples_without_age.head(3))
X_train = samples_with_age.drop(columns = 'Age')
y_train = samples_with_age['Age']
X_test = samples_without_age.drop(columns = 'Age')
y_test = samples_without_age['Age']
model = RandomForestRegressor(random_state=0, n_estimators=200)
model.fit(X_train,y_train)
y_pre = model.predict(X_test)
data_train.loc[ (data_train.Age.isnull()), 'Age' ] = y_pre 

    Age  Sex     Fare  Parch  SibSp  Pclass Cabin
0  22.0    0   7.2500      0      1       3     0
1  38.0    1  71.2833      0      1       1     1
2  26.0    1   7.9250      0      0       3     0
    Age  Sex     Fare  Parch  SibSp  Pclass Cabin
5   NaN    0   8.4583      0      0       3     0
17  NaN    0  13.0000      0      0       2     0
19  NaN    1   7.2250      0      0       3     0
data_train.head(10)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harris022.00000010A/5 211717.250000
1211Cumings, Mrs. John Bradley (Florence Briggs Th...138.00000010PC 1759971.283311
2313Heikkinen, Miss. Laina126.00000000STON/O2. 31012827.925000
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)135.0000001011380353.100010
4503Allen, Mr. William Henry035.000000003734508.050000
5603Moran, Mr. James023.127944003308778.458302
6701McCarthy, Mr. Timothy J054.000000001746351.862510
7803Palsson, Master. Gosta Leonard02.0000003134990921.075000
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)127.0000000234774211.133300
91012Nasser, Mrs. Nicholas (Adele Achem)114.0000001023773630.070801

目前为止,基本的数据处理基本完成,需要数值化处理的也已经处理完毕,再对数值数据进行异常值的探测,判断是否有异常值。利用最简单的箱体图观察。

data_train[['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin','Embarked']].boxplot()

整体上来看,数据正常,虽然票价存在500这样的远远超过正常值,但也可能是土豪任性或者特殊情况之类的,也可以接受的。

数据处理完毕,下面开始建模预测,先用简单的逻辑回归来预测一下当作baseline。

数据建模

data = data_train.drop(columns=['PassengerId','Name','Ticket'])

数据集的划分:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns='Survived'), data['Survived'], random_state=1, test_size=0.2)

采用逻辑回归模型

from sklearn.linear_model import LogisticRegression
model5 = LogisticRegression()
model5.fit(X_train,y_train)
score = model5.score(X_test,y_test)
print(score)
print(model5.coef_)
0.7988826815642458
[[-9.02910417e-01  2.45051490e+00 -4.13729185e-02 -3.87050276e-01
   5.63331782e-02 -1.36826728e-03  1.13677853e+00  3.03292337e-01]]


E:\anaconda\install\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

如果用基于决策树的模型结果会不会不一样,采用Adaboost模型

from sklearn.ensemble import AdaBoostClassifier
model1 = AdaBoostClassifier(random_state=1)
model1.fit(X_train, y_train)
y_pred = model1.predict(X_test)
score = model1.score(X_test, y_test)
print(score)
0.770949720670391

因为模型中采用的是默认参数,在利用几种预测模型都预测出结果之后可采用网格搜索得到合适的参数。

采用GBDT模型

from sklearn.ensemble import GradientBoostingClassifier
model2 = GradientBoostingClassifier(random_state=1)
model2.fit(X_train,y_train)
y_pred = model2.predict(X_test)
score = model2.score(X_test,y_test)
print(score)
0.7821229050279329

采用XGBoost模型

X_train.Cabin.replace({1:True,0:False},inplace=True)
X_test.Cabin.replace({1:True,0:False},inplace=True)
from xgboost import XGBClassifier
model3 = XGBClassifier()
model3.fit(X_train,y_train)
score = model3.score(X_test,y_test)
print(score)
[18:47:51] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
0.7821229050279329


E:\anaconda\install\lib\site-packages\xgboost\sklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
  warnings.warn(label_encoder_deprecation_msg, UserWarning)

采用LightGBM模型

from lightgbm import LGBMClassifier
model4 = LGBMClassifier()
model4.fit(X_train,y_train)
score = model4.score(X_test,y_test)
print(score)
0.776536312849162

对GBDT网格搜索最佳参数

from sklearn.model_selection import GridSearchCV
parameters = {'max_depth':[1,3,5,7],'n_estimators':[50,100,200,500],'learning_rate':[0.01,0.02,0.05,0.1,0.2]}
model = GradientBoostingClassifier()
grid_search = GridSearchCV(model,parameters,scoring = 'r2',cv=5)
grid_search.fit(X_train,y_train)
grid_search.best_params_
{'learning_rate': 0.02, 'max_depth': 5, 'n_estimators': 100}
model2 = GradientBoostingClassifier(learning_rate= 0.02, max_depth= 5, n_estimators=100,random_state=1)
model2.fit(X_train,y_train)
y_pred = model2.predict(X_test)
score = model2.score(X_test,y_test)
print(score)
0.7821229050279329

难道处理这么多,预测效果最好的是逻辑回归嘛,笑cry。anyway,还是来预测一下测试集吧。

data_test = pd.read_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/test.csv')
data_test.head(10)
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
68983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
78992Caldwell, Mr. Albert Francismale26.01124873829.0000NaNS
89003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC
99013Davies, Mr. John Samuelmale21.020A/4 4887124.1500NaNS

对测试集做相同的数据处理,Sex替换为0,1,缺失的Age用预测值填充,Cabin用有和没有来填充,Embarked用序号。

data_test.Sex.replace({'male':0,'female':1},inplace = True)
print(data_test.Embarked.value_counts())
data_test.Embarked.replace({'S':0,'C':1,'Q':2},inplace=True)

S    270
C    102
Q     46
Name: Embarked, dtype: int64
print(data_test.Cabin.isna().value_counts())
#一定要先填充非空的,否则填充完都是非空,都变成了1
data_test.loc[data_test.Cabin.notna(), 'Cabin'] = 1
data_test.loc[data_test.Cabin.isna(), 'Cabin'] = 0
data_test
True     327
False     91
Name: Cabin, dtype: int64
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. James034.5003309117.829202
18933Wilkes, Mrs. James (Ellen Needs)147.0103632727.000000
28942Myles, Mr. Thomas Francis062.0002402769.687502
38953Wirz, Mr. Albert027.0003151548.662500
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)122.011310129812.287500
....................................
41313053Spector, Mr. Woolf0NaN00A.5. 32368.050000
41413061Oliva y Ocana, Dona. Fermina139.000PC 17758108.900011
41513073Saether, Mr. Simon Sivertsen038.500SOTON/O.Q. 31012627.250000
41613083Ware, Mr. Frederick0NaN003593098.050000
41713093Peter, Master. Michael J0NaN11266822.358301

418 rows × 11 columns

#处理fare为空值的情况
data_test.Fare[data_test.Fare.isna()] = data_test.Fare[data_test.Fare.notna()].mean()
<ipython-input-163-aae3951b0f12>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_test.Fare[data_test.Fare.isna()] = data_test.Fare[data_test.Fare.notna()].mean()
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    int64  
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        418 non-null    object 
 10  Embarked     418 non-null    int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB
print(data_test.Age.isna().value_counts())
False    332
True      86
Name: Age, dtype: int64
age_with = data_test[data_test.Age.notna()]
age_without = data_test[data_test.Age.isna()]
model = RandomForestRegressor(random_state=1)
X_train = age_with.drop(columns=['Age', 'Name', 'PassengerId', 'Ticket'])
y_train = age_with['Age']
model.fit(X_train, y_train)
data_test.Age[data_test.Age.isna()] = model.predict(
    age_without.drop(columns=['Age', 'Name', 'PassengerId', 'Ticket']))
<ipython-input-167-24ba260ac57c>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_test.Age[data_test.Age.isna()] = model.predict(
data_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    int64  
 4   Age          418 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        418 non-null    object 
 10  Embarked     418 non-null    int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB
#下面开始预测
pred_test = model2.predict(data_test.drop(columns = ['PassengerId','Name','Ticket']))
# pred_test = pd.DataFrame(pred_test)
result = pd.DataFrame()
result['PassengerId'] = data_test['PassengerId']
result['Survived'] = list(pred_test)
result.to_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/test_result.csv',index = False)

最终提交到Kaggle的预测结果0.78467。
到此第一次kaggle实战结束,跟很多大佬比起来还差很远,存在的问题是适合的模型不熟悉。另外在特征工程上其实需要很多操作,也要考虑处理的顺序,应该优先处理异常值和文本数据,再对缺失值采用放弃或者预测填补或者其他操作。
还有特征属性里应该还有可以挖掘的信息,比如名字根据是否是一个家族的来判断是否更可能获救,对于模型的优化也可以继续,考虑其他的模型或者组合模型等等…

  人工智能 最新文章
2022吴恩达机器学习课程——第二课(神经网
第十五章 规则学习
FixMatch: Simplifying Semi-Supervised Le
数据挖掘Java——Kmeans算法的实现
大脑皮层的分割方法
【翻译】GPT-3是如何工作的
论文笔记:TEACHTEXT: CrossModal Generaliz
python从零学(六)
详解Python 3.x 导入(import)
【答读者问27】backtrader不支持最新版本的
上一篇文章      下一篇文章      查看所有文章
加:2021-07-17 11:54:48  更:2021-07-17 11:56:41 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年12日历 -2024/12/22 10:44:44-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码