Kaggle exercise 1: Titanic Disaster [data]
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimSun']
plt.rcParams['axes.unicode_minus'] = False
data_train = pd.read_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/train.csv')
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
属性解释: passengerID–乘客id、survived–是否获救、Pclass–票的等级、sibsp–兄弟姐妹是否在船上,人数、parch–父母小孩是否在船上,人数、fare–票价、 cabin–船舱编号、embarked–出发的港口位置
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
ax1, ax2, ax3, ax4, ax5, ax6 = axes.flatten()
a = pd.DataFrame()
a['获救'] = data_train.Sex[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Sex[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax2)
ax3.hist(data_train.Age, color='darkorange')
ax3.hist(data_train.Age[data_train.Survived == 1])
a = pd.DataFrame()
a['获救'] = data_train.Pclass[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Pclass[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax4)
a = pd.DataFrame()
a['获救'] = data_train.Embarked[data_train.Survived==1].value_counts()
a['未获救'] = data_train.Embarked[data_train.Survived==0].value_counts()
a.plot(kind = 'bar',stacked = True,ax = ax5)
import seaborn as sns
sns.heatmap(data_train.drop(columns = 'Survived').corr(),annot=True,ax = ax6, vmax=0.5, square=True, cmap="Blues")
从以上的简单统计分析可知: 1、获救人员占小部分,大部分乘客没有获救 2、男性乘客人数多于女性乘客人数,但是女性中有更多的被救;年龄上大多数乘客属于[20,40]中青年,而获救中占比较多的是小孩和老年人 3、船票等级越高,获救的概率越大,是否获救与港口位置关系不大。 4、乘客的不同特征之间是相互独立的,从相关系数上来看没有多重共线性 先处理了显而易见的简单属性,对于父母兄弟姐妹以及存在较多缺失值的船舱还没有研究
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 4))
a = pd.DataFrame()
a['获救'] = data_train.Parch[data_train.Survived == 1].value_counts()
a['未获救'] = data_train.Parch[data_train.Survived == 0].value_counts()
a.plot(kind='bar', stacked=True, ax=ax1)
b = pd.DataFrame()
b['获救'] = data_train.SibSp[data_train.Survived == 1].value_counts()
b['未获救'] = data_train.SibSp[data_train.Survived == 0].value_counts()
b.plot(kind='bar', ax=ax2, stacked=True)
c = pd.DataFrame()
c['空值'] = data_train.Survived[pd.isna(data_train.Cabin)].value_counts()
c['非空值'] = data_train.Survived[pd.notna(data_train.Cabin)].value_counts()
c.plot(kind = 'bar',ax=ax3,stacked=True)
数据中存在缺失值的属性包括年龄和Cabin,缺失值会很大影响模型预测效果。 对缺失值的处理包括舍弃含有缺失值的行,或者预测均值填充。 因为样本数本来也不够多,因而不能舍弃,只能填充,年龄可以考虑用均值或者预测值填充,Cabin可以将其区分为空值和非空值
data_train.loc[data_train.Cabin.notnull(),'Cabin'] = 1
data_train.loc[data_train.Cabin.isnull(),'Cabin'] = 0
from sklearn.preprocessing import LabelEncoder
data_train['Sex'].replace({'male': 0, 'female': 1}, inplace=True)
data_train.loc[data_train.Embarked.isna(),'Embarked'] = 'S'
S 644
C 168
Q 77
Name: Embarked, dtype: int64
S 646
C 168
Q 77
Name: Embarked, dtype: int64
from sklearn.ensemble import RandomForestRegressor
df = data_train[['Age','Sex','Fare', 'Parch', 'SibSp', 'Pclass','Cabin']]
samples_with_age = df[pd.notna(data_train.Age)]
samples_without_age = df[pd.isna(data_train.Age)]
X_train = samples_with_age.drop(columns = 'Age')
y_train = samples_with_age['Age']
X_test = samples_without_age.drop(columns = 'Age')
y_test = samples_without_age['Age']
model = RandomForestRegressor(random_state=0, n_estimators=200)
y_pre = model.predict(X_test)
data_train.loc[ (data_train.Age.isnull()), 'Age' ] = y_pre
Age Sex Fare Parch SibSp Pclass Cabin
0 22.0 0 7.2500 0 1 3 0
1 38.0 1 71.2833 0 1 1 1
2 26.0 1 7.9250 0 0 3 0
Age Sex Fare Parch SibSp Pclass Cabin
5 NaN 0 8.4583 0 0 3 0
17 NaN 0 13.0000 0 0 2 0
19 NaN 1 7.2250 0 0 3 0
data = data_train.drop(columns=['PassengerId','Name','Ticket'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
data.drop(columns='Survived'), data['Survived'], random_state=1, test_size=0.2)
from sklearn.linear_model import LogisticRegression
model5 = LogisticRegression()
score = model5.score(X_test,y_test)
[[-9.02910417e-01 2.45051490e+00 -4.13729185e-02 -3.87050276e-01
5.63331782e-02 -1.36826728e-03 1.13677853e+00 3.03292337e-01]]
from sklearn.ensemble import AdaBoostClassifier
model1 = AdaBoostClassifier(random_state=1)
model1.fit(X_train, y_train)
y_pred = model1.predict(X_test)
score = model1.score(X_test, y_test)
from sklearn.ensemble import GradientBoostingClassifier
model2 = GradientBoostingClassifier(random_state=1)
y_pred = model2.predict(X_test)
score = model2.score(X_test,y_test)
from xgboost import XGBClassifier
model3 = XGBClassifier()
score = model3.score(X_test,y_test)
from lightgbm import LGBMClassifier
model4 = LGBMClassifier()
score = model4.score(X_test,y_test)
from sklearn.model_selection import GridSearchCV
parameters = {'max_depth':[1,3,5,7],'n_estimators':[50,100,200,500],'learning_rate':[0.01,0.02,0.05,0.1,0.2]}
model = GradientBoostingClassifier()
grid_search = GridSearchCV(model,parameters,scoring = 'r2',cv=5)
{'learning_rate': 0.02, 'max_depth': 5, 'n_estimators': 100}
model2 = GradientBoostingClassifier(learning_rate= 0.02, max_depth= 5, n_estimators=100,random_state=1)
y_pred = model2.predict(X_test)
score = model2.score(X_test,y_test)
data_test = pd.read_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/test.csv')
data_test.Sex.replace({'male':0,'female':1},inplace = True)
S 270
C 102
Q 46
Name: Embarked, dtype: int64
data_test.loc[data_test.Cabin.notna(), 'Cabin'] = 1
data_test.loc[data_test.Cabin.isna(), 'Cabin'] = 0
True 327
False 91
Name: Cabin, dtype: int64
data_test.Fare[data_test.Fare.isna()] = data_test.Fare[data_test.Fare.notna()].mean()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null int64
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 418 non-null float64
9 Cabin 418 non-null object
10 Embarked 418 non-null int64
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB
False 332
True 86
Name: Age, dtype: int64
age_with = data_test[data_test.Age.notna()]
age_without = data_test[data_test.Age.isna()]
model = RandomForestRegressor(random_state=1)
X_train = age_with.drop(columns=['Age', 'Name', 'PassengerId', 'Ticket'])
y_train = age_with['Age']
model.fit(X_train, y_train)
data_test.Age[data_test.Age.isna()] = model.predict(
age_without.drop(columns=['Age', 'Name', 'PassengerId', 'Ticket']))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null int64
4 Age 418 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 418 non-null float64
9 Cabin 418 non-null object
10 Embarked 418 non-null int64
dtypes: float64(2), int64(6), object(3)
memory usage: 36.0+ KB
pred_test = model2.predict(data_test.drop(columns = ['PassengerId','Name','Ticket']))
result = pd.DataFrame()
result['PassengerId'] = data_test['PassengerId']
result['Survived'] = list(pred_test)
result.to_csv('D:/数据分析/实战项目/Kaggle_Titanic-master/test_result.csv',index = False)
最终提交到Kaggle的预测结果0.78467。 到此第一次kaggle实战结束,跟很多大佬比起来还差很远,存在的问题是适合的模型不熟悉。另外在特征工程上其实需要很多操作,也要考虑处理的顺序,应该优先处理异常值和文本数据,再对缺失值采用放弃或者预测填补或者其他操作。 还有特征属性里应该还有可以挖掘的信息,比如名字根据是否是一个家族的来判断是否更可能获救,对于模型的优化也可以继续,考虑其他的模型或者组合模型等等…