数据清洗及特征处理
依旧先导入numpy,pandas和数据
import pandas as pd
import numpy as np
df = pd.read_csv('train.csv')
df.head(3)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
BTW,我已经解决了相对路径无法导入的情况,可以分享一下解决方法,也许能帮到其他人。 【原因】:工作环境与数据不在同一目录 使用print(os.getcwd())可以查看当前工作环境,工作环境!=ipynb文件储存位置只有当工作环境和你的数据文件在同一目录才可以使用相对路径。 【解决方法】:使之相同,这里提供两种解决思路 1.在数据所在目录用anaconda3的CMD.exe Prompt打开jupyter notebook。 2.重新设置jupyter notebook默认工作环境。
什么是数据清洗
一般来说,数据中存在很多不好的点,比如缺失数据,异常数据等等。这一章我们还做不到清洗数据到完美,或者是说贴近实际。但是我们可以把其清洗到适合建模。i.e.我们只能将数据初步清洗到可以建模,但在这个数据下模型并不准确。
2.1 缺失值观察与处理
什么是缺失数据? 首先在pandas中,数据中的缺失值用浮点值NaN(Not a Number)表示,也称哨兵值。 在pandas代码里,我们将缺失值表示为NA(Not Available) 在后文中会详细介绍pandas自带的几种处理缺失数据的函数
2.1.1 检查缺失值
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
以上是两种寻找缺失值的方法
df[['Age','Cabin','Embarked']].head(3)
| Age | Cabin | Embarked |
---|
0 | 22.0 | NaN | S |
---|
1 | 38.0 | C85 | C |
---|
2 | 26.0 | NaN | S |
---|
2.1.2 缺失值的处理
方法 | 说明 |
---|
dropna | 过滤缺失值,可以设置容忍度(一行最多缺失n个),只关注特定列的缺失值进行筛选 | fillna | 用指定值或插值方法填充缺失数据 | isnull | 返回一个含有布尔值的对象,返回的对象与isnull方法所判断的对象(源对象)类型相同 | notnull | isnull的否定 |
下面给出三种类似的处理方法,缺失数据均用0替换
df[df['Age'] == None ]=0
df.head(3)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
df[df['Age'].isnull()]=0
df.head(3)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
df[df['Age'] == np.nan]=0
df.head(3)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
这三种方法都是类似思路,只是在寻找缺失值上有差异。个人认为,np所理解NaN更接近“缺失值”这个定义,而None和Null的定义我个人是含糊不清模棱两可的,加之害怕会过滤不干净,所以还是尽量用np.nan吧。
下面给出两种我感觉会非常常用的处理缺失数据的方法
df.dropna().head(3)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
5 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0000 | 0 | 0 |
---|
df.fillna(0).head(3)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 0 | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 0 | S |
---|
这里就顺便附上pandas官方提供的说明性质的文档的网址: https://pandas.pydata.org/pandas-docs/stable/reference/index.html
2.2重复值的观察与处理
2.2.1 查看数据中的重复值
df[df.duplicated()]
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
17 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
19 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
26 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
28 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
29 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
---|
859 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
863 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
868 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
878 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
888 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0.0 | 0 | 0 |
---|
176 rows × 12 columns
2.2.2 处理数据中的重复值
df = df.drop_duplicates()
df.head(4)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
这是清除数据中的重复值
2.2.3 将清除后的数据保存
df.to_csv('test_clear.csv')
2.3 特征观察及处理
特征的分类: 1.数值型特征(离散型,连续型) 2.文本类特征 数值型特征一般可以直接用于模型的训练,有时为了保证模型的稳定性和耐操性(鲁棒性:robustness)会将连续数据离散化再处理 文本型特折需要转成数值型特征再用于建模分析
2.3.1 对年龄分箱(离散化)处理
分箱操作: 一般的,连续数据的性质可能会呈现离散型(类似分段函数,在各个区间内连续,区间连接处不连续)将连续变量离散化成多个区间,这就是分箱操作,箱子就是各个区间。可以增强模型的适应性。
df['AgeBand'] = pd.cut(df['Age'],5,labels=[1,2,3,4,5])
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 3 |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 2 |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 3 |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 3 |
---|
df.to_csv('test_ave.csv')
df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5])
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 3 |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 4 |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 |
---|
df.to_csv('test_cut.csv')
df['AgeBand'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 5 |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 |
---|
df.to_csv('test_pr.csv')
【注释】 1.cut规定的是每个箱子代表的区间 2.qcut规定的是每个箱子中对象个数占数据集整体的比例,并且是有序的。例如前10%,10%-20%,20%-40%,40%-60%,前两组所占比例相同,箱子中对象个数也就相同。 3.两种方法都可以只规定部分对象所属的箱子,不需要全部对象都在箱子中。 **【cut】**https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html **【qcut】**https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
2.3.2 对文本变量进行转换
(1)查看文本变量名称种类数量
df['Sex'].value_counts()
male 453
female 261
0 1
Name: Sex, dtype: int64
(2)将文本变量转换为数字
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand | Sex_num |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 | 1 |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 5 | 2 |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 | 2 |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 | 2 |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 | 1 |
---|
df['Sex_num'] = df['Sex'].map({'male':1,'female':2})
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand | Sex_num |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 | 1.0 |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 5 | 2.0 |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 | 2.0 |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 | 2.0 |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 | 1.0 |
---|
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin','Ticket']:
lbl = LabelEncoder()
label_dict = dict(zip(df[feat].unique(),range(df[feat].nunique())))
df[feat + '_labelEncode'] = df[feat].map(label_dict)
df[feat + '_labelEncode'] = lbl.fit_transform(df[feat].astype(str))
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeBand | Sex_num | Cabin_labelEncode | Ticket_labelEncode |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 2 | 1.0 | 135 | 409 |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 5 | 2.0 | 74 | 472 |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 | 2.0 | 135 | 533 |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 | 2.0 | 50 | 41 |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 4 | 1.0 | 135 | 374 |
---|
(3)将类别文本转换为one-hot编码
for feat in ['Age','Embarked']:
x = pd.get_dummies(df[feat],prefix=feat)
df = pd.concat([df,x],axis=1)
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Age_66.0 | Age_70.0 | Age_70.5 | Age_71.0 | Age_74.0 | Age_80.0 | Embarked_0 | Embarked_C | Embarked_Q | Embarked_S |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
---|
5 rows × 202 columns
2.3.3 从Name特征里提取出Titles特征(Mr. Mrs. Miss)
df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
df.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Age_70.0 | Age_70.5 | Age_71.0 | Age_74.0 | Age_80.0 | Embarked_0 | Embarked_C | Embarked_Q | Embarked_S | Title |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Mr |
---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | Mrs |
---|
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Miss |
---|
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Mrs |
---|
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Mr |
---|
5 rows × 203 columns
df.to_csv('test_fin.csv')
数据重构笔记
|