1. 查找缺失值
df.isnull()
df.isnull().any(axis=0)
df.isnull().any(axis=1)
df.isnull().sum()
df.isnull().sum(axis=1)
df.isnull().sum().sum()
df['列名'].isnull().sum(axis=0)
df[df.isnull().values==True]
df[df.isnull().T.any()]
对比
2. 删除空缺值
pandas.dropna()官方文档
DataFrame.dropna(axis=0, how=‘any’, thresh=None, subset=None, inplace=False)
- axis: 0表示行,1表示列
- how: Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
- any: 当每行/列有空缺值时删除
- all:当每行/列全为空缺值时删除
- thresh: int, 保留至少 int 个非nan行
- subset: list,在特定列缺失值处理
- inplace: 是否修改原数据
df.dropna(thresh=2)
df.dropna(subset=['Attr_B','sum'])
3. 填充缺失值
- 使用其他类别标签、平均值或中值填充
- 使用最有可能的值填充:使用K近邻法填充
pandas.fillna()官方文档
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
- value: 填充值
- method: {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
- backfill/bfill: 用后一个值填充
- pad/ffill: 用前一个值填充
- axis: 0为按行,1为按列
- limit: int, 填充int个空缺值
- downcast:
按值填充
df.fillna(999)
df['sum']=df['sum'].fillna(999)
按dataframe里的值填充
df.fillna(method='bfill',axis=0)
df.fillna(method='ffill',axis=0)
df.fillna(method='bfill',axis=1)
df.fillna(method='ffill',axis=1)
sklearn填充缺失值
sklearn.simpleImputer 官方文档
from sklearn.impute import SimpleImputer
sums=df['sum'].values.reshape(-1,1)
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean=imp_mean.fit_transform(sums)
df['sum']=imp_mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
imp_modes = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_modes = SimpleImputer(missing_values=np.nan, strategy='constant',fill_value=0)
使用KNN填充缺失值
sklearn.impute.KNNImputer官方文档
from sklearn.impute import KNNImputer
sums=df['sum'].values.reshape(-1,1)
imp_knn = KNNImputer(n_neighbors=2)
df['sum']=imp_knn.fit_transform(sums)
参考来源 每天一点sklearn之SimpleImputer(9.19) 【机器学习】sklearn中使用k近邻来完成缺失值的填补(KNNImputer) 数据处理笔记6:缺失值填充
|