【参考:New York City Taxi Fare Prediction | Kaggle】
【参考:美国纽约市出租车大数据探索-基于kaggle比赛_@Irene的博客-CSDN博客】这个参考较多 代码【参考:2 机器学习实战 纽约出租车车费预测_哔哩哔哩_bilibili】
【参考:Kaggle-纽约市出租车费预测_qq_28584559的博客-CSDN博客】
代码:【参考:机器学习/Kaggle/出租车车费预测/版本一车费预测实战.ipynb · myaijarvis/AI - 码云 - 开源中国】
这篇也不错 【参考:Cleansing+EDA+Modelling(LGBM + XGBoost starters) | Kaggle】
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
1. 数据导入
train = pd.read_csv("data/train.csv", nrows=1000000)
test = pd.read_csv("data/test.csv")
2. 数据审查与处理
边审查边处理
2.1 整体情况
从上面两个表看出来有异常值(min,max),后面需要处理
2.2 缺失值
train.isnull().sum().sort_values(ascending=False)
test.isnull().sum().sort_values(ascending=False)
train.drop(train[train.isnull().any(1)].index, axis=0, inplace=True)
train.shape
(999990, 8)
2.3 异常值
检查车费fare_amount
train['fare_amount'].describe()
from collections import Counter
Counter(train['fare_amount']<0)
Counter({False: 999952, True: 38})
train.drop(train[train['fare_amount']<0].index, axis=0, inplace=True)
train[train.fare_amount<100].fare_amount.hist(bins=100, figsize=(14,3))
plt.xlabel('fare $USD')
plt.title("Histogram")
检查乘客passenger_count
train['passenger_count'].describe()
train[train['passenger_count']>6]
train.drop(train[train['passenger_count']>6].index, axis=0, inplace=True)
检查上车点的经度和纬度
- 纬度范围:-90 ~ 90
- 经度范围:-180 ~ 180
快速谷歌搜索可以知道:纬度范围是-90至90,经度的范围是-180至180.下面的描述明显地显示了一些异常值。 让我们删除它们。 删除之后的训练集还有999928条记录。
train['pickup_latitude'].describe()
train[train['pickup_latitude']<-90]
train[train['pickup_latitude']>90]
train.drop(train[(train['pickup_latitude']<-90) | (train['pickup_latitude']>90)].index, axis=0, inplace=True)
train['pickup_longitude'].describe()
train[train['pickup_longitude']<-180]
train.drop(train[train['pickup_longitude']<-180].index, axis=0, inplace=True)
检查下车点的经度和纬度
train.drop(train[(train['dropoff_latitude']<-90) | (train['dropoff_latitude']>90)].index, axis=0, inplace=True)
train.drop(train[(train['dropoff_longitude']<-180) | (train['dropoff_longitude']>180)].index, axis=0, inplace=True)
2.3 数据类型
train.dtypes
for dataset in [train, test]:
dataset['key'] = pd.to_datetime(dataset['key'])
dataset['pickup_datetime'] = pd.to_datetime(dataset['pickup_datetime'])
处理日期数据
现在,对于EDA。以下是我的考虑 - 乘客人数会影响票价吗? 取车日期和时间会影响票价吗? 星期几会影响票价吗? 行驶距离会影响票价吗?
将日期分隔为:
- year
- month
- day
- hour
- day of week
dataset.shape
(9914, 7)
dataset.head()
type(dataset['pickup_datetime'][0])
pandas._libs.tslibs.timestamps.Timestamp
for dataset in [train, test]:
dataset['year'] = dataset['pickup_datetime'].dt.year
dataset['month'] = dataset['pickup_datetime'].dt.month
dataset['day'] = dataset['pickup_datetime'].dt.day
dataset['hour'] = dataset['pickup_datetime'].dt.hour
dataset['day of week'] = dataset['pickup_datetime'].dt.dayofweek
根据经纬度计算距离
首先,让我们将接到乘客的日期时间字段“ pickup_datetime”拆分为年、月、日期、小时、星期几。 下面再计算接到乘客的地点和乘客下车地点的距离。利用Haversine公式算得。Haversine公式表示给出纬度和经度时,我们可以计算球体中的距离(网址https://en.wikipedia.org/wiki/Haversine_formula )Haversine(θ)=sin2(θ/ 2)。利用纬度经度,地球半径R(平均半径= 6,371 km)来计算。得到结果如图所示.
'''
'''
def distance(lat1, long1, lat2, long2):
'''
:param lat1: 字段名 str 下同
:param long1:
:param lat2:
:param long2:
:return: 距离 float
'''
global d
data = [train, test]
for i in data:
R = 6371
phi1 = np.radians(i[lat1])
phi2 = np.radians(i[lat2])
delta_phi = np.radians(i[lat2]-i[lat1])
delta_lambda = np.radians(i[long2]-i[long1])
a = np.sin(delta_phi / 2.0) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2.0) ** 2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
d = (R * c)
i['H_Distance'] = d
return d
distance('pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude')
train[(train['H_Distance']==0) & (train['fare_amount']==0)]
train.drop(train[(train['H_Distance']==0) & (train['fare_amount']==0)].index, axis=0, inplace=True)
len(train[(train['H_Distance']==0) & (train['fare_amount']!=0)])
28477
train.drop(train[(train['H_Distance']==0) & (train['fare_amount']!=0)].index, axis=0, inplace=True)
改进
【参考:Cleansing+EDA+Modelling(LGBM + XGBoost starters) | Kaggle】
新的字段:每公里车费
根据距离、车费,计算每公里的车费
train['fare_per_mile'] = train.fare_amount / train.H_Distance
train.fare_per_mile.describe()
train.pivot_table('fare_per_mile', index='hour', columns='year').plot(figsize=(14, 6))
plt.ylabel('Fare $USD/mile')
3.模型训练和数据预测
3.1 特征选择
选择特征列,构建数据
X_train = train.iloc[:, [3,4,5,6,7,8,9,10,11,12,13]]
y_train = train.iloc[:, [1]]
3.2 选择模型、训练、预测、评估
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
rf_predict = rf.predict(test.iloc[:, [2,3,4,5,6,7,8,9,10,11,12]])
3.3 生成结果并提交
submission = pd.read_csv("data/sample_submission.csv")
submission.head()
submission['fare_amount'] = rf_predict
submission.to_csv("submission_1.csv", index=False)
submission.head()
|