1 项?描述
本次?赛的?的是预测?个?将要签到的地?。 为了本次?赛,Facebook创建了?个虚拟世界,其中包括10公?*10 公?共100平?公?的约10万个地?。 对于给定的坐标集,您的任务将根据?户的位置,准确性和时间戳等预测?户下 ?次的签到位置。 数据被制作成类似于来?移动设备的位置数据。 请注意:您只能使?提供的数据进?预测。
2 数据集介绍
?件说明 train.csv, test.csv
row id:签?事件的id
x y:坐标
accuracy: 准确度,定位精度
time: 时间戳
place_id: 签到的位置,这也是你需要预测的内容
3 步骤分析
import pandas as pd
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
3.1 读取数据
data=pd.read_csv("./data/train.csv")
data.head(5)
data.shape
3.2 基本数据处理
3.2.1 缩小数据范围
partial_data=data.query("x>2.0 & x<2.5 & y>2.0 &y<2.5")
partial_data.head()
partial_data.shape
3.2.2 选择时间特征
partial_data["time"].head()
time=pd.to_datetime(partial_data["time"],unit="s")
time.head()
partial_data["hour"]=time.hour
partial_data["day"]=time.day
partial_data["weekday"]=time.weekday
3.2.3 去掉签到数较少的地方
place_count=partial_data.groupby("place_id").count()
place_count=place_count[place_count["row_id"]>3]
place_count.head()
partial_data=partial_data[partial_data["place_id"].isin(place_count.index)]
3.2.4 确定特征值和目标值
x=partial_data[["x","y","accuracy","hour","day","weekday"]]
y=partial_data["place_id"]
3.2.5 分割数据集
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=2,test_size=0.25)
3.3 特征工程—特征预处理
transfer=StandardScaler()
x_train=transfer.fit_transform(x_train)
x_test=transfer.fit_transform(x_test)
3.4 机器学习
3.4.1 实例化一个训练模型
estimator=KNeighborsClassifier()
3.4.2 交叉验证网格搜索实现
param_grid={"n_neighbors":[3,5,7,9]}
estimator=GridSearchCV(estimator=estimator,param_grid=param_grid,cv=10,n_jobs=4)
3.4.3 模型训练
estimator.fit(x_train,y_train)
3.5 模型评估
3.5.1 准确率输出
score_ret=estimator.score(x_test,y_test)
print("模型准确率为:\n",score_ret)
3.5.2 预测结果
y_pridict=estimator.predict(x_test)
print("预测值:\n",y_pridict)
3.5.3 其他结果输出
print("最好的模型是\n",estimator.best_estimator_)
print("最好的结果是\n",estimator.best_score_)
print("所有的结果\n",estimator.cv_results_)
最好的模型是
KNeighborsClassifier()
最好的结果是
0.36103403905372417
所有的结果
{'mean_fit_time': array([0.23680475, 0.24250968, 0.27274141, 0.28223894]), 'std_fit_time': array([0.04789135, 0.05988373, 0.05860585, 0.04193857]), 'mean_score_time': array([0.54308534, 0.54496615, 0.72975917, 0.7652776 ]), 'std_score_time': array([0.02659151, 0.07117335, 0.08149376, 0.08075544]), 'param_n_neighbors': masked_array(data=[3, 5, 7, 9],
mask=[False, False, False, False],
fill_value='?',
dtype=object), 'params': [{'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 7}, {'n_neighbors': 9}], 'split0_test_score': array([0.34725698, 0.36438884, 0.36612127, 0.3599615 ]), 'split1_test_score': array([0.34687199, 0.35938402, 0.35919153, 0.35688162]), 'split2_test_score': array([0.35052936, 0.35899904, 0.3574591 , 0.35784408]), 'split3_test_score': array([0.34898941, 0.36458133, 0.36246391, 0.35899904]), 'split4_test_score': array([0.35264678, 0.36381136, 0.36458133, 0.35803657]), 'split5_test_score': array([0.34513956, 0.3572666 , 0.35688162, 0.35514918]), 'split6_test_score': array([0.35437921, 0.36477382, 0.36554379, 0.36997113]), 'split7_test_score': array([0.35264678, 0.35880654, 0.3626564 , 0.36323388]), 'split8_test_score': array([0.34501348, 0.35618021, 0.35463997, 0.35309973]), 'split9_test_score': array([0.35021178, 0.36214863, 0.3567578 , 0.35309973]), 'mean_test_score': array([0.34936853, 0.36103404, 0.36062967, 0.35862765]), 'std_test_score': array([0.00310395, 0.00310417, 0.00392977, 0.00478577]), 'rank_test_score': array([4, 1, 2, 3])}
|