2021CCF BDCI图书推荐系统竞赛baseline——itemCF
比赛地址
这是用最基本的基于物品协同过滤算法实现的图书推荐。
1. 导包
import random
import numpy as np
import pandas as pd
import math
from operator import itemgetter
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')
2. 读取数据
path = '/Users/Desktop/比赛/图书推荐系统'
train = pd.read_csv(path + '/dataset/train_dataset.csv')
test = pd.read_csv(path + '/dataset/test_dataset.csv')
sub = pd.read_csv(path + '/dataset/submission.csv')
logging.info("打印完毕")
data = train.copy()
data['rating'] = 1
data.head(5)
data.pivot(index='user_id', columns='item_id', values='rating')
3. 切分数据集
trainSet, testSet = {}, {}
trainSet_len, testSet_len = 0, 0
pivot = 0.75
for ele in data.itertuples():
user, item, rating = getattr(ele, 'user_id'), getattr(ele, 'item_id'), getattr(ele, 'rating')
if random.random() < pivot:
trainSet.setdefault(user, {})
trainSet[user][item] = rating
trainSet_len += 1
else:
testSet.setdefault(user, {})
testSet[user][item] = rating
testSet_len += 1
4. 计算item相似度
item_popular = {}
for user, item in trainSet.items():
for item in items:
if item not in item_popular:
item_popular[item] = 0
item_popular[item] += 1
item_count = len(item_popular)
print('Total movie number = %d' % movie_count)
print('Build user co-rated items matrix ...')
item_sim_matrix = {}
for user, items in trainSet.items():
for m1 in items:
for m2 in items:
if m1 == m2:
continue
item_sim_matrix.setdefault(m1, {})
item_sim_matrix[m1].setdefault(m2, 0)
item_sim_matrix[m1][m2] += 1
for m1, related_items in item_sim_matrix.items():
for m2, count in related_items.items():
if item_popular[m1] == 0 or item_popular[m2] == 0:
item_sim_matrix[m1][m2] = 0
else:
item_sim_matrix[m1][m2] = count / math.sqrt(item_popular[m1] * item_popular[m2])
5. 生成推荐list
user_lst = test['user_id'].tolist()
k = 198
n = 10
result = []
for user in user_lst:
rank ={}
watched_items = trainSet[user]
for item, rating in watched_movies.items():
for related_item, w in sorted(item_sim_matrix[item].items(), key=itemgetter(1), reverse=True)[:k]:
if related_item in watched_items:
continue
rank.setdefault(related_item, 0)
rank[related_item] += w * float(rating)
rec_items = sorted(rank.items(), key=itemgetter(1), reverse=True)[:n]
for i in list(rec_items):
result.append(i)
6. 生成提交文件
r = []
for i in result:
r.append(i[0])
sub['item_id'] = r
sub
sub.to_csv(path + '/result/ItemCF.csv')
线上得分:0.02109538784
|