Machine Learning homework project at EPFL
We study the problem of recommending items to users using a collaborative-based filtering approach. In particular, we are given access to 10000 users and 1000 movies, and a subset of their ratings, and we would like to recommend new
movies by predicting the missing ratings. To this end, we
implement 16 different models. A modified version of Singular
Value Decomposition (SVD++) performs the best among these
models. It achieves a score of 0.97738 on Kaggle’s validation set.
In order to improve our predictions even further, we
implement a ridge regression model that predicts ratings based
on the predictions of 13 different models. We obtain a slight
improvement over SVD++, increasing our accuracy score to
0.97368 on Kaggle’s validation set.
Install a python environment. For all our tests, anaconda environment was used.
Install custom surprise library:
Install requirements for the project
Data Sets:
Download of dumps of predictions
In order to speed up the code we provide a pickle dump of all the models used for the blending. The folders containing the dumps can be downloaded from Switch Drive at the link: https://drive.switch.ch/index.php/s/1zqfWRX7Xbgsybn.
The folders “predictions” and “test” should be put in the main folder. At the end the tree of folders should be as we show here:
—Recommender System
———-code
———-data
———-figures
———-predictions
———-test
When run.py is run, it automatically create a kaggle submission in the data folder called “final_submission.csv”. It is the one which scores 0.97368 on the Kaggle public leaderboard.
The code takes approximately 4 hours on a Intel i7 6700HQ when executed from scratch (no pickle dump provided) and less than 2 minutes when all the model predictions have already been dumped (only blending). The algorithms are not parallelized, so only a fraction of the total computational power of the CPU is used. No GPU is required since it is not exploited. The RAM usage is always lower than 4GB. In our tests, the code ran perfectly on a 8GB system.
These are the minutes the models take separately with data splitted in 93.3% training, 6.7% testing. The models not listed take less than 1 minute.
matrix_factorization_SGD : 13
matrix_factorization_ALS : 13
KNN_user : 50
KNN_item : 5
slope_one : 4
svdpp : 3
svdpp : 140