This project works on the Last.FM dataset that was provided by Oscar Celma. It consists of user-artist-plays tuples that was collected from the LastFM API. This small project dweleved from my interest in music and also a need to understand item-based collaborative filtering as part of my independent study on recommender systems The project involved the use of NumpPy
, Pandas
, Scipy
, Sci-kit Learn
, fuzzywuzzy
. It introduced me to a variety of new modules and methods. Let’s take a look at a few code snippets to understand what this project is all about.
So, firstly we load the user-data file which contains users, artists that each user listens to and the number of plays that user has for that artist. We also load the user-profile data that gives information like the users country of origin, age
Here we want to calculate the total plays an artist gets from all users combined in order to get an understanding of the really popular artists. This is needed as we next create a threshold of artists that we would consider for our recommendations that lie above this threshold. We take the top 3 % artists and set the threshold to 40000
After we’ve obtained our subsetted user-artist data, we restrict our data to just users that are from the United States to reduce the complexity of the project and obtain a more narrow result. Once we have this wide matrix created we implement a nearest neighbors model in order to obtain artists closer to the artist of concern by utilizing the artist-plays vector. We calculate the distance between each artist using Cosine
distance. Once we have these distances calculated, we pick the closest 10 to make our recommendations. Here’s a snippet of how sklearn is used.
Here is just a sample of the type of recommendations we observe if snoop dogg is selected as a users selected artist into consideration
We can see how these recommendations are very good ! We can also work with binary counts instead of considering total artist plays. Another feature that could be introduced would be fuzzy matching so that artist names with different characters could be recognized as well. Some limitations of this implementation is the need to maintain a matrix with item similarity, the intuition is basically expected recommendations and isn’t a bold prediction which user-based approaches could make, recommendations are made just for popular artists so it would be interesting to build a recommender for low-profile artists and lastly scaling such a solution to the larger dataset would need a lot of optimization.
Credits to Nick Becker whose blog post on music recommenders inspired this project