About
As a Computer Science student who's deeply interested in applying Machine Learning algorithms to recommend users the most relevant and helpful content
and a film lover who loves to analyze classics from all angles, I wanted to create a recommendation system that can help users pick the perfect film to watch
for any occasion.
I wanted to build an application that can give users the most fitting movie recommendations based on known information, and ideally the
information can be as minimal and conceptual as possible. Whether it is a brief description of the movie they are looking for, a movie
they already watched and love, or if they wanted to know the top movies in each genre. Naturally, datasets on how users classify, label, and rank films is of particular interest to me.
The source codes for this app are on my GitHub:
Data Processing + ML Algorithm: https://github.com/Cadey-chen/movie_recommender
Django Web Application: https://github.com/Cadey-chen/movie_rec_app
Everything started by pre-processing relevant movies dataset consisting of metadata such as movie title, year, ratings, keywords, and
descriptions. I found two amazing TMDB Movies datasets on Kaggle which I preprocessed and used to train my ML model:
1. The Movies Dataset by Rounak Banik: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
2. Full TMDB Movies Dataset 2024 (1M Movies) by asaniczka: https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies
In the data pre-processing stage, a great deal was about determining which are the samples I want to filter that are relevant to this
recommender, since I wanted this recommender to give the most relevant and high-quality recommendations. As well as pre-processing each
column into a data type that can be easily accessed and operated on for the training later on.
To understand the how these movies relate to one another, I applied the Frequency-Inverse Document Frequency (TF-IDF) vectorizer and
cosine similarity to calculate the similarity between each movie's plot description and keywords string. In this process, I found the most
challenging part to be accurately representing the overall tone and style of the movie with the given data, since solely using plot description
and keywords string can sometimes lead to ill-suited results because of a specific word (such as an object that was important to the plot,
or two movies both have a character with the same name). To mitigate this, the model also took other metadata values such as genres, production
country, and ratings similarity into consideration, and the results are more generalized.
For the movie poster images, I use the TMDB API to fetch these images. This product uses the TMDB API but is not endorsed or certified by TMDB.
For the Full Stack application itself, I built it with Django with a PostgreSQL database and used built the user interface with HTML and Bootstrap 5.
I designed the UI/UX with Figma, the design concept is a dark background with light gradient colors to emulate a cool-toned movie screen.
I leveraged threads to process such a large dataset and load the associate movie poster images to users.