INTRODUCTION
RESOURCES
DATA ANALYSIS
FEATURE ENGINEERING
BUILDING RECOMMENDATION MODEL
INTERACTIVE SEARCH WITH WIDGETS
COLLABORATIVE FILTERING WITH KNN
ENHANCED SCORING FOR SIMILAR USERS’ RECOMMENDATION
ENCAPSULATING LOGIC
FULL INTERACTIVE PIPELINE FOR FINAL RECOMMENDATION
CONCLUSION

1. INTRODUCTION

In this project, we developed a Content-Based Movie Recommendation System using Python, Natural Language Processing (NLP), and Machine Learning techniques. The system analyses movie metadata like genres, cast, crew, keywords, and overviews to recommend movies similar to the one you like.

2. RESOURCES

2.1 INTRODUCTION TO DATASET

This project utilises the MovieLens 25M dataset, which comprises detailed information about movies and user ratings. The dataset includes two primary CSV files:

movies.csv: This file contains metadata for 62,000+ movies, including movie titles and their associated genres. Each entry provides a unique movie ID used for linking with other files.
Ratings.csv: This file includes over 25 million user ratings for various movies. Each record consists of a user ID, movie ID, rating (from 0.5 to 5.0), and a timestamp, enabling detailed analysis of user preferences and recommendation modelling.
DATASET LINK: https://drive.google.com/drive/folders/1HhY3Ct851iv_XXNYXwqpdpolAJG_nHJF

COLAB LINK: https://colab.research.google.com/drive/12jvCZYsh3W-tLRJPEIp32ic3ttq2P3ZO

3. DATA ANALYSIS

3.1 ATTRIBUTE DESCRIPTION

For movies.csv
This file contains metadata about movies used for content-based recommendations.

movieId: A unique identifier for each movie. It is used as a key to link the movie data with the ratings in ratings.csv.
title: The title of the movie, typically including the year of release in parentheses.
genres: A pipe-separated (|) list of genres (e.g., Action|Comedy) that classify the type of movie. This information is useful for genre-based filtering or similarity measures.
clean_title (generated during preprocessing): A processed version of the title where all special characters are removed, used for better performance in text-based matching algorithms like TF-IDF

For ratings.csv
This file contains user interactions with the movies in the form of ratings, enabling collaborative filtering.
- userId: A unique identifier for each user who rated a movie. Used to group and analyze user behavior.
- movieId: A foreign key that links each rating to a specific movie in movies.csv.
- rating: The numerical rating provided by the user, typically ranging from 0.5 to 5.0 in 0.5 increments. This is the core signal used in collaborative filtering models.
- timestamp: Represents the time when the rating was given. Though not always used, it can be helpful in building time-aware or session-based recommendation systems.

3.2 EXPLORATORY DATA ANALYSIS (EDA)

Necessary imports:

Next, we read the csv (movies.csv) and print its first 5 rows:

Next, we have “Recommended Plots" toggled on, Colab detects that movies is a DataFrame and:

Analyses data types
Checks for unique value counts
Suggests appropriate plot types based on column properties

Next, we remove all special characters from movie titles so that they consist only of letters, numbers, and spaces. This makes the titles cleaner and more consistent for further analysis like:

Search and filtering
Text vectorisation in recommendation systems
Avoiding mismatches during comparisons

Function clean_title(title):
This function takes a movie title as input and applies a regular expression using re.sub()

regular expression: "[^a-zA-Z0-9 ]" means “everything that is NOT a letter, number, or space”.

4. FEATURE ENGINEERING: Text Vectorisation for Title Similarity

We use TF-IDF to numerically represent movie titles and capture semantic meaning

Inference:
We used TfidfVectorizer with n-grams (1,2) to convert each cleaned title into a numerical vector. TF-IDF helps emphasize distinctive terms in a title while minimizing common, less informative words. By using both unigrams and bigrams, we capture not only individual words but also word combinations, which enhances contextual understanding. This vector representation allows us to mathematically compute similarity between titles.

5. BUILDING TITLE BASED RECOMMENDATION MODEL

We now build a content-based filtering model using cosine similarity to find similar movie titles

Inference:
Here, we implemented a function to find the top 5 similar movies based on the cosine similarity between the input movie title and the TF-IDF vectors of all titles. This content-based approach is especially useful for cold-start situations where user rating data may not yet exist. It's a quick and effective way to recommend movies that are semantically or thematically similar to what the user inputs.

6. INTERACTIVE TITLE-BASED SEARCH WITH WIDGETS

Then add interactive search using ipywidgets to search movies by title and show similar results.

Inference:
To improve usability, we introduced an interactive widget (ipywidgets.Text) that listens to user input and dynamically updates recommendations in real time. This front-end component provides a more engaging and intuitive experience, allowing users to instantly explore similar movies as they type a title. It bridges the gap between the model and the end-user.

7. COLLABORATIVE FILTERING WITH KNN (User-Based Recommendations)

We then introduce collaborative filtering using Nearest Neighbours on filtered rating data

Inference:
This segment introduces collaborative filtering, which suggests movies based on user behavior rather than content. We filtered out users and movies with low activity to improve reliability. Using a pivot table to create a user-movie rating matrix, we then applied KNN to find users with similar preferences. This approach identifies communities of similar users and leverages their ratings to make personalized recommendations.

8. ENHANCED SCORING FOR SIMILAR USERS’ RECOMMENDATION

Now we are adjusting recommendations using ratio of ratings from similar users vs all users

Inference:
Not all recommendations by similar users are equally useful. To address this, we created a scoring system based on how frequently a movie was rated highly by similar users compared to all users. This ratio helps prioritize movies that are uniquely appreciated by the target user group. It adds a layer of personalization and filters out universally liked but less relevant suggestions.

9. ENCAPSULATING LOGIC IN A FUNCTION

Next, Create reusable function for generating recommendations based on similarity scores

Inference:
We encapsulated the scoring logic into a reusable function find_similar_movies(movie_id). Wrapping complex logic in a function promotes clean code, makes the model reusable, and separates concerns, which is a good software engineering practice. This also allows us to plug in any movie ID and receive optimized recommendations without repeating the filtering and scoring logic.

10. FULL INTERACTIVE PIPELINE FOR FINAL RECOMMENDATION

Wrap up with a fully interactive widget that performs search and recommends similar movies

Inference:
In the final section, we combined the title search and collaborative recommendation systems into a seamless user interface. As the user types a movie title, the app fetches its ID, runs the find_similar_movies function, and displays a ranked list of recommendations. This completes the interactive recommendation engine and allows real-time, user-friendly movie discovery.

10. CONCLUSION

By combining content-based filtering (via TF-IDF and cosine similarity) with collaborative filtering (via KNN and user rating behavioUr), we built a hybrid recommendation system that delivers relevant movie suggestions based on both textual similarity and user preferences. The integration of interactivity using ipywidgets further enhances user experience, making this a robust prototype for real-world movie recommendation engines.

This project not only showcases the application of machine learning and NLP but also emphasiz\Ses the importance of clean data, feature engineering, and interactivity in delivering meaningful insights.

🎬 Movie Recommendation System Using Machine Learning

Table of contents

CONTENTS