Table of contents
What is a Recommender System?
Recommender systems are software or algorithms that provide personalized recommendations based on user behaviour and preferences. They are used in various platforms to improve user experience and engagement. There are different types of recommender systems, including collaborative filtering, content-based filtering, hybrid systems, matrix factorization, deep learning-based systems, context-aware systems, and reinforcement learning-based systems. The effectiveness of these systems depends on data quality, algorithm choice, and their ability to handle scalability and new user/item scenarios.
Types of recommender systems:
Collaborative Filtering: Recommends items based on the preferences and behaviour of other users
Content-Based Filtering: Recommends items based on their characteristics and the user's historical preferences
Hybrid Recommender Systems: Combines collaborative and content-based filtering techniques
Matrix Factorization: Factors user-item interaction data to understand underlying patterns
Deep Learning-Based Recommenders: Uses neural networks to capture complex patterns in user behaviour and content
Context-Aware Recommenders: Considers additional contextual information for more relevant recommendations
Reinforcement Learning-Based Recommenders: Optimizes recommendation strategies using user feedback
Applications of recommender systems:
E-commerce
Streaming services
Social media
Effectiveness of recommender systems:
- Depends on data quality, algorithm choice, and handling issues like the cold start problem and scalability
Project Flow
For now let's focus on Data, Pre-Processing and Model Building with a simple Website.
DataSet used is from Kaggel :
Let's do some simple coding:
Import necessary packages:
import numpy as np import pandas as pd import nltk from sklearn.feature_extraction.text import CountVectorizer from nltk.stem.porter import PorterStemmer from sklearn.metrics.pairwise import cosine_similarity import pickle
Read the data set files which are in CSV format:
credits = pd.read_csv('tmdb_5000_credits.csv') movies = pd.read_csv('tmdb_5000_movies.csv')
To read the contents inside the dataset:
movies.head(), here movies is the dataset name.
To make the readability of the dataset more easy we have to merge the two data sets with a common column, where here it is Title
movies = movies.merge(credits,on='title')
Now we have to read the list in a better format: Using this
def convert(text): L = [] for i in ast.literal_eval(text): L.append(i['name']) return L
import ast ast.literal_eval('[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]')
Now using the above function we have to convert all the columns to a better-formatted list.
Next, in the cast and crew column there are many numbers of data, so to simplify the reading we will extract the top 3 data of both columns.
def convert3(text): L = [] counter = 0 for i in ast.literal_eval(text): if counter < 3: L.append(i['name']) counter+=1 return L
movies['cast'] = movies['cast'].apply(lambda x:x[0:3])
Now with crew, we will extract the top directors:
def fetch_director(text): L = [] for i in ast.literal_eval(text): if i['job'] == 'Director': L.append(i['name']) return L
movies['crew'] = movies['crew'].apply(fetch_director)
Removing extra space(White Spaces):
movies['genres']= movies['genres'].apply(lambda x:[i.replace(" ","") for i in x]) movies['keywords']= movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x]) movies['cast']= movies['cast'].apply(lambda x:[i.replace(" ","") for i in x]) movies['crew']= movies['crew'].apply(lambda x:[i.replace(" ","") for i in x]
Creating tags: Tags are where apart from the columns movie id and title rest columns will be merged to form a paragraph, meaning full easy to understand.
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
new_df = movies[['movie_id','title','tags']]
new_df['tags']= new_df['tags'].apply(lambda x:" ".join(x))
new_df['tags'].apply(lambda x:x.lower())
new_df['tags']= new_df['tags'].apply(lambda x:x.lower())
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(max_features=5000, stop_words='english') cv.fit_transform(new_df['tags']).toarray() vectors = cv.fit_transform(new_df['tags']).toarray() cv.get_feature_names_out() import nltk from nltk.stem.porter import PorterStemmer ps = PorterStemmer() def stem(text): y = [] for i in text.split(): y.append( ps.stem(i)) return " ".join(y)
Using stem:
new_df['tags'].apply(stem) new_df['tags']= new_df['tags'].apply(stem) from sklearn.metrics.pairwise import cosine_similarity similarity = cosine_similarity(vectors) sorted(list(enumerate(similarity[0])), reverse=True, key = lambda x:x[1])[1:6] def recommend(movie): movie_index = new_df[new_df['title']==movie].index[0] distances = similarity[movie_index] movie_list = sorted(list(enumerate(distances)), reverse=True, key = lambda x:x[1])[1:6] for i in movie_list: print(new_df.iloc[i[0]].title) import pickle pickle.dump(new_df, open('movies.pkl','wb')) new_df['title'].values pickle.dump(similarity,open('similarity.pkl','wb')) pickle.dump(new_df.to_dict(),open('movie_dict.pkl','wb'))
Last is the Front End : With a simple library Streamlit :
import streamlit as st
import pickle
import pandas as pd
def recommend(movie):
movie_index = movies[movies['title'] == movie].index[0]
distances = similarity[movie_index]
movie_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x: x[1])[1:6]
recommended_movies = []
for i in movie_list:
recommended_movies.append(movies.iloc[i[0]].title)
return recommended_movies
movies_dict = pickle.load(open('movie_dict.pkl','rb'))
movies = pd.DataFrame(movies_dict)
similarity = pickle.load(open('similarity.pkl','rb'))
st.title('Movie Recommendation System')
selected_movie_name = st.selectbox(
'Which movie you want to watch?',
movies['title'].values)
if st.button('Recommend'):
recommendations = recommend(selected_movie_name)
for i in recommendations:
st.write(i)