00 Movie Recommender
Movie Recommender System
This notebook demonstrates how Pinecone's similarity search as a service helps you build a simple Movie Recommender System. There are three parts to this recommender system:
- A dataset containing movie ratings
- Two deep learning models for embedding movies and users
- A vector index to perform similarity search on those embeddings
The architecture of our recommender system is shown below. We have two models, a user model and a movie model, which generate embedding for users and movies. The two models are trained such that the proximity between a user and a movie in the multi-dimensional vector space depends on the rating given by the user for that movie. This means if a user gives a high rating to a movie, the movie will be closer to the user in the multi-dimensional vector space and vice versa. This ultimately brings users with similar movie preferences and the movies they rated higher closer in the vector space. A similarity search in this vector space for a user would give new recommendations based on the shared movie preference with other users.

Install Dependencies
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
|████████████████████████████████| 365 kB 16.1 MB/s
Collecting transformers
Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
|████████████████████████████████| 4.7 MB 53.9 MB/s
Collecting pinecone-client
Downloading pinecone_client-2.0.13-py3-none-any.whl (175 kB)
|████████████████████████████████| 175 kB 61.7 MB/s
Requirement already satisfied: tensorflow in /usr/local/lib/python3.7/dist-packages (2.8.2+zzzcolab20220719082949)
Requirement already satisfied: pyarrow>=6.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (6.0.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.3.5)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2.23.0)
Collecting multiprocess
Downloading multiprocess-0.70.13-py37-none-any.whl (115 kB)
|████████████████████████████████| 115 kB 59.0 MB/s
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from datasets) (1.21.6)
Collecting huggingface-hub<1.0.0,>=0.1.0
Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
|████████████████████████████████| 120 kB 49.5 MB/s
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (4.64.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from datasets) (4.12.0)
Collecting xxhash
Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
|████████████████████████████████| 212 kB 58.9 MB/s
Requirement already satisfied: aiohttp in /usr/local/lib/python3.7/dist-packages (from datasets) (3.8.1)
Collecting responses<0.19
Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (2022.7.1)
Requirement already satisfied: dill<0.3.6 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.5.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from datasets) (21.3)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (4.1.1)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.8.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->datasets) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2022.6.15)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (3.0.4)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
|████████████████████████████████| 127 kB 59.8 MB/s
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2022.6.2)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
|████████████████████████████████| 6.6 MB 42.9 MB/s
Collecting loguru>=0.5.0
Downloading loguru-0.6.0-py3-none-any.whl (58 kB)
|████████████████████████████████| 58 kB 5.8 MB/s
Requirement already satisfied: python-dateutil>=2.5.3 in /usr/local/lib/python3.7/dist-packages (from pinecone-client) (2.8.2)
Collecting dnspython>=2.0.0
Downloading dnspython-2.2.1-py3-none-any.whl (269 kB)
|████████████████████████████████| 269 kB 50.7 MB/s
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.5.3->pinecone-client) (1.15.0)
Requirement already satisfied: keras<2.9,>=2.8.0rc0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.8.0)
Requirement already satisfied: protobuf<3.20,>=3.9.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.17.3)
Requirement already satisfied: wrapt>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.14.1)
Requirement already satisfied: gast>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.5.3)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.3.0)
Requirement already satisfied: tensorboard<2.9,>=2.8 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.8.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from tensorflow) (57.4.0)
Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.2.0)
Requirement already satisfied: keras-preprocessing>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.1.2)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.47.0)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.26.0)
Requirement already satisfied: tensorflow-estimator<2.9,>=2.8 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.8.0)
Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.1.0)
Requirement already satisfied: libclang>=9.0.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (14.0.6)
Requirement already satisfied: flatbuffers>=1.12 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.0)
Requirement already satisfied: absl-py>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.2.0)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.1.0)
Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.6.3)
Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.7/dist-packages (from astunparse>=1.6.0->tensorflow) (0.37.1)
Requirement already satisfied: cached-property in /usr/local/lib/python3.7/dist-packages (from h5py>=2.9.0->tensorflow) (1.5.2)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.9,>=2.8->tensorflow) (1.8.1)
Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.9,>=2.8->tensorflow) (1.35.0)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.9,>=2.8->tensorflow) (0.6.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.9,>=2.8->tensorflow) (3.4.1)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.9,>=2.8->tensorflow) (1.0.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.9,>=2.8->tensorflow) (0.4.6)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow) (0.2.8)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow) (4.2.4)
Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.7/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.9,>=2.8->tensorflow) (1.3.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets) (3.8.1)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.9,>=2.8->tensorflow) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.9,>=2.8->tensorflow) (3.2.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (6.0.2)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (22.1.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.8.1)
Requirement already satisfied: asynctest==0.13.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (0.13.0)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.1.0)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (1.2.0)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2022.2.1)
Installing collected packages: urllib3, xxhash, tokenizers, responses, multiprocess, loguru, huggingface-hub, dnspython, transformers, pinecone-client, datasets
Attempting uninstall: urllib3
Found existing installation: urllib3 1.24.3
Uninstalling urllib3-1.24.3:
Successfully uninstalled urllib3-1.24.3
Successfully installed datasets-2.4.0 dnspython-2.2.1 huggingface-hub-0.9.1 loguru-0.6.0 multiprocess-0.70.13 pinecone-client-2.0.13 responses-0.18.0 tokenizers-0.12.1 transformers-4.21.2 urllib3-1.25.11 xxhash-3.0.0
Load the Dataset
We will use a subset of the MovieLens 25M Dataset in this project. This dataset contains ~1M user ratings provided by over 30k unique users for the most recent ~10k movies from the MovieLens 25M Dataset. The subset is available here on HuggingFace datasets.
Using custom data configuration default Reusing dataset movie_lens (/Users/jamesbriggs/.cache/huggingface/datasets/pinecone___movie_lens/default/0.0.0/0b5cf78c3c23d9db1c33d17d7d490a06b45c6d9f00a6691aa005c6fcad1c8b82)
Initialize Embedding Models
The user_model and movie_model are trained using Tensorflow Keras. The user_model transforms a given user_id into a 32-dimensional embedding in the same vector space as the movies, representing the user’s movie preference. The movie recommendations are then fetched based on proximity to the user’s location in the multi-dimensional space.
Similarly, the movie_model transforms a given movie_id into a 32-dimensional embedding in the same vector space as other similar movies — making it possible to find movies similar to a given movie.
config.json not found in HuggingFace Hub WARNING:huggingface_hub.hub_mixin:config.json not found in HuggingFace Hub WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually. config.json not found in HuggingFace Hub WARNING:huggingface_hub.hub_mixin:config.json not found in HuggingFace Hub WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
Create Pinecone Index
To create our vector index, we first need to initialize our connection to Pinecone. For this we need a free API key, and then we initialize the connection like so:
Now we create a new index called "movie-emb", what we name this isn't important.
Create Movie Embeddings
We will be creating movie embeddings using the pretrained movie_model. All of the movie embeddings will be upserted to the new "movie-emb" index in Pinecone.
0%| | 0/161 [00:00<?, ?it/s]
{'dimension': 32,
, 'index_fullness': 0.0,
, 'namespaces': {'': {'vector_count': 10269}},
, 'total_vector_count': 10269} Get Recommendations
We now have movie embeddings stored in Pinecone. To get recommendations we can do two things:
- Get a user embedding via a user embedding model and our
user_ids, and retrieve movie embeddings (from Pinecone) that are most similar. - Use an existing movie embedding to retrieve other similar movies.
Both of these use the same approach, the only difference is the source of data (user vs. movie) and the embedding model (user vs. movie).
We will start with task 1.
We will start by looking at a users top rated movies, we can find this information inside the movies dataframe by filtering for movie ratings by a specific user (as per their user_id), and ordering these by the rating score.
After this, we can define a function called display_posters that will take a list of movie posters (like those returned by top_movies_user_rated) and display them in the notebook.
Let's take a look at user 3s top rated movies:
[4.5, 4.0, 4.0, 2.5, 2.5]
User 3 has rated these five movies, with Big Hero 6, Civil War, and Avengers being given good scores. They seem less enthusiastic about more sci-fi films like Arrival and The Martian.
Now let's see how to make some movie recommendations for this user.
Start by defining the get_recommendations function. Given a specific user_id, this uses the user_model to create a user embedding (xq). It then retrieves the most similar movie vectors from Pinecone (xc), and extracts the relevant movie posters so we can display them later.
Recommendations for User
That looks good, the top results actually match the users three favorite results. Following this we see a lot of Marvel superhero films, which user 3 is probably going to enjoy judging from their current ratings.
Let's see another user, this time we choose 128.
[4.5, 4.5, 4.5, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]
Because this user seems to like everything, they also get recommended a mix of different things...
[5.0, 4.0, 3.5, 3.5, 3.5, 3.0, 1.0]
We can see more of a trend towards action films with this user, so we can expect the see similar action focused recommendations.
Find Similar Movies
Now let's see how to find some similar movies.
Start by defining the get_similar_movies function. Given a specific imdb_id, we query directly using the pre-existing embedding for that ID stored in Pinecone.
Now we have Avengers: Infinity War. Let's find movies that are similar to this movie.
The top results closely match Avengers: Infinity War, the top most similar movie being the movie itself. Following this we see a lot of other Marvel superhero films.
Let's see another movie. This time a cartoon.
This result quality is good again. The top results returning plenty of cartoons.