Main
Product Recommender using Collaborative Filtering and LanceDB
We are going to use LanceDB and Collaborative Filtering to recommend products based on a user's past buying history. We used the Instacart dataset as our data for this example.
Credentials
Copy and paste the project name and the api key from your project page. These will be used later to connect to LanceDB Cloud
You can also set the LANCEDB_API_KEY as an environment variable. More details can be found here.
Get dataset
Download and unzip the dataset from LanceDB s3 bucket.
--2024-01-23 03:30:37-- http://vectordb-recipes.s3.us-west-2.amazonaws.com/product-recommender.zip Resolving vectordb-recipes.s3.us-west-2.amazonaws.com (vectordb-recipes.s3.us-west-2.amazonaws.com)... 3.5.84.12, 3.5.84.155, 3.5.84.131, ... Connecting to vectordb-recipes.s3.us-west-2.amazonaws.com (vectordb-recipes.s3.us-west-2.amazonaws.com)|3.5.84.12|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 411510857 (392M) [application/zip] Saving to: ‘product-recommender.zip’ product-recommender 100%[===================>] 392.45M 22.5MB/s in 19s 2024-01-23 03:30:56 (20.8 MB/s) - ‘product-recommender.zip’ saved [411510857/411510857] Archive: product-recommender.zip creating: product-recommender/ inflating: __MACOSX/._product-recommender inflating: product-recommender/order_products__prior.csv.zip inflating: __MACOSX/product-recommender/._order_products__prior.csv.zip inflating: product-recommender/order_products__train.csv.zip inflating: __MACOSX/product-recommender/._order_products__train.csv.zip inflating: product-recommender/orders.csv.zip inflating: __MACOSX/product-recommender/._orders.csv.zip inflating: product-recommender/products.csv.zip inflating: __MACOSX/product-recommender/._products.csv.zip inflating: product-recommender/instacart-market-basket-analysis.zip inflating: __MACOSX/product-recommender/._instacart-market-basket-analysis.zip
Install dependencies:
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.23.5)
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (1.5.3)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (1.11.4)
Requirement already satisfied: kaggle in /usr/local/lib/python3.10/dist-packages (1.5.16)
Collecting implicit
Downloading implicit-0.7.2-cp310-cp310-manylinux2014_x86_64.whl (8.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 4.6 MB/s eta 0:00:00
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.1.0+cu121)
Collecting lancedb
Downloading lancedb-0.5.0-py3-none-any.whl (87 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.4/87.4 kB 10.7 MB/s eta 0:00:00
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas) (2023.3.post1)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.10/dist-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from kaggle) (2023.11.17)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from kaggle) (4.66.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.10/dist-packages (from kaggle) (8.0.1)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.0.7)
Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from kaggle) (6.1.0)
Requirement already satisfied: threadpoolctl in /usr/local/lib/python3.10/dist-packages (from implicit) (3.2.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.13.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch) (4.5.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.2.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.3)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2023.6.0)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch) (2.1.0)
Collecting deprecation (from lancedb)
Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting pylance==0.9.6 (from lancedb)
Downloading pylance-0.9.6-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.6/18.6 MB 14.4 MB/s eta 0:00:00
Collecting ratelimiter~=1.0 (from lancedb)
Downloading ratelimiter-1.2.0.post0-py3-none-any.whl (6.6 kB)
Collecting retry>=0.9.2 (from lancedb)
Downloading retry-0.9.2-py2.py3-none-any.whl (8.0 kB)
Requirement already satisfied: pydantic>=1.10 in /usr/local/lib/python3.10/dist-packages (from lancedb) (1.10.13)
Requirement already satisfied: attrs>=21.3.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (23.2.0)
Collecting semver>=3.0 (from lancedb)
Downloading semver-3.0.2-py3-none-any.whl (17 kB)
Requirement already satisfied: cachetools in /usr/local/lib/python3.10/dist-packages (from lancedb) (5.3.2)
Requirement already satisfied: pyyaml>=6.0 in /usr/local/lib/python3.10/dist-packages (from lancedb) (6.0.1)
Requirement already satisfied: click>=8.1.7 in /usr/local/lib/python3.10/dist-packages (from lancedb) (8.1.7)
Collecting overrides>=0.7 (from lancedb)
Downloading overrides-7.6.0-py3-none-any.whl (17 kB)
Collecting pyarrow>=12 (from pylance==0.9.6->lancedb)
Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.3/38.3 MB 8.4 MB/s eta 0:00:00
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->kaggle) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->kaggle) (3.6)
Requirement already satisfied: decorator>=3.4.2 in /usr/local/lib/python3.10/dist-packages (from retry>=0.9.2->lancedb) (4.4.2)
Collecting py<2.0.0,>=1.4.26 (from retry>=0.9.2->lancedb)
Downloading py-1.11.0-py2.py3-none-any.whl (98 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.7/98.7 kB 13.8 MB/s eta 0:00:00
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from deprecation->lancedb) (23.2)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (2.1.3)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.10/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch) (1.3.0)
Installing collected packages: ratelimiter, semver, pyarrow, py, overrides, deprecation, retry, pylance, implicit, lancedb
Attempting uninstall: pyarrow
Found existing installation: pyarrow 10.0.1
Uninstalling pyarrow-10.0.1:
Successfully uninstalled pyarrow-10.0.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.
Successfully installed deprecation-2.1.0 implicit-0.7.2 lancedb-0.5.0 overrides-7.6.0 py-1.11.0 pyarrow-15.0.0 pylance-0.9.6 ratelimiter-1.2.0.post0 retry-0.9.2 semver-3.0.2
First, let's import all the required modules for this example.
We must now extract the zip files.
Now we can move on to loading the dataset. We'll first read the csv files and create dataframes.
Since there isn't a user rating attribute, we'll gather "confidence" data by looking at the frequency of each item purchased by a user, and store this in the data dataframe.
Let's create a couple of test users to examine the recommendations later:
- 1st test user: buys 50 sodas: Zero Calorie Cola
- 2nd test user: buys organic produce: Organic Whole Milk and Organic Blackberries
13863749
In the next step, we will extract user and product unique ids, in order to create a CSR (Compressed Sparse Row) matrix. This will allow us to perform collaborative filtering.
Let's now create a recommender model using the implicit library. The recommendation model is based off the algorithms described in the paper Collaborative Filtering for Implicit Feedback Datasets with performance optimizations described in Applications of the Conjugate Gradient Method for Implicit Feedback Collaborative Filtering.
Note: this step will take about 17 minutes with the current parameter setup.
/usr/local/lib/python3.10/dist-packages/implicit/cpu/als.py:95: RuntimeWarning: OpenBLAS is configured to use 2 threads. It is highly recommended to disable its internal threadpool by setting the environment variable 'OPENBLAS_NUM_THREADS=1' or by calling 'threadpoolctl.threadpool_limits(1, "blas")'. Having OpenBLAS use a threadpool can lead to severe performance issues here. check_blas_config()
0%| | 0/50 [00:00<?, ?it/s]
Let's now evaluate the model.
0%| | 0/192802 [00:00<?, ?it/s]
{'precision': 0.2742377453615933,
, 'map': 0.04506404325620732,
, 'ndcg': 0.1449554399501384,
, 'auc': 0.6549935260418878} From the model, we'll be able to retrieve item and user factors, which we can use later on to store in LanceDB as vector embeddings.
array([[ 4.18832153e-03, 3.25558195e-03, -1.20758591e-02, , 1.40742492e-03, -9.09519568e-03, 3.18243494e-03, , 2.07483694e-02, -3.95777356e-03, -7.84489443e-04, , 1.28329173e-03, 4.66100639e-03, 1.26599418e-02, , 1.69202778e-02, -3.54033429e-03, -1.87805621e-04, , -8.05972423e-03, 4.04613744e-03, 7.47162709e-03, , 4.05248860e-03, 1.68309249e-02, -1.78848747e-02, , -9.86590981e-03, 8.46584328e-03, -1.20693864e-02, , 7.22488947e-03, 3.90211469e-03, 6.32435898e-04, , 3.13967327e-03, 9.04218480e-03, 2.50183023e-03, , 1.39820874e-02, 7.54051283e-03, 1.57470535e-02, , 4.96101473e-03, 1.74571313e-02, 4.82573919e-03, , 1.31175248e-02, 2.78141089e-02, 2.54594497e-02, , 1.70677726e-04, 6.35464117e-03, -3.27711529e-03, , 8.61203857e-03, 1.61729436e-02, -7.27234699e-04, , 7.29484204e-03, -6.27670763e-03, 2.42914446e-02, , 9.70306620e-03, 9.60955396e-03, 1.76130934e-03, , 1.24175642e-02, 1.61149055e-02, -6.19298825e-03, , 1.43120736e-02, 8.98846332e-03, -4.45187604e-03, , -1.01331789e-02, 1.13288751e-02, 5.21639129e-03, , -2.32453570e-02, -9.21340834e-04, 1.41203729e-02, , 1.15836377e-03, 9.21401940e-03, 1.86691377e-02, , -1.45641970e-03, 3.42004225e-02, 4.21455083e-03, , 1.72144044e-02, 6.25161314e-03, 1.53229507e-02, , 1.02525502e-02, 3.70174204e-03, -3.06739035e-04, , 4.36588563e-03, 9.17611178e-03, 2.26073209e-02, , 4.50356351e-03, 7.92219583e-03, 9.34277428e-04, , 1.91239640e-02, -1.67676080e-02, 4.76368004e-03, , 6.63227355e-03, -5.15057752e-03, 1.04246605e-02, , 1.05045931e-02, 2.13206583e-03, 8.84506665e-03, , -3.37255420e-03, -6.84900908e-03, -4.62881243e-03, , 8.68821703e-03, 5.13017131e-03, 5.22500556e-03, , -9.12018027e-03, -6.31605508e-03, 6.93989592e-03, , 2.04393896e-03, -1.66683702e-03, 7.34541751e-03, , 1.54855782e-02, -2.50343612e-04, 3.87350516e-03, , 1.11501506e-02, 1.94554869e-02, 3.02761160e-02, , 5.73130697e-03, -3.03466641e-03, 8.57606344e-03, , 9.56064463e-03, 9.24304873e-03, -1.49936741e-02, , -6.85681123e-03, 1.99363139e-02, -4.29221604e-04, , -5.85102988e-03, -2.01355782e-03, 1.39436489e-02, , -5.09022153e-04, 7.93045852e-03, -2.93425820e-03, , 1.70512926e-02, 3.72680346e-03, 4.26774239e-03, , 1.29361469e-02, 3.41003831e-03], , [ 4.08880366e-03, 1.89150311e-03, 3.25225573e-03, , 5.50956652e-03, 4.17970167e-03, 1.52355502e-03, , 3.83031485e-03, 3.52009456e-03, 2.86640553e-03, , 4.81489720e-03, 3.90547770e-03, 5.25039481e-03, , 8.52285326e-03, 2.83156661e-03, 7.00753042e-03, , 4.67074849e-03, 5.77870058e-03, 3.62071581e-03, , 4.98738885e-03, 1.30909227e-03, 6.40545553e-03, , 5.35790483e-03, 7.04027340e-03, 4.54069860e-03, , 4.93164733e-03, 2.20916839e-03, 4.92953369e-03, , 5.04408404e-03, 2.08156300e-03, 5.32587618e-03, , 4.29942692e-03, 5.37325954e-03, 3.32720438e-03, , 7.78398663e-03, 2.72745849e-03, 5.18748770e-03, , 6.30498864e-03, 5.85784856e-03, 4.62009897e-03, , 6.24990417e-03, 4.08851821e-03, 4.49793646e-03, , 7.78977934e-04, 2.64118239e-03, 2.32547079e-03, , 5.02325455e-03, 6.91512600e-03, 4.60041454e-03, , 6.66597480e-05, 5.87717863e-03, 4.27115988e-03, , 4.28729318e-03, 1.13794568e-03, 7.68032717e-03, , 5.33338822e-03, 6.90902770e-03, 5.38264960e-03, , 5.93157578e-03, 4.84365830e-03, 4.92752390e-03, , 1.62087195e-03, 7.48377480e-03, 3.89479683e-03, , -5.76462335e-05, 1.03033381e-02, 3.63176106e-03, , 4.49880911e-03, 4.64092754e-03, 1.38480240e-03, , 4.81152860e-03, 5.39690442e-03, 4.84804343e-03, , 3.47388530e-04, 7.04673876e-04, 6.95901597e-03, , 7.98352994e-03, 2.47756205e-03, 1.70948007e-03, , 5.22315735e-03, 2.06266297e-03, 1.11589418e-03, , 1.01095904e-03, 2.19165138e-03, -9.10140574e-04, , 7.64639908e-03, 5.72459772e-03, 4.89675207e-03, , 1.48792891e-03, 2.68044509e-03, 6.07493240e-03, , 5.42714074e-03, 7.35473679e-03, 3.19598289e-03, , 3.64008965e-03, 1.87583105e-03, 4.48295055e-03, , 2.47131498e-03, 3.09168128e-03, 4.25936468e-03, , 2.27378379e-03, 2.08440656e-03, 6.94426883e-04, , 2.01272778e-03, 2.77051283e-03, 5.01386821e-03, , 5.31353708e-03, 1.90395059e-03, 2.16349540e-03, , 4.04190738e-03, 4.96644387e-03, 1.97983976e-03, , 9.15821642e-04, 3.11542186e-03, 3.71921458e-03, , 2.56881723e-03, 5.01005258e-03, 4.94958553e-03, , 2.06254027e-03, 4.21693781e-03, 6.14025909e-03, , 5.64814592e-03, 1.09314881e-02, 4.46141372e-03, , 3.37589253e-03, 7.11428293e-04, 3.79333482e-03, , 3.88169941e-03, 4.75861132e-03]], dtype=float32)
array([[-0.48312342, -0.16332878, -0.27058715, -0.68734646, 0.55745304, , -0.76024646, 1.3025886 , -1.1410682 , 0.19876784, 0.322232 , , 1.418613 , -0.35110232, -0.20965634, 0.06050462, -1.2792661 , , -1.0213155 , 0.4870829 , 0.1747867 , -0.56089026, 1.9309798 , , -1.1751343 , -1.7791682 , -1.1694795 , 0.05588444, 1.1789317 , , 0.46748516, -1.4641706 , -0.34146857, 0.38970897, 0.8604016 , , 0.3465701 , 1.1880745 , 0.06135967, -1.3244237 , 0.3275966 , , -1.1865908 , -0.01917509, 2.7532892 , 2.7307365 , 0.44283357, , 0.5644037 , -0.697197 , -1.8847649 , 0.10031813, 0.3599322 , , -0.83181113, -1.9561976 , 0.8480924 , 0.910125 , -0.35006854, , 0.45438412, 1.1324192 , 0.02506897, 0.7978778 , -1.0787288 , , 0.41879764, -1.0015563 , -0.11314881, -1.512127 , -0.37960863, , -0.5743517 , -1.0606588 , 0.9415234 , 0.1189226 , -0.10419434, , 1.4429063 , -0.35251117, 0.59351844, 0.5283425 , -0.24646994, , -0.48999467, 1.0533476 , 0.28534362, 0.74745566, 0.26966977, , 0.01470857, 0.5190429 , 0.85178673, -0.62364656, -0.44840345, , -0.6985944 , 1.7859677 , -0.9912727 , 0.88918775, 0.61314136, , 1.3294568 , 1.7689328 , -0.42922932, -0.27359295, 1.8145771 , , -0.05140882, -0.72702384, -0.11391591, -0.1860256 , 0.7310641 , , -0.7768954 , -0.3302253 , 0.150209 , -0.60365665, 0.24954513, , -0.2766658 , 0.01893546, 0.3570815 , 0.18330622, -0.89038587, , 0.50650024, 1.0074087 , 1.7643334 , 1.5506059 , -0.38804454, , -0.45902696, -0.3882332 , -0.58766186, 0.30682987, -0.45430216, , 0.17607969, 0.6972072 , -0.3375235 , -1.6623874 , 0.05010271, , -1.246921 , 1.4658022 , -1.158234 , -0.42433274, 0.49941427, , -1.1462147 , 1.3886684 , 1.3426281 ], , [-0.48055026, -1.076108 , 1.2871186 , 0.73388743, 1.1587979 , , -0.61240053, -1.1271679 , 1.5407826 , -1.0408585 , 0.6814867 , , -0.05775254, 0.36426723, -1.6217808 , 0.3340878 , -1.076462 , , -0.44586924, 1.0720152 , 0.8573093 , -0.81757593, -1.3212438 , , -1.4259018 , 0.8028897 , 0.727854 , -0.72402936, -0.26787922, , 0.4334872 , 3.0854182 , -0.903931 , 0.3117463 , 1.932017 , , 1.743012 , -0.08208363, -1.1798037 , -1.4148307 , -0.03076403, , 1.3006622 , -1.5442777 , 0.5676142 , -0.755088 , 2.4009585 , , 0.33378768, -1.1779053 , -0.11361812, -0.46143544, 1.6553828 , , 0.31190038, -2.1039965 , -0.903235 , 2.319655 , -3.0109007 , , -1.284968 , 0.6581418 , 0.40891904, 0.57213986, -2.1724799 , , -1.4901172 , -0.10466211, 0.82121205, 0.0346746 , -0.4013229 , , 0.8444738 , -0.9185106 , 1.9658837 , 1.9450268 , -1.6841023 , , 2.7010896 , 1.1157808 , 0.06317325, 0.4229485 , -0.94922143, , -1.4750186 , -1.0483259 , 3.7233133 , 1.9119471 , -0.5080464 , , 0.4889877 , 0.48215535, -0.35629106, -1.8599209 , -1.0194218 , , 0.11349088, 1.1718806 , 1.3258948 , 1.0701228 , -2.3570247 , , -0.42508158, 0.04244204, -1.3229184 , -0.7360056 , 0.05403712, , 1.6118884 , 1.5898055 , 1.5195148 , -1.1609313 , 0.43079212, , -1.3221414 , 0.17119163, 1.4561695 , 0.8667575 , 0.02400587, , -0.55747974, 0.16746764, 1.7400613 , 0.88008255, -0.6901739 , , 0.4686606 , 2.7078378 , 2.7286143 , -0.52630275, -1.3082739 , , 3.9579751 , 0.2908509 , 2.0343082 , -0.05273173, 1.4064884 , , -1.2191583 , 1.6978588 , 2.9528291 , 0.35665286, -1.6854041 , , -3.23004 , 0.20751497, -2.429357 , 2.0009892 , -0.6266644 , , 0.736535 , -1.2620703 , -0.16571261]], dtype=float32)
Let's save the data and create a empty LanceDB Table using a Pydantic model.
A Table is designed to store large numbers of columns and huge quantities of data! For those interested, a LanceDB is columnar-based, and uses Lance, an open data format to store data.
Let's now store our item factors into the table via the vector column of product_entries.
Let's create an ANN index in order to speed up retrieval. This might take a while.
{} This is a helper method for analysing recommendations later. This method returns top N products that someone bought in the past (based on product quantity).
Let's retrieve our test users so we can query for recommendations.