Notebooks
M
MongoDB
Multimodal Ecommerce Agent Voyageai Pixeltable

Multimodal Ecommerce Agent Voyageai Pixeltable

agentsartificial-intelligencellmspartnersmongodb-genai-showcasePixeltablegenerative-airag

Build a Multimodal Shopping Agent with Voyage AI and Pixeltable

What happens when a customer searches for "birthday gift for a 6-year-old who loves math and dinosaurs"? Traditional keyword search can fail, because no product listing contains those exact words. Semantic search understands meaning, not just keywords, but meaning alone isn't always enough. Sometimes the customer cares about what a product looks like, not what the description says. And even good search results can be sharpened by reranking them against the original question.

In this tutorial, you'll build a shopping assistant that handles all three: text intent, visual style, and reranked precision. We'll start simple and layer on capabilities, so you can see each piece do its job before we combine them.

	┌────────────────────────────────────────────────────────────────────────────────┐
│  DATA              EMBED             SEARCH            AGENT         RERANK    │
│                                                                                │
│  ┌────────┐      ┌──────────┐      ┌──────────┐     ┌──────────┐  ┌────────┐   │
│  │Products│─────▶│ Voyage   │─────▶│Similarity│────▶│ LLM Tool │─▶│Rerank  │   │
│  │  Table │      │Embeddings│      │  Search  │     │ Calling  │  │  2.5   │   │
│  └────────┘      └──────────┘      └──────────┘     └──────────┘  └────────┘   │
│      │                │                  │                │             │      │
│  Amazon           Text +            Query by          Agent picks   Reorder    │
│  product data     image vectors     text              the right     results    │
│                                                       search tool   by fit     │
│                                                                                │
└────────────────────────────────────────────────────────────────────────────────┘

In this notebook

  1. Load product data: Import Amazon listings into a Pixeltable table with images, descriptions, and metadata
  2. Add embedding indexes: Build text and multimodal indexes using Voyage AI models
  3. Search for similar items: Run text similarity and cross-modal (text-to-image) search
  4. Build reusable query functions: Turn ad-hoc queries into persistent, orchestrated workflows
  5. Build a multimodal agent: Wire both search tools into an LLM that picks the right one per question
  6. Improve results with reranking: Use Voyage AI's reranker to sharpen answer quality
ComponentTechnologyRole in this notebook
EmbeddingsVoyage AI voyage-3.5Semantic text retrieval
RerankerVoyage AI rerank-2.5Reorders candidates for precision
Multimodal SearchVoyage AI voyage-multimodal-3.5Text-to-image retrieval
AI Data InfrastructurePixeltableStores data, manages indexes, orchestrates pipelines

About Pixeltable

Pixeltable is AI data infrastructure where the table is the unit of work. A few things worth knowing going in:

  • Incremental by default. Add new data, and only the new rows get processed. Embeddings, transformations, and model calls don't re-run on data that's already been handled.
  • Embedding indexes and similarity search are first-class table operations. You don't bolt on a separate vector database; retrieval lives alongside your data.
  • Images, video, audio, and documents live in table columns right next to structured data like strings, numbers, and JSON.
  • Inputs, intermediates, and outputs all stay queryable. Nothing disappears into a pipeline. Every step is inspectable as table data.

Setup

Install the required packages, set up API keys, and configure the environment.

[ ]
[1]
[2]

Pixeltable uses directories to organize tables. We'll start with a fresh directory for this tutorial.

[ ]

You can view all of your directories:

[ ]

Load Amazon Product Data

We'll use a pre-processed subset of the Amazon Product Dataset 2020, which contains real product listings with rich metadata including:

  • Product names and descriptions
  • Categories and specifications
  • Pricing information
  • One image URL per row

The dataset contains ~1,800 rows from 500 products, with each product having 1-7 images.

The dataset URL uses GitHub raw content for reproducibility:

[4]

We read the parquet file into a Pandas dataframe and pass it as the source for a new Pixeltable table. on_error='ignore' tells Pixeltable to skip rows where Amazon image URLs have expired rather than failing the entire load.

[5]
Created table 'products'.
Inserted 1779 rows with 10 errors across 2 columns (products.None, products.Image) in 11.99 s (148.40 rows/s)

In a typical setup, you'd load this data into Pandas, push it to a database, then separately send images to object storage. Here, one create_table call handles structured data and images together. The 10 errors in the output are rows where Amazon image URLs have expired. Pixeltable logged them and kept going.

Pixeltable can work with files from anywhere: local paths, URLs, or cloud storage. It stores file references and only downloads to disk on access. Learn more about working with remote files, tables and data operations, and the type system.

We can see the table created in our directory:

[6]
['ecommerce_search/products']

You can clean and filter data directly in the table before applying models like embeddings. Here, we need to remove rows where About_Product is none or empty because these won't produce useful embeddings.

[ ]
37 rows deleted.

Several rows were deleted. Let's count the remaining rows we can work with:

[8]
1742

Pixeltable queries work like you'd expect from SQL:

  • select() picks columns, refined with order_by() and where()
  • collect() executes the query and returns results
  • limit(n), head(n), and tail(n) control how many rows come back

Let's see what products we're working with:

[9]

We can also filter rows with a keyword match using where() and contains():

[10]

There are definitely some dinosaurs in this product catalog. And there are duplicate items, which is par for the course with Amazon since there can be multiple listings for identical items in their marketplace.

Add Text & Multimodal Embeddings for Search

In Pixeltable, an embedding index can be attached to any column. We'll add two to the same table using separate Voyage AI models. Why two?

  • Each captures a different signal. Text embeddings encode what the listing says; image embeddings encode what the product looks like.
  • You can query each one independently. Search descriptions for "dinosaur figurine" or search images for "cuddly plush toy." Different retrieval paths for different questions.
  • You can add more indexes later without touching the existing ones, so you can compare models or methods side by side.

First, text embeddings on the About_Product column:

[11]

Now add an image embedding index on the same table. These take a bit longer to generate because Voyage AI processes the full image for each row.

[12]

We can see both of our new indexes now in our table schema:

[13]

What just happened

Pixeltable sent each product description and image to the Voyage AI APIs, stored the resulting vectors, and built two searchable indexes.

What we didn't set up: a separate vector database, an ETL pipeline to sync product data with an embedding store, a cron job to keep them in agreement. The indexes live in the same table as the data.

Whenever you insert new products into this table later, everything updates incrementally:

  1. The Voyage AI APIs compute embeddings for only the new products
  2. The index adds the new vectors
  3. We can search right away without re-indexing or re-processing any existing rows in our table

Search for Similar Items

Now that we have indexes, we can search. Pixeltable provides a .similarity() method on any indexed column. You pass it a query, and it works like any other expression in order_by, select, and where.

We have two indexes, so we have two ways to search:

  • Text search (txt_idx): pass a text query, find products whose descriptions are semantically similar
  • Image search (img_idx): pass a text query, find products whose photos match visually

Let's try both.

We'll start with a text query on the product descriptions. For text queries, pass the string= argument.

[14]

We'll set a similarity threshold to filter out weak matches. This is optional, but it keeps results focused. You can adjust or remove it depending on how broad you want your search to be.

[15]

Now we compose a query from the similarity expression. These results aren't stored; they're for exploration.

[16]
[17]

Now for the cross-modal search. This is where Voyage AI's voyage-multimodal-3.5 model comes in: we pass a text query and search across images. The model maps both modalities into the same embedding space, so "plush, cuddly stuffed dinosaur" finds products that look cuddly, even if the listing never uses that word.

[18]

Build Reusable Retrieval Queries

Ad-hoc queries work well for exploration, but we can also make search reusable and persistent. In this section, we'll define @pxt.query functions for text and image search, then wire them into a dedicated table so results are computed automatically on every insert.

This gives us two things: a standalone workflow for testing and inspecting search results, and reusable query functions that we'll later hand to an agent as tools.

The pattern has three components:

  1. Query functions (@pxt.query) that turn the search-and-rank pattern into reusable functions. We'll create one for text_search and another for image_search.

  2. A dedicated table with a column for query strings. This gives us a persistent data structure to store queries as rows.

  3. Computed columns that call the query functions. This is the orchestration layer: Pixeltable runs the search functions automatically whenever a new row is inserted.

We define two query functions: text_search (uses txt_idx) and image_search (uses img_idx). Each takes a query string and returns ranked results.

[19]

Next, we make a fresh table for this workflow to store our search queries for now, and our search results in the next step.

[20]
Created table 'searches_live'.
[21]

Now we add computed columns that call our query functions. This is the orchestration: Pixeltable runs them automatically on every insert.

[22]
Added 0 column values with 0 errors in 0.01 s
Added 0 column values with 0 errors in 0.01 s
No rows affected.

The table is empty for now, but it's an orchestrated workflow waiting for data. As soon as we insert a query string, Pixeltable runs the search functions automatically.

[23]
0

Insert queries, and both search functions run automatically.

[24]
Inserted 3 rows with 0 errors in 2.24 s (1.34 rows/s)
3 rows inserted.
[26]

The output looks different from the similarity results earlier. That's a shape difference:

  • Before: one row per product, scalar columns (name, image, score), clean table rendering.
  • Now: one row per query, with search results stored as a JSON list inside a single column. We extract fields with [0:5].Product_Name, which gives arrays instead of individual rows.

Less pretty in a notebook, but more useful in practice: this JSON structure is machine-readable, which is exactly what an agent or application needs downstream.

[29]
[30]

At this point we have the building blocks: a product table with two embedding indexes, similarity search that works across text and images, and reusable query functions wired into an orchestrated workflow. Next, we'll hand these query functions to an LLM as tools and let it decide how to search.

Build a Multimodal Agent

Some searches might be solved by reading the product text description. Others are about what a product looks like, not what the listing says. By giving the agent two tools (each powered by Voyage AI's embeddings), it can choose the right retrieval path based on the question.

The flow:

  1. The LLM receives a question and decides to call a search tool: either text or image.
  2. Pixeltable executes the search and stores the results.
  3. A second LLM call assembles a final answer from what the tool found.

Every step is a computed column, so the whole chain runs automatically when you insert a new question.

The agent uses OpenAI for the LLM calls. Voyage AI handles the retrieval; OpenAI handles the reasoning.

[31]

pxt.tools() takes our query functions and generates the tool schema that OpenAI's API expects. When we pass this to chat_completions in the next step, the LLM can decide to call text_search or image_search, and Pixeltable handles the execution and stores the results.

[32]

The agent table has one input column: the question. Everything else is computed.

[33]
Created table 'mm_agent'.

The system prompt tells the agent what tools it has and when to use each one:

[34]
[35]
Added 0 column values with 0 errors in 0.01 s
Added 0 column values with 0 errors in 0.01 s
No rows affected.

The agent's tool results will come back as structured data, but we need to assemble them into a prompt for a second LLM call that produces the final answer. This UDF collects the product names and prices from the tool output and formats them into a concise context block.

[37]

Now we close the loop with two more computed columns: one that assembles the answer prompt from tool results, and one that sends that prompt to the LLM for a final answer. When a question is inserted, Pixeltable runs the entire chain automatically.

[38]
Added 0 column values with 0 errors in 0.01 s
Added 0 column values with 0 errors in 0.01 s
No rows affected.
[39]

Look at the Computed With column in the schema above. The table is the workflow documentation. You can see the entire agent pipeline at a glance: LLM call, tool invocation, answer assembly, final response. If someone new joins the project, the schema tells them what happens and in what order.

The pipeline is ready. Let's give it some questions.

[40]
Inserted 3 rows with 0 errors in 10.34 s (0.29 rows/s)
3 rows inserted.

One insert, and Pixeltable ran the full chain: an LLM call that chose to invoke the text search tool, the tool execution itself, answer prompt assembly, and a second LLM call to produce the final recommendation. Every intermediate result is stored in the table. We can inspect any step.

[41]

Try inserting your own questions. The pipeline is live: every new row triggers the full chain, from tool selection through search to the final answer.

[42]
Inserted 1 row with 0 errors in 0.86 s (1.17 rows/s)
[43]
Inserted 1 row with 0 errors in 0.87 s (1.14 rows/s)

Improve Results with Reranking

Embedding search is fast and broad. It casts a wide net. But the top results aren't always the best fit for the specific question. Voyage AI's rerank-2.5 model solves this with a two-stage retrieval pattern:

  • Retrieve broadly. Use embeddings to pull back a generous set of candidates (top 15).
  • Score precisely. Pass those candidates through the reranker, which reads each one against the original query and reorders them by relevance.

The result is sharper answers with the same embedding infrastructure we already have. We'll add reranking as computed columns, first to our searches table for inspection, then to the agent.

To feed the reranker, we need a richer version of our text search that retrieves more candidates and includes the full product description. We also need two helper UDFs: one to extract the description strings the reranker expects, and another to format reranked results for display.

[44]
[45]
[46]
[47]
[48]
Added 5 column values with 0 errors in 1.59 s (3.14 rows/s)
Added 5 column values with 0 errors in 0.60 s (8.34 rows/s)
5 rows updated.
[49]

Let's test it. We'll insert a query and compare the embedding-only top 5 against the reranked top 5 to see whether reranking surfaces more relevant products.

[50]
Inserted 1 row with 0 errors in 1.11 s (0.90 rows/s)
1 row inserted.
[51]

Compare the two columns. Embedding search found "dinosaur" products broadly; the reranker recognized which ones are actually about math-and-dinosaur toys for kids. That's the difference between "related to the topic" and "answers the question."

Now we can wire reranking into the multimodal agent. We need one more UDF to assemble a prompt from reranked evidence, and then three computed columns that retrieve candidates, rerank them, and generate an answer grounded in the reranked results rather than raw embedding scores.

[52]

Wire it up:

[53]
Added 3 column values with 0 errors in 1.05 s (2.86 rows/s)
Added 3 column values with 0 errors in 0.39 s (7.64 rows/s)
Added 3 column values with 0 errors in 4.08 s (0.74 rows/s)
3 rows updated.
[54]

Compare the answer and answer_reranked columns. The embedding-only answer recommends products based on broad similarity. The reranked answer is grounded in evidence that was scored specifically against the question. Same agent, same tools, better retrieval feeding the final LLM call.

Summary

You built a shopping agent that combines three Voyage AI capabilities inside a single Pixeltable workflow:

  • Semantic text search with voyage-3.5 embeddings finds products by what their descriptions mean, not just what they say.
  • Multimodal image search with voyage-multimodal-3.5 finds products by what they look like, using a text query to search across images.
  • Reranking with rerank-2.5 takes the broad results from embedding search and reorders them by how well they actually fit the question.

The key Pixeltable idea running through all of this: data, retrieval, and orchestration live in tables with computed columns. That means every step, from raw product data to embedding vectors to agent answers, stays incremental and queryable. Insert a new question, and the whole chain runs. Insert new products, and the indexes update without reprocessing what's already there.

Without Pixeltable, building this means stitching together a database, a vector store, an orchestration framework, and custom glue code to keep them synchronized. Here, every component lives in one system: the data, the embeddings, the search indexes, the agent logic, and the results. When something goes wrong, you can query any step.

From here, you could swap in a different embedding model, add product categories as filters, or point the agent at your own catalog. The table is live; everything you've built persists and updates incrementally. Try extending the notebook and see what works.

Learn More