LanceDB Materialized Views

Materialized Views

feature-engineeringagentsllmsvector-databaselancedbgptopenaiAImultimodal-aitutorialsmachine-learningembeddingsfine-tuningdeep-learninggpt-4-visionllama-indexragmultimodallangchainlancedb-recipes

alph-notebooks/lancedb-recipes / materialized-views.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

In this part of the tutorial, we'll create materialized views: precomputed queries stored as physical tables. Some reasons you might want materialized views:

Training on data only from a certain country
Building different models on shirts, pants, dresses, and coats
Using only a subset of data to test your training pipeline, then swapping in the whole dataset later

LanceDB makes materialized views super easy to create and keep updated; let's jump in!

[ ]

1. Data Ingestion

We'll use the same fashion-product-images dataset as in the Feature Engineering 101 tutorial. All the data loading is the same as that tutorial, but we'll also grab the articleType, and we'll make one important change while creating the Lance table:

Enable Stable Row IDs

In classic Lance tables, compaction may cause row IDs to change. This causes problems for materialized views, which are based on the row IDs. So, when creating our table, we will use the new_table_enable_stable_row_ids option:

	db.create_table(
    name='my_table',
    data=my_data,
    storage_options={'new_table_enable_stable_row_ids': 'true'}
)

[ ]

2. Materialized Views

Imagine that we work at a clothing company, and table represents all the styles of clothing that we sell. Then let's say that we're building a model that focuses on shirts, so we only care about shirts. We could run a query like this every time:

[ ]

But this is unwieldy, and could become expensive if the original table got too big. Instead, we'll create a materialized view so we only have what we need. First we create the view, then we refresh it so it has current data, like so:

[ ]

Like in the 101 demo, we are using db.local_ray_context() to say "run on a Ray instance on this computer." It will be much slower than running on multiple remote machines, but will simplify setup for this tutorial.

Updating our materialized view

The real value of materialized views is the ability to "view" the source table - to reflect changes in it! Let's assume that our source table had 500 more clothing items added to it:

[ ]

Now to see all the new Shirts reflected in the shirts table, we can simply refresh() our materialized view. This will pick up all the new Shirts from the original table. Importantly, it will only process these 500 new rows! For our demo, that's not a big difference, but imagine a table with 10 years' worth of data that adds new data every day: only processing the new day's data will take about 0.02% of the time compared to reprocessing the whole table.

[ ]

Custom columns

You can customize your materialized views by renaming columns and using sql expressions (DataFusion dialect) to create new columns. Say we want to find all Jeans, but our department uses "clothingType" instead of "articleType", and we want a simple flag to separate out blue jeans vs other jeans:

[ ]

Wrapping up

This short demo shows the powerful views you can create very simply with LanceDB Materialized Views. For more, check out our docs. In the next section, we'll go back to working with backfills, focusing on connecting to remote Ray clusters to do feature engineering at production scale.