Feature Engineering 101
Feature Engineering with LanceDB and Geneva
This notebook will focus on the crucial process of feature engineering. We'll start with a raw dataset of fashion products, ingest it in LanceDB, then enrich our data with meaningful features that we could use to build a search engine or train a model.
We will cover the following steps:
- Data Ingestion: Downloading a fashion dataset and loading it into a LanceDB table.
- Declarative Feature Engineering: Using Geneva to define and compute features on-the-fly.
- Embedding Generation: Creating vector embeddings for both images and text to enable semantic search.
- Updating: Adding more raw data to our table and rerunning our backfills on only the new data.
Note about Colab
This notebook runs on Google Colab, even the free tier, but it will be slow, because it has to start a local Ray cluster and execute multiple ML models on its workers. We recommend downloading this notebook and running it locally. But if you do run on Colab, we recommend:
- using a GPU instance (Runtime -> Change runtime type)
- running on only 100 rows
- not drawing conclusions about speed from this notebook. This notebook is meant as a demo of the basic workflow of feature engineering with LanceDB, not a benchmark or speed demo.
1. Data Ingestion
First, let's download our dataset. We're using a small version of the Fashion Product Images dataset from Kaggle. This dataset contains images and metadata for a variety of fashion products.
Set Scale based on your environment
This tutorial uses Ray locally to build features, which means the scale of concurrent jobs will be limited to the system you're working on. These parameters are good defaults, but feel free to adjust them if you'd like.
Now, let's load the data into a LanceDB table. We'll read the CSV file with the product metadata, and for each product, we'll also load the corresponding image from the images directory. We'll then create a LanceDB table and add the data to it in batches. LanceDB can store objects (images in this case) along with vector embeddings and metadata.
2. Feature Engineering with Geneva
Now that we have our data in a LanceDB table, we can start engineering features. We'll use Geneva to create new features for our products.
Defining Geneva UDFs
Geneva uses Python User Defined Functions (UDFs) to define features as columns in a Lance dataset. Adding a feature is straightforward:
- Prototype your Python function in your favorite environment.
- Wrap the function with small UDF decorator.
- Register the UDF as a virtual column using Table.add_columns().
- Trigger a backfill operation
UDFs can work on one row or a batch at a time, and can be stateful (e.g. some work is done to set up a model the first time, and future runs use the same model) or stateless. Read more about geneva UDFs here.
Simple Feature Extraction
Let's start with a simple feature: extracting color tags from the product description. We'll define a User-Defined Function (UDF) that takes the product description as input and returns a comma-separated string of colors found in the description.
Adding a Computed Column
Now that we've defined our feature-generating UDF, we can add it to our table as a computed column. Computed columns are computed on-the-fly when you perform a backfill operation.
Let's inspect the table schema to see our newly registered UDF.
Backfilling Features
Triggering backfill creates a distributed job to run the UDF and populate the column values in your LanceDB table. The Geneva framework simplifies several aspects of distributed execution.
Checkpoints: Each batch of UDF execution is checkpointed so that partial results are not lost in case of job failures. Jobs can resume and avoid most of the expense of having to recalculate values.
backfill() accepts various params to customize scale of your workload, here we'll use:
checkpoint_size- the number of rows that are processed before writing a checkpointconcurrency- how many nodes are used for parallelization
Here, we'll use db.local_ray_context() to run on a local Ray instance, so we don't need to set up a Ray cluster, but you can also use the same setup and run distributed jobs remotely on Ray clusters.
Let's take a look at our enriched data.
3. Embedding Generation
Now that we have our text-based features, let's create some vector embeddings. Embeddings are numerical representations of data that capture its semantic meaning. We'll also create some captions to describe in words what each image shows.
Image Embeddings
We'll use a pretrained CLIP model to generate embeddings for our product images. We'll define a UDF that takes a batch of image bytes as input, preprocesses them, and then uses the CLIP model to generate embeddings.
Captions
We'll also use a pretrained BLIP model here, that takes in an image and returns a caption of up to 30 words.
Adding and Backfilling Embedding and Caption Columns
Now, let's add our new generators as virtual columns and then backfill them.
Features on Features
Of course, feature engineering workflows often include chains of features: features that depend on other features we've already computed! Let's make some text embeddings for those captions we just generated.
4. Updating
Let's add some new clothes to our table and rerun the backfills to add all our derived features. This will only recompute our backfills on the new rows. This doesn't save that much time in this tutorial, but it absolutely does in production. Imagine adding new data daily; you won't want to recompute your costly features on all your data every day!
We'll add another batch of data that's half as big as the first. Then, as we do the following backfills, you will notice this as the progress bars only process the new data. For example, if our dataset was originally 100 rows, and we add 50 more, progress bars will show "100/150", reflecting that the original 100 rows have already been computed.
5. Wrapping up
That's the basics of feature engineering! Next, check out our other feature engineering tutorials, or if you're ready to start building features in production, read about Execution Contexts.