Astradb Haystack Integration
Introduction
In this notebook, you'll learn how to use AstraDB as a data source in your Haystack pipelines.
Prerequisites
You'll need an OpenAPI key to follow along. (Haystack is model-agnostic so feel free to use a different one if you'd prefer!)
You'll need the following variables in order to use the Haystack extension. The following tutorials will show you how to create an AstraDB database, and save these pieces of information.
- API Endpoint
- Token
- Astra keyspace
- Astra collection name
Follow the first step in this this tutorial to create a free AstraDB database and save your database ID, application token, keyspace, and database region.
Follow these steps to create a collection. Save the name of your collection.
Choose the number of dimensions that matches the embedding model you plan on using. For this example we'll use a 384-dimension model, sentence-transformers/all-MiniLM-L6-v2.
Next, install our dependencies.
Here you'll enter your credentials and such. In production code, you'd want to use environment variables for sensitive credentials such as the application token to avoid committing those to source control.
Next we'll create a Haystack pipeline to create the embeddings and add them into the AstraDocumentStore.
Batches: 0%| | 0/1 [00:00<?, ?it/s]
WARNING:astra_haystack.document_store:No documents written. Argument policy set to SKIP
3
Next we'll make a RAG pipeline so we can query our documents.
Batches: 0%| | 0/1 [00:00<?, ?it/s]
{'answer_builder': {'answers': [GeneratedAnswer(data='There are over 7,000 languages spoken around the world today.', query='How many languages are there in the world today?', documents=[Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.9267925, embedding: vector of size 384), Document(id=6f20658aeac3c102495b198401c1c0c2bd71d77b915820304d4fbc324b2f3cdb, content: 'Elephants have been observed to behave in a way that indicates a high level of self-awareness, such ...', score: 0.5357444, embedding: vector of size 384)], meta={'model': 'gpt-4o-mini-2024-07-18', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 14, 'prompt_tokens': 83, 'total_tokens': 97}})]}}
The output should be something like this:
{'answer_builder': {'answers': [GeneratedAnswer(data='There are over 7,000 languages spoken around the world today.', query='How many languages are there in the world today?', documents=[Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.9267925, embedding: vector of size 384), Document(id=6f20658aeac3c102495b198401c1c0c2bd71d77b915820304d4fbc324b2f3cdb, content: 'Elephants have been observed to behave in a way that indicates a high level of self-awareness, such ...', score: 0.5357444, embedding: vector of size 384)], meta={'model': 'gpt-4o-mini-2024-07-18', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 14, 'prompt_tokens': 83, 'total_tokens': 97}})]}}
Now that you understand how to use AstraDB as a data source for your Haystack pipeline. Thanks for reading! To learn more about Haystack, join us on Discord or sign up for our Monthly newsletter.