Apify Haystack Rag
Crawl Website Content for Question Answering with Apify
Author: Jiri Spilka (Apify)
In this tutorial, we'll use the apify-haystack integration to call Website Content Crawler and crawl and scrape text content from the Haystack website. Then, we'll use the OpenAIDocumentEmbedder to compute text embeddings and the InMemoryDocumentStore to store documents in a temporary in-memory database. The last step will be a retrieval augmented generation pipeline to answer users' questions from the scraped data.
Install dependencies
Set up the API keys
You need to have an Apify account and obtain APIFY_API_TOKEN.
You also need an OpenAI account and OPENAI_API_KEY
Enter YOUR APIFY_API_TOKEN·········· Enter YOUR OPENAI_API_KEY··········
Use the Website Content Crawler to scrape data from the haystack documentation
Now, let us call the Website Content Crawler using the Haystack component ApifyDatasetFromActorCall. First, we need to define parameters for the Website Content Crawler and then what data we need to save into the vector database.
The actor_id and detailed description of input parameters (variable run_input) can be found on the Website Content Crawler input page.
For this example, we will define startUrls and limit the number of crawled pages to five.
Next, we need to define a dataset mapping function. We need to know the output of the Website Content Crawler. Typically, it is a JSON object that looks like this (truncated for brevity):
[
{
"url": "https://haystack.deepset.ai/overview/quick-start",
"text": "Haystack is an open-source AI framework to build custom production-grade LLM ..."
},
{
"url": "https://haystack.deepset.ai/cookbook",
"text": "You can use these examples as guidelines on how to make use of different mod... "
},
]
We will convert this JSON to a Haystack Document using the dataset_mapping_function as follows:
And the definition of the ApifyDatasetFromActorCall:
Before actually running the Website Content Crawler, we need to define embedding function and document store:
After that, we can call the Website Content Crawler and print the scraped data:
{'documents': [Document(id=3650d4d2050c97d0b20d6bb9202eb72494e2dc6ad0222a7e4a7bad038780ab31, content: 'Haystack | Haystack
Multimodal
AI
Architect a next generation AI app around all modalities, not just...', meta: {'url': 'https://haystack.deepset.ai/'}, embedding: vector of size 1536), Document(id=a441728f7b8c8f7541304f23be229372f526306c6d39f634fecf245923d2f239, content: 'What is Haystack? | Haystack
Haystack is an open-source AI orchestration framework built by deepset ...', meta: {'url': 'https://haystack.deepset.ai/overview/intro'}, embedding: vector of size 1536), Document(id=82282e7eb3115bf0e8efbaaa4de70fd68bcd1bebf25218a68973c3441ff9638f, content: 'Demos | Haystack
Check out demos built with Haystack!
AutoQuizzer
Try out our AutoQuizzer demo built...', meta: {'url': 'https://haystack.deepset.ai/overview/demo'}, embedding: vector of size 1536), Document(id=55f775825a43a52c8f51f4ba08713389a652e05eb992ed15d7c18bbe68bbe38a, content: 'Get Started | Haystack
Haystack is an open-source AI framework to build custom production-grade LLM ...', meta: {'url': 'https://haystack.deepset.ai/overview/quick-start'}, embedding: vector of size 1536), Document(id=1b7ed59f60d536b9e1903b9c66f86e942bfed9bab5ae9f32dcecc6645b95daab, content: '🧑🍳 Cookbook | Haystack
You can use these examples as guidelines on how to make use of different mod...', meta: {'url': 'https://haystack.deepset.ai/cookbook'}, embedding: vector of size 1536)]}
Compute the embeddings and store them in the database:
Calculating embeddings: 1it [00:01, 1.07s/it]
5
Retrieval and LLM generative pipeline
Once we have the crawled data in the database, we can set up the classical retrieval augmented pipeline. Refer to the RAG Haystack tutorial for details.
Initializing pipeline...
<haystack.core.pipeline.pipeline.Pipeline object at 0x79d0f361ea90> ,🚅 Components , - embedder: OpenAITextEmbedder , - retriever: InMemoryEmbeddingRetriever , - prompt_builder: ChatPromptBuilder , - llm: OpenAIChatGenerator ,🛤️ Connections , - embedder.embedding -> retriever.query_embedding (List[float]) , - retriever.documents -> prompt_builder.documents (List[Document]) , - prompt_builder.prompt -> llm.messages (List[ChatMessage])
Now, you can ask questions about Haystack and get correct answers:
question: What is haystack? answer: Haystack is an open-source AI orchestration framework developed by deepset that enables Python developers to create real-world applications using large language models (LLMs). It provides tools for building various types of applications, including autonomous agents, multi-modal apps, and scalable retrieval-augmented generation (RAG) systems. Haystack's modular architecture allows users to customize components, experiment with state-of-the-art methods, and manage their technology stack effectively. It caters to developers at all levels, from prototyping to full-scale deployment, and is supported by a community that values open-source collaboration. Haystack can be utilized directly in Python or through a visual interface called deepset Studio.