Rag Over Pdfs Docling Weaviate
Performing RAG over PDFs with Weaviate and Docling
A recipe 🧑🍳 🐥 💚
By Mary Newhauser, MLE @ Weaviate
This is a code recipe that uses Weaviate to perform RAG over PDF documents parsed by Docling.
In this notebook, we accomplish the following:
- Parse the top machine learning papers on arXiv using Docling
- Perform hierarchical chunking of the documents using Docling
- Generate text embeddings with OpenAI
- Perform RAG using Weaviate
To run this notebook, you'll need:
- An OpenAI API key
- Access to GPU/s
Note: For best results, please use GPU acceleration to run this notebook. Here are two options for running this notebook:
- Locally on a MacBook with an Apple Silicon chip. Converting all documents in the notebook takes ~2 minutes on a MacBook M2 due to Docling's usage of MPS accelerators.
- Run this notebook on Google Colab. Converting all documents in the notebook takes ~8 mintutes on a Google Colab T4 GPU.
Install Docling and Weaviate client
Note: If Colab prompts you to restart the session after running the cell below, click "restart" and proceed with running the rest of the notebook.
🐥 Part 1: Docling
Part of what makes Docling so remarkable is the fact that it can run on commodity hardware. This means that this notebook can be run on a local machine with GPU acceleration. If you're using a MacBook with a silicon chip, Docling integrates seamlessly with Metal Performance Shaders (MPS). MPS provides out-of-the-box GPU acceleration for macOS, seamlessly integrating with PyTorch and TensorFlow, offering energy-efficient performance on Apple Silicon, and broad compatibility with all Metal-supported GPUs.
The code below checks to see if a GPU is available, either via CUDA or MPS.
MPS GPU is enabled.
Here, we've collected 10 influential machine learning papers published as PDFs on arXiv. Because Docling does not yet have title extraction for PDFs, we manually add the titles in a corresponding list.
Note: Converting all 10 papers should take around 8 minutes with a T4 GPU.
Convert PDFs to Docling documents
Here we use Docling's .convert_all() to parse a batch of PDFs. The result is a list of Docling documents that we can use for text extraction.
Note: Please ignore the ERR# message.
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 84072.91it/s]
ERR#: COULD NOT CONVERT TO RS THIS TABLE TO COMPUTE SPANS
Post-process extracted document data
Perform hierarchical chunking on documents
We use Docling's HierarchicalChunker() to perform hierarchy-aware chunking of our list of documents. This is meant to preserve some of the structure and relationships within the document, which enables more accurate and relevant retrieval in our RAG pipeline.
Because we're splitting the documents into chunks, we'll concatenate the article title to the beginning of each chunk for additional context.
💚 Part 2: Weaviate
Create and configure an embedded Weaviate collection
We'll be using the OpenAI API for both generating the text embeddings and for the generative model in our RAG pipeline. The code below dynamically fetches your API key based on whether you're running this notebook in Google Colab and running it as a regular Jupyter notebook. All you need to do is replace openai_api_key_var with the name of your environmental variable name or Colab secret name for the API key.
If you're running this notebook in Google Colab, make sure you add your API key as a secret.
Embedded Weaviate allows you to spin up a Weaviate instance directly from your application code, without having to use a Docker container. If you're interested in other deployment methods, like using Docker-Compose or Kubernetes, check out this page in the Weaviate docs.
Wrangle data into an acceptable format for Weaviate
Transform our data from lists to a list of dictionaries for insertion into our Weaviate collection.
Insert data into Weaviate and generate embeddings
Embeddings will be generated upon insertion to our Weaviate collection.
Query the data
Here, we perform a simple similarity search to return the most similar embedded chunks to our search query.
{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A distinctive feature of BERT is its unified architecture across different tasks. There is mini-', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}
0.6578550338745117
{'text': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding We introduce a new language representation model called BERT , which stands for B idirectional E ncoder R epresentations from T ransformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.', 'title': 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'}
0.6696287989616394
Perform RAG on parsed articles
Weaviate's generate module allows you to perform RAG over your embedded data without having to use a separate framework.
We specify a prompt that includes the field we want to search through in the database (in this case it's text), a query that includes our search term, and the number of retrieved results to use in the generation.
We can see that our RAG pipeline performs relatively well for simple queries, especially given the small size of the dataset. Scaling this method for converting a larger sample of PDFs would require more compute (GPUs) and a more advanced deployment of Weaviate (like Docker, Kubernetes, or Weaviate Cloud). However, this notebook demonstrates that Docling is a robust and powerful open source tool for converting PDFs to structured data.
To take this solution to next level, you consider:
- Experimenting with different chunking techniques
- Using a RAG framework like DSPy, LlamaIndex, or LangChain
- Implementing advanced RAG techniques, like Agentic RAG