Why Rag Still Matters
Longer ≠ Better: Why RAG Still Matters
Retrieval-Augmented Generation (RAG) emerged as a solution to early large language models' context window limitations, allowing selective information retrieval when token constraints prevented processing entire datasets. Now, as models like Gemini 1.5 have the ability to handle millions of tokens, this breakthrough enables us to compare whether RAG is still a necessary tool to provide context in the era of LLMs with millions tokens context.
Background
- RAG was developed as a workaround for token constraints in LLMs
- RAG allowed selective information retrieval to avoid context window limitations
- New models like Gemini 1.5 can handle millions of tokens
- As token limits increase, the need for selective retrieval diminishes
- Future applications may process massive datasets without external databases
- RAG may become obsolete as models handle more information directly
Let's test it how good models with large token context are compared to RAG
Architecture
-
RAG: We're using Elasticsearch with Semantic text search enabled, and results provided are supplied to LLM as context, in this case Gemini.
-
LLM: We're providing context to the LLM, in this case Gemini, with a maximum of 1M token context.
Methodology
To compare performance between RAG and LLM full context, we're going to work a mix of technical articles and documentation. To provide full context to the LLM articles and documentation will be provided as context.
To identify if answer is the correct or not we're going to ask to both systems *** What is the title of the article?*** . For this we're going to run 2 sets of tests:
- Run a textual query in order to find an extract of document and identify where it belongs. Compare RAG and LLM performance
- Run a semantic query in order to find a a semantic equivalent sentence from a document. Compare Rag and LLM performance
To compare both technologies we're going to measure:
- Accuracy
- Time
- Cost
Setup
We setup the python libraries we're going to use
- Elasticsearch - To run queries to Elasticsearch
- Langchain - Interface to LLM
Also call API Keys to start working with both components
Import libraries, Elasticsearch, defining LLM and Open AI API Keys
Elasticsearch client
Defining LLM
Function to calculate cost of LLM
1. Index working files
For this test, we're going to index a mix of 303 documents with technical articles and documentation. These documents will be the source of information for both tests.
Create and populate index
To implement RAG we're including in mappings a semantic_text field so we can run semantic queries in Elasticsearch, along with the regular text field.
Also we're pushing documents to "technical-articles" index.
Creating index
Populating index
Indexing documents using the Bulk API to Elasticsearch
2. Run Comparisons
Test 1: Textual Query
Query to retrieve semantic search results from Elasticsearch
We extract a paragraph of Elasticsearch in JavaScript the proper way, part II article, we will use it as input to retrieve the results from Elasticsearch.
Results will be stored in the results variable.
RAG strategy (Textual)
Executing Match Phrase Search
This is the query we're going to use to retrieve the results from Elasticsearch using match phrase search capabilities. We will pass the query_str as input to the match phrase search.
This template gives the LLM the instructions to answer the question and the context to do so. At the end of the prompt we're asking for the title of the article.
The prompt template will be the same for all test.
Run results through LLM
Results from Elasticsearch will be provided as context to the LLM for us to get the result we need.
LLM strategy (Textual)
Match all query
To provide context to the LLM, we're going to get it from the indexed documents in Elasticsearch. Since maximum number of tokens are 1 million, this is 303 documents.
Run results through LLM
As in the previous step, we're going to provide the context to the LLM and ask for the answer.
Test 2: Semantic Query
RAG strategy (Non-textual)
To the second test we're going to use a semantic query to retrieve the results from Elasticsearch. For that we built a short synopsis of Elasticsearch in JavaScript the proper way, part II article as query_str and provided it as input to RAG.
Executing semantic search
This is the query we're going to use to retrieve the results from Elasticsearch using semantic search capabilities. We will pass the query_str as input to the semantic search.
Run results through LLM
Now results from Elasticsearch will be provided as context to the LLM for us to get the result we need.
LLM strategy (Non-textual)
Match all query
To provide context to the LLM, we're going to get it from the indexed documents in Elasticsearch. Since maximum number of tokens are 1 million, this is 303 documents.
Run results through LLM
As in the previous step, we're going to provide the context to the LLM and ask for the answer.
3. Printing results
Printing results
Now we're going to print the results of both tests in a dataframe.
Printing charts
And for better visualization of the results, we're going to print a bar chart with the number of tokens sent and the response time by strategy.
Clean resources
As an optional step, we're going to delete the index from Elasticsearch.
Comments on Textual query
On RAG
- RAG was able to find the correct result
- The time to run a full context was similar to LLM with partial context
On LLM
- LLM was unable to find the correct result
- Time to provide a result was much longer than RAG
- Pricing is much higher than RAG\
- If we are using a self managed LLM, the level hardware must be more powerful than with a RAG approach.