Notebooks
E
Elastic
Why Rag Still Matters

Why Rag Still Matters

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIwhy-rag-still-matterschatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainapplications

Longer ≠ Better: Why RAG Still Matters

Retrieval-Augmented Generation (RAG) emerged as a solution to early large language models' context window limitations, allowing selective information retrieval when token constraints prevented processing entire datasets. Now, as models like Gemini 1.5 have the ability to handle millions of tokens, this breakthrough enables us to compare whether RAG is still a necessary tool to provide context in the era of LLMs with millions tokens context.

Background

  • RAG was developed as a workaround for token constraints in LLMs
  • RAG allowed selective information retrieval to avoid context window limitations
  • New models like Gemini 1.5 can handle millions of tokens
  • As token limits increase, the need for selective retrieval diminishes
  • Future applications may process massive datasets without external databases
  • RAG may become obsolete as models handle more information directly

Let's test it how good models with large token context are compared to RAG

Architecture

  • RAG: We're using Elasticsearch with Semantic text search enabled, and results provided are supplied to LLM as context, in this case Gemini.

  • LLM: We're providing context to the LLM, in this case Gemini, with a maximum of 1M token context.

Methodology

To compare performance between RAG and LLM full context, we're going to work a mix of technical articles and documentation. To provide full context to the LLM articles and documentation will be provided as context.

To identify if answer is the correct or not we're going to ask to both systems *** What is the title of the article?*** . For this we're going to run 2 sets of tests:

  1. Run a textual query in order to find an extract of document and identify where it belongs. Compare RAG and LLM performance
  2. Run a semantic query in order to find a a semantic equivalent sentence from a document. Compare Rag and LLM performance

To compare both technologies we're going to measure:

  • Accuracy
  • Time
  • Cost

Setup

We setup the python libraries we're going to use

  • Elasticsearch - To run queries to Elasticsearch
  • Langchain - Interface to LLM

Also call API Keys to start working with both components

[ ]

Import libraries, Elasticsearch, defining LLM and Open AI API Keys

[71]

Elasticsearch client

[46]

Defining LLM

[72]

Function to calculate cost of LLM

[49]

1. Index working files

For this test, we're going to index a mix of 303 documents with technical articles and documentation. These documents will be the source of information for both tests.

Create and populate index

To implement RAG we're including in mappings a semantic_text field so we can run semantic queries in Elasticsearch, along with the regular text field.

Also we're pushing documents to "technical-articles" index.

Creating index

[ ]

Populating index

Indexing documents using the Bulk API to Elasticsearch

[ ]

2. Run Comparisons

Test 1: Textual Query

Query to retrieve semantic search results from Elasticsearch

[75]

We extract a paragraph of Elasticsearch in JavaScript the proper way, part II article, we will use it as input to retrieve the results from Elasticsearch.

Results will be stored in the results variable.

RAG strategy (Textual)

Executing Match Phrase Search

This is the query we're going to use to retrieve the results from Elasticsearch using match phrase search capabilities. We will pass the query_str as input to the match phrase search.

[ ]

This template gives the LLM the instructions to answer the question and the context to do so. At the end of the prompt we're asking for the title of the article.

The prompt template will be the same for all test.

[78]

Run results through LLM

Results from Elasticsearch will be provided as context to the LLM for us to get the result we need.

[ ]

LLM strategy (Textual)

Match all query

To provide context to the LLM, we're going to get it from the indexed documents in Elasticsearch. Since maximum number of tokens are 1 million, this is 303 documents.

[ ]

Run results through LLM

As in the previous step, we're going to provide the context to the LLM and ask for the answer.

[ ]

Test 2: Semantic Query

RAG strategy (Non-textual)

[84]

To the second test we're going to use a semantic query to retrieve the results from Elasticsearch. For that we built a short synopsis of Elasticsearch in JavaScript the proper way, part II article as query_str and provided it as input to RAG.

Executing semantic search

This is the query we're going to use to retrieve the results from Elasticsearch using semantic search capabilities. We will pass the query_str as input to the semantic search.

[ ]

Run results through LLM

Now results from Elasticsearch will be provided as context to the LLM for us to get the result we need.

[ ]

LLM strategy (Non-textual)

Match all query

To provide context to the LLM, we're going to get it from the indexed documents in Elasticsearch. Since maximum number of tokens are 1 million, this is 303 documents.

[ ]

Run results through LLM

As in the previous step, we're going to provide the context to the LLM and ask for the answer.

[ ]

3. Printing results

Printing results

Now we're going to print the results of both tests in a dataframe.

[ ]

Printing charts

And for better visualization of the results, we're going to print a bar chart with the number of tokens sent and the response time by strategy.

[ ]

Clean resources

As an optional step, we're going to delete the index from Elasticsearch.

[ ]

Comments on Textual query

On RAG

  1. RAG was able to find the correct result
  2. The time to run a full context was similar to LLM with partial context

On LLM

  1. LLM was unable to find the correct result
  2. Time to provide a result was much longer than RAG
  3. Pricing is much higher than RAG\
  4. If we are using a self managed LLM, the level hardware must be more powerful than with a RAG approach.