Elastic Why Rag Still Matters

Why Rag Still Matters

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIwhy-rag-still-matterschatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainapplications

alph-notebooks/elasticsearch-labs / why_rag_still_matters.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Longer ≠ Better: Why RAG Still Matters

Retrieval-Augmented Generation (RAG) emerged as a solution to early large language models' context window limitations, allowing selective information retrieval when token constraints prevented processing entire datasets. Now, as models like Gemini 1.5 have the ability to handle millions of tokens, this breakthrough enables us to compare whether RAG is still a necessary tool to provide context in the era of LLMs with millions tokens context.

Background

RAG was developed as a workaround for token constraints in LLMs
RAG allowed selective information retrieval to avoid context window limitations
New models like Gemini 1.5 can handle millions of tokens
As token limits increase, the need for selective retrieval diminishes
Future applications may process massive datasets without external databases
RAG may become obsolete as models handle more information directly

Let's test it how good models with large token context are compared to RAG

Architecture

RAG: We're using Elasticsearch with Semantic text search enabled, and results provided are supplied to LLM as context, in this case Gemini.
LLM: We're providing context to the LLM, in this case Gemini, with a maximum of 1M token context.

Methodology

To compare performance between RAG and LLM full context, we're going to work a mix of technical articles and documentation. To provide full context to the LLM articles and documentation will be provided as context.

To identify if answer is the correct or not we're going to ask to both systems *** What is the title of the article?*** . For this we're going to run 2 sets of tests:

Run a textual query in order to find an extract of document and identify where it belongs. Compare RAG and LLM performance
Run a semantic query in order to find a a semantic equivalent sentence from a document. Compare Rag and LLM performance

To compare both technologies we're going to measure:

Accuracy
Time
Cost

Setup

We setup the python libraries we're going to use

Elasticsearch - To run queries to Elasticsearch
Langchain - Interface to LLM

Also call API Keys to start working with both components

[ ]

Import libraries, Elasticsearch, defining LLM and Open AI API Keys

[71]

Elasticsearch client

[46]

Defining LLM

[72]

Function to calculate cost of LLM

[49]

1. Index working files

For this test, we're going to index a mix of 303 documents with technical articles and documentation. These documents will be the source of information for both tests.

Create and populate index

To implement RAG we're including in mappings a semantic_text field so we can run semantic queries in Elasticsearch, along with the regular text field.

Also we're pushing documents to "technical-articles" index.

Creating index

[ ]

Populating index

Indexing documents using the Bulk API to Elasticsearch

[ ]

2. Run Comparisons

Test 1: Textual Query

Query to retrieve semantic search results from Elasticsearch

[75]

We extract a paragraph of Elasticsearch in JavaScript the proper way, part II article, we will use it as input to retrieve the results from Elasticsearch.

Results will be stored in the results variable.

RAG strategy (Textual)

Executing Match Phrase Search

This is the query we're going to use to retrieve the results from Elasticsearch using match phrase search capabilities. We will pass the query_str as input to the match phrase search.

[ ]

And for better visualization of the results, we're going to print a bar chart with the number of tokens sent and the response time by strategy.

[ ]

Clean resources

As an optional step, we're going to delete the index from Elasticsearch.

[ ]

Comments on Textual query

On RAG

RAG was able to find the correct result
The time to run a full context was similar to LLM with partial context

On LLM

LLM was unable to find the correct result
Time to provide a result was much longer than RAG
Pricing is much higher than RAG\
If we are using a self managed LLM, the level hardware must be more powerful than with a RAG approach.