MongoDB Rag Chunking Strategies

Rag Chunking Strategies

agentsartificial-intelligencellmsmongodb-genai-showcasenotebooksgenerative-airag

alph-notebooks/mongodb-genai-showcase / rag_chunking_strategies.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

RAG Series Part 3: Choosing the right chunking strategy for RAG

In this notebook, we will explore and evaluate different chunking techniques for RAG.

Step 1: Install required libraries

[1]

Step 2: Setup pre-requisites

Set the MongoDB connection string. Follow the steps here to get the connection string from the Atlas UI.
Set the OpenAI API key. Steps to obtain an API key as here

[2]

[3]

[4]

Step 3: Load the dataset

[5]

[6]

Step 4: Define chunking functions

[34]

[8]

[9]

[10]

Step 5: Generate the evaluation dataset

[ ]

[16]

Filename and doc_id are the same for all nodes.                 
Generating: 100%|██████████| 10/10 [01:16<00:00,  7.68s/it]

[17]

[18]

[19]

Step 6: Evaluate chunking strategies

[ ]

[32]

[22]

[30]

[24]

CHUNK SIZE: 100
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:01<00:00,  5.22it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:23<00:00,  1.17s/it]

Result: {'context_precision': 0.8583, 'context_recall': 0.7833}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:01<00:00,  5.12it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:18<00:00,  1.09it/s]

Result: {'context_precision': 0.9000, 'context_recall': 0.9500}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.93it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.10s/it]

Result: {'context_precision': 0.9000, 'context_recall': 0.9833}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.90it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.15s/it]

Result: {'context_precision': 0.9833, 'context_recall': 0.9833}
CHUNK SIZE: 200
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.94it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:21<00:00,  1.09s/it]

Result: {'context_precision': 0.9000, 'context_recall': 0.9000}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:01<00:00,  5.10it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:20<00:00,  1.03s/it]

Result: {'context_precision': 1.0000, 'context_recall': 0.9383}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:01<00:00,  5.13it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.12s/it]

Result: {'context_precision': 0.9000, 'context_recall': 0.9008}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.75it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:21<00:00,  1.10s/it]

Result: {'context_precision': 1.0000, 'context_recall': 0.8583}
CHUNK SIZE: 500
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.99it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.11s/it]

Result: {'context_precision': 0.8833, 'context_recall': 0.9500}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.77it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:17<00:00,  1.15it/s]

Result: {'context_precision': 0.7000, 'context_recall': 0.9000}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.65it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:20<00:00,  1.02s/it]

Result: {'context_precision': 0.5667, 'context_recall': 0.8236}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:01<00:00,  5.11it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:15<00:00,  1.30it/s]

Result: {'context_precision': 0.6000, 'context_recall': 0.8800}
CHUNK SIZE: 1000
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:01<00:00,  5.18it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:18<00:00,  1.08it/s]

Result: {'context_precision': 0.9000, 'context_recall': 0.8909}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.27it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:17<00:00,  1.15it/s]

Result: {'context_precision': 0.7833, 'context_recall': 0.8909}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:03<00:00,  2.64it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:19<00:00,  1.02it/s]

Result: {'context_precision': 0.7833, 'context_recall': 0.8800}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.64it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:19<00:00,  1.01it/s]

Result: {'context_precision': 0.8000, 'context_recall': 0.8709}
------ Semantic chunking ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set

100%|██████████| 10/10 [00:02<00:00,  4.69it/s]

Running evals

Evaluating: 100%|██████████| 20/20 [00:23<00:00,  1.16s/it]

Result: {'context_precision': 0.9000, 'context_recall': 0.8187}