Notebooks
M
MongoDB
Rag Chunking Strategies

Rag Chunking Strategies

agentsartificial-intelligencellmsmongodb-genai-showcasenotebooksgenerative-airag

Open In Colab

RAG Series Part 3: Choosing the right chunking strategy for RAG

In this notebook, we will explore and evaluate different chunking techniques for RAG.

Step 1: Install required libraries

[1]

Step 2: Setup pre-requisites

  • Set the MongoDB connection string. Follow the steps here to get the connection string from the Atlas UI.

  • Set the OpenAI API key. Steps to obtain an API key as here

[2]
[3]
[4]

Step 3: Load the dataset

[5]
[6]
3

Step 4: Define chunking functions

[34]
[8]
[9]
[10]

Step 5: Generate the evaluation dataset

[ ]
[16]
Filename and doc_id are the same for all nodes.                 
Generating: 100%|██████████| 10/10 [01:16<00:00,  7.68s/it]
[17]
[18]
10
[19]

Step 6: Evaluate chunking strategies

[ ]
[32]
[22]
[30]
[24]
CHUNK SIZE: 100
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00,  5.22it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:23<00:00,  1.17s/it]
Result: {'context_precision': 0.8583, 'context_recall': 0.7833}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00,  5.12it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:18<00:00,  1.09it/s]
Result: {'context_precision': 0.9000, 'context_recall': 0.9500}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.93it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.10s/it]
Result: {'context_precision': 0.9000, 'context_recall': 0.9833}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.90it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.15s/it]
Result: {'context_precision': 0.9833, 'context_recall': 0.9833}
CHUNK SIZE: 200
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.94it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:21<00:00,  1.09s/it]
Result: {'context_precision': 0.9000, 'context_recall': 0.9000}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00,  5.10it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:20<00:00,  1.03s/it]
Result: {'context_precision': 1.0000, 'context_recall': 0.9383}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00,  5.13it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.12s/it]
Result: {'context_precision': 0.9000, 'context_recall': 0.9008}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.75it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:21<00:00,  1.10s/it]
Result: {'context_precision': 1.0000, 'context_recall': 0.8583}
CHUNK SIZE: 500
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.99it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:22<00:00,  1.11s/it]
Result: {'context_precision': 0.8833, 'context_recall': 0.9500}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.77it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:17<00:00,  1.15it/s]
Result: {'context_precision': 0.7000, 'context_recall': 0.9000}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.65it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:20<00:00,  1.02s/it]
Result: {'context_precision': 0.5667, 'context_recall': 0.8236}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00,  5.11it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:15<00:00,  1.30it/s]
Result: {'context_precision': 0.6000, 'context_recall': 0.8800}
CHUNK SIZE: 1000
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00,  5.18it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:18<00:00,  1.08it/s]
Result: {'context_precision': 0.9000, 'context_recall': 0.8909}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.27it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:17<00:00,  1.15it/s]
Result: {'context_precision': 0.7833, 'context_recall': 0.8909}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:03<00:00,  2.64it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:19<00:00,  1.02it/s]
Result: {'context_precision': 0.7833, 'context_recall': 0.8800}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.64it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:19<00:00,  1.01it/s]
Result: {'context_precision': 0.8000, 'context_recall': 0.8709}
------ Semantic chunking ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00,  4.69it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:23<00:00,  1.16s/it]
Result: {'context_precision': 0.9000, 'context_recall': 0.8187}