Rag Chunking Strategies
agentsartificial-intelligencellmsmongodb-genai-showcasenotebooksgenerative-airag
Export
RAG Series Part 3: Choosing the right chunking strategy for RAG
In this notebook, we will explore and evaluate different chunking techniques for RAG.
Step 1: Install required libraries
[1]
[2]
[3]
[4]
Step 3: Load the dataset
[5]
[6]
3
Step 4: Define chunking functions
[34]
[8]
[9]
[10]
Step 5: Generate the evaluation dataset
[ ]
[16]
Filename and doc_id are the same for all nodes. Generating: 100%|██████████| 10/10 [01:16<00:00, 7.68s/it]
[17]
[18]
10
[19]
Step 6: Evaluate chunking strategies
[ ]
[32]
[22]
[30]
[24]
CHUNK SIZE: 100 ------ Fixed token without overlap ------ Deleting existing documents in the collection evals.chunking Deletion complete Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00, 5.22it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:23<00:00, 1.17s/it]
Result: {'context_precision': 0.8583, 'context_recall': 0.7833}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00, 5.12it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:18<00:00, 1.09it/s]
Result: {'context_precision': 0.9000, 'context_recall': 0.9500}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.93it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:22<00:00, 1.10s/it]
Result: {'context_precision': 0.9000, 'context_recall': 0.9833}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.90it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:22<00:00, 1.15s/it]
Result: {'context_precision': 0.9833, 'context_recall': 0.9833}
CHUNK SIZE: 200
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.94it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:21<00:00, 1.09s/it]
Result: {'context_precision': 0.9000, 'context_recall': 0.9000}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00, 5.10it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:20<00:00, 1.03s/it]
Result: {'context_precision': 1.0000, 'context_recall': 0.9383}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00, 5.13it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:22<00:00, 1.12s/it]
Result: {'context_precision': 0.9000, 'context_recall': 0.9008}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.75it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:21<00:00, 1.10s/it]
Result: {'context_precision': 1.0000, 'context_recall': 0.8583}
CHUNK SIZE: 500
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.99it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:22<00:00, 1.11s/it]
Result: {'context_precision': 0.8833, 'context_recall': 0.9500}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.77it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:17<00:00, 1.15it/s]
Result: {'context_precision': 0.7000, 'context_recall': 0.9000}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.65it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:20<00:00, 1.02s/it]
Result: {'context_precision': 0.5667, 'context_recall': 0.8236}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00, 5.11it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:15<00:00, 1.30it/s]
Result: {'context_precision': 0.6000, 'context_recall': 0.8800}
CHUNK SIZE: 1000
------ Fixed token without overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:01<00:00, 5.18it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:18<00:00, 1.08it/s]
Result: {'context_precision': 0.9000, 'context_recall': 0.8909}
------ Fixed token with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.27it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:17<00:00, 1.15it/s]
Result: {'context_precision': 0.7833, 'context_recall': 0.8909}
------ Recursive with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:03<00:00, 2.64it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:19<00:00, 1.02it/s]
Result: {'context_precision': 0.7833, 'context_recall': 0.8800}
------ Recursive Python splitter with overlap ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.64it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:19<00:00, 1.01it/s]
Result: {'context_precision': 0.8000, 'context_recall': 0.8709}
------ Semantic chunking ------
Deleting existing documents in the collection evals.chunking
Deletion complete
Getting contexts for evaluation set
100%|██████████| 10/10 [00:02<00:00, 4.69it/s]
Running evals
Evaluating: 100%|██████████| 20/20 [00:23<00:00, 1.16s/it]
Result: {'context_precision': 0.9000, 'context_recall': 0.8187}