Ragas Evaluation
agentsartificial-intelligencellmsmongodb-genai-showcaseevalsnotebooksgenerative-airag
Export
RAG Series Part 2: How to evaluate your RAG application
This notebook shows how to evaluate a RAG application using the RAGAS framework.
Step 1: Install required libraries
- datasets: Python library to get access to datasets available on Hugging Face Hub
- **ragas**: Python library for the RAGAS framework
- **langchain**: Python library to develop LLM applications using LangChain
- **langchain-mongodb**: Python package to use MongoDB Atlas vector Search with LangChain
- **langchain-openai**: Python package to use OpenAI models in LangChain
- **pymongo**: Python driver to interacting with MongoDB
- **pandas**: Python library for data analysis, exploration and manipulation
- **tdqm**: Python module to show a progress meter for loops
- **matplotlib, seaborn**: Python libraries for data visualization
[13]
Step 2: Setup pre-requisites
[11]
[28]
Enter your OpenAI API Key:········
[12]
Enter your MongoDB connection string:········
Step 3: Download the evaluation dataset
[30]
[31]
[32]
[33]
232
Step 4: Create reference document chunks
[21]
[22]
[23]
[24]
[25]
[29]
3795
[40]
'Figgis had problems because permits were not issued for some street scenes. This caused him to film some scenes on the Las Vegas strip in one take to avoid the police, which Figgis said benefited production and the authenticity of the acting, remarking "I\'ve always hated the convention of shooting on a street, and then having to stop the traffic, and then having to tell the actors, \'Well, there\'s meant to be traffic here, so you\'re going to have to shout.\' And they\'re shouting, but it\'s quiet and they feel really stupid, because it\'s unnatural. You put them up against a couple of trucks, with it all happening around them, and their voices become great". Filming took place over 28 days.'
Step 5: Create embeddings and ingest them into MongoDB
[14]
[42]
[ ]
[44]
[45]
[46]
Getting embeddings for the text-embedding-ada-002 model
0%| | 0/30 [00:00<?, ?it/s]
Finished getting embeddings for the text-embedding-ada-002 model Inserting embeddings for the text-embedding-ada-002 model Finished inserting embeddings for the text-embedding-ada-002 model Getting embeddings for the text-embedding-3-small model
0%| | 0/30 [00:00<?, ?it/s]
Finished getting embeddings for the text-embedding-3-small model Inserting embeddings for the text-embedding-3-small model Finished inserting embeddings for the text-embedding-3-small model
Step 6: Compare embedding models for retrieval
[47]
[48]
[49]
[50]
0%| | 0/232 [00:00<?, ?it/s]
Evaluating: 0%| | 0/464 [00:00<?, ?it/s]
Failed to parse output. Returning None.
Result for the text-embedding-ada-002 model: {'context_precision': 0.9310, 'context_recall': 0.8561}
0%| | 0/232 [00:00<?, ?it/s]
Evaluating: 0%| | 0/464 [00:00<?, ?it/s]
Result for the text-embedding-3-small model: {'context_precision': 0.9116, 'context_recall': 0.8826}
Step 7: Compare completion models for generation
[51]
[52]
[54]
0%| | 0/232 [00:00<?, ?it/s]
Evaluating: 0%| | 0/464 [00:00<?, ?it/s]
No statements were generated from the answer. No statements were generated from the answer. No statements were generated from the answer.
Result for the gpt-3.5-turbo-1106 model: {'faithfulness': 0.9671, 'answer_relevancy': 0.9105}
0%| | 0/232 [00:00<?, ?it/s]
Evaluating: 0%| | 0/464 [00:00<?, ?it/s]
No statements were generated from the answer. No statements were generated from the answer. No statements were generated from the answer. No statements were generated from the answer. No statements were generated from the answer.
Result for the gpt-3.5-turbo model: {'faithfulness': 0.9714, 'answer_relevancy': 0.9087}
Step 8: Measure overall performance of the RAG application
[55]
[56]
0%| | 0/232 [00:00<?, ?it/s]
Evaluating: 0%| | 0/464 [00:00<?, ?it/s]
Overall metrics: {'answer_similarity': 0.8873, 'answer_correctness': 0.5922}
[57]
[58]
[60]
[59]
Step 9: Tracking performance over time
[2]
[6]
[22]
InsertOneResult(ObjectId('66132f1305da5dc970ad919c'), acknowledged=True)