MongoDB Ragas Evaluation

Ragas Evaluation

agentsartificial-intelligencellmsmongodb-genai-showcaseevalsnotebooksgenerative-airag

alph-notebooks/mongodb-genai-showcase / ragas-evaluation.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

RAG Series Part 2: How to evaluate your RAG application

This notebook shows how to evaluate a RAG application using the RAGAS framework.

Step 1: Install required libraries

datasets: Python library to get access to datasets available on Hugging Face Hub

- **ragas**: Python library for the RAGAS framework

- **langchain**: Python library to develop LLM applications using LangChain

- **langchain-mongodb**: Python package to use MongoDB Atlas vector Search with LangChain

- **langchain-openai**: Python package to use OpenAI models in LangChain

- **pymongo**: Python driver to interacting with MongoDB

- **pandas**: Python library for data analysis, exploration and manipulation

- **tdqm**: Python module to show a progress meter for loops

- **matplotlib, seaborn**: Python libraries for data visualization

[13]

Step 2: Setup pre-requisites

Set the MongoDB connection string. Follow the steps here to get the connection string from the Atlas UI.
Set the OpenAI API key. Steps to obtain an API key as here

[11]

[28]

Enter your OpenAI API Key:········

[12]

Enter your MongoDB connection string:········

Step 3: Download the evaluation dataset

[30]

[31]

[32]

[33]

Step 4: Create reference document chunks

[21]

[22]

[23]

[24]

[25]

[29]

[40]

'Figgis had problems because permits were not issued for some street scenes. This caused him to film some scenes on the Las Vegas strip in one take to avoid the police, which Figgis said benefited production and the authenticity of the acting, remarking "I\'ve always hated the convention of shooting on a street, and then having to stop the traffic, and then having to tell the actors, \'Well, there\'s meant to be traffic here, so you\'re going to have to shout.\' And they\'re shouting, but it\'s quiet and they feel really stupid, because it\'s unnatural. You put them up against a couple of trucks, with it all happening around them, and their voices become great". Filming took place over 28 days.'

Step 5: Create embeddings and ingest them into MongoDB

[14]

[42]

[ ]

[44]

[45]

[46]

Getting embeddings for the text-embedding-ada-002 model

  0%|          | 0/30 [00:00<?, ?it/s]

Finished getting embeddings for the text-embedding-ada-002 model
Inserting embeddings for the text-embedding-ada-002 model
Finished inserting embeddings for the text-embedding-ada-002 model
Getting embeddings for the text-embedding-3-small model

  0%|          | 0/30 [00:00<?, ?it/s]

Finished getting embeddings for the text-embedding-3-small model
Inserting embeddings for the text-embedding-3-small model
Finished inserting embeddings for the text-embedding-3-small model

Step 6: Compare embedding models for retrieval

[47]

[48]

[49]

[50]

  0%|          | 0/232 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/464 [00:00<?, ?it/s]

Failed to parse output. Returning None.

Result for the text-embedding-ada-002 model: {'context_precision': 0.9310, 'context_recall': 0.8561}

  0%|          | 0/232 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/464 [00:00<?, ?it/s]

Result for the text-embedding-3-small model: {'context_precision': 0.9116, 'context_recall': 0.8826}

Step 7: Compare completion models for generation

[51]

[52]

[54]

  0%|          | 0/232 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/464 [00:00<?, ?it/s]

No statements were generated from the answer.
No statements were generated from the answer.
No statements were generated from the answer.

Result for the gpt-3.5-turbo-1106 model: {'faithfulness': 0.9671, 'answer_relevancy': 0.9105}

  0%|          | 0/232 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/464 [00:00<?, ?it/s]

No statements were generated from the answer.
No statements were generated from the answer.
No statements were generated from the answer.
No statements were generated from the answer.
No statements were generated from the answer.

Result for the gpt-3.5-turbo model: {'faithfulness': 0.9714, 'answer_relevancy': 0.9087}

Step 8: Measure overall performance of the RAG application

[55]

[56]

  0%|          | 0/232 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/464 [00:00<?, ?it/s]

Overall metrics: {'answer_similarity': 0.8873, 'answer_correctness': 0.5922}

[57]

[58]

[60]

[59]

Step 9: Tracking performance over time

[2]

[6]

[22]

InsertOneResult(ObjectId('66132f1305da5dc970ad919c'), acknowledged=True)