SEE THIS

Eval
There are broadly 2 types of evaluations:
- Offline
- Online
Offline Evaluation
This is where you build the models, eval models, LLMs setc and you mostly compute everything in a controlled environment. This is done against a set benchmark, GT answers, Reference contexts etc. Hit Rate, NDCG, MRR, Precision@K, Recall@K etc etc are used here to check the effectiveness of Embedding Model + Vector DB + Rerankers + Chunking Strategy
On the other hand: Fluency, Complexity, Perplexity, BERTScore, BLEU, ROUGE, METROR, LLMasJudge, Groundedness, Hallucination Rate, Toxicity, Context Adherence, Faithfulness etc etc are used to judge the quality of LLM and it's response. You can use your rubric fine tuned models to use them as LLMasJudge.
Online Evaluation
This is where we do live evaluation. No ground truth and there is pipelines in place having all the components to figure out shortcomings of system. These are multiple Choke / Failure Points. For example:
- LLM itself is slow, bad, toxic, bias etc and prone to injections etc. It may be prone to divulging sensitive info
- Context retrieval models are slow or not good enough so you need to use different chunk size, different chunking strategy etc
- VectorDB uses ANN isntead of pure Cosine similarity so it has it's issues. So you end up using Re-Ranksers. On the other hand, not all tasks need sementic search so you need to use Syntactic search
- API failure rate, Latency, Time to response, throughput, load resistance etc etc are checked here
Types of Eaaluation metrics:
In this repo, you'll find 6 different types of Classes where 4 of them are actively working and 2 are abstract classes. The 5 working classes make up for more than 50+ metrics that are used. These are as follows:
IOGuards
Guads to protext the model from taking in prompt injections, divulging sensitive data bias, toxicity, polarity, harmful output, sentiment etc etc for Query, Context and Response
TextStat
These are mostly for Response answering the questions like: How complex is the output, how understandable is it, how fluent, which calls of student can understand it etc etc
ComparisonMetrics
Mostly used for Summarisation
These are Query-Context, Query-Response and Context-Response based metrics. They tell you the answer quality according to query and response. Metrics are Hallucination, Contradiction, BERTScore, ROUGE, METEOR etc etc
Within it, there is string matching based ones using BM25, Levenshtein Distance, Fuzzy Score etc (Need to add LSH and other Hashing too)
AppMetrics
Checks the overall APP usage. Failure rate, latenct, time to response, time to fetch the context, time for LLM to answer etc etc. Newrelic, Prometheus etc are there which be directly integrated with Flask or FastAPI. Streamlit is not supported by them but I have written by own decorstors for time.
LLMasJudge
These are nothing but a wrapper. You use a prompt and send the question, answer and context the LLM to get response. If it is offline evaluation, you can get reasoning steps and compare them against GT. You can as for reasonong for answering and compare against GT, Humans. This can be used for any task. Many models like Prometheus-2, PHUDGE, JudgeLM etc are there finetuned for specific tasks
TraditionalPipelines
You use traditional pipelines for storing the topics from Query, Context, Response to evaluate and compare whether they all talk about same thing or not. Then you can use POS tagging and other classification tasks for your usecase
Requirements
Tested with: Python 3.9
Step:
pip install -r requirements.txtpip install -U evaluate(without it, some old metrics won't work)streamlit run eval_rag_app.py
Running the below code will run the st.spinner while loading models
[nltk_data] Downloading package vader_lexicon to
[nltk_data] /Users/mahkumar/nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
Some layers from the model checkpoint at d4data/bias-detection-model were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at d4data/bias-detection-model and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/Users/mahkumar/anaconda3/envs/py39/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
WARNING:evaluate_modules.metrics.evaluate-metric--bleurt.98e148b2f8c4a88aba5037e4e0e90c9fd9ec35dc37a054ded8cfef0fa801ffab.bleurt:Using default BLEURT-Base checkpoint for sequence maximum length 128. You can use a bigger model for better results with e.g.: evaluate.load('bleurt', 'bleurt-large-512').
INFO:tensorflow:Reading checkpoint /Users/mahkumar/.cache/huggingface/metrics/bleurt/default/downloads/extracted/3ab93262e863625b5602d5c988317eca1e3022de221c7e6e9b88b58fca9ee841/bleurt-base-128.
INFO:tensorflow:Reading checkpoint /Users/mahkumar/.cache/huggingface/metrics/bleurt/default/downloads/extracted/3ab93262e863625b5602d5c988317eca1e3022de221c7e6e9b88b58fca9ee841/bleurt-base-128.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Config file found, reading.
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Will load checkpoint bert_custom
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:Loads full paths and checks that files exists.
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... name:bert_custom
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... vocab_file:vocab.txt
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... bert_config_file:bert_config.json
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... do_lower_case:True
INFO:tensorflow:... max_seq_length:128
INFO:tensorflow:... max_seq_length:128
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating BLEURT scorer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:Creating WordPiece tokenizer.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:WordPiece tokenizer instantiated.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Creating Eager Mode predictor.
INFO:tensorflow:Loading model.
INFO:tensorflow:Loading model.
INFO:tensorflow:BLEURT initialized.
INFO:tensorflow:BLEURT initialized. [nltk_data] Downloading package wordnet to [nltk_data] /Users/mahkumar/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package punkt to /Users/mahkumar/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package omw-1.4 to [nltk_data] /Users/mahkumar/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date!
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
{'guards': {'query_injection': [{'label': 'SAFE',
, 'score': 0.9999986886978149}],
, 'context_injection': [{'label': 'SAFE', 'score': 0.9999991655349731}],
, 'query_bias': [{'label': 'Biased', 'score': 0.6330747604370117}],
, 'context_bias': [{'label': 'Non-biased', 'score': 0.5858706831932068}],
, 'response_bias': [{'label': 'Biased', 'score': 0.5588837265968323}],
, 'query_regex': {},
, 'context_regex': {},
, 'response_regex': {},
, 'query_toxicity': [{'label': 'toxic', 'score': 0.9225953817367554}],
, 'context_toxicity': [{'label': 'toxic', 'score': 0.9640267491340637}],
, 'response_toxicity': [{'label': 'non-toxic', 'score': 0.9988303780555725}],
, 'query_sentiment': {'neg': 0.701,
, 'neu': 0.299,
, 'pos': 0.0,
, 'compound': -0.6908},
, 'query_polarity': [{'negative': 0.98,
, 'other': 0.01,
, 'neutral': 0.01,
, 'positive': 0.0}],
, 'context_polarity': [{'negative': 0.96,
, 'other': 0.03,
, 'neutral': 0.01,
, 'positive': 0.0}],
, 'response_polarity': [{'negative': 0.7,
, 'other': 0.19,
, 'neutral': 0.1,
, 'positive': 0.02}],
, 'query_response_hallucination': {'entailment': 1.5,
, 'neutral': 97.8,
, 'contradiction': 0.7},
, 'context_response_hallucination': {'entailment': 10.3,
, 'neutral': 79.7,
, 'contradiction': 10.0},
, 'harmful_query': False,
, 'harmful_context': False,
, 'harmful_response': False,
, 'refusal_response': False},
, 'reference_based_metrics': {'query_response_bertscore': {'precision': [0.8446345925331116],
, 'recall': [0.8695610761642456],
, 'f1': [0.8569165468215942],
, 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.41.0)'},
, 'query_response_rouge': {'rouge1': [0.125],
, 'rouge2': [0.0],
, 'rougeL': [0.125],
, 'rougeLsum': [0.125]},
, 'query_response_bleu': {'bleu': 0.0,
, 'precisions': [0.07692307692307693, 0.0, 0.0, 0.0],
, 'brevity_penalty': 1.0,
, 'length_ratio': 3.25,
, 'translation_length': 13,
, 'reference_length': 4},
, 'query_response_bleurt': {'scores': [-1.2369751930236816]},
, 'query_response_meteor': {'meteor': 0.10204081632653061},
, 'context_response_bertscore': {'precision': [0.8416609168052673],
, 'recall': [0.8466401696205139],
, 'f1': [0.8441432118415833],
, 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.41.0)'},
, 'context_response_rouge': {'rouge1': [0.09523809523809525],
, 'rouge2': [0.0],
, 'rougeL': [0.09523809523809525],
, 'rougeLsum': [0.09523809523809525]},
, 'context_response_bleu': {'bleu': 0.0,
, 'precisions': [0.07692307692307693, 0.0, 0.0, 0.0],
, 'brevity_penalty': 1.0,
, 'length_ratio': 1.625,
, 'translation_length': 13,
, 'reference_length': 8},
, 'context_response_bleurt': {'scores': [-1.3652904033660889]},
, 'context_response_meteor': {'meteor': 0.05319148936170213}},
, 'string_similarities': {'query_response_fuzz_q_ratio': 33,
, 'query_response_fuzz_partial_ratio': 52,
, 'query_response_fuzz_partial_token_set_ratio': 100,
, 'query_response_fuzz_partial_token_sort_ratio': 52,
, 'query_response_fuzz_token_set_ratio': 38,
, 'query_response_fuzz_token_sort_ratio': 38,
, 'query_response_levenshtein_distance': 49,
, 'query_response_bm_25_scores': array([-0.27465307]),
, 'context_response_fuzz_q_ratio': 40,
, 'context_response_fuzz_partial_ratio': 40,
, 'context_response_fuzz_partial_token_set_ratio': 100,
, 'context_response_fuzz_partial_token_sort_ratio': 50,
, 'context_response_fuzz_token_set_ratio': 42,
, 'context_response_fuzz_token_sort_ratio': 42,
, 'context_response_levenshtein_distance': 47,
, 'context_response_bm_25_scores': array([-0.27465307])},
, 'response_text_stats': {'result_flesch_reading_ease': 90.77,
, 'result_flesch_kincaid_grade': 2.1,
, 'result_smog_index': 0.0,
, 'result_coleman_liau_index': 3.82,
, 'result_automated_readability_index': 2.0,
, 'result_dale_chall_readability_score': 6.57,
, 'result_difficult_words': 2,
, 'result_linsear_write_formula': 2.0,
, 'result_gunning_fog': 2.4,
, 'result_text_standard': '1st and 2nd grade',
, 'result_fernandez_huerta': 122.72,
, 'result_szigriszt_pazos': 122.96,
, 'result_gutierrez_polini': 51.88,
, 'result_crawford': -0.8,
, 'result_gulpease_index': 95.7,
, 'result_osman': 89.92}}