Mistral AI Evaluation

Evaluation

mistral-cookbookevaluationmistral

alph-notebooks/mistral-cookbook / evaluation.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

How to evaluate LLMs

[ ]

[2]

Type your API Key··········

Example 1: Information extraction benchmark with accuracy

Evaluation data

[3]

How to evaluate?

Step 1: Define prompt template

[6]

Step 2: Define how we compare the model response with the golden answer

[7]

Step 3: Calculate accuracy rate across test cases

[8]

100.0

Example 2: evaluate code generation

[10]

Step 1: Define prompt template

[11]

Step 2: Decide how we evaluate the code generation

[13]

Downloading builder script:   0%|          | 0.00/9.18k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/6.10k [00:00<?, ?B/s]

[14]

/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()

({'pass@1': 1.0},
, defaultdict(list,
,             {0: [(0,
,                {'task_id': 0,
,                 'passed': True,
,                 'result': 'passed',
,                 'completion_id': 0})]}))

Step 3: Calculate accuracy rate across test cases

[15]

{'pass@1': 1.0}

Example 3: evaluate summary generation with LLM

[16]

Step 1: Generate summary for the given news

[20]

[21]

[22]

Step 2: Define evaluation metrics and rubrics

[23]

Step 3: Employ a more powerful LLM (e.g., Mistral Large) as a judge

[24]

[25]

{"relevancy": 4}
{"readability": 3}

[ ]