Evaluation

mistral-cookbookevaluationmistral

How to evaluate LLMs

[ ]
[2]
Type your API Key··········

Example 1: Information extraction benchmark with accuracy

Evaluation data

[3]

How to evaluate?

  • Step 1: Define prompt template
[6]
  • Step 2: Define how we compare the model response with the golden answer
[7]
  • Step 3: Calculate accuracy rate across test cases
[8]
100.0

Example 2: evaluate code generation

[10]
  • Step 1: Define prompt template
[11]
  • Step 2: Decide how we evaluate the code generation
[13]
Downloading builder script:   0%|          | 0.00/9.18k [00:00<?, ?B/s]
Downloading extra modules:   0%|          | 0.00/6.10k [00:00<?, ?B/s]
[14]
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
({'pass@1': 1.0},
, defaultdict(list,
,             {0: [(0,
,                {'task_id': 0,
,                 'passed': True,
,                 'result': 'passed',
,                 'completion_id': 0})]}))
  • Step 3: Calculate accuracy rate across test cases
[15]
{'pass@1': 1.0}

Example 3: evaluate summary generation with LLM

[16]
  • Step 1: Generate summary for the given news
[20]
[21]
[22]
  • Step 2: Define evaluation metrics and rubrics
[23]
  • Step 3: Employ a more powerful LLM (e.g., Mistral Large) as a judge
[24]
[25]
{"relevancy": 4}
{"readability": 3}
[ ]