Export
How to evaluate LLMs
[ ]
[2]
Type your API Key··········
Example 1: Information extraction benchmark with accuracy
Evaluation data
[3]
How to evaluate?
- Step 1: Define prompt template
[6]
- Step 2: Define how we compare the model response with the golden answer
[7]
- Step 3: Calculate accuracy rate across test cases
[8]
100.0
Example 2: evaluate code generation
[10]
- Step 1: Define prompt template
[11]
- Step 2: Decide how we evaluate the code generation
[13]
Downloading builder script: 0%| | 0.00/9.18k [00:00<?, ?B/s]
Downloading extra modules: 0%| | 0.00/6.10k [00:00<?, ?B/s]
[14]
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork()
({'pass@1': 1.0},
, defaultdict(list,
, {0: [(0,
, {'task_id': 0,
, 'passed': True,
, 'result': 'passed',
, 'completion_id': 0})]})) - Step 3: Calculate accuracy rate across test cases
[15]
{'pass@1': 1.0} Example 3: evaluate summary generation with LLM
[16]
- Step 1: Generate summary for the given news
[20]
[21]
[22]
- Step 2: Define evaluation metrics and rubrics
[23]
- Step 3: Employ a more powerful LLM (e.g., Mistral Large) as a judge
[24]
[25]
{"relevancy": 4}
{"readability": 3}
[ ]