Evals With Synthetic Data
Creating Evals with synthetic data and measuring hallucinations
When you deploy Llama for your use case, it is a good practice to have Evals for your use case. Though it might be ideal to have human annotated Evals, this notebook shows a strategy fow how one might go about addressing this using synthetic data. However, the Evals generated still requires validation by a human to make sure that your production use case can rely on this. The notebook also shows how one could accurately measure hallucinations without using LLM-As-A-Judge methodology using Llama
Overall idea
Let's assume we have a use case for generating a summarization report based on a given context, which is a pretty common use case with LLM. Both the context and the report have a lot of factual information and we want to make sure the generated report is not hallucinating.
Since its not trivial to find an open source dataset for this, the idea is to take synthetic tabular data and then use Llama to generate a story(context) for every row of the tabular data using Prompt Engineering. Then we ask Llama to summarize the generated context as a report in a specific format using Prompt Engineering. Finally we check the factual accuracy of the generated report using Llama by converting this into a QA task using the tabular data as the ground truth.
To generate synthetic data for this approach, we use an open source tool like Synthetic Data Vault
The overall workflow is shown in the below diagram

Synthetic Data Vault installation
SDV has a number of single table datasets. We choose student_placements dataset for this notebook
Generate synthetic data from real data
Load pre-generated synthetic tabular data
Synthetic Data Generation with Llama-3.3-70B-Instruct
In this section, we use Llama-3.3-70B-Instruct to create a story using tabular data and then generate extractive summary report from the generated context.
You could try using Llama-3.1-8B-Instruct but we have seen better results with the 70B model for generating synthetic data.
Alternate approach
In the below section, we choose tabular data as the ground truth and generate all the context & reports from the table. Another approach is to use couple of examples as few shot prompting and use Llama to generate the context & story from this and asking it to vary the factual information. We can then use Llama to create the ground truth tabular data
/home/agunapal/anaconda3/envs/torchtune/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Loading checkpoint shards: 100%|██████████| 30/30 [19:53<00:00, 39.79s/it]
Generate 12 examples of synthetic data using this loop
Why 12?: We will use 2 examples for few shot prompting and the rest 10 for Evals.
In practice, you want the number of data points to be much higher for your production application
Example Context & Report
By manual inspection we see that, Llama has created well structured context and the corresponding report. We also see that all the factual information is correct.
Context is ------------------------- ### Progress Report for Student 3040587 #### <academic_background> Student 3040587 has a strong academic foundation, with a high school percentage of 66.62% in Science. He also holds a degree in Science and Technology with a percentage of 75.76%. His second-year percentage is 75.01%, demonstrating his consistent academic performance. #### <career_aspirations> With a specialization in Marketing and Finance, Student 3040587 aspires to pursue a career in the finance sector, leveraging his skills in market analysis and financial planning. His career goal is to become a financial analyst, with a focus on investment banking. #### <salary_expectations> Student 3040587 expects a starting salary of 5000 tokens per annum, considering his one year of work experience and academic achievements. He is confident that his skills and knowledge will enable him to secure a job with a reputable company. #### <placement_status> Student 3040587 has been successfully placed, with an employability percentage of 85.98%. His placement is a testament to his hard work and dedication to his studies, as well as his relevant work experience. #### <course_details> Student 3040587 is currently pursuing an MBA with a specialization in Marketing and Finance, with a course duration of 3 years. He has completed one year of the course, with an MBA percentage of 58.37%. #### <story_behind_the_numbers> Behind the numbers, Student 3040587's story is one of perseverance and determination. Despite facing challenges in his academic journey, he has consistently worked hard to achieve his goals. His work experience has equipped him with the skills and knowledge required to succeed in the finance sector. With his strong academic background, career aspirations, and relevant work experience, Student 3040587 is poised to achieve great things in his future career. ### End of Report 3040587 Report is ------------------------- Summary Report: <student_id> Student 3040587 <student_id> <salary> Student has a realistic salary expectation of 5000 tokens per annum <salary> <degree_type> Student has a degree in Science and Technology <degree_type> <mba_spec> Student has a specialization in Marketing and Finance <mba_spec> <duration> Student has a degree duration of 3 years <duration> <employability_perc> Student has a 85.98% employability percentage <employability_perc>
Important!!!!! Verification by human!
At this point, ideally you need a human to look at the synthetic data that you have generated and fix any errors in the formatting or factual information or be aware of the number of errors in the dataset
Measuring Hallucinations
The usual method to measure hallucinations uses LLM-As-Judge methodology. An example hallucination metric is using DeepEval. This would use a powerful LLM as the ground truth to measure hallucinations.
The below section shows a way to measure hallucinations using the ground truth data that we have (tabular data). The methodology is to make use of the tags that we have added in the report and use Llama to answer simple questions looking at the corresponding sections. Llama compares the answers with the ground truth and generates a list of boolean values. This is then used to measure accuracy of the factual information in the report. If your report has a well defined structure, using QA to measure hallucinations can be highly effective and cost efficient
Checking accuracy of generated report in generated_data/data_2.json <answer> student_id: [True, None] degree_type: [True, None] salary: [True, None] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Checking accuracy of generated report in generated_data/data_3.json <answer> student_id: [True, None] degree_type: [True, None] salary: [True, None] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Checking accuracy of generated report in generated_data/data_4.json <answer> student_id: [True, None] degree_type: [True, None] salary: [True, None] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Checking accuracy of generated report in generated_data/data_5.json <answer> student_id: [True, None] degree_type: [True, None] salary: [True, None] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Checking accuracy of generated report in generated_data/data_6.json <answer> student_id: [False, report shows Student ID is not mentioned in the data and ground truth says 6180804] degree_type: [True, None] salary: [True, None] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Checking accuracy of generated report in generated_data/data_7.json <answer> student_id: [True, None] degree_type: [False, report says Mkt&Fin and ground truth says Sci&Tech] salary: [True, None] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Checking accuracy of generated report in generated_data/data_8.json <answer> student_id: [True, None] degree_type: [True, None] salary: [True, None] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Checking accuracy of generated report in generated_data/data_9.json <answer> student_id: [True, None] degree_type: [True, None] salary: [True, None] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Checking accuracy of generated report in generated_data/data_10.json <answer> student_id: [True, None] degree_type: [True, None] salary: [True, None] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Checking accuracy of generated report in generated_data/data_11.json <answer> student_id: [True, None] degree_type: [False, report says Commerce and ground truth says Comm&Mgmt] salary: [False, report says 500 and ground truth says NaN] mba_spec: [True, None] duration: [True, None] employability_perc: [True, None] </answer> Accuracy of factual information generation is : 0.9333
Conclusion & Next Steps
- Creating Evals for summarization is important
- Llama can be used to create evals given few samples of ground truth
- Using simple QA to measure hallucinations can be an effective strategy to be be confident that important factual information is being verified