Notebooks
H
Hugging Face
Function Calling Fine Tuning Llms On Xlam

Function Calling Fine Tuning Llms On Xlam

hf-cookbookennotebooks

Fine-tuning LLMs for Function Calling with xLAM Dataset

Authored by: Behrooz Azarkhalili

This notebook demonstrates how to fine-tune language models for function calling capabilities using the xLAM dataset from Salesforce and QLoRA (Quantized Low-Rank Adaptation) technique. We'll work with popular models like Llama 3, Qwen2, Mistral, and others.

What is Function Calling? Function calling enables language models to interact with external tools and APIs by generating structured function invocations. Instead of just generating text, the model learns to call specific functions with the right parameters based on user requests.

What You'll Learn:

  • Data Processing: How to format the xLAM dataset for function calling training
  • Model Fine-tuning: Using QLoRA for memory-efficient training on consumer GPUs
  • Evaluation: Testing the fine-tuned models with example prompts
  • Multi-model Support: Working with different model architectures

Key Benefits:

  • Memory Efficient: QLoRA enables training on 16-24GB GPUs
  • Production Ready: Modular code with proper error handling
  • Flexible Architecture: Easy to adapt for different models and datasets
  • Universal Support: Works with Llama, Qwen, Mistral, Gemma, Phi, and more

Hardware Requirements:

  • GPU: 16GB+ VRAM (24GB recommended for larger models)
  • RAM: 32GB+ system memory
  • Storage: 50GB+ free space for models and datasets

Software Dependencies: The notebook will install required packages automatically, including:

  • transformers, peft, bitsandbytes, trl, datasets, accelerate

For detailed methodology and results, see: Function Calling: Fine-tuning Llama 3 and Qwen2 on xLAM

[1]

Basic Setup and Imports

Let's start with the essential imports and basic setup for our notebook.

[2]
PyTorch version: 2.8.0+cu128
CUDA available: True
GPU: NVIDIA H100 NVL
VRAM: 100.0 GB

Hugging Face Authentication Setup

Next, we'll set up authentication with HuggingFace Hub. This allows us to download models and datasets, and optionally upload our fine-tuned models.

[3]
โœ… Successfully authenticated with HuggingFace!

Model Configuration Classes

We'll create two configuration classes to organize our settings:

  1. ModelConfig: Stores model-specific settings like tokenizer configuration
  2. TrainingConfig: Stores training parameters like learning rate and batch size
[4]

Automatic Model Configuration

This function automatically detects the model's tokenizer settings and creates a proper configuration. It handles different model architectures (Llama, Qwen, Mistral, etc.) and their specific token requirements.

[5]
[6]
โœ… Configuration system ready!
๐Ÿ’ก Supports Llama, Qwen, Mistral, Gemma, Phi, and more

Hardware Detection and Setup

Let's detect our hardware capabilities and configure optimal settings. We'll check for bfloat16 support and set up the best attention mechanism for our GPU.

[ ]

Tokenizer Setup Function

Now let's create a function to set up our tokenizer with the right configuration from our model settings.

[8]
๐Ÿ“Š Hardware Configuration Complete:
   โ€ข Compute dtype: torch.bfloat16
   โ€ข Attention implementation: flash_attention_2
   โ€ข Device: NVIDIA H100 NVL

Dataset Processing

Now we'll work with the xLAM dataset from Salesforce. This dataset contains about 60,000 examples of function calling conversations that we'll use to train our model.

Key Functions:

  • process_xlam_sample(): Converts a single dataset example into the training format with special tags (<user>, <tools>, <calls>) and EOS token
  • load_and_process_xlam_dataset(): Loads the complete xLAM dataset (60K samples) from Hugging Face and processes all samples using multiprocessing for efficiency
  • preview_dataset_sample(): Displays a formatted preview of a processed dataset sample for inspection with statistics
[9]
[10]
[11]

Loading and Processing the Dataset

Now let's add functions to load the xLAM dataset and process it into the format our model needs for training.

[12]

QLoRA Training Setup

QLoRA (Quantized Low-Rank Adaptation) allows us to fine-tune large language models efficiently. It uses 4-bit quantization to reduce memory usage while maintaining training quality.

[13]

LoRA Configuration

LoRA (Low-Rank Adaptation) is the key technique that makes efficient fine-tuning possible. Instead of updating all model parameters, LoRA adds small trainable matrices to specific layers while keeping the base model frozen.

Training Execution

Now we'll create the main training function that puts everything together. This function configures the training arguments and executes the fine-tuning process using TRL's SFTTrainer.

[14]

๐ŸŽฏ Universal Model Selection

Choose any model for fine-tuning! This notebook supports a wide range of popular models. Simply uncomment the model you want to use or specify your own.

๐Ÿ“‹ Quick Model Selection

Uncomment one of these popular models or specify your own:

Why Llama 3-8B-Instruct as default?

  • Proven Performance: Excellent function calling capabilities and instruction following
  • Optimal Size: 8B parameters provide great balance between performance and resource usage
[15]
๐ŸŽฏ Selected Model: meta-llama/Meta-Llama-3-8B-Instruct

๐Ÿ”ง Auto-configuring everything for meta-llama/Meta-Llama-3-8B-Instruct...
๐Ÿ” Loading model configuration: meta-llama/Meta-Llama-3-8B-Instruct
๐Ÿ“Š Model: llama, vocab_size: 128,256
โœ… Configured - pad: '<|eot_id|>' (ID: 128009), eos: '<|eot_id|>' (ID: 128009)

๐ŸŽ‰ Ready to fine-tune! Everything configured automatically:
   โœ… Model type: llama
   โœ… Vocabulary: 128,256 tokens
   โœ… Pad token: '<|eot_id|>' (ID: 128009)
   โœ… Output dir: ./Meta_Llama_3_8B_Instruct_xLAM

๐Ÿš€ Configuration complete for meta-llama/Meta-Llama-3-8B-Instruct!
[ ]

Model Loading for Inference

After training is complete, we need to load the trained model for inference. This function loads the base model with quantization and applies the trained LoRA adapters.

[19]

Text Generation for Function Calls

Now let's create the function that generates responses from our fine-tuned model. This handles tokenization, generation parameters, and decoding.

[20]

Testing Function Calling Capabilities

This function provides a comprehensive test suite to evaluate our fine-tuned model with different types of function calling scenarios.

[21]
[ ]
[23]
๐Ÿงช Testing function calling capabilities...

============================================================
Test Case 1: Mathematical Function
============================================================
๐ŸŽฏ Generating response for prompt...
๐Ÿ“ Input: <user>Check if the numbers 8 and 1233 are powers of two.</user>

<tools>
โœ… Generation completed!
๐Ÿ“Š Generated 90 new tokens

๐Ÿ” Complete Response:
----------------------------------------
<user>Check if the numbers 8 and 1233 are powers of two.</user>

<tools>{'name': 'is_power_of_two', 'description': 'Checks if a number is a power of two.', 'parameters': {'num': {'description': 'The number to check.', 'type': 'int'}}}</tools>

<calls>{'name': 'is_power_of_two', 'arguments': {'num': 8}}
{'name': 'is_power_of_two', 'arguments': {'num': 1233}}</calls>
----------------------------------------

============================================================
Test Case 2: Weather Query
============================================================
๐ŸŽฏ Generating response for prompt...
๐Ÿ“ Input: <user>What's the weather like in New York today?</user>

<tools>
โœ… Generation completed!
๐Ÿ“Š Generated 105 new tokens

๐Ÿ” Complete Response:
----------------------------------------
<user>What's the weather like in New York today?</user>

<tools>{'name':'realtime_weather_api', 'description': 'Fetches current weather information based on the provided query parameter.', 'parameters': {'q': {'description': 'Query parameter used to specify the location for which weather data is required. It can be in various formats such as:', 'type':'str', 'default': '53.1,-0.13'}}}</tools>

<calls>{'name':'realtime_weather_api', 'arguments': {'q': 'New York'}}</calls>
----------------------------------------

============================================================
Test Case 3: Data Processing
============================================================
๐ŸŽฏ Generating response for prompt...
๐Ÿ“ Input: <user>Calculate the average of these numbers: 10, 20, 30, 40, 50</user>

<tools>
โœ… Generation completed!
๐Ÿ“Š Generated 81 new tokens

๐Ÿ” Complete Response:
----------------------------------------
<user>Calculate the average of these numbers: 10, 20, 30, 40, 50</user>

<tools>{'name': 'average', 'description': 'Calculates the arithmetic mean of a list of numbers.', 'parameters': {'numbers': {'description': 'The list of numbers.', 'type': 'List[float]'}}}</tools>

<calls>{'name': 'average', 'arguments': {'numbers': [10, 20, 30, 40, 50]}}</calls>
----------------------------------------

โœ… All test cases completed!

๐ŸŽ‰ Conclusion and Next Steps


๐Ÿ“Š Summary

This notebook demonstrated a complete, production-ready, universal pipeline for fine-tuning language models for function calling capabilities using:

  • ๐ŸŽฏ Universal Model Support: Works with any model - just change the MODEL_NAME variable
  • ๐Ÿ”ง Intelligent Configuration: Automatic token detection using auto_configure_model()
  • โšก QLoRA Efficiency: Memory-efficient training on consumer GPUs (16-24GB)
  • ๐Ÿ“‹ Comprehensive Testing: Automated evaluation and interactive testing capabilities

๐Ÿš€ Key Improvements Made

Universal Compatibility

  • โœ… Multi-Model Support: Works with Llama, Qwen, Mistral, Gemma, Phi, DeepSeek, Yi, and more
  • โœ… Smart Token Detection: Automatically finds pad/EOS tokens from any model's tokenizer
  • โœ… Error Prevention: Validates configurations and provides helpful error messages
  • โœ… Flexible Architecture: Easy to add new models without code changes

Code Quality

  • โœ… Type Hints: Full type annotations for better IDE support and error catching
  • โœ… Docstrings: Comprehensive documentation for all functions
  • โœ… Error Handling: Robust error handling with informative messages
  • โœ… Modular Design: Clean separation of concerns and reusable components

User Experience

  • โœ… One-Line Model Selection: Simply change MODEL_NAME variable
  • โœ… Automatic Configuration: Everything extracted from transformers automatically
  • โœ… Clear Progress Indicators: Emojis and detailed logging throughout
  • โœ… Production Ready: Code suitable for research and deployment

๐Ÿ”„ Next Steps and Extensions

Model Improvements

  1. Try Different Models: Simply change the MODEL_NAME variable and re-run
  2. Hyperparameter Tuning: Experiment with different LoRA ranks, learning rates
  3. Extended Training: Try multi-epoch training for better convergence

Evaluation Enhancements

  1. Quantitative Metrics: Add BLEU, ROUGE, or custom function calling accuracy
  2. Benchmark Datasets: Test on additional function calling benchmarks
  3. Multi-Model Comparison: Compare performance across different model families

Deployment Options

  1. Model Serving: Deploy with FastAPI, TensorRT, or vLLM
  2. Integration: Connect with real APIs and function execution environments
  3. Optimization: Implement model quantization and pruning for production

Additional Features

  1. Multi-turn Conversations: Extend to handle conversation context
  2. Tool Selection: Improve tool selection and reasoning capabilities
  3. Error Recovery: Add error handling and recovery mechanisms

๐Ÿ“š Resources and References

๐ŸŽ–๏ธ Achievement Unlocked

๐Ÿ† Universal Function Calling Fine-tuning Master!

You now have a production-ready system that can fine-tune virtually any open-source language model for function calling with just a single line change!


Happy Fine-tuning! ๐Ÿš€ Try different models, share your results, and contribute back to the community!