Hugging Face Function Calling Fine Tuning Llms On Xlam

Function Calling Fine Tuning Llms On Xlam

hf-cookbookennotebooks

alph-notebooks/hf-cookbook / function_calling_fine_tuning_llms_on_xlam.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Fine-tuning LLMs for Function Calling with xLAM Dataset

Authored by: Behrooz Azarkhalili

This notebook demonstrates how to fine-tune language models for function calling capabilities using the xLAM dataset from Salesforce and QLoRA (Quantized Low-Rank Adaptation) technique. We'll work with popular models like Llama 3, Qwen2, Mistral, and others.

What is Function Calling? Function calling enables language models to interact with external tools and APIs by generating structured function invocations. Instead of just generating text, the model learns to call specific functions with the right parameters based on user requests.

What You'll Learn:

Data Processing: How to format the xLAM dataset for function calling training
Model Fine-tuning: Using QLoRA for memory-efficient training on consumer GPUs
Evaluation: Testing the fine-tuned models with example prompts
Multi-model Support: Working with different model architectures

Key Benefits:

Memory Efficient: QLoRA enables training on 16-24GB GPUs
Production Ready: Modular code with proper error handling
Flexible Architecture: Easy to adapt for different models and datasets
Universal Support: Works with Llama, Qwen, Mistral, Gemma, Phi, and more

Hardware Requirements:

GPU: 16GB+ VRAM (24GB recommended for larger models)
RAM: 32GB+ system memory
Storage: 50GB+ free space for models and datasets

Software Dependencies: The notebook will install required packages automatically, including:

transformers, peft, bitsandbytes, trl, datasets, accelerate

For detailed methodology and results, see: Function Calling: Fine-tuning Llama 3 and Qwen2 on xLAM

[1]

Basic Setup and Imports

Let's start with the essential imports and basic setup for our notebook.

[2]

PyTorch version: 2.8.0+cu128
CUDA available: True
GPU: NVIDIA H100 NVL
VRAM: 100.0 GB

Hugging Face Authentication Setup

Next, we'll set up authentication with HuggingFace Hub. This allows us to download models and datasets, and optionally upload our fine-tuned models.

[3]

✅ Successfully authenticated with HuggingFace!

Model Configuration Classes

We'll create two configuration classes to organize our settings:

ModelConfig: Stores model-specific settings like tokenizer configuration
TrainingConfig: Stores training parameters like learning rate and batch size

[4]

Automatic Model Configuration

This function automatically detects the model's tokenizer settings and creates a proper configuration. It handles different model architectures (Llama, Qwen, Mistral, etc.) and their specific token requirements.

[5]

[6]

✅ Configuration system ready!
💡 Supports Llama, Qwen, Mistral, Gemma, Phi, and more

Hardware Detection and Setup

Let's detect our hardware capabilities and configure optimal settings. We'll check for bfloat16 support and set up the best attention mechanism for our GPU.

[ ]

Tokenizer Setup Function

Now let's create a function to set up our tokenizer with the right configuration from our model settings.

[8]

📊 Hardware Configuration Complete:
   • Compute dtype: torch.bfloat16
   • Attention implementation: flash_attention_2
   • Device: NVIDIA H100 NVL

Dataset Processing

Now we'll work with the xLAM dataset from Salesforce. This dataset contains about 60,000 examples of function calling conversations that we'll use to train our model.

Key Functions:

process_xlam_sample(): Converts a single dataset example into the training format with special tags (<user>, <tools>, <calls>) and EOS token
load_and_process_xlam_dataset(): Loads the complete xLAM dataset (60K samples) from Hugging Face and processes all samples using multiprocessing for efficiency
preview_dataset_sample(): Displays a formatted preview of a processed dataset sample for inspection with statistics

[9]

[10]

[11]

Loading and Processing the Dataset

Now let's add functions to load the xLAM dataset and process it into the format our model needs for training.

[12]

QLoRA Training Setup

QLoRA (Quantized Low-Rank Adaptation) allows us to fine-tune large language models efficiently. It uses 4-bit quantization to reduce memory usage while maintaining training quality.

[13]

LoRA Configuration

LoRA (Low-Rank Adaptation) is the key technique that makes efficient fine-tuning possible. Instead of updating all model parameters, LoRA adds small trainable matrices to specific layers while keeping the base model frozen.

Training Execution

Now we'll create the main training function that puts everything together. This function configures the training arguments and executes the fine-tuning process using TRL's SFTTrainer.

[14]

🎯 Universal Model Selection

Choose any model for fine-tuning! This notebook supports a wide range of popular models. Simply uncomment the model you want to use or specify your own.

📋 Quick Model Selection

Uncomment one of these popular models or specify your own:

Why Llama 3-8B-Instruct as default?

Proven Performance: Excellent function calling capabilities and instruction following
Optimal Size: 8B parameters provide great balance between performance and resource usage

[15]

🎯 Selected Model: meta-llama/Meta-Llama-3-8B-Instruct

🔧 Auto-configuring everything for meta-llama/Meta-Llama-3-8B-Instruct...
🔍 Loading model configuration: meta-llama/Meta-Llama-3-8B-Instruct
📊 Model: llama, vocab_size: 128,256
✅ Configured - pad: '<|eot_id|>' (ID: 128009), eos: '<|eot_id|>' (ID: 128009)

🎉 Ready to fine-tune! Everything configured automatically:
   ✅ Model type: llama
   ✅ Vocabulary: 128,256 tokens
   ✅ Pad token: '<|eot_id|>' (ID: 128009)
   ✅ Output dir: ./Meta_Llama_3_8B_Instruct_xLAM

🚀 Configuration complete for meta-llama/Meta-Llama-3-8B-Instruct!

[ ]

Model Loading for Inference

After training is complete, we need to load the trained model for inference. This function loads the base model with quantization and applies the trained LoRA adapters.

[19]

Text Generation for Function Calls

Now let's create the function that generates responses from our fine-tuned model. This handles tokenization, generation parameters, and decoding.

[20]

Testing Function Calling Capabilities

This function provides a comprehensive test suite to evaluate our fine-tuned model with different types of function calling scenarios.

[21]

[ ]

[23]

🧪 Testing function calling capabilities...

============================================================
Test Case 1: Mathematical Function
============================================================
🎯 Generating response for prompt...
📝 Input: <user>Check if the numbers 8 and 1233 are powers of two.</user>

<tools>
✅ Generation completed!
📊 Generated 90 new tokens

🔍 Complete Response:
----------------------------------------
<user>Check if the numbers 8 and 1233 are powers of two.</user>

<tools>{'name': 'is_power_of_two', 'description': 'Checks if a number is a power of two.', 'parameters': {'num': {'description': 'The number to check.', 'type': 'int'}}}</tools>

<calls>{'name': 'is_power_of_two', 'arguments': {'num': 8}}
{'name': 'is_power_of_two', 'arguments': {'num': 1233}}</calls>
----------------------------------------

============================================================
Test Case 2: Weather Query
============================================================
🎯 Generating response for prompt...
📝 Input: <user>What's the weather like in New York today?</user>

<tools>
✅ Generation completed!
📊 Generated 105 new tokens

🔍 Complete Response:
----------------------------------------
<user>What's the weather like in New York today?</user>

<tools>{'name':'realtime_weather_api', 'description': 'Fetches current weather information based on the provided query parameter.', 'parameters': {'q': {'description': 'Query parameter used to specify the location for which weather data is required. It can be in various formats such as:', 'type':'str', 'default': '53.1,-0.13'}}}</tools>

<calls>{'name':'realtime_weather_api', 'arguments': {'q': 'New York'}}</calls>
----------------------------------------

============================================================
Test Case 3: Data Processing
============================================================
🎯 Generating response for prompt...
📝 Input: <user>Calculate the average of these numbers: 10, 20, 30, 40, 50</user>

<tools>
✅ Generation completed!
📊 Generated 81 new tokens

🔍 Complete Response:
----------------------------------------
<user>Calculate the average of these numbers: 10, 20, 30, 40, 50</user>

<tools>{'name': 'average', 'description': 'Calculates the arithmetic mean of a list of numbers.', 'parameters': {'numbers': {'description': 'The list of numbers.', 'type': 'List[float]'}}}</tools>

<calls>{'name': 'average', 'arguments': {'numbers': [10, 20, 30, 40, 50]}}</calls>
----------------------------------------

✅ All test cases completed!

🎉 Conclusion and Next Steps

📊 Summary

This notebook demonstrated a complete, production-ready, universal pipeline for fine-tuning language models for function calling capabilities using:

🎯 Universal Model Support: Works with any model - just change the MODEL_NAME variable
🔧 Intelligent Configuration: Automatic token detection using auto_configure_model()
⚡ QLoRA Efficiency: Memory-efficient training on consumer GPUs (16-24GB)
📋 Comprehensive Testing: Automated evaluation and interactive testing capabilities

🚀 Key Improvements Made

Universal Compatibility

✅ Multi-Model Support: Works with Llama, Qwen, Mistral, Gemma, Phi, DeepSeek, Yi, and more
✅ Smart Token Detection: Automatically finds pad/EOS tokens from any model's tokenizer
✅ Error Prevention: Validates configurations and provides helpful error messages
✅ Flexible Architecture: Easy to add new models without code changes

Code Quality

✅ Type Hints: Full type annotations for better IDE support and error catching
✅ Docstrings: Comprehensive documentation for all functions
✅ Error Handling: Robust error handling with informative messages
✅ Modular Design: Clean separation of concerns and reusable components

User Experience

✅ One-Line Model Selection: Simply change MODEL_NAME variable
✅ Automatic Configuration: Everything extracted from transformers automatically
✅ Clear Progress Indicators: Emojis and detailed logging throughout
✅ Production Ready: Code suitable for research and deployment

🔄 Next Steps and Extensions

Model Improvements

Try Different Models: Simply change the MODEL_NAME variable and re-run
Hyperparameter Tuning: Experiment with different LoRA ranks, learning rates
Extended Training: Try multi-epoch training for better convergence

Evaluation Enhancements

Quantitative Metrics: Add BLEU, ROUGE, or custom function calling accuracy
Benchmark Datasets: Test on additional function calling benchmarks
Multi-Model Comparison: Compare performance across different model families

Deployment Options

Model Serving: Deploy with FastAPI, TensorRT, or vLLM
Integration: Connect with real APIs and function execution environments
Optimization: Implement model quantization and pruning for production

Additional Features

Multi-turn Conversations: Extend to handle conversation context
Tool Selection: Improve tool selection and reasoning capabilities
Error Recovery: Add error handling and recovery mechanisms

📚 Resources and References

xLAM Dataset: Salesforce/xlam-function-calling-60k
QLoRA Paper: Efficient Finetuning of Quantized LLMs
Function Calling Guide: Complete methodology article
PEFT Library: Hugging Face PEFT Documentation

🎖️ Achievement Unlocked

🏆 Universal Function Calling Fine-tuning Master!

You now have a production-ready system that can fine-tune virtually any open-source language model for function calling with just a single line change!

Happy Fine-tuning! 🚀 Try different models, share your results, and contribute back to the community!