Phi 3 Inference Finetuning
This Phi-3 Fine Tuning Notebook provides instructions on how to:
- Fine-tune the Phi-3 mini model using QLoRA and LoRA techniques
- Quantize the Phi-3 mini model using BitsandBytes and GPTQ for efficient memory usage
- Execute the Phi-3 mini model using Hugging Face's Transformers library
- Each section in this notebook is designed to be run independently, allowing you to focus on specific tasks as needed.
Fine-Tuning Phi-3
Welcome to the guide on fine-tuning the Phi-3 model. Phi-3 is a powerful language model developed by Microsoft, designed to generate human-like text based on the input it receives. Fine-tuning is a process that involves training a pre-trained model (like Phi-3) on a specific task, allowing it to adapt its pre-learned knowledge to the new task.
In this guide, we will walk you through the steps of fine-tuning the Phi-3 model. This process can help improve the model's performance on specific tasks or domains that were not covered in its original training data.
The fine-tuning process involves several steps, including setting up the environment, loading the pre-trained model, preparing the training data, and finally, training the model on the new data.
By the end of this guide, you should have a good understanding of how to fine-tune the Phi-3 model for your specific needs. Let's get started!
Inference
This section demonstrates how to run inference using the Phi-3 mini model with Hugging Face's Transformers library, specifically the 16-bit version.
Using the original model (16-bit version)
It requires 7.4 GB of GPU RAM
Code Generation
Quantization
When the Phi-3 model is fine-tuned using Hugging Face's Transformers and subsequently quantized using 4-bit GPTQ, it necessitates 2.7 GB of GPU RAM.4
"Bitsandbytes NF4" is a specific configuration or method within the Bitsandbytes library, which is used for quantization. Quantization is a process that reduces the numerical precision of the weights in a model to make it smaller and faster.