NVIDIA Lora

Lora

gpu-accelerationStarCoder2retrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-serverLLMfinetuningragnemo

alph-notebooks/nvidia-generative-ai-examples / lora.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Fine tuning StarCoder2 with NeMo Framework

Introduction

StarCoder2 is a newly improved open coding assistant model. The 15B model outperforms leading open code LLMs on popular programming benchmarks and delivers superior performance in its class. Notably, with a context length of 16,000 tokens, StarCoder2 model can handle a longer code base and elaborate coding instructions, get a better understanding of code structure, and provide improved code documentation. This model is the outcome of the collaboration among BigCode, ServiceNow, and NVIDIA. StarCoder2 is more powerful than StarCoder with doubled context window. With NVIDIA NeMo framework, you can customize StarCoder2 to fit your usecase and deploy an optimized model on your NVIDIA GPU.

In this tutorial, we'll go over a popular Parameter-Efficient Fine-Tuning (PEFT) customization technique -- i.e. Low-Rank Adaptation (also known as LoRA) which enables the already upgraded StarCoder2 model to learn a new coding language or coding style.

Note that the subject 15B StarCoder2 model takes 30GB disk space and requires more than 80GB CUDA memory while performing PEFT on a single GPU. Therefore, the verified hardware configuration for this notebook and the subsequent inference notebook employ a single node machine with 8 80GB NVIDIA GPUs.

Download the base model

For all of our customization and deployment processes, we'll need to start off with a pre-trained model of StarCoder2. You can download the base model from Hugging Face (HF). Before doing that, make sure you have registered a HF account, consent to the StarCoder2 terms, and install the required libraries and Python packages to download the dataset.

[ ]

Once done, clone the model from HF with:

[ ]

Getting NeMo Framework

NVIDIA NeMo Framework is a generative AI framework built for researchers and PyTorch developers working on large language models (LLMs), multimodal models (MM), automatic speech recognition (ASR), and text-to-speech synthesis (TTS). The primary objective of NeMo is to provide a scalable framework for researchers and developers from industry and academia to more easily implement and design new generative AI models by being able to leverage existing code and pretrained models.

If you haven't already, you can pull a container that includes the version of NeMo Framework and all dependencies needed for this notebook with the following: docker pull nvcr.io/nvidia/nemo:24.01.starcoder2

The best way to run this notebook is from within the container. You can do that by launching the container with the following command:

[ ]

From within the container, start the Jupyter server with:

[ ]

We will need to convert newly donwloaded HF model to .nemo format to perform PEFT. First, let's upgrade the transformers package to the latest.

[ ]

Next, execute the following command to convert the StarCoder2 checkpoint into a .nemo file. Please make sure all the arguments point to the correct paths.

[ ]

Data Preparation

Next, we'll need to prepare the data that we're going to use for our LoRA fine tuning. Here we're going to be using Alpaca Python Code Instructions Dataset, and training our model to enhance its instruction following ability for generating Python code.

Let's download Alpaca Python Code Instructions dataset from Hugging Face:

[ ]

Finally, the following code snippets convert the dataset into the JSONL format that NeMo defaults for PEFT. Meanwhile, we will reformat the data into list of (prompt, completion) pairs that our model can appropriately handle. Please refer to the printout for the original code instruction data format.

[ ]

Here's an example of what the data looks like after reformatting:

[ ]

LoRA Configuration And PEFT

Step 1: Start NeMo Container

If the container is not already running launch the following command:

[ ]

Step 2: Run PEFT

The megatron_gpt_peft_tuning_config.yaml file is referred to configure the parameters for the running PEFT training jobs in NeMo with LoRA technique for language model tuning. Let's point restore_from_path to the just converted .nemo file and dataset paths to the train and validation JSONL files.

[ ]

Note: For running PEFT on multiple nodes (for example on a Slurm cluster, replace the torchrun --nproc_per_node=8 with python.

Step 3: Merge The Adapted Weights

Once PEFT is finished, we'll need to merge the weights of the base model and the weights of the adapter. If you're using the NeMo Framework container, you'll find a script for this at /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py. Otherwise, you can download the standalone script from GitHub at https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/nlp_language_modeling/merge_lora_weights/merge.py. To run the merge script, you'll need the paths to the pretained .nemo model, the trained adapter .nemo model, as well as the path to save the merged model. Please modify according to your local environment.

[ ]

Step 4: Run Evaluation

Run evaluation using megatron_gpt_peft_eval.py

Set the appropriate model checkpoint path, test file path, batch sizes, number of tokens etc. and run evaluation on the test file.

[ ]

Check the output from the result file:

[ ]

Note, This is only a sample output (based of a toy LoRA example) and your output may vary. The performance can be further improved by fine tuning the model for more steps.