NVIDIA 02 LLM Finetuning

02 LLM Finetuning

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtGTC25_DLInvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-servercommunityknowledge_graph_ragLLMnotebooksragnemo

alph-notebooks/nvidia-generative-ai-examples / 02_LLM_Finetuning.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

[ ]

Finetuning LLM for Triplet Prediction
with NVIDIA NIM microservice

Welcome To Your Cloud Environment! This interactive web application, which you're currently using to run Python code, is more than just a simple interface. When you access this Jupyter Notebook, an instance on a cloud platform is allocated to you by the NVIDIA Deep Learning Institute (DLI). This forms your base cloud environment, essentially a blank canvas for further setup, and includes:

A dedicated CPU, and possibly a GPU, for processing.
A pre-installed base operating system.
A pre-installation of packages necessary to run the lab.

Learning Objectives

Fine-Tuning a Smaller LLM for Accurate Triplet Predictions

In this tutorial, we will fine-tune a smaller Large Language Model (LLM) for more accurate triplet predictions using NVIDIA NeMo and NVIDIA Inference Microservices (NIM).

Introduction to NVIDIA NeMo and NIM

NVIDIA NeMo is a scalable, cloud-native generative AI framework designed for researchers and developers working with Large Language Models, Multimodal AI, and Speech AI (e.g., Automatic Speech Recognition and Text-to-Speech). It allows users to efficiently create, customize, and deploy generative AI models by leveraging existing code and pre-trained model checkpoints.

NVIDIA Inference Microservices (NIM) is a suite of microservices that enables fast and seamless deployment of AI models. NIM can be used on-premises or in DGX Cloud, allowing users to transition models to self-managed hosting with minimal code changes. These microservices are designed to scale dynamically based on load and run efficiently on GPUs.

Why Fine-Tune a Smaller LLM?

Large Language Models (LLMs) are trained for a wide range of tasks. However, for this specific use case, we only need the model to predict triplets from given text. Instead of deploying a large LLM, we use LLM distillation to train a smaller, more efficient model that retains the accuracy of a larger model while consuming fewer computational resources.

LLM Distillation is a process where a large LLM (teacher model) is used to train a smaller LLM (student model). The smaller model learns by replicating the teacher’s output, achieving similar accuracy with reduced computational overhead.

While teacher models provide high accuracy, they are resource-intensive. Deploying them for a single task is often inefficient. Instead, a fine-tuned student model offers significantly better throughput while meeting business-related performance KPIs.

Tutorial Overview

In this tutorial, we will fine-tune the LLaMa-3 8B model using NVIDIA NeMo and deploy it with NVIDIA NIM. We will cover the following:

Dataset Preparation: How to collect and preprocess data for LLM distillation
Fine-Tuning LLaMa-3 8B: Setting up and fine-tuning the model using NVIDIA NeMo (the model is pre-downloaded, and the necessary Python scripts are provided)
Deploying the Fine-Tuned Model: Using NVIDIA NIM for efficient model deployment
Querying the Deployed Model: Interacting with the model to make predictions
Enhancing Accuracy: Additional techniques to improve model performance

The complete process of fine-tuning and deployment is summarized in the image below:

Importing necessary modules

[ ]

The step defines directories and output file paths used in a data processing pipeline:

TRIPLES_DIR: Path to JSON files containing triples for the corresponding file in RAW_JSON_DIR .
RAW_JSON_DIR: Path to raw JSON files that contain unprocessed sec data.
OUTPUT_JSONL: Path to save the processed data in JSON Lines (JSONL) format, where each line represents a separate JSON object.

[ ]

1. Dataset for Distillation

Here's a concise explanation of your code:

clean_text function: Cleans the input text by removing extra spaces, tabs, and newlines.
read_raw_json_item_1 function: Reads the item_1 key from a specified raw Sec JSON file. It returns an empty string if the file is not found or if there's an error reading the file.
process_triples_and_raw_json function:
- It processes files in the triples_dir directory (which should contain JSON files with triples data).
- For each triple file, it:
  - Reads the corresponding raw JSON file (based on the filename field in the triples JSON).
  - Cleans the text in the item_1 field of the raw JSON.
  - Cleans and processes the item_1a field from the triples file (treated as triples).
  - Creates a JSONL entry with input as the cleaned item_1 text and output as the cleaned triples.
  - Writes each JSONL entry to the output_file.

[ ]

Below code defines a function to split a JSONL file into training, validation, and test datasets, and then writes the resulting data into separate files:

[ ]

Applying additional cleaning function as malformed/bad json in jsonl often found to halt training midway.

[ ]

In knowledge distillation, models are fine-tuned using labeled datasets to improve their performance on specific tasks. Task-specific fine-tuning enhances response quality and helps overcome the limitations of the student model. During this process, the model is trained over multiple iterations on labeled data to refine its predictions.

For fine-tuning with NVIDIA NeMo, labeled data must be provided in JSON Lines (JsonL) format. JsonL is a convenient format for storing structured data, allowing for efficient processing of records one at a time.

Typically following format is used when doing finetuning with NeMo:

{"input": "Sample input text", "output": "Expected model response"}
{"input": "Another example input", "output": "Corresponding expected output"}

In the case of finetunint model for the triplet extraction the labelled data looks as given below:

{"input": "ITEM 1. BUSINESS ImageWare Systems, Inc., a Delaware corporation, has its principal place of business at 11440 West Bernardo Court, Suite 300, San Diego, California 92127. We maintain a corporate website at www.iwsinc.com. Our common stock, par value $0.01 per share (\u201cCommon Stock\u201d), is currently listed for quotation on the OTCQB marketplace under the symbol \u201cIWSY\u201d. As used in this Annual Report, \u201cwe\u201d, \u201cus\u201d, \u201cour\u201d, \u201cImageWare\u201d, \u201cImageWare Systems\u201d or the \u201cCompany\u201d refers to ImageWare Systems, Inc. and all of its subsidiaries. Overview ImageWare Systems, Inc. (\u201cImageWare,\u201d the \u201cCompany,\u201d \u201cwe,\u201d \u201cour\u201d) provides defense-grade biometric identification and authentication solutions to safeguard your data, products, services or facilities. We are experts in biometric authentication and considered a preeminent patent holder of multimodal biometrics IP, having many of the most-cited patents in the industry. Our patented IWS Biometric Engine\u00ae is one of the most accurate and fastest biometrics matching engines in the industry, capable of our patented biometrics fusion. Part of our heritage is in law enforcement, having built the first statewide digital booking platform for United States local law enforcement in the late 1990\u2019s - and having more than three decades of experience in the challenging government sector creating biometric smart cards and logical access for millions of individuals. We are a \u201cbiometrics first\u201d company, leveraging unique human characteristics to provide unparalleled accuracy for identification while protecting your identity. The Company\u2019s products also provide law enforcement and public safety sector with integrated biographic, mugshot, SMT, and fingerprint capture for booking, in addition to investigative capabilities. The Company also provides comprehensive authentication security software using biometrics to secure physical and logical access to facilities, computer networks or Internet sites. Biometric technology is now an integral part of all markets that the Company addresses, and every product leverages our patented IWS Biometric Engine\u00ae. The IWS Biometric Engine\u00ae is a patented biometric identity and authentication database built for multi-biometric enrollment, management and authentication. It is hardware agnostic and can utilize different types of biometric algorithms. It allows different types of biometrics to be operated at the same time on a seamlessly integrated platform. It is also offered as a Software Development Kit (\u201cSDK\u201d), enabling developers and system integrators to implement biometric solutions or integrate biometric capabilities, into existing applications. Our secure credential solutions empower customers to design and create smart digital identification wristbands and badges for access control systems. We develop, sell and support software and design systems that utilize digital imaging and biometrics for photo identification cards, credentials and identification systems. Our products in this market consist of IWS EPI Suite and IWS EPI Builder. These products allow for production of digital identification badges and related databases and records and can be used by, among others, schools, airports, hospitals, corporations and governments. .....................", "output": "[['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Insufficient Cash Resources', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Need', 'Additional Capital', 'FIN_INSTRUMENT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Operate_In', 'Identity Management Solutions Industry', 'SECTOR'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Competition', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Fluctuating Operating Results', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Depends_Upon', 'Large System Sales', 'PRODUCT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Lengthy Sales Cycle', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Negative Working Capital', 'ECON_INDICATOR'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Sell', 'Products to Government Agencies', 'GPE'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Rely_On', 'Systems Integrators', 'ORG'], ['Systems Integrators', 'ORG', 'Perform', 'Adequately', 'VERB'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Accumulated Deficit', 'ECON_INDICATOR'], ['IMAGEWARE SYSTEMS INC', 'COMP', ' experience', 'Fluctuations in Operating Results', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Subject_To', 'Penny Stock Regulations', 'FIN_INSTRUMENT'], ['Penny Stock Regulations', 'FIN_INSTRUMENT', 'Impose', 'Additional Sales Practice Requirements', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Foreign Operations', 'COMP'], ['Foreign Operations', 'COMP', 'Expose', 'Foreign Political Risks', 'CONCEPT'], ['Foreign Operations', 'COMP', 'Expose', 'Foreign Economic Risks', 'CONCEPT'], ['Foreign Operations', 'COMP', 'Expose', 'Foreign Legal Risks', 'CONCEPT'], ['Foreign Operations', 'COMP', 'Expose', 'Foreign Currency Exchange Rates', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Foreign Operations', 'COMP'], ['Foreign Operations', 'COMP', 'Affect', 'Results', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Subject_To', 'Income Taxes', 'CONCEPT'], ['Income Taxes', 'CONCEPT', 'Requires', 'Significant Judgments', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Subject_To', 'Income Taxes', 'CONCEPT'], ['Income Taxes', 'CONCEPT', 'Subject_To', 'Examinations', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Exposed_To', 'Foreign Political Risks', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Exposed_To', 'Foreign Economic Risks', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Exposed_To', 'Foreign Legal Risks', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Exposed_To', 'Foreign Currency Exchange Rates', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Foreign Operations', 'COMP'], ['Foreign Operations', 'COMP', 'Affect', 'Results', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Penny Stock Rules', 'FIN_INSTRUMENT'], ['Penny Stock Rules', 'FIN_INSTRUMENT', 'Affect', 'Market Liquidity', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Volatility', 'CONCEPT'], ['Volatility', 'CONCEPT', 'Affect', 'Investment Value', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Fluctuations', 'CONCEPT'], ['Fluctuations', 'CONCEPT', 'Cause', 'Decline in Value', 'CONCEPT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Have', 'Common Stock', 'FIN_INSTRUMENT'], ['IMAGEWARE SYSTEMS INC', 'COMP', 'Face', 'Specific Factors', 'CONCEPT'], ['Specific Factors', 'CONCEPT', 'Affect', 'Market Price', 'CONCEPT'], ,,,,,,,,,,]"}

In our case the "input" key will be text which is given as input and "output" key will be the triplets predicted by Teacher model. To construct the current data set we have used Mixtral8x7B as a teacher. We have used SEC-10 dataset.

2. Finetuning

Llama 3 is an open-source large language model by Meta that delivers state-of-the-art performance on popular industry benchmarks. It has been pretrained on over 15 trillion tokens, and supports an 8K token context length. It is available in two sizes, 8B and 70B, and each size has two variants---base pretrained and instruction tuned.

Low-Rank Adaptation (LoRA) has emerged as a popular Parameter-Efficient Fine-Tuning (PEFT) technique that tunes a very small number of additional parameters as compared to full fine-tuning, thereby reducing the compute required.

NVIDIA NeMo Framework provides tools to perform LoRA on Llama 3 to fit your use case, which can then be deployed using NVIDIA NIM for optimized inference on NVIDIA GPUs.

This notebook shows how to perform LoRA PEFT on Llama 3 8B Instruct using SEC-10 with NeMo Framework.

Download the base model

The first set of commands creates a directory to store the Llama-3-8B-Instruct model file if it doesn’t already exist. The second set of commands downloads the model file (8b_instruct_nemo_bf16.nemo) from NVIDIA's NGC server using requests and saves it in the newly created directory. The third set of commands verifies the successful download by listing the contents of the directory.

[ ]

Check GPU availability for training

The command docker exec containerB nvidia-smi runs the nvidia-smi tool inside the containerB container to display GPU status. Ensure the container has GPU access (--gpus all) and the NVIDIA drivers installed.

[ ]

NeMo framework (Current environment) includes a high level python script for fine-tuning megatron_gpt_finetuning.py that can abstract away some of the lower level API calls. Once you have your model downloaded and the dataset ready, LoRA fine-tuning with NeMo is essentially just running this script!

For this demonstration, this training run is capped at 20 max steps, and validation is carried out every 10 steps. You may increase the steps to 10,000+ in practical scenarios, but currently in interest of time we have limited the steps.

This will create a LoRA adapter - a file named megatron_gpt_peft_lora_tuning.nemo in /workspace/model/Meta-Llama-3-8B-Instruct-Sec-LoRA We'll use this later.

trainer.max_steps are capped at 20 iteration to save time and treat it as learning example. Typically finetuning is done on 8xH100 kind of setup and often require 10,000+ steps.

The peft.peft_scheme parameter determines the technique being used. In this case, we did LoRA, but NeMo Framework supports other techniques as well - such as P-tuning, Adapters, and IA3. For more information, refer to the PEFT support matrix. For example, for P-tuning, simply set model.peft.peft_scheme="ptuning" # instead of "lora"

[ ]

Transfer the finetuned LORA adapter to directory where NIM can load and make model avaialble for inference

[ ]

3. Deploy LoRA Inference Adapters with NVIDIA NIM

Run the container with following env variables

Below given steps are just for your information and not required to be executed right now as we have already set an environment for you

Details on how this container was run

Download the example LoRA adapters.

The following steps assume that you have authenticated with NGC and downloaded the CLI tool, as listed in the Requirements section.

# Set path to your LoRA model store
export LOCAL_PEFT_DIRECTORY="$(pwd)/loras"

mkdir -p $LOCAL_PEFT_DIRECTORY
pushd $LOCAL_PEFT_DIRECTORY

# downloading NeMo-format loras
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1"
ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"

popd
chmod -R 777 $LOCAL_PEFT_DIRECTORY

Prepare the LoRA model store.

After training is complete, that LoRA model checkpoint will be created at ./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo, assuming default paths in the first notebook weren't modified.

To ensure the model store is organized as expected, create a folder named llama3-8b-pubmed-qa, and move your .nemo checkpoint there.

mkdir -p $LOCAL_PEFT_DIRECTORY/llama3-8b-pubmed-qa

# Ensure the source path is correct
cp ./results/Meta-Llama-3-8B-Instruct/checkpoints/megatron_gpt_peft_lora_tuning.nemo $LOCAL_PEFT_DIRECTORY/llama3-8b-pubmed-qa

Ensure that the LoRA model store directory follows this structure: the model name(s) should be sub-folder(s) containing the .nemo file(s).

<$LOCAL_PEFT_DIRECTORY>
├── llama3-8b-instruct-lora_vnemo-math-v1
│   └── llama3_8b_math.nemo
├── llama3-8b-instruct-lora_vnemo-squad-v1
│   └── llama3_8b_squad.nemo
└── llama3-8b-pubmed-qa
    └── megatron_gpt_peft_lora_tuning.nemo

The last one was just trained on the PubmedQA dataset in the previous notebook.

Set-up NIM.

From your host OS environment, start the NIM docker container while mounting the LoRA model store, as follows:

# Set these configurations
export NGC_API_KEY=<YOUR_NGC_API_KEY>
export NIM_PEFT_REFRESH_INTERVAL=3600  # (in seconds) will check NIM_PEFT_SOURCE for newly added models in this interval
export NIM_CACHE_PATH=</path/to/NIM-model-store-cache>  # Model artifacts (in container) are cached in this directory

mkdir -p $NIM_CACHE_PATH
chmod -R 777 $NIM_CACHE_PATH

export NIM_PEFT_SOURCE=/home/nvs/loras # Path to LoRA models internal to the container
export CONTAINER_NAME=meta-llama3-8b-instruct

docker run -it --rm --name=$CONTAINER_NAME\
    --runtime=nvidia\
    --gpus all\
    --shm-size=16GB\
    -e NGC_API_KEY\
    -e NIM_PEFT_SOURCE\
    -e NIM_PEFT_REFRESH_INTERVAL\
    -v $NIM_CACHE_PATH:/opt/nim/.cache\
    -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE\
    -p 8000:8000\
    nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

The first time you run the command, it will download the model and cache it in $NIM_CACHE_PATH so subsequent deployments are even faster. There are several options to configure NIM other than the ones listed above. You can find a full list in the NIM configuration documentation.

To help interface with this framework, the langchain-nvidia-ai-endpoints package provides connectors like ChatNVIDIA and NVIDIAEmbeddings to help interface with the raw endpoints. These will be used throughout the course to power our RAG pipeline!

4. Querying LoRA for Inference

Check available LoRA models

Once the NIM server is up and running, check the available models as follows:

[ ]

Query the LoRA

Create a prompt template ; Idelly this should be the same as training template.

[ ]