Inference

gpu-accelerationStarCoder2retrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-exampleslarge-language-modelsmicroservicetriton-inference-serverLLMfinetuningragnemo

Accelerated Inference With PEFT'd StarCoder2

In the previous notebook, we show how to parameter efficiently finetune StarCoder2 model with a custom code (instruction, completion) pair dataset. We choose LoRA as our PEFT algorithnm and finetune for 50 interations. In this notebook, the goal is to demonstrate how to compile fintuned .nemo model into optimized TensorRT-LLM engines. The converted model engine can perform accelerated inference locally or be deployed to Triton Inference Server.

Export Model Via TensorRT-LLM

NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of the latest LLMs on supported AI platforms. NVIDIA NeMo framework offers TensorRT-LLM as an user friendly tool to compile .nemo models into optimized engines. To start with, let's create a folder where the exported model files will be saved.

[ ]

Next, we need to create an instance of the TensorRTLLM class and call the TensorRTLLM.export() function with the nemo_checkpoint_path pointing to the LoRA fine-tuned .nemo checkpoint.

After optimized model export, a few files will be stored in the folder we just created. These files include an engine file that holds the weights, the compiled execution graph of the model, a tokenizer.model file which contains the tokenizer information, and config.json which records the metadata about the model (along with model.cache, which caches some operations and makes it faster to re-compile the model in the future.)

[ ]

After the finetuned model is exported into TensorRT-LLM optimized engines, we can perform accelerated inference.

[ ]

Another code generation example:

[ ]

Deploy Model Using Triton Inference Server

Lastly, we can easily deploy the finetuned model as a service, which is supported by Triton Inference Server:

[ ]