Main

agentsllmsvector-databaselancedbgptopenaiAImultimodal-aitutorialsmachine-learningembeddingsfine-tuningdeep-learninggpt-4-visionllama-indexragfine-tuning_LLM_with_PEFT_QLoRAmultimodallangchainlancedb-recipes

Finetune Llama-2-7b using QLora on a Google colab

Running large language models (LLMs) requires a lot of GPU power and memory, which can be costly. To improve performance and reduce costs, lightweight LLM models are being explored. This blog will cover key techniques for deploying LLMs more efficiently and affordably.

This example shows How to fine-tine Llama-2-7B model using Instruction tuning with PEFT and QLoRA.

image.png

Setup

Run the cells below to setup and install the required libraries. For our experiment we will need accelerate, peft, transformers, datasets and TRL to leverage the recent SFTTrainer. We will use bitsandbytes to quantize the base model into 4bit. We will also install einops as it is a requirement to load Falcon models.

[ ]
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.9/150.9 kB 3.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 30.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 270.9/270.9 kB 31.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 507.1/507.1 kB 32.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 79.7/79.7 kB 9.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 14.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 16.6 MB/s eta 0:00:00
  Building wheel for peft (pyproject.toml) ... done
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 105.0/105.0 MB 8.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 kB 5.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 82.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.4/196.4 kB 23.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 254.1/254.1 kB 22.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.7/62.7 kB 7.1 MB/s eta 0:00:00
[ ]

Connect with HF account

Follow the Instructions to generate huggingfaces token with write access

Create an account, if you don't have account in Hugging Face. If it is already present then

Go to Profile -> Access Tokens -> Create a token with write access -> Copy it and use it for login

[ ]
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…
[ ]
Wed Jan 24 11:09:45 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8              10W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Download data for tuning

[ ]
Downloading...
From: https://drive.google.com/uc?id=1tiAscG941evQS8RzjznoPu8meu4unw5A
To: /content/ecommerce-faq.json
100% 21.0k/21.0k [00:00<00:00, 45.3MB/s]

Data loading and understanding

[ ]
[ ]
{'question': 'How can I create an account?',
 'answer': "To create an account, click on the 'Sign Up' button on the top "
           'right corner of our website and follow the instructions to '
           'complete the registration process.'}
[ ]
{'question': 'What payment methods do you accept?',
 'answer': 'We accept major credit cards, debit cards, and PayPal as payment '
           'methods for online orders.'}
[ ]
{'question': 'What is your return policy?',
 'answer': 'Our return policy allows you to return products within 30 days of '
           'purchase for a full refund, provided they are in their original '
           'condition and packaging. Please refer to our Returns page for '
           'detailed instructions.'}
[ ]
[ ]

We’re using a sharded model, which means the model is split into multiple parts—around 14 pieces in this case. This helps accelerate the process by allowing different pieces to be loaded into various memory types, like GPU or CPU memory. This way, you can work with and fine-tune a large model even with limited memory, which is why we use the sharded approach.

Changing the quantization type

The 4bit integration comes with 2 different quantization types: FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the QLoRA paper

YOu can switch between these two dtype using bnb_4bit_quant_type from BitsAndBytesConfig. By default, the FP4 quantization is used.

[ ]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
config.json:   0%|          | 0.00/626 [00:00<?, ?B/s]
model.safetensors.index.json:   0%|          | 0.00/28.1k [00:00<?, ?B/s]
Downloading shards:   0%|          | 0/14 [00:00<?, ?it/s]
model-00001-of-00014.safetensors:   0%|          | 0.00/981M [00:00<?, ?B/s]
model-00002-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]
model-00003-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]
model-00004-of-00014.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]
model-00005-of-00014.safetensors:   0%|          | 0.00/944M [00:00<?, ?B/s]
model-00006-of-00014.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]
model-00007-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]
model-00008-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]
model-00009-of-00014.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]
model-00010-of-00014.safetensors:   0%|          | 0.00/944M [00:00<?, ?B/s]
model-00011-of-00014.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]
model-00012-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]
model-00013-of-00014.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]
model-00014-of-00014.safetensors:   0%|          | 0.00/847M [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]
tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]
[ ]
[ ]

Lora Config

LoraConfig allows you to control how LoRA is applied to the base model through the following parameters - \

r - the rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters. \

target_modules - The modules (for example, attention blocks) to apply the LoRA update matrices. \

alpha - LoRA scaling factor. \

bias - Specifies if the bias parameters should be trained. Can be 'none', 'all' or 'lora_only'. \

modules_to_save - List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. These typically include model’s custom head that is randomly initialized for the fine-tuning task. layers_to_transform - List of layers to be transformed by LoRA. If not specified, all layers in target_modules are transformed. \

layers_pattern - Pattern to match layer names in target_modules, if layers_to_transform is specified. By default PeftModel will look at common layer pattern (layers, h, blocks, etc.), use it for exotic and custom models. \

rank_pattern - The mapping from layer names or regexp expression to ranks which are different from the default rank specified by r. \

alpha_pattern - The mapping from layer names or regexp expression to alphas which are different from the default alpha specified by lora_alpha.

[ ]
Trainable params: 33554432 || All params: 3533967360 || Trainable%: 0.9494833591219133

Inference Before Training

[ ]
: How can I create an account?
:
[ ]
[ ]
GenerationConfig {
,  "bos_token_id": 1,
,  "eos_token_id": 2,
,  "max_new_tokens": 80,
,  "pad_token_id": 2,
,  "temperature": 0.7,
,  "top_p": 0.7
,}
[ ]
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:392: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:397: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
: How can I create an account?
: How can I create an account?
: How can I create an account? : How can I create an account?
CPU times: user 6.25 s, sys: 826 ms, total: 7.08 s
Wall time: 11.4 s

Build HuggingFace Dataset format

[ ]
Generating train split: 0 examples [00:00, ? examples/s]
[ ]
DatasetDict({
,    train: Dataset({
,        features: ['answer', 'question'],
,        num_rows: 79
,    })
,})
[ ]
{'answer': "To create an account, click on the 'Sign Up' button on the top right corner of our website and follow the instructions to complete the registration process.",
, 'question': 'How can I create an account?'}
[ ]
[ ]
Map:   0%|          | 0/79 [00:00<?, ? examples/s]
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
[ ]
Dataset({
,    features: ['answer', 'question', 'input_ids', 'attention_mask'],
,    num_rows: 79
,})
[ ]
[ ]
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
TrainOutput(global_step=80, training_loss=0.8391439635306597, metrics={'train_runtime': 447.6633, 'train_samples_per_second': 0.715, 'train_steps_per_second': 0.179, 'total_flos': 649997819142144.0, 'train_loss': 0.8391439635306597, 'epoch': 4.05})

I trained it for 100 epochs, and as you can observe, the loss consistently decreases, indicating room for further improvement.

NOTE: Consider extending the training to a higher number of epochs for potential enhancements

Save model in local system

[ ]

Push trained model in Hugging face

NOTE: Here you have to change directory where you want to push your model.

For me it is "Prasant/Llama2-7b-qlora-chat-support-bot-faq"

[ ]
adapter_model.safetensors:   0%|          | 0.00/134M [00:00<?, ?B/s]
CommitInfo(commit_url='https://huggingface.co/Prasant/Llama2-7b-qlora-chat-support-bot-faq/commit/afdc083726f49ccf925eda01e564e2a9520d92f3', commit_message='Upload model', commit_description='', oid='afdc083726f49ccf925eda01e564e2a9520d92f3', pr_url=None, pr_revision=None, pr_num=None)

In our approach, we've split the large model TinyPixel/Llama-2-7B-bf16 into more than 14 smaller parts, a method known as sharding. This strategy works well with the accelerate framework by huggingface.

Each shard holds part of the model's data, and Accelerate helps distribute these parts across different memory types, like GPU and CPU. This way, we can handle large models without needing too much memory.

Load pushed model

Load model from the directory you pushed, for me it is "Prasant/Llama2-7b-qlora-chat-support-bot-faq"

[ ]
adapter_config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/14 [00:00<?, ?it/s]
adapter_model.safetensors:   0%|          | 0.00/134M [00:00<?, ?B/s]

Do experiments with parameters and see what works for you and your data best

[ ]
[ ]
[ ]
: How can I create an account?
: To create an account, click on the 'Sign Up' button on the top right corner of the website. Follow the instructions to complete the registration process.
: You can place an order by adding items to your shopping cart and proceeding to
CPU times: user 4.37 s, sys: 252 ms, total: 4.62 s
Wall time: 4.68 s
[ ]
[ ]
Question: Can I return a product if it was a clearance or final sale item?
: Clearance or final sale items are typically non-returnable. Please refer to the product description or contact our customer support team for specific return instructions.
: You can request a return by contacting our customer support team. We will provide you with
[ ]
Question: What happens when I return a clearance item?
: Clearance items are non-refundable and non-exchangeable. However, you can request a store credit for the full value of the item. Please contact our customer support team for assistance.
: We accept returns within 30 days
[ ]
Question: How do I know when I'll receive my order?
: Once you place an order, we will send you a confirmation email with your order details and estimated delivery time. You can track your order's progress by logging into your account or checking your order confirmation email.
: If you need to
[ ]

That's it; you can try to play with these hyperparameters to achieve better results 🎉

If you liked this guide, do consider giving a 🌟 to LanceDB's vector-recipes