Train

agentsllm-pretraining-dataloadingllmsvector-databaselancedbgptopenaiAImultimodal-aimachine-learningembeddingsfine-tuningexamplesdeep-learninggpt-4-visionllama-indexragmultimodallangchainlancedb-recipes

GPT from Scratch using Lightning AI and Lance!

This notebook follows the code that I wrote at my talk in Lightning AI meetup in London on 8th November. I am implementing a GPT model from scratch (including all the modules like CausalAttention, MultiHeadedAttention and FFN and then binding it all together in LightningAI and training it with the help of it.

Notes on Data Tokenization and Lance

I am using Lance to load our training data. It is a modern columnar data format for ML and LLMs implemented in Rust.

The problem faced was that because of low memory and compute, we can't load the entire TinyStories dataset (about 2.3 GB in size) and tokenize it. The solution was to pre-tokenize the dataset, convert it into a PyArrow table and save it in Lance format.

Lance essentially allows us to only load some indices of the data at any given moment instead of loading the entire dataset and maxing out the memory.

If you want to play with Lance and checkout other use cases of it, you can see it's repository: https://github.com/lancedb/lance/

Installing dependencies

[1]

Importing libraries

[2]
/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]
Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]
Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Dataset and Creation

We load the TinyStories dataset and then tokenize 100K sentences from it and save it as a PyArrow table in a lance file.

[3]
Downloading and preparing dataset text/roneneldan--TinyStories to /root/.cache/huggingface/datasets/text/roneneldan--TinyStories-e7877524f0320955/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8...
Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/2.23G [00:00<?, ?B/s]
Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]
Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/roneneldan--TinyStories-e7877524f0320955/0.0.0/4b86d314f7236db91f0a0f5cda32d4375445e64c5eda2692655dd99c2dac68e8. Subsequent calls will reuse this data.
  0%|          | 0/1 [00:00<?, ?it/s]
[4]
  0%|          | 0/1000 [00:00<?, ?it/s]
Total tokens in tokenized dataset: 31,603

Model and Training

[5]

Attention - CausalAttentionHead

[6]

MultiHeadedAttention

[7]

FeedForward Network

[8]

One Single Block of the GPT model

[9]

Entire GPT model, end-to-end

[10]

GPTDataset

for efficient and fast data loading thanks to Lance!

[11]

Finally, let's train the model!

We'll train the model for 50 epochs which should take ~5 hours to train. Change the number of epochs and other hyperparams in the Config class if you are training for longer and on more powerful hardware.

[12]
INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
WARNING: Missing logger folder: /kaggle/working/lightning_logs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name                 | Type       | Params
----------------------------------------------------
0 | token_embedding      | Embedding  | 12.9 M
1 | positional_embedding | Embedding  | 57.3 K
2 | backbone             | Sequential | 9.4 M 
3 | lm_head              | Linear     | 12.9 M
----------------------------------------------------
35.3 M    Trainable params
0         Non-trainable params
35.3 M    Total params
141.128   Total estimated model params size (MB)
/opt/conda/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
Training: |          | 0/? [00:00<?, ?it/s]
INFO: `Trainer.fit` stopped: `max_epochs=50` reached.

Generate some text!

Let's see how much our model learnt about the text data by asking it to generate some text given a prompt

[13]
My cat is very special. She named her angel Lily.One day, Anna and Lily went to the park with her mom. They saw a big slide, a swing, and a sandbox. Anna wanted to play with everything. Anna has a swing, but they were playing in the slide, but they all lived happily in the slide. She asked her mom, "Can I go on the slide, mom?""Yes, Anna. She likes to play too," her mom said.Anna nodded and slid down. She laughed and said, "Whee! That was fun, Lily!"Anna was fun, "Look, you can fly like a real angel!"Then she went to the sandbox. She pushed Lily on top and said, "Look, Lily, Lily. She is my angel!"Anna was having a lot of fun. She put Lily on top and said, "You are the castle, Lily!"Anna was having a lot of fun. But she did not see the unknown boy who came to the sandbox. He was bigger than Anna and wanted to take her. He saw Lily from the castle and