Notebooks
H
Hugging Face
Nucleotide Transformer Dna Sequence Modelling With Peft

Nucleotide Transformer Dna Sequence Modelling With Peft

hf-notebooksexamples

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers as well as some other libraries. Uncomment the following cell and run it.

[ ]
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 43.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 kB 7.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 104.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 486.2/486.2 kB 43.5 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 236.8/236.8 kB 23.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.6/227.6 kB 21.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 49.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 37.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.5/110.5 kB 11.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 212.5/212.5 kB 20.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.3/134.3 kB 15.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 56.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.5/114.5 kB 9.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 268.8/268.8 kB 23.0 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 149.6/149.6 kB 16.9 MB/s eta 0:00:00

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to login to the huggingface hub

[ ]
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful

Then you need to install Git-LFS. Uncomment the following instructions:

[ ]
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 13 not upgraded.

Fine-Tuning the Nucleotide-transformer with LoRA

The Nucleotide Transformer paper Dalla-torre et al, 2023 introduces 4 genomics foundational models developed by InstaDeep. These transformers, of various sizes and trained on different datasets, allow powerful representations of DNA sequences that allow to tackle a very diverse set of problems such as chromatin accessibility, deleteriousness prediction, promoter and enhancer prediction etc... These representations can be extracted from the transformer and used as proxies of the DNA sequences (this is called probing) or the transformer can be trained further on a specific task (this is called finetuning).

Figure_1.png

This notebook allows you to fine-tune one of these models.

LoRA: Low-Rank Adaptation of Large Language Models is one of the state of the art parameter-efficient finetuning methods that is explained in details in this blog post. Any transformer model can be finetuned using this method with very little effort using the 🤗 Transformers library, which is why it is used in this notebook instead of the IA³ technique presented in the original paper.

The model we are going to use is the 500M Human Ref model, which is a 500M parameters transformer pre-trained on the human reference genome, per the training methodology presented in the Nucleotide Transformer Paper. It is one of the 4 models introduced, all available on the Instadeep HuggingFace page:

| Model name          | Num layers | Num parameters | Training dataset       |
|---------------------|------------|----------------|------------------------|
| `500M Human Ref`    | 24         | 500M           | Human reference genome |
| `500M 1000G`        | 24         | 500M           | 1000G genomes          |
| `2.5B 1000G`        | 32         | 2.5B           | 1000G genomes          |
| `2.5B Multispecies` | 32         | 2.5B           | Multi-species dataset  |

Note that even though the finetuning is done with a parameter-efficient method, using the larger checkpoints will still require more GPU memory and produce longer finetuning times

In the following, we showcase the nucleotide transformer ability to classify genomic sequences as two of the most basic genomic motifs: promoters and enhancers types. Both of them are classification task, but the enhancers types task is much more challenging with its 3 classes.

These two tasks are very basic, but the nucleotide transformers have been shown to beat/match state of the art models on much more complex tasks such as DeepSEA, which, given a DNA sequence, predicts 919 chromatin profiles from a diverse set of human cells and tissues from a single sequence or DeepSTARR, which predicts an enhancer's activity.

Importing required packages and setting up PEFT model

Import and install

[ ]
[ ]

Prepare and create the model for fine-tuning

The nucleotide transformer will be fine-tuned on two classification tasks: promoter and enhancer types classification. The AutoModelForSequenceClassification module automatically loads the model and adds a simple classification head on top of the final embeddings.

[ ]
Some weights of the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref were not used when initializing EsmForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing EsmForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EsmForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

The LoRA parameters are now added to the model, and the parameters that will be finetuned are indicated.

[ ]
[ ]
trainable params: 3407364 || all params: 482205925 || trainable%: 0.7066201021897439
PeftModelForSequenceClassification(
,  (base_model): LoraModel(
,    (model): EsmForSequenceClassification(
,      (esm): EsmModel(
,        (embeddings): EsmEmbeddings(
,          (word_embeddings): Embedding(4105, 1280, padding_idx=1)
,          (dropout): Dropout(p=0.0, inplace=False)
,          (position_embeddings): Embedding(1002, 1280, padding_idx=1)
,        )
,        (encoder): EsmEncoder(
,          (layer): ModuleList(
,            (0-23): 24 x EsmLayer(
,              (attention): EsmAttention(
,                (self): EsmSelfAttention(
,                  (query): Linear(
,                    in_features=1280, out_features=1280, bias=True
,                    (lora_dropout): ModuleDict(
,                      (default): Dropout(p=0.1, inplace=False)
,                    )
,                    (lora_A): ModuleDict(
,                      (default): Linear(in_features=1280, out_features=1, bias=False)
,                    )
,                    (lora_B): ModuleDict(
,                      (default): Linear(in_features=1, out_features=1280, bias=False)
,                    )
,                    (lora_embedding_A): ParameterDict()
,                    (lora_embedding_B): ParameterDict()
,                  )
,                  (key): Linear(in_features=1280, out_features=1280, bias=True)
,                  (value): Linear(
,                    in_features=1280, out_features=1280, bias=True
,                    (lora_dropout): ModuleDict(
,                      (default): Dropout(p=0.1, inplace=False)
,                    )
,                    (lora_A): ModuleDict(
,                      (default): Linear(in_features=1280, out_features=1, bias=False)
,                    )
,                    (lora_B): ModuleDict(
,                      (default): Linear(in_features=1, out_features=1280, bias=False)
,                    )
,                    (lora_embedding_A): ParameterDict()
,                    (lora_embedding_B): ParameterDict()
,                  )
,                  (dropout): Dropout(p=0.0, inplace=False)
,                )
,                (output): EsmSelfOutput(
,                  (dense): Linear(in_features=1280, out_features=1280, bias=True)
,                  (dropout): Dropout(p=0.0, inplace=False)
,                )
,                (LayerNorm): LayerNorm((1280,), eps=1e-12, elementwise_affine=True)
,              )
,              (intermediate): EsmIntermediate(
,                (dense): Linear(in_features=1280, out_features=5120, bias=True)
,              )
,              (output): EsmOutput(
,                (dense): Linear(in_features=5120, out_features=1280, bias=True)
,                (dropout): Dropout(p=0.0, inplace=False)
,              )
,              (LayerNorm): LayerNorm((1280,), eps=1e-12, elementwise_affine=True)
,            )
,          )
,          (emb_layer_norm_after): LayerNorm((1280,), eps=1e-12, elementwise_affine=True)
,        )
,        (contact_head): EsmContactPredictionHead(
,          (regression): Linear(in_features=480, out_features=1, bias=True)
,          (activation): Sigmoid()
,        )
,      )
,      (classifier): ModulesToSaveWrapper(
,        (original_module): EsmClassificationHead(
,          (dense): Linear(in_features=1280, out_features=1280, bias=True)
,          (dropout): Dropout(p=0.0, inplace=False)
,          (out_proj): Linear(in_features=1280, out_features=2, bias=True)
,        )
,        (modules_to_save): ModuleDict(
,          (default): EsmClassificationHead(
,            (dense): Linear(in_features=1280, out_features=1280, bias=True)
,            (dropout): Dropout(p=0.0, inplace=False)
,            (out_proj): Linear(in_features=1280, out_features=2, bias=True)
,          )
,        )
,      )
,    )
,  )
,)

First task : Promoter prediction

Promoter prediction is a sequence classification problem, in which the DNA sequence is predicted to be either a promoter or not.

A promoter is a region of DNA where transcription of a gene is initiated. Promoters are a vital component of expression vectors because they control the binding of RNA polymerase to DNA. RNA polymerase transcribes DNA to mRNA which is ultimately translated into a functional protein

This task was introduced in DeePromoter, where a set of TATA and non-TATA promoters was gathered. A negative sequence was generated from each promoter, by randomly sampling subsets of the sequence, to guarantee that some obvious motifs were present both in the positive and negative dataset.

Dataset loading and preparation

[ ]
WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks/promoter_all/0.0.0/4a78b0644424e03fb4f26af3966a46e57ea50a1132ab8bb2f63b7808ce6a8772)
WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks/promoter_all/0.0.0/4a78b0644424e03fb4f26af3966a46e57ea50a1132ab8bb2f63b7808ce6a8772)
[ ]

Let us have a look at the data. If we extract the last sequence of the dataset, we see that it is indeed a promoter, as its label is 1. Furthermore, we can also see that it is a TATA promoter, as the TATA motif is present at the 221th nucleotide of the sequence!

[ ]
The DNA sequence is CACACCAGACAAAATTTGGTTAATTTGCGCCCAATATTCATTACTTTGACCTAACCTTTGTTCTGAAGGCCGTGTACAAGGACAAGGCCCTGAGATTATTGCAACAGTAACTTGAAAAACTTTCAGAAGTCTATTCTGTAGGATTAAAGGAATGCTGAGACTATTCAAGTTTGAAGTCCTGGGGGTGGGGAAAAATAAAAAACCTGTGCTAGAAAGCTTAGTATAGCATGTAACTTTAGAGTCCTGTGGAGTCCTGAGTCTCCCACAGACCAGAACAGTCATTTAAAAGTTTTCAGGAAA.
Its associated label is label 1.
This promoter is a TATA promoter, as the TATA motif is present at the 221th nucleotide.

Tokenizing the datasets

All inputs to neural nets must be numerical. The process of converting strings into numerical indices suitable for a neural net is called tokenization.

[ ]
Downloading (…)okenizer_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/101 [00:00<?, ?B/s]
[ ]
[ ]
[ ]
Map:   0%|          | 0/50612 [00:00<?, ? examples/s]
Map:   0%|          | 0/2664 [00:00<?, ? examples/s]
Map:   0%|          | 0/5920 [00:00<?, ? examples/s]

Fine-tuning and evaluation

We initialize our TrainingArguments. These control the various training hyperparameters, and will be passed to our Trainer.

The hyperparameters used for the IA³ method in the paper do not provide good performance for the LoRa method. Mainly, LoRA introduces more trainable parameters, therefore requiring a smaller learning rate. We here use a learning rate of 5.10⁻⁴, which enables us to get close to the paper's performance.

[ ]

Next, we define the metric we will use to evaluate our models and write a compute_metrics function. We can load this from the scikit-learn library.

[ ]
[ ]

We can now finetune our model by just calling the train method:

[ ]
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Note that the finetuning is done with a small batch size (8). The training time can be reduced by increasing the batch size, as it leverages parallelism in the GPU.

Validation F1 score

[ ]
[ ]
Output

F1 score on the test dataset

[ ]
F1 score on the test dataset: 0.9370581134678393

For the promoter prediction task, we reproduced the experiment carried out in the article by adapting the learning rate to the LoRa method. A F1 score of 0.937 is obtained after just 1000 training steps. To get closer to the 0.954 score obtained in the nucleotide transformer paper after 10,000 training steps, we surely need to train for longer!

Second task : Enhancer prediction

In this section, we fine-tune the nucleotide transformer model on enhancer type prediction, which consists in classifying a DNA sequence as strong, weak or non enhancer.

In genetics, an enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins (activators) to increase the likelihood that transcription of a particular gene will occur.

A deep learning framework for enhancer prediction using word embedding and sequence generation introduced the dataset used here by augmenting an original set of enhancers with 6000 synthetic enhancers and 6000 synthetic non-enhancers produced through a generative model.

Dataset loading and preparation

[ ]
Downloading builder script: 0.00B [00:00, ?B/s]
Downloading readme: 0.00B [00:00, ?B/s]
Downloading and preparing dataset nucleotide_transformer_downstream_tasks/enhancers_types to /root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks/enhancers_types/0.0.0/4a78b0644424e03fb4f26af3966a46e57ea50a1132ab8bb2f63b7808ce6a8772...
Downloading data:   0%|          | 0.00/3.10M [00:00<?, ?B/s]
Downloading data:   0%|          | 0.00/83.2k [00:00<?, ?B/s]
Generating train split: 0 examples [00:00, ? examples/s]
Generating test split: 0 examples [00:00, ? examples/s]
Dataset nucleotide_transformer_downstream_tasks downloaded and prepared to /root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks/enhancers_types/0.0.0/4a78b0644424e03fb4f26af3966a46e57ea50a1132ab8bb2f63b7808ce6a8772. Subsequent calls will reuse this data.
WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks/enhancers_types/0.0.0/4a78b0644424e03fb4f26af3966a46e57ea50a1132ab8bb2f63b7808ce6a8772)
[ ]
[ ]

Tokenizing the datasets

[ ]
[ ]
Map:   0%|          | 0/13471 [00:00<?, ? examples/s]
Map:   0%|          | 0/1497 [00:00<?, ? examples/s]
Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Fine-tuning and evaluation

[ ]
Downloading (…)lve/main/config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]
Downloading pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]
Some weights of the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref were not used when initializing EsmForSequenceClassification: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing EsmForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EsmForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[ ]
[ ]
trainable params: 3409926 || all params: 482208487 || trainable%: 0.7071476533344383
PeftModelForSequenceClassification(
,  (base_model): LoraModel(
,    (model): EsmForSequenceClassification(
,      (esm): EsmModel(
,        (embeddings): EsmEmbeddings(
,          (word_embeddings): Embedding(4105, 1280, padding_idx=1)
,          (dropout): Dropout(p=0.0, inplace=False)
,          (position_embeddings): Embedding(1002, 1280, padding_idx=1)
,        )
,        (encoder): EsmEncoder(
,          (layer): ModuleList(
,            (0-23): 24 x EsmLayer(
,              (attention): EsmAttention(
,                (self): EsmSelfAttention(
,                  (query): Linear(
,                    in_features=1280, out_features=1280, bias=True
,                    (lora_dropout): ModuleDict(
,                      (default): Dropout(p=0.1, inplace=False)
,                    )
,                    (lora_A): ModuleDict(
,                      (default): Linear(in_features=1280, out_features=1, bias=False)
,                    )
,                    (lora_B): ModuleDict(
,                      (default): Linear(in_features=1, out_features=1280, bias=False)
,                    )
,                    (lora_embedding_A): ParameterDict()
,                    (lora_embedding_B): ParameterDict()
,                  )
,                  (key): Linear(in_features=1280, out_features=1280, bias=True)
,                  (value): Linear(
,                    in_features=1280, out_features=1280, bias=True
,                    (lora_dropout): ModuleDict(
,                      (default): Dropout(p=0.1, inplace=False)
,                    )
,                    (lora_A): ModuleDict(
,                      (default): Linear(in_features=1280, out_features=1, bias=False)
,                    )
,                    (lora_B): ModuleDict(
,                      (default): Linear(in_features=1, out_features=1280, bias=False)
,                    )
,                    (lora_embedding_A): ParameterDict()
,                    (lora_embedding_B): ParameterDict()
,                  )
,                  (dropout): Dropout(p=0.0, inplace=False)
,                )
,                (output): EsmSelfOutput(
,                  (dense): Linear(in_features=1280, out_features=1280, bias=True)
,                  (dropout): Dropout(p=0.0, inplace=False)
,                )
,                (LayerNorm): LayerNorm((1280,), eps=1e-12, elementwise_affine=True)
,              )
,              (intermediate): EsmIntermediate(
,                (dense): Linear(in_features=1280, out_features=5120, bias=True)
,              )
,              (output): EsmOutput(
,                (dense): Linear(in_features=5120, out_features=1280, bias=True)
,                (dropout): Dropout(p=0.0, inplace=False)
,              )
,              (LayerNorm): LayerNorm((1280,), eps=1e-12, elementwise_affine=True)
,            )
,          )
,          (emb_layer_norm_after): LayerNorm((1280,), eps=1e-12, elementwise_affine=True)
,        )
,        (contact_head): EsmContactPredictionHead(
,          (regression): Linear(in_features=480, out_features=1, bias=True)
,          (activation): Sigmoid()
,        )
,      )
,      (classifier): ModulesToSaveWrapper(
,        (original_module): EsmClassificationHead(
,          (dense): Linear(in_features=1280, out_features=1280, bias=True)
,          (dropout): Dropout(p=0.0, inplace=False)
,          (out_proj): Linear(in_features=1280, out_features=3, bias=True)
,        )
,        (modules_to_save): ModuleDict(
,          (default): EsmClassificationHead(
,            (dense): Linear(in_features=1280, out_features=1280, bias=True)
,            (dropout): Dropout(p=0.0, inplace=False)
,            (out_proj): Linear(in_features=1280, out_features=3, bias=True)
,          )
,        )
,      )
,    )
,  )
,)

We initialize our TrainingArguments. These control the various training hyperparameters, and will be passed to our Trainer.

We keep the same hyperparameters as for the promoter task, i.e the same as in the paper except for a learning rate of 5.10⁻⁴, which enables us to get close to paper's performance.

[ ]

Here, the metric used to evaluate the model is the Matthews Correlation Coefficient, which is more relevant than the accuracy when the classes in the dataset are unbalanced. We can load a predefined function from the scikit-learn library.

[ ]
[ ]

We can now finetune our model by just calling the train method:

[ ]
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

As with the first task, the time can be greatly reduced by increasing the batch size.

Validation MCC score

[ ]
[ ]
Output

MCC on the test dataset

[ ]
MCC score on the test dataset: 0.3962247896674896

For the enhancers types prediction task, we obtain a perforance after 1000 training steps that is 0.40, which is already beating the baseline on which Nucleotide Transformer is compared (0.395). This is still, however, 8.5 percent points below its performance (0.485) in the Nucleotide Transformers paper. To match the paper results more closely, it will probably be necessary to increase the number of training steps. Also note that the paper used a parameter-efficient finetuning method called IA3, whereas in this notebook we use the LoRA setting, which differs in various manners