Notebooks
H
Hugging Face
Nucleotide Transformer Dna Sequence Modelling

Nucleotide Transformer Dna Sequence Modelling

hf-notebooksexamples

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers as well as some other libraries. Uncomment the following cell and run it.

[1]

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to login to the huggingface hub

[ ]

Then you need to install Git-LFS. Uncomment the following instructions:

[ ]
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.

Fine-Tuning the Nucleotide-transformer

The Nucleotide Transformer paper Dalla-torre et al, 2023 introduces 4 genomics foundational models developed by InstaDeep. These transformers, of various sizes and trained on different datasets, allow powerful representations of DNA sequences that allow to tackle a very diverse set of problems such as chromatin accessibility, deleteriousness prediction, promoter and enhancer prediction etc... These representations can be extracted from the transformer and used as proxies of the DNA sequences (this is called probing) or the transformer can be trained further on a specific task (this is called finetuning).

Figure_1.png

This notebook allows you to fine-tune these models.

The model we are going to use is the 500M Human Ref model, which is a 500M parameters transformer pre-trained on the human reference genome, per the training methodology presented in the Nucleotide Transformer Paper. It is one of the 4 models introduced, all available on the Instadeep HuggingFace page:

| Model name          | Num layers | Num parameters | Training dataset       |
|---------------------|------------|----------------|------------------------|
| `500M Human Ref`    | 24         | 500M           | Human reference genome |
| `500M 1000G`        | 24         | 500M           | 1000G genomes          |
| `2.5B 1000G`        | 32         | 2.5B           | 1000G genomes          |
| `2.5B Multispecies` | 32         | 2.5B           | Multi-species dataset  |

Note that using the larger models will require more GPU memory and produce longer finetuning times

In the following, we showcase the nucleotide transformer ability to classify genomic sequences as two of the most basic genomic motifs: promoters and enhancers types. Both of them are classification task, but the enhancers types task is much more challenging with its 3 classes.

These two tasks are still very basic, but the nucleotide transformers have been shown to beat/match state of the art models on much more complex tasks such as DeepSEA, which, given a DNA sequence, predicts 919 chromatin profiles from a diverse set of human cells and tissues from a single sequence or DeepSTARR, which predicts an enhancer's activity.

Importing required packages

Import and install

[2]
[5]

Prepare and create the model for fine-tuning

The nucleotide transformer will be fine-tuned on two classification tasks: promoter and enhancer types classification. The AutoModelForSequenceClassification module automatically loads the model and adds a simple classification head on top of the final embeddings.

First task : Promoter prediction

Promoter prediction is a sequence classification problem, in which the DNA sequence is predicted to be either a promoter or not.

A promoter is a region of DNA where transcription of a gene is initiated. Promoters are a vital component of expression vectors because they control the binding of RNA polymerase to DNA. RNA polymerase transcribes DNA to mRNA which is ultimately translated into a functional protein

This task was introduced in DeePromoter, where a set of TATA and non-TATA promoters was gathered. A negative sequence was generated from each promoter, by randomly sampling subsets of the sequence, to guarantee that some obvious motifs were present both in the positive and negative dataset.

[ ]

Dataset loading and preparation

[ ]
WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks_public (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks_public/promoter_all/0.0.0/d649d80b49e7b062da8a12a4d80a5d636571467e76a0a036d89078ffded1e5c9)
WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks_public (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks_public/promoter_all/0.0.0/d649d80b49e7b062da8a12a4d80a5d636571467e76a0a036d89078ffded1e5c9)
[ ]

Let us have a look at the data. If we extract the last sequence of the dataset, we see that it is indeed a promoter, as its label is 1. Furthermore, we can also see that it is a TATA promoter, as the TATA motif is present at the 221th nucleotide of the sequence!

[ ]
The DNA sequence is CACACCAGACAAAATTTGGTTAATTTGCGCCCAATATTCATTACTTTGACCTAACCTTTGTTCTGAAGGCCGTGTACAAGGACAAGGCCCTGAGATTATTGCAACAGTAACTTGAAAAACTTTCAGAAGTCTATTCTGTAGGATTAAAGGAATGCTGAGACTATTCAAGTTTGAAGTCCTGGGGGTGGGGAAAAATAAAAAACCTGTGCTAGAAAGCTTAGTATAGCATGTAACTTTAGAGTCCTGTGGAGTCCTGAGTCTCCCACAGACCAGAACAGTCATTTAAAAGTTTTCAGGAAA.
Its associated label is label 1.
This promoter is a TATA promoter, as the TATA motif is present at the 221th nucleotide.

Tokenizing the datasets

All inputs to neural nets must be numerical. The process of converting strings into numerical indices suitable for a neural net is called tokenization.

[12]
Downloading (…)okenizer_config.json:   0%|          | 0.00/129 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]
Downloading (…)cial_tokens_map.json:   0%|          | 0.00/101 [00:00<?, ?B/s]
[ ]
[13]
[ ]
Map:   0%|          | 0/50612 [00:00<?, ? examples/s]
Map:   0%|          | 0/2664 [00:00<?, ? examples/s]
Map:   0%|          | 0/5920 [00:00<?, ? examples/s]

Fine-tuning and evaluation

The hyper-parameters introduced here are different from the ones used in the paper since we are training the whole model. Further hyper-parameters search will surely improve the performance on the task!. We initialize our TrainingArguments. These control the various training hyperparameters, and will be passed to our Trainer.

[ ]

Next, we define the metric we will use to evaluate our models and write a compute_metrics function. We can load this from the scikit-learn library.

[ ]
[ ]

We can now finetune our model by just calling the train method:

[ ]
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Note that the finetuning is done with a small batch size (8). The training time can be reduced by increasing the batch size, as it leverages parallelism in the GPU.

Validation F1 score

[ ]
[ ]
Output

F1 score on the test dataset

[ ]
F1 score on the test dataset: 0.9388458225667529

For the promoter prediction task, we obtain a perforance that is already close to the one displayed in the article by training on only 1000 steps. A F1 score of 0.938 is obtained after just 1000 training steps. To get closer to the 0.954 score obtained in the nucleotide transformer paper after 10,000 training steps, we surely need to train for longer!

Second task : Enhancer type prediction

In this section, we fine-tune the nucleotide transformer model on enhancer type prediction, which consists in classifying a DNA sequence as strong, weak or non enhancer.

In genetics, an enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins (activators) to increase the likelihood that transcription of a particular gene will occur.

A deep learning framework for enhancer prediction using word embedding and sequence generation introduced the dataset used here by augmenting an original set of enhancers with 6000 synthetic enhancers and 6000 synthetic non-enhancers produced through a generative model.

Model

[22]
Some weights of the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref were not used when initializing EsmForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing EsmForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing EsmForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Dataset loading and preparation

[23]
WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks/enhancers_types/0.0.0/4a78b0644424e03fb4f26af3966a46e57ea50a1132ab8bb2f63b7808ce6a8772)
WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks/enhancers_types/0.0.0/4a78b0644424e03fb4f26af3966a46e57ea50a1132ab8bb2f63b7808ce6a8772)
[24]
[25]

Tokenizing the datasets

[26]
[27]
Map:   0%|          | 0/13471 [00:00<?, ? examples/s]
Map:   0%|          | 0/1497 [00:00<?, ? examples/s]
Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Fine-tuning and evaluation

As with the promoters task, the hyper-parameters introduced here are different from the ones used in the paper since we are training the whole model. Further hyper-parameters search will surely improve the performance on the task!. We initialize our TrainingArguments. These control the various training hyperparameters, and will be passed to our Trainer.

[28]

Here, the metric used to evaluate the model is the Matthews Correlation Coefficient, which is more relevant than the accuracy when the classes in the dataset are unbalanced. We can load a predefined function from the scikit-learn library.

[29]
[30]

We can now finetune our model by just calling the train method:

[31]
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

As with the first task, the time can be greatly reduced by increasing the batch size.

Validation MCC score

[35]
[36]
Output

MCC on the test dataset

[37]
MCC score on the test dataset: 0.39976156802970286

For the enhancers types prediction task, we obtain a perforance after 1000 training steps that is 0.40, which is already beating the baseline on which Nucleotide Transformer is compared (0.395). This is still, however, 8.5 percent points below its performance (0.485) in the Nucleotide Transformers paper. To match the paper results more closely, it will probably be necessary to increase the number of training steps. Also note that the paper used a parameter-efficient finetuning method called IA3, whereas in this notebook we fine-tuned the entire model for simplicity.

Conclusion

This notebook showcases the simple approach required to finetune a Nucleotide Transformer model on any classification task. For the sake of simplicity, a standard approach has been used where the whole model is finetuned on the new dataset, which is a different approach than what was used to tackle the different downstream tasks presented in the paper!