Nucleotide Transformer Dna Sequence Modelling
If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers as well as some other libraries. Uncomment the following cell and run it.
If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.
To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.
First you have to login to the huggingface hub
Then you need to install Git-LFS. Uncomment the following instructions:
Reading package lists... Done Building dependency tree Reading state information... Done git-lfs is already the newest version (2.9.2-1). 0 upgraded, 0 newly installed, 0 to remove and 16 not upgraded.
Fine-Tuning the Nucleotide-transformer
The Nucleotide Transformer paper Dalla-torre et al, 2023 introduces 4 genomics foundational models developed by InstaDeep. These transformers, of various sizes and trained on different datasets, allow powerful representations of DNA sequences that allow to tackle a very diverse set of problems such as chromatin accessibility, deleteriousness prediction, promoter and enhancer prediction etc... These representations can be extracted from the transformer and used as proxies of the DNA sequences (this is called probing) or the transformer can be trained further on a specific task (this is called finetuning).
This notebook allows you to fine-tune these models.
The model we are going to use is the 500M Human Ref model, which is a 500M parameters transformer pre-trained on the human reference genome, per the training methodology presented in the Nucleotide Transformer Paper. It is one of the 4 models introduced, all available on the Instadeep HuggingFace page:
| Model name | Num layers | Num parameters | Training dataset |
|---------------------|------------|----------------|------------------------|
| `500M Human Ref` | 24 | 500M | Human reference genome |
| `500M 1000G` | 24 | 500M | 1000G genomes |
| `2.5B 1000G` | 32 | 2.5B | 1000G genomes |
| `2.5B Multispecies` | 32 | 2.5B | Multi-species dataset |
Note that using the larger models will require more GPU memory and produce longer finetuning times
In the following, we showcase the nucleotide transformer ability to classify genomic sequences as two of the most basic genomic motifs: promoters and enhancers types. Both of them are classification task, but the enhancers types task is much more challenging with its 3 classes.
These two tasks are still very basic, but the nucleotide transformers have been shown to beat/match state of the art models on much more complex tasks such as DeepSEA, which, given a DNA sequence, predicts 919 chromatin profiles from a diverse set of human cells and tissues from a single sequence or DeepSTARR, which predicts an enhancer's activity.
Importing required packages
Import and install
Prepare and create the model for fine-tuning
The nucleotide transformer will be fine-tuned on two classification tasks: promoter and enhancer types classification.
The AutoModelForSequenceClassification module automatically loads the model and adds a simple classification head on top of the final embeddings.
First task : Promoter prediction
Promoter prediction is a sequence classification problem, in which the DNA sequence is predicted to be either a promoter or not.
A promoter is a region of DNA where transcription of a gene is initiated. Promoters are a vital component of expression vectors because they control the binding of RNA polymerase to DNA. RNA polymerase transcribes DNA to mRNA which is ultimately translated into a functional protein
This task was introduced in DeePromoter, where a set of TATA and non-TATA promoters was gathered. A negative sequence was generated from each promoter, by randomly sampling subsets of the sequence, to guarantee that some obvious motifs were present both in the positive and negative dataset.
Dataset loading and preparation
WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks_public (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks_public/promoter_all/0.0.0/d649d80b49e7b062da8a12a4d80a5d636571467e76a0a036d89078ffded1e5c9) WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks_public (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks_public/promoter_all/0.0.0/d649d80b49e7b062da8a12a4d80a5d636571467e76a0a036d89078ffded1e5c9)
Let us have a look at the data. If we extract the last sequence of the dataset, we see that it is indeed a promoter, as its label is 1. Furthermore, we can also see that it is a TATA promoter, as the TATA motif is present at the 221th nucleotide of the sequence!
The DNA sequence is CACACCAGACAAAATTTGGTTAATTTGCGCCCAATATTCATTACTTTGACCTAACCTTTGTTCTGAAGGCCGTGTACAAGGACAAGGCCCTGAGATTATTGCAACAGTAACTTGAAAAACTTTCAGAAGTCTATTCTGTAGGATTAAAGGAATGCTGAGACTATTCAAGTTTGAAGTCCTGGGGGTGGGGAAAAATAAAAAACCTGTGCTAGAAAGCTTAGTATAGCATGTAACTTTAGAGTCCTGTGGAGTCCTGAGTCTCCCACAGACCAGAACAGTCATTTAAAAGTTTTCAGGAAA. Its associated label is label 1. This promoter is a TATA promoter, as the TATA motif is present at the 221th nucleotide.
Tokenizing the datasets
All inputs to neural nets must be numerical. The process of converting strings into numerical indices suitable for a neural net is called tokenization.
Downloading (…)okenizer_config.json: 0%| | 0.00/129 [00:00<?, ?B/s]
Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/101 [00:00<?, ?B/s]
Map: 0%| | 0/50612 [00:00<?, ? examples/s]
Map: 0%| | 0/2664 [00:00<?, ? examples/s]
Map: 0%| | 0/5920 [00:00<?, ? examples/s]
Fine-tuning and evaluation
The hyper-parameters introduced here are different from the ones used in the paper since we are training the whole model. Further hyper-parameters search will surely improve the performance on the task!.
We initialize our TrainingArguments. These control the various training hyperparameters, and will be passed to our Trainer.
Next, we define the metric we will use to evaluate our models and write a compute_metrics function. We can load this from the scikit-learn library.
We can now finetune our model by just calling the train method:
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn(
Note that the finetuning is done with a small batch size (8). The training time can be reduced by increasing the batch size, as it leverages parallelism in the GPU.
Validation F1 score
F1 score on the test dataset
F1 score on the test dataset: 0.9388458225667529
For the promoter prediction task, we obtain a perforance that is already close to the one displayed in the article by training on only 1000 steps. A F1 score of 0.938 is obtained after just 1000 training steps. To get closer to the 0.954 score obtained in the nucleotide transformer paper after 10,000 training steps, we surely need to train for longer!
Second task : Enhancer type prediction
In this section, we fine-tune the nucleotide transformer model on enhancer type prediction, which consists in classifying a DNA sequence as strong, weak or non enhancer.
In genetics, an enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins (activators) to increase the likelihood that transcription of a particular gene will occur.
A deep learning framework for enhancer prediction using word embedding and sequence generation introduced the dataset used here by augmenting an original set of enhancers with 6000 synthetic enhancers and 6000 synthetic non-enhancers produced through a generative model.
Model
Some weights of the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref were not used when initializing EsmForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias'] - This IS expected if you are initializing EsmForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing EsmForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of EsmForSequenceClassification were not initialized from the model checkpoint at InstaDeepAI/nucleotide-transformer-500m-human-ref and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Dataset loading and preparation
WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks/enhancers_types/0.0.0/4a78b0644424e03fb4f26af3966a46e57ea50a1132ab8bb2f63b7808ce6a8772) WARNING:datasets.builder:Found cached dataset nucleotide_transformer_downstream_tasks (/root/.cache/huggingface/datasets/InstaDeepAI___nucleotide_transformer_downstream_tasks/enhancers_types/0.0.0/4a78b0644424e03fb4f26af3966a46e57ea50a1132ab8bb2f63b7808ce6a8772)
Tokenizing the datasets
Map: 0%| | 0/13471 [00:00<?, ? examples/s]
Map: 0%| | 0/1497 [00:00<?, ? examples/s]
Map: 0%| | 0/400 [00:00<?, ? examples/s]
Fine-tuning and evaluation
As with the promoters task, the hyper-parameters introduced here are different from the ones used in the paper since we are training the whole model. Further hyper-parameters search will surely improve the performance on the task!.
We initialize our TrainingArguments. These control the various training hyperparameters, and will be passed to our Trainer.
Here, the metric used to evaluate the model is the Matthews Correlation Coefficient, which is more relevant than the accuracy when the classes in the dataset are unbalanced. We can load a predefined function from the scikit-learn library.
We can now finetune our model by just calling the train method:
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn(
As with the first task, the time can be greatly reduced by increasing the batch size.
Validation MCC score
MCC on the test dataset
MCC score on the test dataset: 0.39976156802970286
For the enhancers types prediction task, we obtain a perforance after 1000 training steps that is 0.40, which is already beating the baseline on which Nucleotide Transformer is compared (0.395). This is still, however, 8.5 percent points below its performance (0.485) in the Nucleotide Transformers paper. To match the paper results more closely, it will probably be necessary to increase the number of training steps. Also note that the paper used a parameter-efficient finetuning method called IA3, whereas in this notebook we fine-tuned the entire model for simplicity.
Conclusion
This notebook showcases the simple approach required to finetune a Nucleotide Transformer model on any classification task. For the sake of simplicity, a standard approach has been used where the whole model is finetuned on the new dataset, which is a different approach than what was used to tackle the different downstream tasks presented in the paper!