Human Genome Ml
Modeling human genome at lightning speed
The aim of this notebook is to build a classification model to predict a gene family based on the DNA sequence of the coding sequence.
Gene families are collections of genes that are evolutionarily related, having originated from a common ancestral gene. These genes typically share similar DNA sequences and often retain related biological functions, although they may also evolve new roles over time. Within a gene family, the individual members can be classified based on their evolutionary history into two main types: paralogs and orthologs.
Paralogs are genes that arise through duplication events within the same genome. After duplication, these genes may diverge in function, allowing organisms to develop new traits or regulatory mechanisms. For example, two paralogous genes in humans may perform distinct but complementary roles in cellular processes.
Orthologs, on the other hand, are genes found in different species that evolved from a common ancestral gene due to speciation events. These genes typically retain similar functions across species and are valuable for comparative genomics studies. For instance, a gene in humans and its ortholog in mice may both play essential roles in embryonic development.
Understanding gene families and the relationships between paralogs and orthologs is crucial for studying gene evolution, function, and the conservation of biological processes across different organisms.
Imports
We will be using GPUs throughtout the notebook.
- We will use the zero-code change pandas code to run on the GPU.
- To featurize our dataset -- we will be using the transformer models from Hugging Face.
The cudf.pandas extension is already loaded. To reload it, use: %reload_ext cudf.pandas The cuml.accel extension is already loaded. To reload it, use: %reload_ext cuml.accel
Let's get the data
The data contains two columns:
- The genome sequence. It is a complete list of the DNA bases—adenine (A), thymine (T), cytosine (C), and guanine (G)—that make up the genetic material of an organism. It represents the full set of instructions encoded in the DNA, covering all of the organism’s genes as well as non-coding regions that help regulate gene activity or have other structural and functional roles.
- Class label.
In this particular dataset, the class label stands for the following gene family:
| Class label | Gene family | Description |
|---|---|---|
| 0 | G Protein coupled receptors | Cell surface proteins that detect external signals and activate internal cellular responses by interacting with G proteins. |
| 1 | Tyrosine kinase | An enzyme that transfers phosphate groups to specific tyrosine residues on proteins, often initiating signal transduction pathways. |
| 2 | Tyrosine phosphatase | An enzyme that removes phosphate groups from phosphorylated tyrosine residues, thereby regulating or turning off signaling pathways. |
| 3 | Synthetase | An enzyme that catalyzes the joining of two molecules using energy from ATP, often involved in biosynthesis processes like attaching amino acids to tRNA. |
| 4 | Synthase | An enzyme that catalyzes the formation of a chemical bond between molecules to synthesize a compound, typically without requiring energy from ATP. |
| 5 | Ion channel | A gene-encoded protein that forms pores in cell membranes, allowing specific ions to pass in and out, which is critical for cellular signaling and homeostasis. |
| 6 | Transcription factor | A protein that binds to specific DNA sequences to regulate the transcription of genes, controlling when and how much a gene is expressed. |
Let's check the class frequencies.
<Axes: xlabel='class'>
Tokenize the gene sequences to create embeddings
We will use the nucleotide transformer from Hugging Face to convert the sequences of the adenine, thymine, cytosine, and quanine to a numerical representation. Because the whole dataset would not fit into memory -- we will batch the dataset. This process can take quite a bit of time.
Producing embeddings...: 0%| | 0/438 [00:00<?, ?it/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1257 > 1000). Running this sequence through the model will result in indexing errors
Now that we have the embeddings -- let's create our final DataFrame.
Classfiying the genes
Data transformation
In order to use the scikit-learn models -- we need to convert out embeddings into columns. Each embedding has 1280 elements.
Random Forest classification model with all columns
We will build a Random Forest model using our full dataset. But first, let's split our dataset into training and testing parts.
Now we are ready to train our classifier.
Let's see how well it does. In order to help show the performance -- let's define a helpful method.
Now we can see how well our model performs.
Confusion matrix Predicted 0 1 2 3 4 5 6 Actual 0 64 3 0 2 6 0 14 1 3 69 0 9 12 0 24 2 3 4 35 5 10 0 15 3 0 2 0 77 16 0 26 4 6 3 0 9 99 0 29 5 5 2 0 2 8 19 5 6 1 2 0 3 4 0 234 accuracy = 0.719 precision = 0.750 recall = 0.719 f1 = 0.712
Overall accuracy > 70% is not too bad. The other metrics are pretty good as well. But we're using a large model with > 1200 columns. Let's try to reduce this using UMAP.
Random Forest classification with UMAP reduced columns
First, let's define the reducer. We will use 200 components thus we expect the final dataset to have 200 columns -- a significant reduction from almost 1300.
Building knn graph using brute force
(4150, 200)
As expected, we have 200 columns. Okay, let's build our model -- note that instead of using X variable -- we are now passing the embedding one to the train_test_split method.
How well did we do?
Confusion matrix Predicted 0 1 2 3 4 5 6 Actual 0 58 4 6 5 6 2 8 1 7 71 2 9 8 1 19 2 4 6 29 8 9 1 15 3 1 5 2 76 16 0 21 4 7 7 5 19 88 2 18 5 3 1 1 7 7 18 4 6 2 6 2 20 10 2 202 accuracy = 0.653 precision = 0.655 recall = 0.653 f1 = 0.648
A reduction of ~6 percentage points across the board for a much smaller model. This may be a good trade off. Let's see how XGboost will do with this dataset.
XGBoost classiciation with UMAP reduced columns
We will train the model using a GPU.
[0] validation_0-mlogloss:1.68613 [1] validation_0-mlogloss:1.54409 [2] validation_0-mlogloss:1.44845 [3] validation_0-mlogloss:1.37327 [4] validation_0-mlogloss:1.32524 [5] validation_0-mlogloss:1.27839 [6] validation_0-mlogloss:1.24326 [7] validation_0-mlogloss:1.21360 [8] validation_0-mlogloss:1.19338 [9] validation_0-mlogloss:1.17935 [10] validation_0-mlogloss:1.16452 [11] validation_0-mlogloss:1.15331 [12] validation_0-mlogloss:1.14098 [13] validation_0-mlogloss:1.13247 [14] validation_0-mlogloss:1.12727 [15] validation_0-mlogloss:1.12191 [16] validation_0-mlogloss:1.11601 [17] validation_0-mlogloss:1.11201 [18] validation_0-mlogloss:1.10938 [19] validation_0-mlogloss:1.10193 [20] validation_0-mlogloss:1.09547 [21] validation_0-mlogloss:1.09147 [22] validation_0-mlogloss:1.08974 [23] validation_0-mlogloss:1.08356 [24] validation_0-mlogloss:1.07820 [25] validation_0-mlogloss:1.07663 [26] validation_0-mlogloss:1.07381 [27] validation_0-mlogloss:1.07026 [28] validation_0-mlogloss:1.06873 [29] validation_0-mlogloss:1.06635 [30] validation_0-mlogloss:1.06773 [31] validation_0-mlogloss:1.06847 [32] validation_0-mlogloss:1.07078 [33] validation_0-mlogloss:1.07012 [34] validation_0-mlogloss:1.07102 [35] validation_0-mlogloss:1.06998 [36] validation_0-mlogloss:1.07008 [37] validation_0-mlogloss:1.06885 [38] validation_0-mlogloss:1.07020 [39] validation_0-mlogloss:1.06737 [40] validation_0-mlogloss:1.06845 [41] validation_0-mlogloss:1.06923 [42] validation_0-mlogloss:1.06982 [43] validation_0-mlogloss:1.07151 [44] validation_0-mlogloss:1.07120 [45] validation_0-mlogloss:1.06896 [46] validation_0-mlogloss:1.06878 [47] validation_0-mlogloss:1.07103 [48] validation_0-mlogloss:1.07080 [49] validation_0-mlogloss:1.07248 [50] validation_0-mlogloss:1.07409 [51] validation_0-mlogloss:1.07548 [52] validation_0-mlogloss:1.07614 [53] validation_0-mlogloss:1.07486 [54] validation_0-mlogloss:1.07506 [55] validation_0-mlogloss:1.07618 [56] validation_0-mlogloss:1.07575 [57] validation_0-mlogloss:1.07550 [58] validation_0-mlogloss:1.07543 [59] validation_0-mlogloss:1.07723 [60] validation_0-mlogloss:1.07840 [61] validation_0-mlogloss:1.07934 [62] validation_0-mlogloss:1.08085 [63] validation_0-mlogloss:1.08051 [64] validation_0-mlogloss:1.08087 [65] validation_0-mlogloss:1.08325 [66] validation_0-mlogloss:1.08538 [67] validation_0-mlogloss:1.08629 [68] validation_0-mlogloss:1.08837 [69] validation_0-mlogloss:1.08824 [70] validation_0-mlogloss:1.08886 [71] validation_0-mlogloss:1.09131 [72] validation_0-mlogloss:1.09277 [73] validation_0-mlogloss:1.09420 [74] validation_0-mlogloss:1.09635 [75] validation_0-mlogloss:1.09666 [76] validation_0-mlogloss:1.09919 [77] validation_0-mlogloss:1.10203 [78] validation_0-mlogloss:1.10485 [79] validation_0-mlogloss:1.10708 [80] validation_0-mlogloss:1.10736 [81] validation_0-mlogloss:1.10902 [82] validation_0-mlogloss:1.11132 [83] validation_0-mlogloss:1.11123 [84] validation_0-mlogloss:1.11484 [85] validation_0-mlogloss:1.11403 [86] validation_0-mlogloss:1.11594 [87] validation_0-mlogloss:1.11753 [88] validation_0-mlogloss:1.12017 [89] validation_0-mlogloss:1.12339 [90] validation_0-mlogloss:1.12301 [91] validation_0-mlogloss:1.12398 [92] validation_0-mlogloss:1.12567 [93] validation_0-mlogloss:1.12576 [94] validation_0-mlogloss:1.12879 [95] validation_0-mlogloss:1.12972 [96] validation_0-mlogloss:1.13177 [97] validation_0-mlogloss:1.13367 [98] validation_0-mlogloss:1.13395 [99] validation_0-mlogloss:1.13413 [100] validation_0-mlogloss:1.13668 [101] validation_0-mlogloss:1.13717 [102] validation_0-mlogloss:1.13865 [103] validation_0-mlogloss:1.14020 [104] validation_0-mlogloss:1.13929 [105] validation_0-mlogloss:1.14107 [106] validation_0-mlogloss:1.14293 [107] validation_0-mlogloss:1.14468 [108] validation_0-mlogloss:1.14548 [109] validation_0-mlogloss:1.14706 [110] validation_0-mlogloss:1.14770 [111] validation_0-mlogloss:1.14898 [112] validation_0-mlogloss:1.15030 [113] validation_0-mlogloss:1.15110 [114] validation_0-mlogloss:1.15239 [115] validation_0-mlogloss:1.15213 [116] validation_0-mlogloss:1.15390 [117] validation_0-mlogloss:1.15639 [118] validation_0-mlogloss:1.15779 [119] validation_0-mlogloss:1.15971 [120] validation_0-mlogloss:1.16117 [121] validation_0-mlogloss:1.16251 [122] validation_0-mlogloss:1.16440 [123] validation_0-mlogloss:1.16585 [124] validation_0-mlogloss:1.16629 [125] validation_0-mlogloss:1.16918 [126] validation_0-mlogloss:1.17022 [127] validation_0-mlogloss:1.17178 [128] validation_0-mlogloss:1.17230 [129] validation_0-mlogloss:1.17459 [130] validation_0-mlogloss:1.17602 [131] validation_0-mlogloss:1.17752 [132] validation_0-mlogloss:1.17924 [133] validation_0-mlogloss:1.17980 [134] validation_0-mlogloss:1.17997 [135] validation_0-mlogloss:1.18059 [136] validation_0-mlogloss:1.18286 [137] validation_0-mlogloss:1.18414 [138] validation_0-mlogloss:1.18453 [139] validation_0-mlogloss:1.18459 [140] validation_0-mlogloss:1.18598 [141] validation_0-mlogloss:1.18661 [142] validation_0-mlogloss:1.18816 [143] validation_0-mlogloss:1.18939 [144] validation_0-mlogloss:1.19143 [145] validation_0-mlogloss:1.19202 [146] validation_0-mlogloss:1.19278 [147] validation_0-mlogloss:1.19341 [148] validation_0-mlogloss:1.19572 [149] validation_0-mlogloss:1.19766 [150] validation_0-mlogloss:1.19885 [151] validation_0-mlogloss:1.20085 [152] validation_0-mlogloss:1.20193 [153] validation_0-mlogloss:1.20338 [154] validation_0-mlogloss:1.20422 [155] validation_0-mlogloss:1.20591 [156] validation_0-mlogloss:1.20728 [157] validation_0-mlogloss:1.20740 [158] validation_0-mlogloss:1.20852 [159] validation_0-mlogloss:1.21073 [160] validation_0-mlogloss:1.21035 [161] validation_0-mlogloss:1.21158 [162] validation_0-mlogloss:1.21316 [163] validation_0-mlogloss:1.21407 [164] validation_0-mlogloss:1.21312 [165] validation_0-mlogloss:1.21498 [166] validation_0-mlogloss:1.21638 [167] validation_0-mlogloss:1.21736 [168] validation_0-mlogloss:1.21816 [169] validation_0-mlogloss:1.21938 [170] validation_0-mlogloss:1.22189 [171] validation_0-mlogloss:1.22348 [172] validation_0-mlogloss:1.22382 [173] validation_0-mlogloss:1.22500 [174] validation_0-mlogloss:1.22526 [175] validation_0-mlogloss:1.22651 [176] validation_0-mlogloss:1.22808 [177] validation_0-mlogloss:1.22900 [178] validation_0-mlogloss:1.22995 [179] validation_0-mlogloss:1.23141 [180] validation_0-mlogloss:1.23163 [181] validation_0-mlogloss:1.23255 [182] validation_0-mlogloss:1.23290 [183] validation_0-mlogloss:1.23360 [184] validation_0-mlogloss:1.23476 [185] validation_0-mlogloss:1.23589 [186] validation_0-mlogloss:1.23791 [187] validation_0-mlogloss:1.23892 [188] validation_0-mlogloss:1.23960 [189] validation_0-mlogloss:1.24011 [190] validation_0-mlogloss:1.24024 [191] validation_0-mlogloss:1.24107 [192] validation_0-mlogloss:1.24266 [193] validation_0-mlogloss:1.24316 [194] validation_0-mlogloss:1.24378 [195] validation_0-mlogloss:1.24344 [196] validation_0-mlogloss:1.24604 [197] validation_0-mlogloss:1.24698 [198] validation_0-mlogloss:1.24788 [199] validation_0-mlogloss:1.24911
Confusion matrix Predicted 0 1 2 3 4 5 6 Actual 0 63 4 4 4 6 0 8 1 6 71 1 7 10 1 21 2 4 2 37 7 8 0 14 3 4 4 4 72 21 1 15 4 7 10 6 17 86 1 19 5 3 1 0 5 10 17 5 6 5 4 5 18 12 4 196 accuracy = 0.653 precision = 0.656 recall = 0.653 f1 = 0.650
The XGboost model achieved a similar performance as the Random Forest on the reduced dataset. Let's see if it can do better on the full dataset.
XGBoost classifier with full dataset
[0] validation_0-mlogloss:1.65992 [1] validation_0-mlogloss:1.49398 [2] validation_0-mlogloss:1.37583 [3] validation_0-mlogloss:1.28338 [4] validation_0-mlogloss:1.22367 [5] validation_0-mlogloss:1.17340 [6] validation_0-mlogloss:1.12075 [7] validation_0-mlogloss:1.07898 [8] validation_0-mlogloss:1.04353 [9] validation_0-mlogloss:1.01402 [10] validation_0-mlogloss:0.99313 [11] validation_0-mlogloss:0.97097 [12] validation_0-mlogloss:0.95507 [13] validation_0-mlogloss:0.93997 [14] validation_0-mlogloss:0.92445 [15] validation_0-mlogloss:0.91154 [16] validation_0-mlogloss:0.90129 [17] validation_0-mlogloss:0.89332 [18] validation_0-mlogloss:0.88546 [19] validation_0-mlogloss:0.87877 [20] validation_0-mlogloss:0.87097 [21] validation_0-mlogloss:0.86566 [22] validation_0-mlogloss:0.85967 [23] validation_0-mlogloss:0.85530 [24] validation_0-mlogloss:0.85216 [25] validation_0-mlogloss:0.84732 [26] validation_0-mlogloss:0.84216 [27] validation_0-mlogloss:0.83973 [28] validation_0-mlogloss:0.83718 [29] validation_0-mlogloss:0.83615 [30] validation_0-mlogloss:0.83273 [31] validation_0-mlogloss:0.82965 [32] validation_0-mlogloss:0.82825 [33] validation_0-mlogloss:0.82811 [34] validation_0-mlogloss:0.82712 [35] validation_0-mlogloss:0.82475 [36] validation_0-mlogloss:0.82192 [37] validation_0-mlogloss:0.81930 [38] validation_0-mlogloss:0.81845 [39] validation_0-mlogloss:0.81631 [40] validation_0-mlogloss:0.81553 [41] validation_0-mlogloss:0.81694 [42] validation_0-mlogloss:0.81609 [43] validation_0-mlogloss:0.81405 [44] validation_0-mlogloss:0.81195 [45] validation_0-mlogloss:0.80969 [46] validation_0-mlogloss:0.80924 [47] validation_0-mlogloss:0.80599 [48] validation_0-mlogloss:0.80362 [49] validation_0-mlogloss:0.80261 [50] validation_0-mlogloss:0.80176 [51] validation_0-mlogloss:0.80146 [52] validation_0-mlogloss:0.80071 [53] validation_0-mlogloss:0.80076 [54] validation_0-mlogloss:0.80208 [55] validation_0-mlogloss:0.80104 [56] validation_0-mlogloss:0.80020 [57] validation_0-mlogloss:0.79923 [58] validation_0-mlogloss:0.79912 [59] validation_0-mlogloss:0.79854 [60] validation_0-mlogloss:0.79909 [61] validation_0-mlogloss:0.80130 [62] validation_0-mlogloss:0.80063 [63] validation_0-mlogloss:0.80015 [64] validation_0-mlogloss:0.79981 [65] validation_0-mlogloss:0.80086 [66] validation_0-mlogloss:0.79985 [67] validation_0-mlogloss:0.79951 [68] validation_0-mlogloss:0.79921 [69] validation_0-mlogloss:0.79879 [70] validation_0-mlogloss:0.79897 [71] validation_0-mlogloss:0.79815 [72] validation_0-mlogloss:0.79667 [73] validation_0-mlogloss:0.79658 [74] validation_0-mlogloss:0.79608 [75] validation_0-mlogloss:0.79569 [76] validation_0-mlogloss:0.79551 [77] validation_0-mlogloss:0.79493 [78] validation_0-mlogloss:0.79500 [79] validation_0-mlogloss:0.79584 [80] validation_0-mlogloss:0.79566 [81] validation_0-mlogloss:0.79570 [82] validation_0-mlogloss:0.79664 [83] validation_0-mlogloss:0.79693 [84] validation_0-mlogloss:0.79722 [85] validation_0-mlogloss:0.79739 [86] validation_0-mlogloss:0.79822 [87] validation_0-mlogloss:0.79701 [88] validation_0-mlogloss:0.79613 [89] validation_0-mlogloss:0.79597 [90] validation_0-mlogloss:0.79662 [91] validation_0-mlogloss:0.79507 [92] validation_0-mlogloss:0.79546 [93] validation_0-mlogloss:0.79578 [94] validation_0-mlogloss:0.79601 [95] validation_0-mlogloss:0.79580 [96] validation_0-mlogloss:0.79607 [97] validation_0-mlogloss:0.79587 [98] validation_0-mlogloss:0.79599 [99] validation_0-mlogloss:0.79616 [100] validation_0-mlogloss:0.79641 [101] validation_0-mlogloss:0.79557 [102] validation_0-mlogloss:0.79460 [103] validation_0-mlogloss:0.79422 [104] validation_0-mlogloss:0.79500 [105] validation_0-mlogloss:0.79546 [106] validation_0-mlogloss:0.79627 [107] validation_0-mlogloss:0.79537 [108] validation_0-mlogloss:0.79535 [109] validation_0-mlogloss:0.79427 [110] validation_0-mlogloss:0.79513 [111] validation_0-mlogloss:0.79550 [112] validation_0-mlogloss:0.79602 [113] validation_0-mlogloss:0.79573 [114] validation_0-mlogloss:0.79592 [115] validation_0-mlogloss:0.79596 [116] validation_0-mlogloss:0.79608 [117] validation_0-mlogloss:0.79549 [118] validation_0-mlogloss:0.79592 [119] validation_0-mlogloss:0.79596 [120] validation_0-mlogloss:0.79536 [121] validation_0-mlogloss:0.79575 [122] validation_0-mlogloss:0.79614 [123] validation_0-mlogloss:0.79633 [124] validation_0-mlogloss:0.79609 [125] validation_0-mlogloss:0.79616 [126] validation_0-mlogloss:0.79646 [127] validation_0-mlogloss:0.79674 [128] validation_0-mlogloss:0.79671 [129] validation_0-mlogloss:0.79688 [130] validation_0-mlogloss:0.79698 [131] validation_0-mlogloss:0.79751 [132] validation_0-mlogloss:0.79711 [133] validation_0-mlogloss:0.79704 [134] validation_0-mlogloss:0.79694 [135] validation_0-mlogloss:0.79651 [136] validation_0-mlogloss:0.79698 [137] validation_0-mlogloss:0.79656 [138] validation_0-mlogloss:0.79663 [139] validation_0-mlogloss:0.79695 [140] validation_0-mlogloss:0.79678 [141] validation_0-mlogloss:0.79655 [142] validation_0-mlogloss:0.79632 [143] validation_0-mlogloss:0.79712 [144] validation_0-mlogloss:0.79718 [145] validation_0-mlogloss:0.79775 [146] validation_0-mlogloss:0.79810 [147] validation_0-mlogloss:0.79853 [148] validation_0-mlogloss:0.79810 [149] validation_0-mlogloss:0.79881 [150] validation_0-mlogloss:0.79910 [151] validation_0-mlogloss:0.79897 [152] validation_0-mlogloss:0.79906 [153] validation_0-mlogloss:0.79903 [154] validation_0-mlogloss:0.79947 [155] validation_0-mlogloss:0.79935 [156] validation_0-mlogloss:0.79967 [157] validation_0-mlogloss:0.79978 [158] validation_0-mlogloss:0.79983 [159] validation_0-mlogloss:0.79944 [160] validation_0-mlogloss:0.79913 [161] validation_0-mlogloss:0.79857 [162] validation_0-mlogloss:0.79846 [163] validation_0-mlogloss:0.79873 [164] validation_0-mlogloss:0.79892 [165] validation_0-mlogloss:0.79863 [166] validation_0-mlogloss:0.79886 [167] validation_0-mlogloss:0.79916 [168] validation_0-mlogloss:0.79960 [169] validation_0-mlogloss:0.80000 [170] validation_0-mlogloss:0.79980 [171] validation_0-mlogloss:0.79953 [172] validation_0-mlogloss:0.79968 [173] validation_0-mlogloss:0.79934 [174] validation_0-mlogloss:0.79955 [175] validation_0-mlogloss:0.79935 [176] validation_0-mlogloss:0.79939 [177] validation_0-mlogloss:0.79927 [178] validation_0-mlogloss:0.79905 [179] validation_0-mlogloss:0.79907 [180] validation_0-mlogloss:0.79916 [181] validation_0-mlogloss:0.79940 [182] validation_0-mlogloss:0.79960 [183] validation_0-mlogloss:0.80013 [184] validation_0-mlogloss:0.80031 [185] validation_0-mlogloss:0.80043 [186] validation_0-mlogloss:0.80096 [187] validation_0-mlogloss:0.80092 [188] validation_0-mlogloss:0.80153 [189] validation_0-mlogloss:0.80193 [190] validation_0-mlogloss:0.80176 [191] validation_0-mlogloss:0.80219 [192] validation_0-mlogloss:0.80187 [193] validation_0-mlogloss:0.80160 [194] validation_0-mlogloss:0.80124 [195] validation_0-mlogloss:0.80183 [196] validation_0-mlogloss:0.80213 [197] validation_0-mlogloss:0.80192 [198] validation_0-mlogloss:0.80219 [199] validation_0-mlogloss:0.80259
Confusion matrix Predicted 0 1 2 3 4 5 6 Actual 0 70 1 0 3 4 0 11 1 1 77 1 5 10 2 21 2 5 4 40 4 8 0 11 3 2 4 2 89 10 0 14 4 7 4 1 14 103 0 17 5 3 2 0 2 7 22 5 6 6 2 2 6 5 0 223 accuracy = 0.752 precision = 0.762 recall = 0.752 f1 = 0.747
Well, this one does a bit better suggesting that we may indeed keep the full dataset to classify the genes.