Classifier

llm-inferencemistral-inferencetutorialsLLMmistralai

Train a classifier with Mistral 7B

In this tutorial, we will see how we can leverage Mistral 7B to train a classifier. We need:

  • The mistral codebase: https://github.com/mistralai/mistral-inference
  • The pretrained model and its tokenizer: https://docs.mistral.ai/llm/mistral-v0.1
  • A dataset to train our classifier

We will train and evaluate the classifier on Symptom2Disease, a public classification dataset from Kaggle provided in a CSV format: https://www.kaggle.com/datasets/niyarrbarman/symptom2disease/

Set paths

[1]

Imports

[2]

Load the model and tokenizer

[3]

The tokenizer is a critical component of the model that segments input text into word-pieces.

For instance, the sentence "Roasted barramundi fish" is encoded as ['▁Ro', 'asted', '▁barr', 'am', 'und', 'i', '▁fish'].

The number of subwords in the model is set to 32000, and each word is decomposed into known word pieces to avoid unknown words.

Load the dataset

We reload the Kaggle dataset from the disk. Each sample is composed of a sentence and a label. The dataset contains 24 different labels.

[4]
Reloaded 1200 samples with 24 labels.

The task is to classify a symptom, for instance "I have a dry cough that never stops." to one of the 24 disease labels (Acne, Arthritis, Bronchial Asthma, Cervical spondylosis, Chicken pox, Common Cold, etc.).

Embed data points in the dataset

We will now learn a linear classifier on frozen features provided by Mistral 7B. In particular, each sentence in the dataset will be tokenized, and provided to the model.

If the input sentence is composed of N tokens, the model output will be a list of N vectors of dimension d=4096, where d is the dimensionality of the model. These vectors are then averaged along the dimension 0 to get a vector of size d.

Finally, the vectors of all sentences in the dataset are concatenated into a matrix of shape (D, d) where D is the number of samples in the dataset (in particular, D=1200).

[5]
1200it [00:25, 46.65it/s]

Plot t-SNE embeddings

Now that we have one vector/embedding for each sample in our dataset, we can visualize them using t-SNE. t-SNE is a powerful tool for reducing the dimensionality of high-dimensional data, while preserving the underlying structure and relationships between the data points, which can help uncover patterns and insights that would be difficult to discern in the high-dimensional space.

In the graph below, we assign a different color to each class in the dataset. It is apparent that several distinct clusters are visible in the data.

[6]
Output

Create a train / test split

For the sake of demonstration, we will shuffle the dataset and split it into two train and test splits composed of 80% and 20% of the data.

[7]
Train set : 960 samples
Test set  : 240 samples

Normalize features

It is usually recommended to normalize features to improve the performance and stability of the classifier. In particular, we ensure that each feature in the training set has a mean of 0 and a standard deviation of 1. This can be done with StandardScaler of Scikit-learn:

[8]

Train a classifier and compute the test accuracy

We can now train the classifier. We will use the LogisticRegression object of Scikit-learn.

The default learning algorithm is lbfgs, which should give an accuracy of 98.33% for this particular train/test split.

You can try different algorithms, and hyper-parameters. In a real-life scenario, it is possible to obtain better results by using k-fold cross-validation methods to validate accuracy and choice of the hyperparameters.

[9]
Precision: 98.33%

Classify a single example

Below is an example showing how to classify new samples, and how to print the probabilities of the top-5 predicted labels.

[10]
[11]
 99.65%  Migraine
  0.20%  Dengue
  0.08%  Cervical spondylosis
  0.05%  Jaundice
  0.02%  allergy
[ ]

Second implementation: Zero shot

Below is another method to do classification which does not require any training set. It also does not require to train a classifier on top of Mistral. However, it is much slower to predict, and it is not expected to work as well as the previous method in the case where you have a training dataset available.

The method consists in prompting the model with the following text:

Symptoms: {symptom}
Disease:

and to evaluate the probability that the model predicts each of the possible labels. The method is quite slower, because you need to compute the probability of all labels.

[12]
240it [00:14, 16.82it/s]
Accuracy: 30.42%

Even though this approach did not require any training set, the performance is only of 30.4%, compared to 98.3% for the previous, faster one.

[ ]