ModernBert
News
Placeholder
Installation
Unsloth
We will be using FastSentenceTransformer to fine-tune ModernBERT.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! ==((====))== Unsloth 2026.1.2: Fast Modernbert patching. Transformers: 4.57.3. \\ /| Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0 \ / Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
We now add LoRA adapters so we only need to update a small amount of parameters!
Setting task_type to FEATURE_EXTRACTION Unsloth: Making `model.base_model.model` require gradients
Let's take a look at the dataset structure:
Dataset examples:
{'anchor': '.308', 'positive': 'The .308 Winchester is a popular rifle cartridge used for hunting and target shooting.'}
{'anchor': '.308', 'positive': 'Many precision rifles are chambered in .308 for its excellent long-range accuracy.'}
{'anchor': '.308', 'positive': 'The sniper selected a .308 caliber round for the mission.'}
{'anchor': '.338 lapua', 'positive': 'The .338 Lapua Magnum is a high-powered cartridge designed for extreme long-range shooting.'}
{'anchor': '.338 lapua', 'positive': 'Military snipers often use .338 Lapua for engagements beyond 1000 meters.'}
{'anchor': '.338 lapua', 'positive': 'The rifle was chambered in .338 Lapua for maximum effective range.'}
Baseline Inference
Let's test the base model before fine-tuning to see how it performs on our specific domain.
--- Pre-Training Results for query: 'apexification' --- 0.5200 | the weed whacker uses an engine that runs on a mixture of gas and oil 0.4114 | a brick left by Yuki 0.4031 | apples are a tasty treat 0.3323 | a type of cancer treatment that uses drugs to boost the body's immune response 0.3159 | a plant hormone for regulating stress responses 0.2783 | induces root tip closure in non-vital teeth
Computing widget examples: 0%| | 0/1 [00:00<?, ?example/s]
GPU = Tesla T4. Max memory = 14.741 GB. 0.348 GB of memory reserved.
Let's train the model! To resume a training run, set trainer.train(resume_from_checkpoint = True)
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 106,628 | Num Epochs = 2 | Total steps = 834 O^O/ \_/ \ Batch size per device = 256 | Gradient accumulation steps = 1 \ / Data Parallel GPUs = 1 | Total batch size (256 x 1 x 1) = 256 "-____-" Trainable parameters = 13,516,800 of 162,531,072 (8.32% trained)
633.9506 seconds used for training. 10.57 minutes used for training. Peak reserved memory = 1.387 GB. Peak reserved memory for training = 1.039 GB. Peak reserved memory % of max memory = 9.409 %. Peak reserved memory for training % of max memory = 7.048 %.
--- Post-Training Results for query: 'apexification' --- 0.2245 | induces root tip closure in non-vital teeth 0.1402 | a type of cancer treatment that uses drugs to boost the body's immune response 0.0677 | apples are a tasty treat 0.0352 | the weed whacker uses an engine that runs on a mixture of gas and oil -0.0091 | a plant hormone for regulating stress responses -0.0785 | a brick left by Yuki
Now if you want to load the LoRA adapters we just saved for inference, set False to True:
Saving to float16 for VLLM
We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
GGUF / llama.cpp Conversion
To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.
Some supported quant methods (full list on our Wiki page):
q8_0- Fast conversion. High resource use, but generally acceptable.q4_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.q5_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.