Qwen3 Embedding (0 6B)
News
Placeholder
Installation
Unsloth
π¦₯ Unsloth: Will patch your computer to enable 2x faster free finetuning. π¦₯ Unsloth Zoo will now patch everything to make training faster!
config.json: 0%| | 0.00/727 [00:00<?, ?B/s]
modules.json: 0%| | 0.00/349 [00:00<?, ?B/s]
==((====))== Unsloth 2026.1.4: Fast Qwen3 patching. Transformers: 4.56.2. \\ /| Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0 \ / Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
model.safetensors: 0%| | 0.00/1.19G [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
vocab.json: 0.00B [00:00, ?B/s]
merges.txt: 0.00B [00:00, ?B/s]
tokenizer.json: 0%| | 0.00/11.4M [00:00<?, ?B/s]
config.json: 0%| | 0.00/313 [00:00<?, ?B/s]
We now add LoRA adapters so we only need to update a small amount of parameters!
Unsloth: Making `model.base_model.model` require gradients
README.md: 0%| | 0.00/512 [00:00<?, ?B/s]
dataset.jsonl: 0.00B [00:00, ?B/s]
Generating train split: 0%| | 0/106628 [00:00<?, ? examples/s]
Let's take a look at the dataset structure:
Dataset examples:
{'anchor': '.308', 'positive': 'The .308 Winchester is a popular rifle cartridge used for hunting and target shooting.'}
{'anchor': '.308', 'positive': 'Many precision rifles are chambered in .308 for its excellent long-range accuracy.'}
{'anchor': '.308', 'positive': 'The sniper selected a .308 caliber round for the mission.'}
{'anchor': '.338 lapua', 'positive': 'The .338 Lapua Magnum is a high-powered cartridge designed for extreme long-range shooting.'}
{'anchor': '.338 lapua', 'positive': 'Military snipers often use .338 Lapua for engagements beyond 1000 meters.'}
{'anchor': '.338 lapua', 'positive': 'The rifle was chambered in .338 Lapua for maximum effective range.'}
Baseline Inference
Let's test the base model before fine-tuning to see how it performs on our specific domain.
--- Pre-Training Results for query: 'apexification' --- 0.5435 | apples are a tasty treat 0.5285 | induces root tip closure in non-vital teeth 0.5110 | a brick left by Yuki 0.5044 | the weed whacker uses an engine that runs on a mixture of gas and oil 0.4734 | a plant hormone for regulating stress responses 0.4162 | a type of cancer treatment that uses drugs to boost the body's immune response
Computing widget examples: 0%| | 0/1 [00:00<?, ?example/s]
GPU = Tesla T4. Max memory = 14.741 GB. 1.256 GB of memory reserved.
Let's train the model! To resume a training run, set trainer.train(resume_from_checkpoint = True)
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 106,628 | Num Epochs = 2 | Total steps = 834 O^O/ \_/ \ Batch size per device = 256 | Gradient accumulation steps = 1 \ / Data Parallel GPUs = 1 | Total batch size (256 x 1 x 1) = 256 "-____-" Trainable parameters = 20,185,088 of 615,961,600 (3.28% trained)
916.4403 seconds used for training. 15.27 minutes used for training. Peak reserved memory = 13.932 GB. Peak reserved memory for training = 12.676 GB. Peak reserved memory % of max memory = 94.512 %. Peak reserved memory for training % of max memory = 85.991 %.
--- Post-Training Results for query: 'apexification' --- 0.2650 | induces root tip closure in non-vital teeth 0.1967 | a type of cancer treatment that uses drugs to boost the body's immune response 0.0913 | the weed whacker uses an engine that runs on a mixture of gas and oil 0.0846 | apples are a tasty treat 0.0555 | a plant hormone for regulating stress responses 0.0027 | a brick left by Yuki
Now if you want to load the LoRA adapters we just saved for inference, set False to True:
Saving to float16 for vLLM
We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.
GGUF / llama.cpp Conversion
To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.
Some supported quant methods (full list on our Wiki page):
q8_0- Fast conversion. High resource use, but generally acceptable.q4_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.q5_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.