Orpheus (3B) TTS
News
Placeholder
Installation
Unsloth
FastModel supports loading nearly any model now! This includes Vision and Text models!
Thank you to Etherl for creating this notebook!
==((====))== Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.0. \\ /| Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0 \ / Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
Data Prep
We will use the MrDragonFox/Elise, which is designed for training TTS models. Ensure that your dataset follows the required format: text, audio for single-speaker models or source, text, audio for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training.
Map: 0%| | 0/1195 [00:00<?, ? examples/s]
Filter: 0%| | 0/1195 [00:00<?, ? examples/s]
Filter: 0%| | 0/1195 [00:00<?, ? examples/s]
Map: 0%| | 0/1195 [00:00<?, ? examples/s]
*** HERE you can modify the text prompt
If you are training a multi-speaker model (e.g., canopylabs/orpheus-3b-0.1-ft),
ensure that the dataset includes a "source" field and format the input accordingly:
- Single-speaker: f"{example['text']}"
- Multi-speaker: f"{example['source']}: {example['text']}"
Map: 0%| | 0/1195 [00:00<?, ? examples/s]
Train the model
Now let's use Huggingface Trainer! More docs here: Transformers docs. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.
Note: Using a per_device_train_batch_size >1 may lead to errors if multi-GPU setup to avoid issues, ensure CUDA_VISIBLE_DEVICES is set to a single GPU (e.g., CUDA_VISIBLE_DEVICES=0).
GPU = Tesla T4. Max memory = 14.741 GB. 5.713 GB of memory reserved.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 1,195 | Num Epochs = 1 | Total steps = 298 O^O/ \_/ \ Batch size per device = 1 | Gradient accumulation steps = 4 \ / Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4 "-____-" Trainable parameters = 97,255,424/3,000,000,000 (3.24% trained)
Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.
('lora_model/tokenizer_config.json',
, 'lora_model/special_tokens_map.json',
, 'lora_model/tokenizer.json') Saving to float16
We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower. We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes. To force `safe_serialization`, set it to `None` instead. Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab. Unsloth: Will remove a cached repo with size 15.1G
Unsloth: Merging 4bit and LoRA weights to 16bit... Unsloth: Will use up to 3.99 out of 12.67 RAM for saving. Unsloth: Saving model... This might take 5 minutes ...
100%|██████████| 28/28 [00:01<00:00, 27.83it/s]
Unsloth: Saving tokenizer... Done. Unsloth: Saving model/pytorch_model-00001-of-00002.bin... Unsloth: Saving model/pytorch_model-00002-of-00002.bin... Done.