Spark TTS (0 5B)
News
Placeholder
Installation
Unsloth
FastModel supports loading nearly any model now! This includes Vision and Text models!
Thank you to Etherl for creating this notebook!
==((====))== Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.3. \\ /| Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0 \ / Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! Unsloth: Float16 full finetuning uses more memory since we upcast weights to float32.
We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
Unsloth: Making `model.base_model.model.model` require gradients
Data Prep
We will use the MrDragonFox/Elise, which is designed for training TTS models. Ensure that your dataset follows the required format: text, audio for single-speaker models or source, text, audio for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training.
/usr/local/lib/python3.11/dist-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`. WeightNorm.apply(module, name, dim)
Missing tensor: mel_transformer.spectrogram.window Missing tensor: mel_transformer.mel_scale.fb
Map: 0%| | 0/1195 [00:00<?, ? examples/s]
Moving Bicodec model and Wav2Vec2Model to cpu.
Unsloth: Tokenizing ["text"] (num_proc=2): 0%| | 0/1195 [00:00<?, ? examples/s]
GPU = Tesla T4. Max memory = 14.741 GB. 5.713 GB of memory reserved.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 1,195 | Num Epochs = 1 | Total steps = 149 O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4 \ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8 "-____-" Trainable parameters = 506,634,112/506,634,112 (100.00% trained)
Generating speech for: 'Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.' Generating token sequence... Token sequence generated. Found 348 semantic tokens. Found 32 global tokens. Detokenizing audio tokens... Detokenization complete. Audio saved to generated_speech_controllable.wav
('lora_model/tokenizer_config.json',
, 'lora_model/special_tokens_map.json',
, 'lora_model/vocab.json',
, 'lora_model/merges.txt',
, 'lora_model/added_tokens.json',
, 'lora_model/tokenizer.json') Saving to float16
We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower. We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes. To force `safe_serialization`, set it to `None` instead. Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab. Unsloth: Will remove a cached repo with size 15.1G
Unsloth: Merging 4bit and LoRA weights to 16bit... Unsloth: Will use up to 3.99 out of 12.67 RAM for saving. Unsloth: Saving model... This might take 5 minutes ...
100%|██████████| 28/28 [00:01<00:00, 27.83it/s]
Unsloth: Saving tokenizer... Done. Unsloth: Saving model/pytorch_model-00001-of-00002.bin... Unsloth: Saving model/pytorch_model-00002-of-00002.bin... Done.