Sesame CSM (1B) TTS
News
Placeholder
Installation
Unsloth
FastModel supports loading nearly any model now! This includes Vision and Text models!
==((====))== Unsloth 2025.5.4: Fast Csm patching. Transformers: 4.52.0.dev0. \\ /| Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0 \ / Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA. unsloth/csm-1b does not have a padding token! Will use pad_token = <|PAD_TOKEN|>.
We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
Unsloth: Making `model.base_model.model.backbone_model` require gradients
Data Prep
We will use the MrDragonFox/Elise, which is designed for training TTS models. Ensure that your dataset follows the required format: text, audio for single-speaker models or source, text, audio for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training.
Train the model
Now let's use Huggingface Trainer! More docs here: Transformers docs. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.
GPU = Tesla T4. Max memory = 14.741 GB. 6.719 GB of memory reserved.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 400 | Num Epochs = 2 | Total steps = 60 O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4 \ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8 "-____-" Trainable parameters = 14,516,224/1,646,616,385 (0.88% trained) `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Unsloth: Will smartly offload gradients to save VRAM!
313.8963 seconds used for training. 5.23 minutes used for training. Peak reserved memory = 6.719 GB. Peak reserved memory for training = 0.0 GB. Peak reserved memory % of max memory = 45.58 %. Peak reserved memory for training % of max memory = 0.0 %.
Voice and style consistency
Sesame CSM's power comes from providing audio context for each speaker. Let's pass a sample utterance from our dataset to ground speaker identity and style.
[]
Saving to float16
We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.