Unsloth Sesame CSM (1B) TTS

Sesame CSM (1B) TTS

unsloth-notebooksunslothoriginal_template

alph-notebooks/unsloth-notebooks / Sesame_CSM_(1B)-TTS.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

News

Placeholder

Installation

[ ]

Unsloth

FastModel supports loading nearly any model now! This includes Vision and Text models!

[21]

==((====))==  Unsloth 2025.5.4: Fast Csm patching. Transformers: 4.52.0.dev0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
unsloth/csm-1b does not have a padding token! Will use pad_token = <|PAD_TOKEN|>.

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

[4]

Unsloth: Making `model.base_model.model.backbone_model` require gradients

Data Prep

We will use the MrDragonFox/Elise, which is designed for training TTS models. Ensure that your dataset follows the required format: text, audio for single-speaker models or source, text, audio for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training.

[ ]

Train the model

Now let's use Huggingface Trainer! More docs here: Transformers docs. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.

[10]

[11]

GPU = Tesla T4. Max memory = 14.741 GB.
6.719 GB of memory reserved.

[12]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 400 | Num Epochs = 2 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 14,516,224/1,646,616,385 (0.88% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.

Unsloth: Will smartly offload gradients to save VRAM!

[13]

313.8963 seconds used for training.
5.23 minutes used for training.
Peak reserved memory = 6.719 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 45.58 %.
Peak reserved memory for training % of max memory = 0.0 %.

Inference

Let's run the model! You can change the prompts

[ ]

Voice and style consistency

Sesame CSM's power comes from providing audio context for each speaker. Let's pass a sample utterance from our dataset to ground speaker identity and style.

[ ]

Saving, loading finetuned models

To save the final model as LoRA adapters, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

[18]

[]

Saving to float16

We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

[ ]