TTS Kyutai
The TTS Kyutai operator runs local text-to-speech synthesis using Kyutai’s neural TTS models (derived from the Moshi speech-text foundation model). It launches an external Python worker process for GPU-accelerated inference and streams audio frames back into TouchDesigner as they are generated.
Key Features
Section titled “Key Features”- Local inference — no API keys or cloud services required
- Streaming synthesis — audio frames arrive progressively during generation
- 250+ built-in voices across multiple voice sets (VCTK, Expresso, CML-TTS, Unmute)
- Voice search — filter the large voice library by keyword
- Extend mode — append new speech to existing audio instead of replacing it
- Auto-save to disk — optionally save every synthesis as WAV or OGG with metadata
- TCP worker reattach — survive TouchDesigner file saves without reloading the model
Requirements
Section titled “Requirements”- Python packages:
moshi,torch(with CUDA 12.1),huggingface_hub— install via the Install/Settings page. The installer pins PyTorch 2.4.0 with CUDA 12.1 (cu121) for compatibility. - Models: The TTS model and voice repository must be downloaded from HuggingFace before first use
- Hardware: NVIDIA GPU with CUDA 12.1-compatible drivers strongly recommended; CPU inference is supported but significantly slower
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”None — text is entered directly via the Input Text field on the KyutaiTTS page.
Outputs
Section titled “Outputs”- Output 1:
store_outputCHOP — generated audio at 24 kHz (mono) - Output 2:
synthesis_logDAT — timestamped log of all synthesis operations - Output 3:
text_queueDAT — queued text entries
Usage Examples
Section titled “Usage Examples”First-Time Setup
Section titled “First-Time Setup”- On the Install/Settings page, pulse Dependencies Available if it shows missing packages. Restart TouchDesigner after installation completes.
- Pulse Download Model to fetch the TTS model from HuggingFace.
- Pulse Download Voices to fetch the voice repository.
- On the KyutaiTTS page, pulse Initialize TTS Kyutai to launch the worker process. The status will show “Ready” when the model is loaded.
Basic Speech Generation
Section titled “Basic Speech Generation”- Select a voice from the Voice menu (use Search Voices to filter by name or style, e.g. “happy”, “whisper”, “narration”).
- Type text into the Input Text field.
- Pulse Generate Speech. Audio appears progressively in the
store_outputCHOP.
Audio Playback
Section titled “Audio Playback”The Playback page controls how generated audio is played back through your system’s audio hardware.
- Enable Active to hear synthesized audio through your speakers or headphones.
- Select a Driver (default DirectSound/CoreAudio, or ASIO for low-latency setups) and choose the target Device from the menu.
- Adjust Volume to control playback level.
- If playback stalls or behaves unexpectedly, pulse Reset Playback to reinitialize the audio output.
Extending Audio
Section titled “Extending Audio”Enable Extend Current Audio on the KyutaiTTS page to append new speech to the end of the existing audio buffer instead of replacing it. This is useful for building up longer recordings across multiple synthesis passes — each pulse of Generate Speech adds to what is already in the output rather than clearing it first.
Saving Audio to Disk
Section titled “Saving Audio to Disk”- On the Playback page, enable Auto Save To Disk to save every synthesis automatically, or pulse Save Current Audio for manual saves.
- Set the Save Folder, Base Name (supports
$TIMESTAMPplaceholder), and File Type (WAV or OGG). - Enable Auto Version Files to avoid overwriting existing files.
Troubleshooting
Section titled “Troubleshooting”- “TTS engine is not ready” — Pulse Initialize TTS Kyutai and wait for the status to show “Ready”.
- No audio output from the Playback page — Check that Active is enabled and the correct Device and Driver are selected.
- Worker crashes on start — Verify CUDA drivers are installed. Try setting Device to “CPU” on the Install/Settings page as a fallback.
- Voices menu shows “(Download Voices First)” — Pulse Download Voices on the Install/Settings page.
Research & Licensing
Kyutai
Kyutai is an AI research lab focused on speech and language technologies. Their Moshi model is a breakthrough in speech-text foundation models for real-time conversational AI.
Moshi TTS
The TTS component of Moshi generates speech from text using a dual-stream transformer architecture with voice cloning from reference audio embeddings.
Technical Details
- 7B parameter transformer architecture for speech processing
- 24 kHz audio output with streaming frame-by-frame generation
- Fully causal and streaming with 80ms frame size
Research Impact
- Enables natural real-time conversation with minimal latency
- Production-ready implementations in Rust, Python, and MLX
Citation
@techreport{kyutai2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
year={2024},
eprint={2410.00037},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2410.00037},
} Key Research Contributions
- Full-duplex spoken dialogue with dual-stream audio modeling
- Ultra-low latency speech synthesis (160ms theoretical)
- Streaming neural audio codec (Mimi) at 1.1 kbps
License
CC-BY 4.0 - This model is freely available for research and commercial use.
Parameters
Section titled “Parameters”KyutaiTTS
Section titled “KyutaiTTS”op('tts_kyutai').par.Status Str - Default:
"" (Empty String)
op('tts_kyutai').par.Texttospeech Pulse - Default:
False
op('tts_kyutai').par.Inputtext Str - Default:
"" (Empty String)
op('tts_kyutai').par.Initialize Pulse - Default:
False
op('tts_kyutai').par.Shutdown Pulse - Default:
False
op('tts_kyutai').par.Initializeonstart Toggle - Default:
False
op('tts_kyutai').par.Appendtooutput Toggle - Default:
False
op('tts_kyutai').par.Voicesearch Str - Default:
"" (Empty String)
op('tts_kyutai').par.Enginestatus Str - Default:
"" (Empty String)
op('tts_kyutai').par.Streamingmode Toggle - Default:
False
op('tts_kyutai').par.Temperature Float - Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('tts_kyutai').par.Cfgcoef Float - Default:
0.0- Range:
- 0.5 to 4
- Slider Range:
- 0.5 to 4
op('tts_kyutai').par.Paddingbetween Int - Default:
0- Range:
- 0 to 5
- Slider Range:
- 0 to 5
op('tts_kyutai').par.Clearqueue Pulse - Default:
False
op('tts_kyutai').par.Stopsynth Pulse - Default:
False
op('tts_kyutai').par.Clearaudio Pulse Clears all generated audio from memory and from the output CHOPs (store_output and full_audio).
- Default:
False
Playback
Section titled “Playback”op('tts_kyutai').par.Resetpulse Pulse - Default:
False
op('tts_kyutai').par.Audioactive Toggle - Default:
True
op('tts_kyutai').par.Volume Float - Default:
1.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('tts_kyutai').par.Autosavetodisk Toggle Automatically save generated audio and metadata locally after successful synthesis.
- Default:
False
op('tts_kyutai').par.Folder Folder Folder where generated audio files and metadata are saved.
- Default:
"" (Empty String)
op('tts_kyutai').par.Name Str Base filename for saved audio. Use $TIMESTAMP for unique names.
- Default:
"" (Empty String)
op('tts_kyutai').par.Autoversion Toggle Automatically add _1, _2, etc. if filename exists.
- Default:
False
op('tts_kyutai').par.Savefile Pulse Saves the audio currently in the output CHOP to a file using the settings above.
- Default:
False
Install/Settings
Section titled “Install/Settings”op('tts_kyutai').par.Installdependencies Pulse - Default:
False
op('tts_kyutai').par.Modelrepo Str - Default:
"" (Empty String)
op('tts_kyutai').par.Downloadmodel Pulse - Default:
False
op('tts_kyutai').par.Voicerepo Str - Default:
"" (Empty String)
op('tts_kyutai').par.Downloadvoices Pulse - Default:
False
op('tts_kyutai').par.Monitorworkerlogs Toggle - Default:
False
op('tts_kyutai').par.Autoreattachoninit Toggle - Default:
False
op('tts_kyutai').par.Forceattachoninit Toggle - Default:
False
Changelog
Section titled “Changelog”v1.1.22026-03-26
Initial release
v1.1.12026-03-01
- Fix TD 32050+ freeze by removing moshi import at module level - Hardcode DEFAULT_DSM_TTS_REPO constants instead of importing from moshi - Use importlib.metadata for dependency checking
- Initial commit
v1.1.02025-08-17
- NEW: TCP IPC Mode - Added robust TCP communication with worker processes (recommended over STDIO)
- NEW: Auto Worker Reattach - Automatically reconnect to existing workers on TD restart/reload
- NEW: TCP Heartbeat System - Automatic connection monitoring with reconnect on timeout
- NEW: Sophisticated Audio Saving - Auto-save with metadata, versioning, and multiple formats (WAV/OGG)
- NEW: Clear Audio Method - Clear audio buffers and CHOPs with one button
- NEW: Manual Save Function - Save current audio with comprehensive metadata tracking
- IMPROVED: Parameter Organization - Cleaned and reorganized parameter menus for better UX
- IMPROVED: Method Naming - Renamed Synthesize to Texttospeech with optional text parameter
- IMPROVED: Connection Reliability - Automatic TCP reconnection and worker process persistence
- IMPROVED: Audio Management - Enhanced buffering and progressive CHOP updates