STT Kyutai
Overview
Section titled “Overview”The STT Kyutai operator performs real-time speech-to-text transcription entirely on your local machine using Kyutai’s Moshi STT models. It runs inference in a separate worker process connected over TCP, keeping TouchDesigner’s main thread responsive while the model streams transcription results back with low latency.
Two model sizes are available: a 1B-parameter bilingual model (English and French) with roughly 0.5 seconds of processing delay, and a 2.6B-parameter English-only model with higher accuracy at roughly 2.5 seconds of delay.
Key Features
Section titled “Key Features”- Fully Local: No cloud API keys or internet connection required after the model is downloaded
- Persistent Worker Process: Model loads once in a background process and stays resident across transcription sessions
- TCP IPC: Communicates with the worker over an authenticated TCP socket for reliable, low-latency data transfer
- Auto Reattach: Can reconnect to an existing worker process after TouchDesigner restarts, avoiding model reload
- Streaming Output: Transcription text arrives word-by-word as audio is processed
- GPU Acceleration: CUDA support for faster inference on NVIDIA GPUs
Requirements
Section titled “Requirements”- Python Dependencies:
moshi,julius,torch,huggingface_hub— use the Dependencies Available button on the Install/Settings page to check and install. PyTorch is installed with CUDA 12.1 support automatically when using the built-in installer - Model Download: Models are downloaded from HuggingFace on first use (pulse Download Model on the Install/Settings page)
- Hardware: CUDA-compatible NVIDIA GPU recommended; CPU mode is supported but slower
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”Audio data is fed to this operator by wiring an upstream audio source that calls ReceiveAudioChunk with float32 audio at 24 kHz. In a typical LOPs network, a microphone or audio-in CHOP is converted and routed into this operator.
Outputs
Section titled “Outputs”- Transcription Text (
transcription_out): Running transcript as a single text DAT, updated in real time as words arrive - Segments Table (
segments_out): Individual word/phrase segments with columns for Start, End, Text, Confidence, IsFinal, Speaker, and Language. Enable Output Segments (out1) on the KyutaiSTT page to expose this on the operator’s first output
Usage Examples
Section titled “Usage Examples”First-Time Setup
Section titled “First-Time Setup”- On the Install/Settings page, check the Dependencies Available button. If it shows missing packages, pulse it and follow the prompts to install them via ChatTD’s Python Manager
- Restart TouchDesigner after installation completes
- Select a Model Size on the KyutaiSTT page — choose 1B EN/FR for bilingual use or 2.6B EN for English-only with higher accuracy
- If the model has not been downloaded yet, pulse Download Model on the Install/Settings page and wait for the download to finish
Starting Transcription
Section titled “Starting Transcription”- Pulse Initialize STT Kyutai on the KyutaiSTT page. The engine status will show “Initializing…” while the model loads, then “Ready” when the worker is prepared
- Toggle Transcription Active to On
- Feed audio into the operator. Transcribed text will appear in the
transcription_outDAT in real time - Toggle Transcription Active to Off when finished
Language Selection
Section titled “Language Selection”- Set Model Size to 1B EN/FR (0.5s delay) — this is the only model that supports French
- Set Language to English, French, or Auto for automatic detection
- Pulse Initialize STT Kyutai to load the model with your language selection
Managing the Worker Process
Section titled “Managing the Worker Process”- Pulse Shutdown STT Kyutai to stop the worker process and free GPU memory
- Enable Initialize On Start to automatically launch the worker whenever the operator loads
- Enable Auto Reattach On Init on the Install/Settings page so the operator reconnects to an already-running worker after a TouchDesigner restart, avoiding a full model reload
Best Practices
Section titled “Best Practices”Model Selection
Section titled “Model Selection”- Use the 1B EN/FR model for lower latency and bilingual support. Use the 2.6B EN model when you need the best English accuracy and can tolerate slightly higher delay
- Set Language to a specific language rather than Auto when you know the input language, for more reliable results
Performance
Section titled “Performance”- Set Device to CUDA on the Install/Settings page when an NVIDIA GPU is available. The Auto setting will detect CUDA automatically
- Keep Auto Reattach On Init enabled to avoid reloading the model on every TouchDesigner restart
- Use TCP mode (the default and recommended IPC Mode) for the most reliable communication between TouchDesigner and the worker
Audio Quality
Section titled “Audio Quality”- Provide audio at 24 kHz sample rate in float32 format — this is the model’s native rate
- Feed audio continuously while transcription is active; the model uses 80 ms frames internally and maintains streaming context across chunks
Troubleshooting
Section titled “Troubleshooting”Engine status stays on “Initializing…”
- Check the operator’s Logger for worker process errors
- Enable Monitor Worker Logs (stderr) on the Install/Settings page and set Worker Logging Level to Info or Debug to see detailed worker output
- Verify CUDA drivers are installed if using GPU mode; try switching Device to CPU as a fallback
No transcription output appears
- Confirm Transcription Active is toggled On and the engine status shows “Transcribing (Stream)”
- Verify audio is arriving in the correct format (float32, 24 kHz)
- Check that the upstream audio source is actively sending data
Model download fails
- Confirm internet access and that HuggingFace is reachable
- Check available disk space — models can be several gigabytes
- If
huggingface_hubis missing, the operator will prompt you to install it
Worker process crashes or disconnects
- Review worker logs for specific error messages
- Try switching Device from CUDA to CPU to rule out GPU driver issues
- Pulse Shutdown STT Kyutai and then Initialize STT Kyutai to restart the worker cleanly
Reactive Dependencies
Section titled “Reactive Dependencies”The operator exposes several tdu.Dependency values that downstream operators or scripts can monitor:
- TranscriptionComplete: Pulses True briefly each time a stable word or phrase commits
- OnSentenceEnd: Pulses True when a sentence-ending punctuation mark is detected
- EmptyTranscription: Pulses True if a commit cycle produces no text
- LastTranscriptionResult: Dictionary containing the latest committed text, length, timestamp, confidence, finality flag, speaker, and language
These dependencies allow other parts of your TouchDesigner network to react to transcription events without polling.
Research & Licensing
Kyutai
Kyutai is an AI research lab focused on speech and language technologies. Their Moshi model is a speech-text foundation model designed for real-time full-duplex dialogue.
Moshi STT
The Moshi STT models extract text from audio using a streaming transformer architecture with delayed streams modeling, providing real-time transcription with low latency.
Technical Details
- Delayed Streams Modeling (DSM) processes 80ms audio frames at 12.5 Hz
- Mimi codec compresses 24 kHz audio down to 1.1 kbps for efficient tokenization
- 1B EN/FR model supports English and French with ~0.5s processing delay
- 2.6B EN model provides higher accuracy English transcription with ~2.5s delay
- SentencePiece tokenizer decodes text tokens with proper word boundary handling
Research Impact
- Enables fully local real-time transcription without cloud dependencies
- Streaming architecture allows continuous transcription with maintained context
Citation
@techreport{kyutai2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
year={2024},
eprint={2410.00037},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2410.00037},
} Key Research Contributions
- Full-duplex spoken dialogue with dual-stream modeling
- Ultra-low latency speech processing (160ms theoretical)
- Streaming neural audio codec (Mimi) with 1.1 kbps compression
- Production-ready STT models at 1B and 2.6B parameter scales
License
CC-BY 4.0 - This model is freely available for research and commercial use.
Parameters
Section titled “Parameters”KyutaiSTT
Section titled “KyutaiSTT”op('stt_kyutai').par.Status Str - Default:
"" (Empty String)
op('stt_kyutai').par.Active Toggle - Default:
False
op('stt_kyutai').par.Copytranscript Pulse - Default:
False
op('stt_kyutai').par.Enginestatus Str - Default:
"" (Empty String)
op('stt_kyutai').par.Initialize Pulse - Default:
False
op('stt_kyutai').par.Shutdown Pulse - Default:
False
op('stt_kyutai').par.Initializeonstart Toggle - Default:
False
op('stt_kyutai').par.Segments Toggle - Default:
False
op('stt_kyutai').par.Chunkduration Float - Default:
0.0- Range:
- 0.1 to 5
- Slider Range:
- 0.1 to 5
op('stt_kyutai').par.Temperature Float - Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('stt_kyutai').par.Cleartranscript Pulse - Default:
False
Install/Settings
Section titled “Install/Settings”op('stt_kyutai').par.Installdependencies Pulse - Default:
False
op('stt_kyutai').par.Monitorworkerlogs Toggle - Default:
False
op('stt_kyutai').par.Autoreattachoninit Toggle - Default:
False
op('stt_kyutai').par.Downloadmodel Pulse - Default:
False
Changelog
Section titled “Changelog”v1.2.32026-03-26
- Expand segments_out from 4 to 7 columns: add IsFinal, Speaker, Language - Add header enforcement to segments_out on init - Add LastTranscriptionResult assignment to full_transcription result path - Align LastTranscriptionResult to standard schema: confidence, is_final, speaker, language
v1.2.22026-01-28
- Fix TD 32050+ freeze by using importlib.metadata for dependency checking - Remove direct torch/moshi/julius imports from check_dependencies()
- Initial commit
v1.2.12025-08-29
cleaned menu and added segements parameter to show segemnts in out1 instead of the whole transcript
v1.2.02025-08-17
- NEW: TCP IPC Mode - Added robust TCP communication with worker processes (recommended over STDIO)
- NEW: Auto Worker Reattach - Automatically reconnect to existing workers on TD restart/reload
- NEW: TCP Heartbeat System - Automatic connection monitoring with reconnect on timeout
- NEW: Force Attach Mode - Skip PID checks for manual worker attachment scenarios
- IMPROVED: Parameter Organization - Cleaned and reorganized parameter menus for better UX
- IMPROVED: Connection Reliability - Automatic TCP reconnection and worker process persistence
- IMPROVED: Worker Management - Enhanced process lifecycle with graceful shutdown/cleanup
- IMPROVED: Error Handling - Better error reporting and recovery mechanisms
- IMPROVED: Logging - Enhanced worker logging and monitoring capabilities
- FIXED: Stability Issues - Resolved various edge cases in worker communication