STT Kyutai

v1.2.3Updated

Overview

The STT Kyutai operator performs real-time speech-to-text transcription entirely on your local machine using Kyutai’s Moshi STT models. It runs inference in a separate worker process connected over TCP, keeping TouchDesigner’s main thread responsive while the model streams transcription results back with low latency.

Two model sizes are available: a 1B-parameter bilingual model (English and French) with roughly 0.5 seconds of processing delay, and a 2.6B-parameter English-only model with higher accuracy at roughly 2.5 seconds of delay.

Key Features

Fully Local: No cloud API keys or internet connection required after the model is downloaded
Persistent Worker Process: Model loads once in a background process and stays resident across transcription sessions
TCP IPC: Communicates with the worker over an authenticated TCP socket for reliable, low-latency data transfer
Auto Reattach: Can reconnect to an existing worker process after TouchDesigner restarts, avoiding model reload
Streaming Output: Transcription text arrives word-by-word as audio is processed
GPU Acceleration: CUDA support for faster inference on NVIDIA GPUs

Requirements

Python Dependencies: moshi, julius, torch, huggingface_hub — use the Dependencies Available button on the Install/Settings page to check and install. PyTorch is installed with CUDA 12.1 support automatically when using the built-in installer
Model Download: Models are downloaded from HuggingFace on first use (pulse Download Model on the Install/Settings page)
Hardware: CUDA-compatible NVIDIA GPU recommended; CPU mode is supported but slower

Input/Output

Inputs

Audio data is fed to this operator by wiring an upstream audio source that calls ReceiveAudioChunk with float32 audio at 24 kHz. In a typical LOPs network, a microphone or audio-in CHOP is converted and routed into this operator.

Outputs

Transcription Text (transcription_out): Running transcript as a single text DAT, updated in real time as words arrive
Segments Table (segments_out): Individual word/phrase segments with columns for Start, End, Text, Confidence, IsFinal, Speaker, and Language. Enable Output Segments (out1) on the KyutaiSTT page to expose this on the operator’s first output

Usage Examples

First-Time Setup

On the Install/Settings page, check the Dependencies Available button. If it shows missing packages, pulse it and follow the prompts to install them via ChatTD’s Python Manager
Restart TouchDesigner after installation completes
Select a Model Size on the KyutaiSTT page — choose 1B EN/FR for bilingual use or 2.6B EN for English-only with higher accuracy
If the model has not been downloaded yet, pulse Download Model on the Install/Settings page and wait for the download to finish

Starting Transcription

Pulse Initialize STT Kyutai on the KyutaiSTT page. The engine status will show “Initializing…” while the model loads, then “Ready” when the worker is prepared
Toggle Transcription Active to On
Feed audio into the operator. Transcribed text will appear in the transcription_out DAT in real time
Toggle Transcription Active to Off when finished

Language Selection

Set Model Size to 1B EN/FR (0.5s delay) — this is the only model that supports French
Set Language to English, French, or Auto for automatic detection
Pulse Initialize STT Kyutai to load the model with your language selection

Managing the Worker Process

Pulse Shutdown STT Kyutai to stop the worker process and free GPU memory
Enable Initialize On Start to automatically launch the worker whenever the operator loads
Enable Auto Reattach On Init on the Install/Settings page so the operator reconnects to an already-running worker after a TouchDesigner restart, avoiding a full model reload

Best Practices

Model Selection

Use the 1B EN/FR model for lower latency and bilingual support. Use the 2.6B EN model when you need the best English accuracy and can tolerate slightly higher delay
Set Language to a specific language rather than Auto when you know the input language, for more reliable results

Performance

Set Device to CUDA on the Install/Settings page when an NVIDIA GPU is available. The Auto setting will detect CUDA automatically
Keep Auto Reattach On Init enabled to avoid reloading the model on every TouchDesigner restart
Use TCP mode (the default and recommended IPC Mode) for the most reliable communication between TouchDesigner and the worker

Audio Quality

Provide audio at 24 kHz sample rate in float32 format — this is the model’s native rate
Feed audio continuously while transcription is active; the model uses 80 ms frames internally and maintains streaming context across chunks

Troubleshooting

Engine status stays on “Initializing…”

Check the operator’s Logger for worker process errors
Enable Monitor Worker Logs (stderr) on the Install/Settings page and set Worker Logging Level to Info or Debug to see detailed worker output
Verify CUDA drivers are installed if using GPU mode; try switching Device to CPU as a fallback

No transcription output appears

Confirm Transcription Active is toggled On and the engine status shows “Transcribing (Stream)”
Verify audio is arriving in the correct format (float32, 24 kHz)
Check that the upstream audio source is actively sending data

Model download fails

Confirm internet access and that HuggingFace is reachable
Check available disk space — models can be several gigabytes
If huggingface_hub is missing, the operator will prompt you to install it

Worker process crashes or disconnects

Review worker logs for specific error messages
Try switching Device from CUDA to CPU to rule out GPU driver issues
Pulse Shutdown STT Kyutai and then Initialize STT Kyutai to restart the worker cleanly

Reactive Dependencies

The operator exposes several tdu.Dependency values that downstream operators or scripts can monitor:

TranscriptionComplete: Pulses True briefly each time a stable word or phrase commits
OnSentenceEnd: Pulses True when a sentence-ending punctuation mark is detected
EmptyTranscription: Pulses True if a commit cycle produces no text
LastTranscriptionResult: Dictionary containing the latest committed text, length, timestamp, confidence, finality flag, speaker, and language

These dependencies allow other parts of your TouchDesigner network to react to transcription events without polling.

Research & Licensing

Kyutai

Kyutai is an AI research lab focused on speech and language technologies. Their Moshi model is a speech-text foundation model designed for real-time full-duplex dialogue.

Moshi STT

The Moshi STT models extract text from audio using a streaming transformer architecture with delayed streams modeling, providing real-time transcription with low latency.

Technical Details

Delayed Streams Modeling (DSM) processes 80ms audio frames at 12.5 Hz
Mimi codec compresses 24 kHz audio down to 1.1 kbps for efficient tokenization
1B EN/FR model supports English and French with ~0.5s processing delay
2.6B EN model provides higher accuracy English transcription with ~2.5s delay
SentencePiece tokenizer decodes text tokens with proper word boundary handling

Research Impact

Enables fully local real-time transcription without cloud dependencies
Streaming architecture allows continuous transcription with maintained context

Citation

@techreport{kyutai2024moshi,
title={Moshi: a speech-text foundation model for real-time dialogue},
author={Alexandre Défossez and Laurent Mazaré and Manu Orsini and
Amélie Royer and Patrick Pérez and Hervé Jégou and Edouard Grave and Neil Zeghidour},
year={2024},
eprint={2410.00037},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2410.00037},
}

Key Research Contributions

Full-duplex spoken dialogue with dual-stream modeling
Ultra-low latency speech processing (160ms theoretical)
Streaming neural audio codec (Mimi) with 1.1 kbps compression
Production-ready STT models at 1B and 2.6B parameter scales

License

CC-BY 4.0 - This model is freely available for research and commercial use.

Parameters

KyutaiSTT

Status (Status) op('stt_kyutai').par.Status Str

Default:: "" (Empty String)

Transcription Active (Active) op('stt_kyutai').par.Active Toggle

Default:: False

Copy Transcript to Clipboard (Copytranscript) op('stt_kyutai').par.Copytranscript Pulse

Default:: False

STT Kyutai (Enginestatus) op('stt_kyutai').par.Enginestatus Str

Default:: "" (Empty String)

Initialize STT Kyutai (Initialize) op('stt_kyutai').par.Initialize Pulse

Default:: False

Shutdown STT Kyutai (Shutdown) op('stt_kyutai').par.Shutdown Pulse

Default:: False

Initialize On Start (Initializeonstart) op('stt_kyutai').par.Initializeonstart Toggle

Default:: False

Output Segments (out1) (Segments) op('stt_kyutai').par.Segments Toggle

Default:: False

Chunk Duration (sec) (Chunkduration) op('stt_kyutai').par.Chunkduration Float

Default:: 0.0
Range:: 0.1 to 5
Slider Range:: 0.1 to 5

Temperature (Temperature) op('stt_kyutai').par.Temperature Float

Default:: 0.0
Range:: 0 to 1
Slider Range:: 0 to 1

Clear Transcript (Cleartranscript) op('stt_kyutai').par.Cleartranscript Pulse

Default:: False

Install/Settings

Dependencies Available (Installdependencies) op('stt_kyutai').par.Installdependencies Pulse

Default:: False

Worker Connection Settings Header

Monitor Worker Logs (stderr) (Monitorworkerlogs) op('stt_kyutai').par.Monitorworkerlogs Toggle

Default:: False

Auto Reattach On Init (Autoreattachoninit) op('stt_kyutai').par.Autoreattachoninit Toggle

Default:: False

Download Model (Downloadmodel) op('stt_kyutai').par.Downloadmodel Pulse

Default:: False

Changelog

v1.2.32026-03-26

Expand segments_out from 4 to 7 columns: add IsFinal, Speaker, Language - Add header enforcement to segments_out on init - Add LastTranscriptionResult assignment to full_transcription result path - Align LastTranscriptionResult to standard schema: confidence, is_final, speaker, language

v1.2.22026-01-28

Fix TD 32050+ freeze by using importlib.metadata for dependency checking - Remove direct torch/moshi/julius imports from check_dependencies()
Initial commit

v1.2.12025-08-29

cleaned menu and added segements parameter to show segemnts in out1 instead of the whole transcript

v1.2.02025-08-17

NEW: TCP IPC Mode - Added robust TCP communication with worker processes (recommended over STDIO)
NEW: Auto Worker Reattach - Automatically reconnect to existing workers on TD restart/reload
NEW: TCP Heartbeat System - Automatic connection monitoring with reconnect on timeout
NEW: Force Attach Mode - Skip PID checks for manual worker attachment scenarios
IMPROVED: Parameter Organization - Cleaned and reorganized parameter menus for better UX
IMPROVED: Connection Reliability - Automatic TCP reconnection and worker process persistence
IMPROVED: Worker Management - Enhanced process lifecycle with graceful shutdown/cleanup
IMPROVED: Error Handling - Better error reporting and recovery mechanisms
IMPROVED: Logging - Enhanced worker logging and monitoring capabilities
FIXED: Stability Issues - Resolved various edge cases in worker communication