STT Whisper
Overview
Section titled “Overview”The STT Whisper LOP performs local speech-to-text transcription using the faster-whisper library. It runs a separate worker process to keep TouchDesigner responsive, supports GPU acceleration via CUDA, and offers three operating modes: continuous streaming, push-to-talk, and file processing. It supports 99 languages with models ranging from 39M to 1550M parameters.
Requirements
Section titled “Requirements”- Python Packages:
faster-whisper,nvidia-cudnn-cu12,huggingface-hub, andnumpy1.24.x must be installed in the shared Python environment. Use the Install Dependencies button on the Install/Settings page to install them automatically. The installer also handles downgrading numpy 2.x to the compatible 1.24.1 version if needed. - Model Download: A Whisper model must be downloaded before first use. Use Download Model on the Install/Settings page, or pulse Initialize Whisper on the Faster Whisper page — if the model is not yet downloaded, you will be prompted to download it. Models are sourced from HuggingFace and verified for integrity before loading.
- NVIDIA GPU (optional): For CUDA acceleration. CPU-only operation is fully supported.
Input / Output
Section titled “Input / Output”Inputs
Section titled “Inputs”- Input 1: Audio stream. Connect an Audio Device In CHOP or any 16kHz mono float32 audio source. The operator receives audio through its
ReceiveAudioChunk()method, which is called automatically when audio is wired to the input.
Outputs
Section titled “Outputs”- Output 1 (when Output Segments is enabled): A table DAT with columns
Start,End,Text,Confidence,IsFinal,Speaker, andLanguagecontaining timestamped transcription segments. - transcription_out (internal DAT): The accumulated full transcript as plain text.
- segments_out (internal DAT): All segments with start/end timestamps, confidence scores, and metadata.
First-Time Setup
Section titled “First-Time Setup”- On the Install/Settings page, pulse Install Dependencies. Follow the prompts to install the required Python packages. A TouchDesigner restart may be needed after installation.
- On the Faster Whisper page, select a Model Size. The “Distil Large v3” model is recommended as a good balance of speed and accuracy.
- Pulse Download Model on the Install/Settings page, or pulse Initialize Whisper on the Faster Whisper page — if the model is not yet downloaded, you will be prompted to download it.
- Once the Whisper Status reads “Ready”, the operator is ready for transcription.
Auto-Initialization
Section titled “Auto-Initialization”Enable Initialize On Start on the Faster Whisper page to have the engine start automatically when the project opens, without needing to pulse Initialize Whisper manually each time. This is independent of Auto Reattach On Init on the Install/Settings page, which reconnects to a worker that survived a TouchDesigner reinitialization rather than starting a new one.
Stream Mode (Live Transcription)
Section titled “Stream Mode (Live Transcription)”- Set Operating Mode to “Stream (Live)” on the Faster Whisper page.
- Connect a 16kHz audio source to the operator’s input.
- Pulse Initialize Whisper to start the engine.
- Toggle Transcription Active to On. The operator begins continuously transcribing incoming audio.
- Transcription text appears in the
transcription_outDAT. Timestamped segments appear insegments_out.
Stream mode uses a chunking strategy controlled by Chunk Duration and Max Chunk Duration. When Smart VAD Chunking is enabled, the operator waits for natural speech pauses (controlled by Pause Sensitivity) before sending audio to the worker, improving phrase coherence. The audio buffer is capped at 60 seconds in Stream mode to prevent unbounded memory growth — older audio is discarded as new audio arrives.
Push-to-Talk Mode
Section titled “Push-to-Talk Mode”- Set Operating Mode to “Push to Talk”.
- Connect your audio source and initialize the engine.
- Toggle Transcription Active to On to begin recording. Audio accumulates in a buffer.
- Toggle Transcription Active to Off to stop recording and trigger transcription of the buffered audio.
Push-to-talk is ideal for discrete speech segments. The audio buffer has no size limit in this mode, so recordings of any length are supported. Recordings larger than 20MB of raw audio are automatically split into chunks with session tracking to ensure the worker processes them reliably.
File Processing Mode
Section titled “File Processing Mode”- Set Operating Mode to “File Processing”.
- On the Faster Whisper page, set Transcription File to the path of an audio or video file (wav, mp3, mp4, mkv, etc.).
- Initialize the engine if not already running.
- Toggle Transcription Active to On. The file is sent to the worker for transcription, and Transcription Active turns Off automatically when processing completes.
Translation
Section titled “Translation”Set Task Type to “Translate (To English)” to translate any supported language into English during transcription, rather than keeping the original language.
Model Selection Guide
Section titled “Model Selection Guide”| Category | Models | Parameters | Notes |
|---|---|---|---|
| Tiny | Tiny, Tiny EN-only | 39M | Fastest, basic accuracy |
| Base | Base, Base EN-only | 74M | Good balance for low-resource setups |
| Small | Small, Small EN-only | 244M | Solid general-purpose |
| Medium | Medium, Medium EN-only | 769M | High accuracy |
| Large | Large v1/v2/v3, Large v3 Turbo | 809M—1550M | Maximum accuracy |
| Distil | Distil Small/Medium EN, Distil Large v2/v3/v3.5 | 166M—756M | Near-large accuracy at much faster speed |
All 17 models are referenced by their full HuggingFace repository ID (e.g., Systran/faster-whisper-large-v3). EN-only models provide better English performance. Distil models are recommended for most use cases — they offer near-large-model accuracy with significantly faster inference. The “Distil Large v3” model is marked as recommended, and “Distil Large v3.5” is the latest addition.
Voice Activity Detection (VAD)
Section titled “Voice Activity Detection (VAD)”The VAD page provides filtering to reduce false transcriptions from non-speech audio.
- Use VAD Filter: Enables Silero VAD to filter out non-speech segments before transcription.
- VAD Threshold: Controls how aggressively non-speech is filtered. Higher values are stricter.
- VAD Min Silence: Minimum silence duration (in ms) to consider as a speech boundary.
- Beam Search Size: Controls the beam search width. Higher values may improve accuracy at the cost of speed.
Custom Filtering
Section titled “Custom Filtering”- Phrases to Avoid: Enter a comma-separated list of phrases to suppress. The operator converts these to token IDs and suppresses them during transcription.
- Custom Spellings (Prompt): Provide domain-specific terms or spellings as an initial prompt to guide the model’s output.
Reactive Dependencies
Section titled “Reactive Dependencies”The operator exposes several reactive state values that external Script CHOPs or other operators can monitor:
- WorkerActive: True while the worker process is running.
- ModelReady: True when the model is loaded and ready for transcription.
- TranscriptionActive: True during active transcription.
- DownloadInProgress: True while a model is being downloaded from HuggingFace.
- TranscriptionComplete: Pulses True briefly when any transcription result arrives.
- EmptyTranscription: Pulses True when a transcription completes with no speech detected (File and Push-to-Talk modes only).
- OnSentenceEnd: Pulses True when the accumulated text ends with sentence punctuation.
- LastTranscriptionResult: Contains metadata about the most recent transcription result, including text content, confidence, mode, and timestamp.
Worker Connection Settings
Section titled “Worker Connection Settings”On the Install/Settings page:
- IPC Mode: Choose between TCP (recommended) and STDIO (legacy). TCP provides better performance and supports worker reattachment across TouchDesigner reinitializations.
- Auto Reattach On Init: When enabled, the operator attempts to reconnect to a previously running worker process on initialization, avoiding a cold start.
- Force Attach (Skip PID Check): Attempts to attach using stored port/token without verifying the worker process is alive.
- Monitor Worker Logs (stderr): Forwards the worker process’s log output to the operator’s Logger.
- Worker Logging Level: Controls verbosity of the worker subprocess logs.
Project Exit Cleanup
Section titled “Project Exit Cleanup”The operator provides an on_exit() method that shuts down the worker process and clears stored connection data so that Auto Reattach On Init does not attempt to reconnect to a stale worker on the next project open. To use it, wire an Execute DAT with its onExit callback enabled to call op('stt_whisper1').on_exit() when the project closes.
Troubleshooting
Section titled “Troubleshooting”Engine fails to initialize
Section titled “Engine fails to initialize”- Ensure all dependencies are installed. Pulse Install Dependencies on the Install/Settings page.
- Verify the selected model is downloaded. Check Whisper Status for error messages.
- On macOS, incompatible compute types (FP16, INT8+FP16) are automatically overridden to INT8.
No transcription output
Section titled “No transcription output”- Confirm Transcription Active is On and Whisper Status shows “Ready” or “Transcribing”.
- Check that your audio source is producing 16kHz float32 data.
- Try disabling Use VAD Filter temporarily to rule out aggressive filtering.
- Lower the VAD Threshold if speech is being filtered out.
High latency
Section titled “High latency”- Use a smaller model or a Distil variant.
- Set Device to “CUDA” if an NVIDIA GPU is available.
- Reduce Max Chunk Duration and Chunk Duration for faster turnaround.
- Disable Smart VAD Chunking for immediate fixed-interval processing.
Worker crashes or pipe errors
Section titled “Worker crashes or pipe errors”- Check the worker logs by enabling Monitor Worker Logs on the Install/Settings page and setting Worker Logging Level to “Info” or “Debug”.
- Ensure
numpyversion 1.24.x is installed in the venv (version 2.x causes compatibility issues). The Install Dependencies button handles this automatically. - On Windows, the operator sets
KMP_DUPLICATE_LIB_OK=TRUEto prevent OpenMP conflicts.
Research & Licensing
OpenAI
OpenAI is an AI research organization developing general-purpose AI systems. Their Whisper model represents a major advance in open-source speech recognition.
Whisper
Whisper is a general-purpose speech recognition model trained on diverse audio data, designed to be robust to accents, background noise, and technical language.
Technical Details
- Encoder-decoder transformer architecture with attention mechanisms
- Support for 99 languages with varying accuracy levels
- Multiple model sizes from Tiny (39M) to Large (1550M parameters)
Research Impact
- Open-source model widely adopted in production speech recognition applications
- Breakthrough in cross-lingual and zero-shot speech recognition
Citation
@article{radford2022robust,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brewer, Greg and Sanders, Daniel and Hallacy, Alex and Kavukcuoglu, Koray and Weng, Lillian},
journal={arXiv preprint arXiv:2212.04356},
year={2022},
url={https://arxiv.org/abs/2212.04356}
} Key Research Contributions
- Large-scale weak supervision training on 680,000 hours of multilingual audio data
- Zero-shot transfer capabilities across languages and domains
- Robust performance approaching human-level accuracy in speech recognition
License
MIT License - This model is freely available for research and commercial use.
Parameters
Section titled “Parameters”Faster Whisper
Section titled “Faster Whisper”op('stt_whisper').par.Status Str - Default:
"" (Empty String)
op('stt_whisper').par.Active Toggle - Default:
False
op('stt_whisper').par.Copytranscript Pulse - Default:
False
op('stt_whisper').par.Enginestatus Str - Default:
Shutdown
op('stt_whisper').par.Initialize Pulse - Default:
False
op('stt_whisper').par.Shutdown Pulse - Default:
False
op('stt_whisper').par.Initializeonstart Toggle - Default:
False
op('stt_whisper').par.Segments Toggle - Default:
False
op('stt_whisper').par.Transcriptionfile File Audio or video file to transcribe (wav, mp3, mp4, mkv, etc.)
- Default:
"" (Empty String)
op('stt_whisper').par.Smartchunking Toggle - Default:
True
op('stt_whisper').par.Pausesensitivity Float - Default:
0.1- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('stt_whisper').par.Maxchunkduration Float - Default:
8.0- Range:
- 3 to 15
- Slider Range:
- 3 to 15
op('stt_whisper').par.Chunkduration Float - Default:
0.8- Range:
- 0.5 to 5
- Slider Range:
- 0.5 to 5
op('stt_whisper').par.Cleartranscript Pulse - Default:
False
op('stt_whisper').par.Phrasestoavoid Str - Default:
"" (Empty String)
op('stt_whisper').par.Customspellings Str - Default:
"" (Empty String)
op('stt_whisper').par.Usevad Toggle - Default:
True
op('stt_whisper').par.Vadthreshold Float - Default:
0.5- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('stt_whisper').par.Vadminsilence Int - Default:
250- Range:
- 50 to 2000
- Slider Range:
- 50 to 2000
op('stt_whisper').par.Beamsearchsize Int - Default:
5- Range:
- 1 to 20
- Slider Range:
- 1 to 20
Install/Settings
Section titled “Install/Settings”op('stt_whisper').par.Installdependencies Pulse - Default:
False
op('stt_whisper').par.Downloadmodel Pulse - Default:
False
op('stt_whisper').par.Downloadprogress Float - Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('stt_whisper').par.Monitorworkerlogs Toggle - Default:
False
op('stt_whisper').par.Autoreattachoninit Toggle - Default:
False
op('stt_whisper').par.Forceattachoninit Toggle - Default:
False
Changelog
Section titled “Changelog”v1.2.42026-03-26
- Expand segments_out from 3 to 7 columns: add Confidence, IsFinal, Speaker, Language - Wire segment confidence from worker result with 2dp formatting - Add header enforcement to segments_out on init - Align LastTranscriptionResult to standard schema: text, confidence, is_final, speaker, language, mode
v1.2.32026-01-28
- Remove torch from dependency list (faster-whisper uses CTranslate2, not torch) - Remove torch CUDA check and install handler - Keep nvidia-cudnn-cu12 (required for CTranslate2 GPU acceleration)
v1.2.22026-01-28
- Fix TD 32050+ freeze by removing faster_whisper/torch imports from TD - Subprocess worker handles all ML imports - Remove torch from dependencies (faster-whisper uses CTranslate2, not torch)
- Initial commit
v1.2.12025-08-29
cleaned menu and added segements parameter to show segemnts in out1 instead of the whole transcript
v1.2.02025-08-17
- NEW: TCP IPC Mode - Added robust TCP communication with worker processes (recommended over STDIO)
- NEW: Auto Worker Reattach - Automatically reconnect to existing workers on TD restart/reload
- NEW: TCP Heartbeat System - Automatic connection monitoring with reconnect on timeout
- NEW: Force Attach Mode - Skip PID checks for manual worker attachment scenarios
- NEW: CHOP Channel Monitoring - Script CHOP callbacks for real-time status/event monitoring
- NEW: Enhanced Dependencies - Pulse channels for TranscriptionComplete, EmptyTranscription, SentenceEnd
- NEW: Worker Connection Management - Improved process lifecycle with graceful shutdown/cleanup
- NEW: Status Channel Outputs - WorkerActive, ModelReady, TranscriptionActive, DownloadInProgress states
- NEW: Result Metadata Tracking - LastTranscriptionResult dependency with timestamp/mode info
- IMPROVED: Connection Reliability - Automatic TCP reconnection and worker process persistence
v1.1.22025-07-24
Added
- File Processing Mode: A new 'File' option in the 'Operating Mode' parameter to transcribe audio/video files from disk.
- Transcription File Parameter: A new
Transcription Fileparameter to select a file for transcription. Supported formats include WAV, MP3, MP4, MKV, and more.
Changed
- The
Activeparameter now automatically turns off upon completion of file transcription. - The
Engine Statusparameter now displays the current file being processed (e.g., "Transcribing File: my_video.mp4").
Fixed
- N/A
v1.1.12025-07-03
added Copytranscript parameter to the operator - this will copy the full transcript to the system clipboard.
v1.1.02025-06-30
This major update overhauls the stt_whisper component, moving from a blocking, in-process transcription model to a robust, non-blocking external worker architecture. This significantly improves performance and stability within TouchDesigner. The parameter interface has also been completely redesigned for clarity and ease of use.
✨ New Features & Major Changes
- External Worker Process: The core
faster-whispertranscription now runs in a separate Python process, preventing the main TouchDesigner thread from blocking or freezing during transcription. - Push-to-Talk Mode: Added a new "Push to Talk" operating mode alongside the default "Stream" mode. This allows for higher-accuracy transcription of complete thoughts by buffering audio and processing it all at once upon release.
- Smart VAD Chunking: Implemented an intelligent, VAD-based chunking strategy for streaming mode. This feature waits for natural pauses in speech before sending audio for transcription, dramatically improving the quality and flow of the final transcript by reducing mid-sentence cuts.
- Content & Spelling Control:
- Phrases to Avoid: Added a parameter to suppress specific words or phrases (e.g., common hallucinations like "Thanks for watching") from the final transcript.
- Custom Spellings: Added a parameter to provide the model with an initial prompt containing custom spellings for technical terms or proper nouns (e.g., "TouchDesigner, LOPs, NVIDIA").
- Reactive State Dependencies: Implemented
tdu.Dependencyobjects forWorkerActive,ModelReady, andTranscriptionActive, allowing other operators to reactively monitor the component's state. - Engine Status Display: Added a read-only "Engine Status" parameter that provides a clear, human-readable summary of the current state (e.g., "Shutdown", "Initializing...", "Ready", "Recording (PTT)", "Transcribing (Stream)").
UI/UX & Parameter Improvements
- Complete Parameter Overhaul: The operator's custom parameters have been completely reorganized into logical pages (
Faster Whisper,VAD / Filter,Content) for a much cleaner and more intuitive user experience. - Simplified Configuration: The
Model Pathis now automatically derived from the centralChatTDcomponent's configuration, removing the need for manual path setup. - Intuitive Language Selection: The "Language" parameter is now a searchable dropdown menu populated with all languages supported by Whisper, with "Auto Detect" as the default.
- Pause Sensitivity Control: The "Smart VAD Level" has been replaced with an intuitive
0-1"Pause Sensitivity" slider, which is much more user-friendly. - Mode-Switching Logic: Implemented a robust callback for the "Operating Mode" parameter to ensure clean and predictable state transitions when switching between "Stream" and "Push to Talk" modes.
🐛 Bug Fixes & Performance
- Fixed Transcription Duplication: Resolved a critical bug where the previous audio overlap strategy caused duplicated text in the output. The new version uses a more robust deduplication cache.
- Fixed Punctuation Artifacts: Implemented a smart post-processing filter to intelligently remove erroneous periods, dashes, and other punctuation artifacts that often appear at chunk boundaries, while preserving legitimate sentence structure.
- Improved Logging: Refined logging to be more concise and useful, removing excessive debug messages from the final version.
- Linter Errors Resolved: Cleaned up all linter errors from previous development iterations.