Skip to content

STT Whisper

v1.2.4Updated

The STT Whisper LOP performs local speech-to-text transcription using the faster-whisper library. It runs a separate worker process to keep TouchDesigner responsive, supports GPU acceleration via CUDA, and offers three operating modes: continuous streaming, push-to-talk, and file processing. It supports 99 languages with models ranging from 39M to 1550M parameters.

  • Python Packages: faster-whisper, nvidia-cudnn-cu12, huggingface-hub, and numpy 1.24.x must be installed in the shared Python environment. Use the Install Dependencies button on the Install/Settings page to install them automatically. The installer also handles downgrading numpy 2.x to the compatible 1.24.1 version if needed.
  • Model Download: A Whisper model must be downloaded before first use. Use Download Model on the Install/Settings page, or pulse Initialize Whisper on the Faster Whisper page — if the model is not yet downloaded, you will be prompted to download it. Models are sourced from HuggingFace and verified for integrity before loading.
  • NVIDIA GPU (optional): For CUDA acceleration. CPU-only operation is fully supported.
  • Input 1: Audio stream. Connect an Audio Device In CHOP or any 16kHz mono float32 audio source. The operator receives audio through its ReceiveAudioChunk() method, which is called automatically when audio is wired to the input.
  • Output 1 (when Output Segments is enabled): A table DAT with columns Start, End, Text, Confidence, IsFinal, Speaker, and Language containing timestamped transcription segments.
  • transcription_out (internal DAT): The accumulated full transcript as plain text.
  • segments_out (internal DAT): All segments with start/end timestamps, confidence scores, and metadata.
  1. On the Install/Settings page, pulse Install Dependencies. Follow the prompts to install the required Python packages. A TouchDesigner restart may be needed after installation.
  2. On the Faster Whisper page, select a Model Size. The “Distil Large v3” model is recommended as a good balance of speed and accuracy.
  3. Pulse Download Model on the Install/Settings page, or pulse Initialize Whisper on the Faster Whisper page — if the model is not yet downloaded, you will be prompted to download it.
  4. Once the Whisper Status reads “Ready”, the operator is ready for transcription.

Enable Initialize On Start on the Faster Whisper page to have the engine start automatically when the project opens, without needing to pulse Initialize Whisper manually each time. This is independent of Auto Reattach On Init on the Install/Settings page, which reconnects to a worker that survived a TouchDesigner reinitialization rather than starting a new one.

  1. Set Operating Mode to “Stream (Live)” on the Faster Whisper page.
  2. Connect a 16kHz audio source to the operator’s input.
  3. Pulse Initialize Whisper to start the engine.
  4. Toggle Transcription Active to On. The operator begins continuously transcribing incoming audio.
  5. Transcription text appears in the transcription_out DAT. Timestamped segments appear in segments_out.

Stream mode uses a chunking strategy controlled by Chunk Duration and Max Chunk Duration. When Smart VAD Chunking is enabled, the operator waits for natural speech pauses (controlled by Pause Sensitivity) before sending audio to the worker, improving phrase coherence. The audio buffer is capped at 60 seconds in Stream mode to prevent unbounded memory growth — older audio is discarded as new audio arrives.

  1. Set Operating Mode to “Push to Talk”.
  2. Connect your audio source and initialize the engine.
  3. Toggle Transcription Active to On to begin recording. Audio accumulates in a buffer.
  4. Toggle Transcription Active to Off to stop recording and trigger transcription of the buffered audio.

Push-to-talk is ideal for discrete speech segments. The audio buffer has no size limit in this mode, so recordings of any length are supported. Recordings larger than 20MB of raw audio are automatically split into chunks with session tracking to ensure the worker processes them reliably.

  1. Set Operating Mode to “File Processing”.
  2. On the Faster Whisper page, set Transcription File to the path of an audio or video file (wav, mp3, mp4, mkv, etc.).
  3. Initialize the engine if not already running.
  4. Toggle Transcription Active to On. The file is sent to the worker for transcription, and Transcription Active turns Off automatically when processing completes.

Set Task Type to “Translate (To English)” to translate any supported language into English during transcription, rather than keeping the original language.

CategoryModelsParametersNotes
TinyTiny, Tiny EN-only39MFastest, basic accuracy
BaseBase, Base EN-only74MGood balance for low-resource setups
SmallSmall, Small EN-only244MSolid general-purpose
MediumMedium, Medium EN-only769MHigh accuracy
LargeLarge v1/v2/v3, Large v3 Turbo809M—1550MMaximum accuracy
DistilDistil Small/Medium EN, Distil Large v2/v3/v3.5166M—756MNear-large accuracy at much faster speed

All 17 models are referenced by their full HuggingFace repository ID (e.g., Systran/faster-whisper-large-v3). EN-only models provide better English performance. Distil models are recommended for most use cases — they offer near-large-model accuracy with significantly faster inference. The “Distil Large v3” model is marked as recommended, and “Distil Large v3.5” is the latest addition.

The VAD page provides filtering to reduce false transcriptions from non-speech audio.

  • Use VAD Filter: Enables Silero VAD to filter out non-speech segments before transcription.
  • VAD Threshold: Controls how aggressively non-speech is filtered. Higher values are stricter.
  • VAD Min Silence: Minimum silence duration (in ms) to consider as a speech boundary.
  • Beam Search Size: Controls the beam search width. Higher values may improve accuracy at the cost of speed.
  • Phrases to Avoid: Enter a comma-separated list of phrases to suppress. The operator converts these to token IDs and suppresses them during transcription.
  • Custom Spellings (Prompt): Provide domain-specific terms or spellings as an initial prompt to guide the model’s output.

The operator exposes several reactive state values that external Script CHOPs or other operators can monitor:

  • WorkerActive: True while the worker process is running.
  • ModelReady: True when the model is loaded and ready for transcription.
  • TranscriptionActive: True during active transcription.
  • DownloadInProgress: True while a model is being downloaded from HuggingFace.
  • TranscriptionComplete: Pulses True briefly when any transcription result arrives.
  • EmptyTranscription: Pulses True when a transcription completes with no speech detected (File and Push-to-Talk modes only).
  • OnSentenceEnd: Pulses True when the accumulated text ends with sentence punctuation.
  • LastTranscriptionResult: Contains metadata about the most recent transcription result, including text content, confidence, mode, and timestamp.

On the Install/Settings page:

  • IPC Mode: Choose between TCP (recommended) and STDIO (legacy). TCP provides better performance and supports worker reattachment across TouchDesigner reinitializations.
  • Auto Reattach On Init: When enabled, the operator attempts to reconnect to a previously running worker process on initialization, avoiding a cold start.
  • Force Attach (Skip PID Check): Attempts to attach using stored port/token without verifying the worker process is alive.
  • Monitor Worker Logs (stderr): Forwards the worker process’s log output to the operator’s Logger.
  • Worker Logging Level: Controls verbosity of the worker subprocess logs.

The operator provides an on_exit() method that shuts down the worker process and clears stored connection data so that Auto Reattach On Init does not attempt to reconnect to a stale worker on the next project open. To use it, wire an Execute DAT with its onExit callback enabled to call op('stt_whisper1').on_exit() when the project closes.

  • Ensure all dependencies are installed. Pulse Install Dependencies on the Install/Settings page.
  • Verify the selected model is downloaded. Check Whisper Status for error messages.
  • On macOS, incompatible compute types (FP16, INT8+FP16) are automatically overridden to INT8.
  • Confirm Transcription Active is On and Whisper Status shows “Ready” or “Transcribing”.
  • Check that your audio source is producing 16kHz float32 data.
  • Try disabling Use VAD Filter temporarily to rule out aggressive filtering.
  • Lower the VAD Threshold if speech is being filtered out.
  • Use a smaller model or a Distil variant.
  • Set Device to “CUDA” if an NVIDIA GPU is available.
  • Reduce Max Chunk Duration and Chunk Duration for faster turnaround.
  • Disable Smart VAD Chunking for immediate fixed-interval processing.
  • Check the worker logs by enabling Monitor Worker Logs on the Install/Settings page and setting Worker Logging Level to “Info” or “Debug”.
  • Ensure numpy version 1.24.x is installed in the venv (version 2.x causes compatibility issues). The Install Dependencies button handles this automatically.
  • On Windows, the operator sets KMP_DUPLICATE_LIB_OK=TRUE to prevent OpenMP conflicts.

Research & Licensing

OpenAI

OpenAI is an AI research organization developing general-purpose AI systems. Their Whisper model represents a major advance in open-source speech recognition.

Whisper

Whisper is a general-purpose speech recognition model trained on diverse audio data, designed to be robust to accents, background noise, and technical language.

Technical Details

  • Encoder-decoder transformer architecture with attention mechanisms
  • Support for 99 languages with varying accuracy levels
  • Multiple model sizes from Tiny (39M) to Large (1550M parameters)

Research Impact

  • Open-source model widely adopted in production speech recognition applications
  • Breakthrough in cross-lingual and zero-shot speech recognition

Citation

@article{radford2022robust,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brewer, Greg and Sanders, Daniel and Hallacy, Alex and Kavukcuoglu, Koray and Weng, Lillian},
journal={arXiv preprint arXiv:2212.04356},
year={2022},
url={https://arxiv.org/abs/2212.04356}
}

Key Research Contributions

  • Large-scale weak supervision training on 680,000 hours of multilingual audio data
  • Zero-shot transfer capabilities across languages and domains
  • Robust performance approaching human-level accuracy in speech recognition

License

MIT License - This model is freely available for research and commercial use.

Status (Status) op('stt_whisper').par.Status Str
Default:
"" (Empty String)
Transcription Active (Active) op('stt_whisper').par.Active Toggle
Default:
False
Copy Transcript to Clipboard (Copytranscript) op('stt_whisper').par.Copytranscript Pulse
Default:
False
Whisper Status (Enginestatus) op('stt_whisper').par.Enginestatus Str
Default:
Shutdown
Initialize Whisper (Initialize) op('stt_whisper').par.Initialize Pulse
Default:
False
Shutdown Whisper (Shutdown) op('stt_whisper').par.Shutdown Pulse
Default:
False
Initialize On Start (Initializeonstart) op('stt_whisper').par.Initializeonstart Toggle
Default:
False
Operating Mode (Mode) op('stt_whisper').par.Mode Menu
Default:
Pushtotalk
Options:
Stream, Pushtotalk, File
Task Type (Task) op('stt_whisper').par.Task Menu

Transcribe: Keep original language. Translate: Convert any language to English

Default:
transcribe
Options:
transcribe, translate
Output Segments (out1) (Segments) op('stt_whisper').par.Segments Toggle
Default:
False
Model Size (Modelsize) op('stt_whisper').par.Modelsize Menu
Default:
tiny.en
Options:
Systran/faster-whisper-tiny, Systran/faster-whisper-tiny.en, Systran/faster-whisper-base, Systran/faster-whisper-base.en, Systran/faster-whisper-small, Systran/faster-whisper-small.en, Systran/faster-whisper-medium, Systran/faster-whisper-medium.en, Systran/faster-whisper-large-v1, Systran/faster-whisper-large-v2, Systran/faster-whisper-large-v3, openai/whisper-large-v3-turbo, Systran/faster-distil-whisper-small.en, Systran/faster-distil-whisper-medium.en, Systran/faster-distil-whisper-large-v2, Systran/faster-distil-whisper-large-v3, Purfview/faster-distil-whisper-large-v3.5
Language (Language) op('stt_whisper').par.Language StrMenu
Default:
en
Menu Options:
  • English (en)
  • Chinese (zh)
  • German (de)
  • Spanish (es)
  • Russian (ru)
  • Korean (ko)
  • French (fr)
  • Japanese (ja)
  • Portuguese (pt)
  • Turkish (tr)
  • Polish (pl)
  • Catalan (ca)
  • Dutch (nl)
  • Arabic (ar)
  • Swedish (sv)
  • Italian (it)
  • Indonesian (id)
  • Hindi (hi)
  • Finnish (fi)
  • Vietnamese (vi)
  • Hebrew (he)
  • Ukrainian (uk)
  • Greek (el)
  • Malay (ms)
  • Czech (cs)
  • Romanian (ro)
  • Danish (da)
  • Hungarian (hu)
  • Tamil (ta)
  • Norwegian (no)
  • Thai (th)
  • Urdu (ur)
  • Croatian (hr)
  • Bulgarian (bg)
  • Lithuanian (lt)
  • Latin (la)
  • Maori (mi)
  • Malayalam (ml)
  • Welsh (cy)
  • Slovak (sk)
  • Telugu (te)
  • Persian (fa)
  • Latvian (lv)
  • Bengali (bn)
  • Serbian (sr)
  • Azerbaijani (az)
  • Slovenian (sl)
  • Kannada (kn)
  • Estonian (et)
  • Macedonian (mk)
  • Breton (br)
  • Basque (eu)
  • Icelandic (is)
  • Armenian (hy)
  • Nepali (ne)
  • Mongolian (mn)
  • Bosnian (bs)
  • Kazakh (kk)
  • Albanian (sq)
  • Swahili (sw)
  • Galician (gl)
  • Marathi (mr)
  • Punjabi (pa)
  • Sinhala (si)
  • Khmer (km)
  • Shona (sn)
  • Yoruba (yo)
  • Somali (so)
  • Afrikaans (af)
  • Occitan (oc)
  • Georgian (ka)
  • Belarusian (be)
  • Tajik (tg)
  • Sindhi (sd)
  • Gujarati (gu)
  • Amharic (am)
  • Yiddish (yi)
  • Lao (lo)
  • Uzbek (uz)
  • Faroese (fo)
  • Haitian (ht)
  • Pashto (ps)
  • Turkmen (tk)
  • Nynorsk (nn)
  • Maltese (mt)
  • Sanskrit (sa)
  • Luxembourgish (lb)
  • Burmese (my)
  • Tibetan (bo)
  • Tagalog (tl)
  • Malagasy (mg)
  • Assamese (as)
  • Tatar (tt)
  • Hawaiian (haw)
  • Lingala (ln)
  • Hausa (ha)
  • Bashkir (ba)
  • Javanese (jw)
  • Sundanese (su)
Device (Device) op('stt_whisper').par.Device Menu
Default:
auto
Options:
auto, cpu, cuda
Compute Type (Computetype) op('stt_whisper').par.Computetype Menu
Default:
default
Options:
default, auto, int8, int8_float16, int16, float16, float32
Transcription File (Transcriptionfile) op('stt_whisper').par.Transcriptionfile File

Audio or video file to transcribe (wav, mp3, mp4, mkv, etc.)

Default:
"" (Empty String)
Smart VAD Chunking (Smartchunking) op('stt_whisper').par.Smartchunking Toggle
Default:
True
Pause Sensitivity (Pausesensitivity) op('stt_whisper').par.Pausesensitivity Float
Default:
0.1
Range:
0 to 1
Slider Range:
0 to 1
Max Chunk Duration (sec) (Maxchunkduration) op('stt_whisper').par.Maxchunkduration Float
Default:
8.0
Range:
3 to 15
Slider Range:
3 to 15
Chunk Duration (sec) (Chunkduration) op('stt_whisper').par.Chunkduration Float
Default:
0.8
Range:
0.5 to 5
Slider Range:
0.5 to 5
Clear Transcript (Cleartranscript) op('stt_whisper').par.Cleartranscript Pulse
Default:
False
Phrases to Avoid (Phrasestoavoid) op('stt_whisper').par.Phrasestoavoid Str
Default:
"" (Empty String)
Custom Spellings (Prompt) (Customspellings) op('stt_whisper').par.Customspellings Str
Default:
"" (Empty String)
Use VAD Filter (Usevad) op('stt_whisper').par.Usevad Toggle
Default:
True
VAD Threshold (Vadthreshold) op('stt_whisper').par.Vadthreshold Float
Default:
0.5
Range:
0 to 1
Slider Range:
0 to 1
VAD Min Silence (ms) (Vadminsilence) op('stt_whisper').par.Vadminsilence Int
Default:
250
Range:
50 to 2000
Slider Range:
50 to 2000
Beam Search Size (Beamsearchsize) op('stt_whisper').par.Beamsearchsize Int
Default:
5
Range:
1 to 20
Slider Range:
1 to 20
Install Dependencies (Installdependencies) op('stt_whisper').par.Installdependencies Pulse
Default:
False
Download Model (Downloadmodel) op('stt_whisper').par.Downloadmodel Pulse
Default:
False
Download Progress (Downloadprogress) op('stt_whisper').par.Downloadprogress Float
Default:
0.0
Range:
0 to 1
Slider Range:
0 to 1
Worker Logging Level (Workerlogging) op('stt_whisper').par.Workerlogging Menu
Default:
OFF
Options:
OFF, CRITICAL, ERROR, WARNING, INFO, DEBUG
Worker Connection Settings Header
IPC Mode (Ipcmode) op('stt_whisper').par.Ipcmode Menu
Default:
tcp
Options:
tcp, stdio
Monitor Worker Logs (stderr) (Monitorworkerlogs) op('stt_whisper').par.Monitorworkerlogs Toggle
Default:
False
Auto Reattach On Init (Autoreattachoninit) op('stt_whisper').par.Autoreattachoninit Toggle
Default:
False
Force Attach (Skip PID Check) (Forceattachoninit) op('stt_whisper').par.Forceattachoninit Toggle
Default:
False
v1.2.42026-03-26
  • Expand segments_out from 3 to 7 columns: add Confidence, IsFinal, Speaker, Language - Wire segment confidence from worker result with 2dp formatting - Add header enforcement to segments_out on init - Align LastTranscriptionResult to standard schema: text, confidence, is_final, speaker, language, mode
v1.2.32026-01-28
  • Remove torch from dependency list (faster-whisper uses CTranslate2, not torch) - Remove torch CUDA check and install handler - Keep nvidia-cudnn-cu12 (required for CTranslate2 GPU acceleration)
v1.2.22026-01-28
  • Fix TD 32050+ freeze by removing faster_whisper/torch imports from TD - Subprocess worker handles all ML imports - Remove torch from dependencies (faster-whisper uses CTranslate2, not torch)
  • Initial commit
v1.2.12025-08-29

cleaned menu and added segements parameter to show segemnts in out1 instead of the whole transcript

v1.2.02025-08-17
  • NEW: TCP IPC Mode - Added robust TCP communication with worker processes (recommended over STDIO)
  • NEW: Auto Worker Reattach - Automatically reconnect to existing workers on TD restart/reload
  • NEW: TCP Heartbeat System - Automatic connection monitoring with reconnect on timeout
  • NEW: Force Attach Mode - Skip PID checks for manual worker attachment scenarios
  • NEW: CHOP Channel Monitoring - Script CHOP callbacks for real-time status/event monitoring
  • NEW: Enhanced Dependencies - Pulse channels for TranscriptionComplete, EmptyTranscription, SentenceEnd
  • NEW: Worker Connection Management - Improved process lifecycle with graceful shutdown/cleanup
  • NEW: Status Channel Outputs - WorkerActive, ModelReady, TranscriptionActive, DownloadInProgress states
  • NEW: Result Metadata Tracking - LastTranscriptionResult dependency with timestamp/mode info
  • IMPROVED: Connection Reliability - Automatic TCP reconnection and worker process persistence
v1.1.22025-07-24

Added

  • File Processing Mode: A new 'File' option in the 'Operating Mode' parameter to transcribe audio/video files from disk.
  • Transcription File Parameter: A new Transcription File parameter to select a file for transcription. Supported formats include WAV, MP3, MP4, MKV, and more.

Changed

  • The Active parameter now automatically turns off upon completion of file transcription.
  • The Engine Status parameter now displays the current file being processed (e.g., "Transcribing File: my_video.mp4").

Fixed

  • N/A
v1.1.12025-07-03

added Copytranscript parameter to the operator - this will copy the full transcript to the system clipboard.

v1.1.02025-06-30

This major update overhauls the stt_whisper component, moving from a blocking, in-process transcription model to a robust, non-blocking external worker architecture. This significantly improves performance and stability within TouchDesigner. The parameter interface has also been completely redesigned for clarity and ease of use.

✨ New Features & Major Changes

  • External Worker Process: The core faster-whisper transcription now runs in a separate Python process, preventing the main TouchDesigner thread from blocking or freezing during transcription.
  • Push-to-Talk Mode: Added a new "Push to Talk" operating mode alongside the default "Stream" mode. This allows for higher-accuracy transcription of complete thoughts by buffering audio and processing it all at once upon release.
  • Smart VAD Chunking: Implemented an intelligent, VAD-based chunking strategy for streaming mode. This feature waits for natural pauses in speech before sending audio for transcription, dramatically improving the quality and flow of the final transcript by reducing mid-sentence cuts.
  • Content & Spelling Control:
    • Phrases to Avoid: Added a parameter to suppress specific words or phrases (e.g., common hallucinations like "Thanks for watching") from the final transcript.
    • Custom Spellings: Added a parameter to provide the model with an initial prompt containing custom spellings for technical terms or proper nouns (e.g., "TouchDesigner, LOPs, NVIDIA").
  • Reactive State Dependencies: Implemented tdu.Dependency objects for WorkerActive, ModelReady, and TranscriptionActive, allowing other operators to reactively monitor the component's state.
  • Engine Status Display: Added a read-only "Engine Status" parameter that provides a clear, human-readable summary of the current state (e.g., "Shutdown", "Initializing...", "Ready", "Recording (PTT)", "Transcribing (Stream)").

UI/UX & Parameter Improvements

  • Complete Parameter Overhaul: The operator's custom parameters have been completely reorganized into logical pages (Faster Whisper, VAD / Filter, Content) for a much cleaner and more intuitive user experience.
  • Simplified Configuration: The Model Path is now automatically derived from the central ChatTD component's configuration, removing the need for manual path setup.
  • Intuitive Language Selection: The "Language" parameter is now a searchable dropdown menu populated with all languages supported by Whisper, with "Auto Detect" as the default.
  • Pause Sensitivity Control: The "Smart VAD Level" has been replaced with an intuitive 0-1 "Pause Sensitivity" slider, which is much more user-friendly.
  • Mode-Switching Logic: Implemented a robust callback for the "Operating Mode" parameter to ensure clean and predictable state transitions when switching between "Stream" and "Push to Talk" modes.

🐛 Bug Fixes & Performance

  • Fixed Transcription Duplication: Resolved a critical bug where the previous audio overlap strategy caused duplicated text in the output. The new version uses a more robust deduplication cache.
  • Fixed Punctuation Artifacts: Implemented a smart post-processing filter to intelligently remove erroneous periods, dashes, and other punctuation artifacts that often appear at chunk boundaries, while preserving legitimate sentence structure.
  • Improved Logging: Refined logging to be more concise and useful, removing excessive debug messages from the final version.
  • Linter Errors Resolved: Cleaned up all linter errors from previous development iterations.