STT Whisper

v1.2.4Updated

Overview

The STT Whisper LOP performs local speech-to-text transcription using the faster-whisper library. It runs a separate worker process to keep TouchDesigner responsive, supports GPU acceleration via CUDA, and offers three operating modes: continuous streaming, push-to-talk, and file processing. It supports 99 languages with models ranging from 39M to 1550M parameters.

Requirements

Python Packages: faster-whisper, nvidia-cudnn-cu12, huggingface-hub, and numpy 1.24.x must be installed in the shared Python environment. Use the Install Dependencies button on the Install/Settings page to install them automatically. The installer also handles downgrading numpy 2.x to the compatible 1.24.1 version if needed.
Model Download: A Whisper model must be downloaded before first use. Use Download Model on the Install/Settings page, or pulse Initialize Whisper on the Faster Whisper page — if the model is not yet downloaded, you will be prompted to download it. Models are sourced from HuggingFace and verified for integrity before loading.
NVIDIA GPU (optional): For CUDA acceleration. CPU-only operation is fully supported.

Input / Output

Inputs

Input 1: Audio stream. Connect an Audio Device In CHOP or any 16kHz mono float32 audio source. The operator receives audio through its ReceiveAudioChunk() method, which is called automatically when audio is wired to the input.

Outputs

Output 1 (when Output Segments is enabled): A table DAT with columns Start, End, Text, Confidence, IsFinal, Speaker, and Language containing timestamped transcription segments.
transcription_out (internal DAT): The accumulated full transcript as plain text.
segments_out (internal DAT): All segments with start/end timestamps, confidence scores, and metadata.

Usage

First-Time Setup

On the Install/Settings page, pulse Install Dependencies. Follow the prompts to install the required Python packages. A TouchDesigner restart may be needed after installation.
On the Faster Whisper page, select a Model Size. The “Distil Large v3” model is recommended as a good balance of speed and accuracy.
Pulse Download Model on the Install/Settings page, or pulse Initialize Whisper on the Faster Whisper page — if the model is not yet downloaded, you will be prompted to download it.
Once the Whisper Status reads “Ready”, the operator is ready for transcription.

Auto-Initialization

Enable Initialize On Start on the Faster Whisper page to have the engine start automatically when the project opens, without needing to pulse Initialize Whisper manually each time. This is independent of Auto Reattach On Init on the Install/Settings page, which reconnects to a worker that survived a TouchDesigner reinitialization rather than starting a new one.

Stream Mode (Live Transcription)

Set Operating Mode to “Stream (Live)” on the Faster Whisper page.
Connect a 16kHz audio source to the operator’s input.
Pulse Initialize Whisper to start the engine.
Toggle Transcription Active to On. The operator begins continuously transcribing incoming audio.
Transcription text appears in the transcription_out DAT. Timestamped segments appear in segments_out.

Stream mode uses a chunking strategy controlled by Chunk Duration and Max Chunk Duration. When Smart VAD Chunking is enabled, the operator waits for natural speech pauses (controlled by Pause Sensitivity) before sending audio to the worker, improving phrase coherence. The audio buffer is capped at 60 seconds in Stream mode to prevent unbounded memory growth — older audio is discarded as new audio arrives.

Push-to-Talk Mode

Set Operating Mode to “Push to Talk”.
Connect your audio source and initialize the engine.
Toggle Transcription Active to On to begin recording. Audio accumulates in a buffer.
Toggle Transcription Active to Off to stop recording and trigger transcription of the buffered audio.

Push-to-talk is ideal for discrete speech segments. The audio buffer has no size limit in this mode, so recordings of any length are supported. Recordings larger than 20MB of raw audio are automatically split into chunks with session tracking to ensure the worker processes them reliably.

File Processing Mode

Set Operating Mode to “File Processing”.
On the Faster Whisper page, set Transcription File to the path of an audio or video file (wav, mp3, mp4, mkv, etc.).
Initialize the engine if not already running.
Toggle Transcription Active to On. The file is sent to the worker for transcription, and Transcription Active turns Off automatically when processing completes.

Translation

Set Task Type to “Translate (To English)” to translate any supported language into English during transcription, rather than keeping the original language.

Model Selection Guide

Category	Models	Parameters	Notes
Tiny	Tiny, Tiny EN-only	39M	Fastest, basic accuracy
Base	Base, Base EN-only	74M	Good balance for low-resource setups
Small	Small, Small EN-only	244M	Solid general-purpose
Medium	Medium, Medium EN-only	769M	High accuracy
Large	Large v1/v2/v3, Large v3 Turbo	809M—1550M	Maximum accuracy
Distil	Distil Small/Medium EN, Distil Large v2/v3/v3.5	166M—756M	Near-large accuracy at much faster speed

All 17 models are referenced by their full HuggingFace repository ID (e.g., Systran/faster-whisper-large-v3). EN-only models provide better English performance. Distil models are recommended for most use cases — they offer near-large-model accuracy with significantly faster inference. The “Distil Large v3” model is marked as recommended, and “Distil Large v3.5” is the latest addition.

Voice Activity Detection (VAD)

The VAD page provides filtering to reduce false transcriptions from non-speech audio.

Use VAD Filter: Enables Silero VAD to filter out non-speech segments before transcription.
VAD Threshold: Controls how aggressively non-speech is filtered. Higher values are stricter.
VAD Min Silence: Minimum silence duration (in ms) to consider as a speech boundary.
Beam Search Size: Controls the beam search width. Higher values may improve accuracy at the cost of speed.

Custom Filtering

Phrases to Avoid: Enter a comma-separated list of phrases to suppress. The operator converts these to token IDs and suppresses them during transcription.
Custom Spellings (Prompt): Provide domain-specific terms or spellings as an initial prompt to guide the model’s output.

Reactive Dependencies

The operator exposes several reactive state values that external Script CHOPs or other operators can monitor:

WorkerActive: True while the worker process is running.
ModelReady: True when the model is loaded and ready for transcription.
TranscriptionActive: True during active transcription.
DownloadInProgress: True while a model is being downloaded from HuggingFace.
TranscriptionComplete: Pulses True briefly when any transcription result arrives.
EmptyTranscription: Pulses True when a transcription completes with no speech detected (File and Push-to-Talk modes only).
OnSentenceEnd: Pulses True when the accumulated text ends with sentence punctuation.
LastTranscriptionResult: Contains metadata about the most recent transcription result, including text content, confidence, mode, and timestamp.

Worker Connection Settings

On the Install/Settings page:

IPC Mode: Choose between TCP (recommended) and STDIO (legacy). TCP provides better performance and supports worker reattachment across TouchDesigner reinitializations.
Auto Reattach On Init: When enabled, the operator attempts to reconnect to a previously running worker process on initialization, avoiding a cold start.
Force Attach (Skip PID Check): Attempts to attach using stored port/token without verifying the worker process is alive.
Monitor Worker Logs (stderr): Forwards the worker process’s log output to the operator’s Logger.
Worker Logging Level: Controls verbosity of the worker subprocess logs.

Project Exit Cleanup

The operator provides an on_exit() method that shuts down the worker process and clears stored connection data so that Auto Reattach On Init does not attempt to reconnect to a stale worker on the next project open. To use it, wire an Execute DAT with its onExit callback enabled to call op('stt_whisper1').on_exit() when the project closes.

Troubleshooting

Engine fails to initialize

Ensure all dependencies are installed. Pulse Install Dependencies on the Install/Settings page.
Verify the selected model is downloaded. Check Whisper Status for error messages.
On macOS, incompatible compute types (FP16, INT8+FP16) are automatically overridden to INT8.

No transcription output

Confirm Transcription Active is On and Whisper Status shows “Ready” or “Transcribing”.
Check that your audio source is producing 16kHz float32 data.
Try disabling Use VAD Filter temporarily to rule out aggressive filtering.
Lower the VAD Threshold if speech is being filtered out.

High latency

Use a smaller model or a Distil variant.
Set Device to “CUDA” if an NVIDIA GPU is available.
Reduce Max Chunk Duration and Chunk Duration for faster turnaround.
Disable Smart VAD Chunking for immediate fixed-interval processing.

Worker crashes or pipe errors

Check the worker logs by enabling Monitor Worker Logs on the Install/Settings page and setting Worker Logging Level to “Info” or “Debug”.
Ensure numpy version 1.24.x is installed in the venv (version 2.x causes compatibility issues). The Install Dependencies button handles this automatically.
On Windows, the operator sets KMP_DUPLICATE_LIB_OK=TRUE to prevent OpenMP conflicts.

Research & Licensing

OpenAI

OpenAI is an AI research organization developing general-purpose AI systems. Their Whisper model represents a major advance in open-source speech recognition.

Whisper

Whisper is a general-purpose speech recognition model trained on diverse audio data, designed to be robust to accents, background noise, and technical language.

Technical Details

Encoder-decoder transformer architecture with attention mechanisms
Support for 99 languages with varying accuracy levels
Multiple model sizes from Tiny (39M) to Large (1550M parameters)

Research Impact

Open-source model widely adopted in production speech recognition applications
Breakthrough in cross-lingual and zero-shot speech recognition

Citation

@article{radford2022robust,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brewer, Greg and Sanders, Daniel and Hallacy, Alex and Kavukcuoglu, Koray and Weng, Lillian},
journal={arXiv preprint arXiv:2212.04356},
year={2022},
url={https://arxiv.org/abs/2212.04356}
}

Key Research Contributions

Large-scale weak supervision training on 680,000 hours of multilingual audio data
Zero-shot transfer capabilities across languages and domains
Robust performance approaching human-level accuracy in speech recognition

License

MIT License - This model is freely available for research and commercial use.

Parameters

Faster Whisper

Status (Status) op('stt_whisper').par.Status Str

Default:: "" (Empty String)

Transcription Active (Active) op('stt_whisper').par.Active Toggle

Default:: False

Copy Transcript to Clipboard (Copytranscript) op('stt_whisper').par.Copytranscript Pulse

Default:: False

Whisper Status (Enginestatus) op('stt_whisper').par.Enginestatus Str

Default:: Shutdown

Initialize Whisper (Initialize) op('stt_whisper').par.Initialize Pulse

Default:: False

Shutdown Whisper (Shutdown) op('stt_whisper').par.Shutdown Pulse

Default:: False

Initialize On Start (Initializeonstart) op('stt_whisper').par.Initializeonstart Toggle

Default:: False

Output Segments (out1) (Segments) op('stt_whisper').par.Segments Toggle

Default:: False

Model Size (Modelsize) op('stt_whisper').par.Modelsize Menu

Default:: tiny.en
Options:: Systran/faster-whisper-tiny, Systran/faster-whisper-tiny.en, Systran/faster-whisper-base, Systran/faster-whisper-base.en, Systran/faster-whisper-small, Systran/faster-whisper-small.en, Systran/faster-whisper-medium, Systran/faster-whisper-medium.en, Systran/faster-whisper-large-v1, Systran/faster-whisper-large-v2, Systran/faster-whisper-large-v3, openai/whisper-large-v3-turbo, Systran/faster-distil-whisper-small.en, Systran/faster-distil-whisper-medium.en, Systran/faster-distil-whisper-large-v2, Systran/faster-distil-whisper-large-v3, Purfview/faster-distil-whisper-large-v3.5

Language (Language) op('stt_whisper').par.Language StrMenu

Default:

en

Menu Options:

English (en)
Chinese (zh)
German (de)
Spanish (es)
Russian (ru)
Korean (ko)
French (fr)
Japanese (ja)
Portuguese (pt)
Turkish (tr)
Polish (pl)
Catalan (ca)
Dutch (nl)
Arabic (ar)
Swedish (sv)
Italian (it)
Indonesian (id)
Hindi (hi)
Finnish (fi)
Vietnamese (vi)
Hebrew (he)
Ukrainian (uk)
Greek (el)
Malay (ms)
Czech (cs)
Romanian (ro)
Danish (da)
Hungarian (hu)
Tamil (ta)
Norwegian (no)
Thai (th)
Urdu (ur)
Croatian (hr)
Bulgarian (bg)
Lithuanian (lt)
Latin (la)
Maori (mi)
Malayalam (ml)
Welsh (cy)
Slovak (sk)
Telugu (te)
Persian (fa)
Latvian (lv)
Bengali (bn)
Serbian (sr)
Azerbaijani (az)
Slovenian (sl)
Kannada (kn)
Estonian (et)
Macedonian (mk)
Breton (br)
Basque (eu)
Icelandic (is)
Armenian (hy)
Nepali (ne)
Mongolian (mn)
Bosnian (bs)
Kazakh (kk)
Albanian (sq)
Swahili (sw)
Galician (gl)
Marathi (mr)
Punjabi (pa)
Sinhala (si)
Khmer (km)
Shona (sn)
Yoruba (yo)
Somali (so)
Afrikaans (af)
Occitan (oc)
Georgian (ka)
Belarusian (be)
Tajik (tg)
Sindhi (sd)
Gujarati (gu)
Amharic (am)
Yiddish (yi)
Lao (lo)
Uzbek (uz)
Faroese (fo)
Haitian (ht)
Pashto (ps)
Turkmen (tk)
Nynorsk (nn)
Maltese (mt)
Sanskrit (sa)
Luxembourgish (lb)
Burmese (my)
Tibetan (bo)
Tagalog (tl)
Malagasy (mg)
Assamese (as)
Tatar (tt)
Hawaiian (haw)
Lingala (ln)
Hausa (ha)
Bashkir (ba)
Javanese (jw)
Sundanese (su)

Transcription File (Transcriptionfile) op('stt_whisper').par.Transcriptionfile File

Audio or video file to transcribe (wav, mp3, mp4, mkv, etc.)

Default:: "" (Empty String)

Smart VAD Chunking (Smartchunking) op('stt_whisper').par.Smartchunking Toggle

Default:: True

Pause Sensitivity (Pausesensitivity) op('stt_whisper').par.Pausesensitivity Float

Default:: 0.1
Range:: 0 to 1
Slider Range:: 0 to 1

Max Chunk Duration (sec) (Maxchunkduration) op('stt_whisper').par.Maxchunkduration Float

Default:: 8.0
Range:: 3 to 15
Slider Range:: 3 to 15

Chunk Duration (sec) (Chunkduration) op('stt_whisper').par.Chunkduration Float

Default:: 0.8
Range:: 0.5 to 5
Slider Range:: 0.5 to 5

Clear Transcript (Cleartranscript) op('stt_whisper').par.Cleartranscript Pulse

Default:: False

VAD

Phrases to Avoid (Phrasestoavoid) op('stt_whisper').par.Phrasestoavoid Str

Default:: "" (Empty String)

Custom Spellings (Prompt) (Customspellings) op('stt_whisper').par.Customspellings Str

Default:: "" (Empty String)

Use VAD Filter (Usevad) op('stt_whisper').par.Usevad Toggle

Default:: True

VAD Threshold (Vadthreshold) op('stt_whisper').par.Vadthreshold Float

Default:: 0.5
Range:: 0 to 1
Slider Range:: 0 to 1

VAD Min Silence (ms) (Vadminsilence) op('stt_whisper').par.Vadminsilence Int

Default:: 250
Range:: 50 to 2000
Slider Range:: 50 to 2000

Beam Search Size (Beamsearchsize) op('stt_whisper').par.Beamsearchsize Int

Default:: 5
Range:: 1 to 20
Slider Range:: 1 to 20

Install/Settings

Install Dependencies (Installdependencies) op('stt_whisper').par.Installdependencies Pulse

Default:: False

Download Model (Downloadmodel) op('stt_whisper').par.Downloadmodel Pulse

Default:: False

Download Progress (Downloadprogress) op('stt_whisper').par.Downloadprogress Float

Default:: 0.0
Range:: 0 to 1
Slider Range:: 0 to 1

Worker Connection Settings Header

Monitor Worker Logs (stderr) (Monitorworkerlogs) op('stt_whisper').par.Monitorworkerlogs Toggle

Default:: False

Auto Reattach On Init (Autoreattachoninit) op('stt_whisper').par.Autoreattachoninit Toggle

Default:: False

Force Attach (Skip PID Check) (Forceattachoninit) op('stt_whisper').par.Forceattachoninit Toggle

Default:: False

Changelog

v1.2.42026-03-26

Expand segments_out from 3 to 7 columns: add Confidence, IsFinal, Speaker, Language - Wire segment confidence from worker result with 2dp formatting - Add header enforcement to segments_out on init - Align LastTranscriptionResult to standard schema: text, confidence, is_final, speaker, language, mode

v1.2.32026-01-28

Remove torch from dependency list (faster-whisper uses CTranslate2, not torch) - Remove torch CUDA check and install handler - Keep nvidia-cudnn-cu12 (required for CTranslate2 GPU acceleration)

v1.2.22026-01-28

Fix TD 32050+ freeze by removing faster_whisper/torch imports from TD - Subprocess worker handles all ML imports - Remove torch from dependencies (faster-whisper uses CTranslate2, not torch)
Initial commit

v1.2.12025-08-29

cleaned menu and added segements parameter to show segemnts in out1 instead of the whole transcript

v1.2.02025-08-17

NEW: TCP IPC Mode - Added robust TCP communication with worker processes (recommended over STDIO)
NEW: Auto Worker Reattach - Automatically reconnect to existing workers on TD restart/reload
NEW: TCP Heartbeat System - Automatic connection monitoring with reconnect on timeout
NEW: Force Attach Mode - Skip PID checks for manual worker attachment scenarios
NEW: CHOP Channel Monitoring - Script CHOP callbacks for real-time status/event monitoring
NEW: Enhanced Dependencies - Pulse channels for TranscriptionComplete, EmptyTranscription, SentenceEnd
NEW: Worker Connection Management - Improved process lifecycle with graceful shutdown/cleanup
NEW: Status Channel Outputs - WorkerActive, ModelReady, TranscriptionActive, DownloadInProgress states
NEW: Result Metadata Tracking - LastTranscriptionResult dependency with timestamp/mode info
IMPROVED: Connection Reliability - Automatic TCP reconnection and worker process persistence

v1.1.22025-07-24

Added

File Processing Mode: A new 'File' option in the 'Operating Mode' parameter to transcribe audio/video files from disk.
Transcription File Parameter: A new Transcription File parameter to select a file for transcription. Supported formats include WAV, MP3, MP4, MKV, and more.

Changed

The Active parameter now automatically turns off upon completion of file transcription.
The Engine Status parameter now displays the current file being processed (e.g., "Transcribing File: my_video.mp4").

Fixed

v1.1.12025-07-03

added Copytranscript parameter to the operator - this will copy the full transcript to the system clipboard.

v1.1.02025-06-30

This major update overhauls the stt_whisper component, moving from a blocking, in-process transcription model to a robust, non-blocking external worker architecture. This significantly improves performance and stability within TouchDesigner. The parameter interface has also been completely redesigned for clarity and ease of use.

✨ New Features & Major Changes

External Worker Process: The core faster-whisper transcription now runs in a separate Python process, preventing the main TouchDesigner thread from blocking or freezing during transcription.
Push-to-Talk Mode: Added a new "Push to Talk" operating mode alongside the default "Stream" mode. This allows for higher-accuracy transcription of complete thoughts by buffering audio and processing it all at once upon release.
Smart VAD Chunking: Implemented an intelligent, VAD-based chunking strategy for streaming mode. This feature waits for natural pauses in speech before sending audio for transcription, dramatically improving the quality and flow of the final transcript by reducing mid-sentence cuts.
Content & Spelling Control:

Phrases to Avoid: Added a parameter to suppress specific words or phrases (e.g., common hallucinations like "Thanks for watching") from the final transcript.
Custom Spellings: Added a parameter to provide the model with an initial prompt containing custom spellings for technical terms or proper nouns (e.g., "TouchDesigner, LOPs, NVIDIA").

Reactive State Dependencies: Implemented tdu.Dependency objects for WorkerActive, ModelReady, and TranscriptionActive, allowing other operators to reactively monitor the component's state.
Engine Status Display: Added a read-only "Engine Status" parameter that provides a clear, human-readable summary of the current state (e.g., "Shutdown", "Initializing...", "Ready", "Recording (PTT)", "Transcribing (Stream)").

UI/UX & Parameter Improvements

Complete Parameter Overhaul: The operator's custom parameters have been completely reorganized into logical pages (Faster Whisper, VAD / Filter, Content) for a much cleaner and more intuitive user experience.
Simplified Configuration: The Model Path is now automatically derived from the central ChatTD component's configuration, removing the need for manual path setup.
Intuitive Language Selection: The "Language" parameter is now a searchable dropdown menu populated with all languages supported by Whisper, with "Auto Detect" as the default.
Pause Sensitivity Control: The "Smart VAD Level" has been replaced with an intuitive 0-1 "Pause Sensitivity" slider, which is much more user-friendly.
Mode-Switching Logic: Implemented a robust callback for the "Operating Mode" parameter to ensure clean and predictable state transitions when switching between "Stream" and "Push to Talk" modes.

🐛 Bug Fixes & Performance

Fixed Transcription Duplication: Resolved a critical bug where the previous audio overlap strategy caused duplicated text in the output. The new version uses a more robust deduplication cache.
Fixed Punctuation Artifacts: Implemented a smart post-processing filter to intelligently remove erroneous periods, dashes, and other punctuation artifacts that often appear at chunk boundaries, while preserving legitimate sentence structure.
Improved Logging: Refined logging to be more concise and useful, removing excessive debug messages from the final version.
Linter Errors Resolved: Cleaned up all linter errors from previous development iterations.