STT AssemblyAI
Overview
Section titled “Overview”The STT AssemblyAI operator transcribes audio into text in real-time using AssemblyAI’s v3 streaming WebSocket API. It accepts live audio input from a CHOP, streams it to AssemblyAI’s servers, and returns continuous transcription results with turn detection, confidence scores, speaker identification, and automatic punctuation. Four speech models are available ranging from low-latency English-only to multilingual and high-accuracy options.
Key Features
Section titled “Key Features”- Four speech models: Universal English, Universal Multilingual, Whisper RT, and Universal-3 Pro
- Real-time speaker diarization with configurable speaker count
- Automatic language detection for multilingual audio
- Key terms prompting to improve recognition of domain-specific vocabulary
- Medical domain specialization mode
- Configurable VAD threshold, turn detection, and idle timeout
- Per-session cost tracking with cumulative totals
- Channel monitoring via tdu.Dependency for integration with CHOP networks
Requirements
Section titled “Requirements”- An active AssemblyAI account and API key.
- The
websocketsandcertifiPython libraries. Use the Install Dependencies button on the Install/Config page to install them automatically.
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”The operator accepts a single CHOP input carrying audio data. An Audio Device In CHOP is the most common source.
Outputs
Section titled “Outputs”- transcription_out — A text DAT containing the full running transcript.
- segments_out — A table DAT with individual turn segments. Columns: Start, End, Text, Confidence, IsFinal, Speaker, and Language. Enable Output Segments (out1) on the STTAssemblyAI page to route this table to the operator’s first output connector.
- session_info — A table DAT tracking the current session ID, status, duration, and provider (including which speech model is in use).
- cost_history — A table DAT logging each session’s start/end times, duration, and estimated cost. This data persists across project saves.
Usage Examples
Section titled “Usage Examples”Real-time Transcription from a Microphone
Section titled “Real-time Transcription from a Microphone”- Place an STT AssemblyAI operator in your network.
- Connect an Audio Device In CHOP to its input.
- On the Install/Config page, pulse Install Dependencies if you have not already installed the required libraries. Restart TouchDesigner after installation completes.
- Enter your API key in the AssemblyAI API Key field, or let it load automatically if stored in KeyManager. Pulse Get API Key to open the AssemblyAI dashboard if you need to obtain one.
- On the STTAssemblyAI page, select your preferred Speech Model.
- Toggle Streaming Active to On. The operator will auto-connect if needed.
- Speak into your microphone. Transcribed text appears in the output DATs in real-time.
- When finished, toggle Connected to Off to end the session and stop billing.
Choosing a Speech Model
Section titled “Choosing a Speech Model”- On the STTAssemblyAI page, open the Speech Model menu.
- Select from the available models:
- Universal English — Optimized for English with low latency. Good default for most use cases.
- Universal Multilingual — Supports multiple languages with automatic detection.
- Whisper RT — AssemblyAI’s real-time Whisper implementation.
- Universal-3 Pro — Highest accuracy model with speaker diarization support. Billed at $0.45/hour instead of $0.15/hour.
- Note that the connection must be re-established when switching models. Toggle Connected Off and back On after changing the model.
Speaker Diarization
Section titled “Speaker Diarization”- On the Model Settings page, enable Speaker Labels.
- Set Max Speakers to the expected number of speakers in the audio (1-10).
- Connect and start streaming. The segments table will include a Speaker column identifying which speaker produced each turn.
Improving Recognition Accuracy
Section titled “Improving Recognition Accuracy”- On the Model Settings page, enter domain-specific vocabulary in the Key Terms (comma-separated) field. For example:
TouchDesigner, CHOP, SOP, GLSLfor a TouchDesigner-focused conversation. - If transcribing medical content, set Domain to Medical for specialized vocabulary recognition.
- Enable Language Detection when working with multilingual audio to automatically identify the spoken language per segment.
Tuning Turn Detection
Section titled “Tuning Turn Detection”- On the Model Settings page, enable Format Turns to receive punctuated, cased transcripts.
- Adjust End of Turn Threshold (0 to 1) to control how confidently the model must detect a pause before finalizing a turn. Higher values require longer, more definitive pauses. Not available with Universal-3 Pro.
- Set Min Turn Silence (ms) to control the minimum silence duration before a turn can end when the model is confident.
- Set Max Turn Silence (ms) to define the maximum silence before a turn is always ended, regardless of confidence.
- Adjust VAD Threshold to control voice activity detection sensitivity. Higher values require stronger voice signals to trigger detection.
Channel Monitoring
Section titled “Channel Monitoring”The operator exposes its state through tdu.Dependency attributes, matching the pattern used by other STT operators. A Script CHOP inside the operator converts these into CHOP channels for use in TouchDesigner networks.
Available channel groups (configurable on the Script CHOP):
- Pulse Events —
transcription_complete,empty_transcription,sentence_end. These pulse briefly when the corresponding event occurs, useful for triggering downstream logic. - Status Data —
worker_active,model_ready,transcription_active,download_in_progress,connected,streaming_active,ready. Continuous status channels reflecting the operator’s current state. - Result Data —
last_has_segments,last_text_length,last_timestamp,last_confidence,last_is_final. Metadata about the most recent transcription result.
Best Practices
Section titled “Best Practices”- The Sample Rate and Audio Encoding fields display the current audio configuration. The default 16kHz PCM 16-bit LE offers the best balance of quality and bandwidth for streaming transcription.
- Set a reasonable Idle Timeout on the Model Settings page to automatically disconnect when no audio has been received for a period. This provides a safety net against forgotten connections, but do not rely on it as your primary cost control.
- Use Copy to Clipboard to quickly grab the current transcript for pasting elsewhere.
- The Estimated Total Cost ($) field tracks cumulative cost across all sessions. Keep in mind that Universal-3 Pro is billed at 3x the rate of other models.
- Use Key Terms when transcribing content with specialized vocabulary to improve accuracy.
Troubleshooting
Section titled “Troubleshooting”- “websockets library not installed” — Pulse Install Dependencies on the Install/Config page and restart TouchDesigner.
- Connection fails — Verify your API key is correct and that your network allows outbound WebSocket connections to
streaming.assemblyai.com. - No transcription appearing — Ensure your audio input CHOP is outputting valid audio data and that Streaming Active is toggled On. Check the operator’s Logger for debug messages.
- macOS SSL errors — The operator automatically sets the
SSL_CERT_FILEenvironment variable using thecertifipackage. If issues persist, ensurecertifiis installed via Install Dependencies. - Speaker labels not appearing — Speaker diarization must be enabled before connecting. Toggle Connected Off, enable Speaker Labels on the Model Settings page, then reconnect.
Technical Notes
Section titled “Technical Notes”- The operator connects to AssemblyAI’s v3 streaming WebSocket endpoint (
wss://streaming.assemblyai.com/v3/ws). - Format Turns and End of Turn Threshold are not available when using the Universal-3 Pro model, which handles turn formatting internally.
- Audio is converted from float32 to int16 PCM before streaming, in 50ms chunks as recommended by AssemblyAI.
- The Idle Timeout feature is implemented locally by the operator, not by the AssemblyAI API. It monitors when the last audio chunk was received and disconnects after the configured period of inactivity.
- Cost estimation is calculated locally: $0.15/hour for Universal English, Universal Multilingual, and Whisper RT; $0.45/hour for Universal-3 Pro.
Parameters
Section titled “Parameters”STTAssemblyAI
Section titled “STTAssemblyAI”op('stt_assemblyai').par.Status Str - Default:
"" (Empty String)
op('stt_assemblyai').par.Active Toggle - Default:
False
op('stt_assemblyai').par.Copytranscript Pulse - Default:
False
op('stt_assemblyai').par.Sttstatus Str - Default:
Disconnected
op('stt_assemblyai').par.Connected Toggle - Default:
False
op('stt_assemblyai').par.Segments Toggle - Default:
False
op('stt_assemblyai').par.Getapikey Pulse - Default:
False
op('stt_assemblyai').par.Estopcost Float - Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('stt_assemblyai').par.Cleartranscript Pulse - Default:
False
Model Settings
Section titled “Model Settings”op('stt_assemblyai').par.Speakerlabels Toggle - Default:
False
op('stt_assemblyai').par.Maxspeakers Int - Default:
0- Range:
- 1 to 10
- Slider Range:
- 1 to 10
op('stt_assemblyai').par.Languagedetection Toggle - Default:
False
op('stt_assemblyai').par.Keytermsprompt Str - Default:
"" (Empty String)
op('stt_assemblyai').par.Vadthreshold Float - Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('stt_assemblyai').par.Formatturns Toggle - Default:
False
op('stt_assemblyai').par.Endofturnthreshold Float - Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('stt_assemblyai').par.Minturnsilence Int - Default:
0- Range:
- 50 to 2000
- Slider Range:
- 50 to 2000
op('stt_assemblyai').par.Maxturnsilence Int - Default:
0- Range:
- 100 to 5000
- Slider Range:
- 100 to 5000
op('stt_assemblyai').par.Idletimeout Int - Default:
0- Range:
- 1 to 60
- Slider Range:
- 1 to 60
Install/Config
Section titled “Install/Config”op('stt_assemblyai').par.Apikey Str - Default:
API KEY LOADED (KeyManager)
op('stt_assemblyai').par.Installdependencies Pulse - Default:
False
Changelog
Section titled “Changelog”v1.2.02026-03-26
- Migrate to Universal Streaming v3 WebSocket API, remove assemblyai SDK dependency - Add Speechmodel parameter (universal-streaming-english/multilingual/whisper-rt/u3-rt-pro) - Add Model Settings page: speaker diarization, language detection, key terms, VAD threshold, domain - Rename EndOfTurn to IsFinal in segments_out schema - Add header enforcement to segments_out on init - Update LastTranscriptionResult key end_of_turn to is_final - Rename Script CHOP channel last_end_of_turn to last_is_final
- Initial commit
v1.1.02025-08-29
ADDED chop channels and depdencies for parity with whisper and kyutai
cleaned menu and added segements parameter to show segemnts in out1 instead of the whole transcript
v1.0.12025-08-17
cleaned menu to match other tts / stt operators
v1.0.02025-07-28
- Initial Release: Real-time speech-to-text transcription using AssemblyAI Streaming v3 API
- WebSocket Connection: Persistent connection management with proper cost control and cleanup
- Single Toggle Interface:
Connectedparameter handles both connect/disconnect operations - Auto-Connection:
Activeparameter automatically connects if needed when streaming is enabled - Parameter Mode Respect: Only updates parameters in CONSTANT mode, preserves expressions/binds
- API Key Management: Supports ChatTD KeyManager and local config file storage
- Audio Processing: Real-time audio buffering and streaming with configurable sample rates (16kHz, 44.1kHz, 48kHz)
- Turn Detection: Configurable end-of-turn detection with threshold and silence parameters
- Output DATs:
transcription_out: Full transcript textsegments_out: Individual segments with timestamps, confidence, and turn markerssession_info: Current session status and duration trackingcost_history: Persistent cost tracking across all sessions- Cost Management:
- Real-time cost tracking at $0.15/hour based on session duration
Estopcostparameter showing total estimated lifetime cost- Persistent cost history that survives project saves/reopens
- Cost calculations exclude currency symbols for clean data handling
- Idle Timeout: Configurable auto-disconnect after specified minutes of no audio input
- Session Persistence: Transcript data persists across TouchDesigner sessions via DAT storage
- Operator Reset: Complete reset functionality including cost history and log clearing
- Dependency Management: Automated installation of
assemblyaiandwebsocketspackages - Cost Optimization: Proper WebSocket session cleanup to prevent unnecessary billing
- Async Processing: Full async/await implementation using TDAsyncIO for non-blocking operation