Skip to content

STT AssemblyAI

v1.2.0Updated

The STT AssemblyAI operator transcribes audio into text in real-time using AssemblyAI’s v3 streaming WebSocket API. It accepts live audio input from a CHOP, streams it to AssemblyAI’s servers, and returns continuous transcription results with turn detection, confidence scores, speaker identification, and automatic punctuation. Four speech models are available ranging from low-latency English-only to multilingual and high-accuracy options.

  • Four speech models: Universal English, Universal Multilingual, Whisper RT, and Universal-3 Pro
  • Real-time speaker diarization with configurable speaker count
  • Automatic language detection for multilingual audio
  • Key terms prompting to improve recognition of domain-specific vocabulary
  • Medical domain specialization mode
  • Configurable VAD threshold, turn detection, and idle timeout
  • Per-session cost tracking with cumulative totals
  • Channel monitoring via tdu.Dependency for integration with CHOP networks
  • An active AssemblyAI account and API key.
  • The websockets and certifi Python libraries. Use the Install Dependencies button on the Install/Config page to install them automatically.

The operator accepts a single CHOP input carrying audio data. An Audio Device In CHOP is the most common source.

  • transcription_out — A text DAT containing the full running transcript.
  • segments_out — A table DAT with individual turn segments. Columns: Start, End, Text, Confidence, IsFinal, Speaker, and Language. Enable Output Segments (out1) on the STTAssemblyAI page to route this table to the operator’s first output connector.
  • session_info — A table DAT tracking the current session ID, status, duration, and provider (including which speech model is in use).
  • cost_history — A table DAT logging each session’s start/end times, duration, and estimated cost. This data persists across project saves.
  1. Place an STT AssemblyAI operator in your network.
  2. Connect an Audio Device In CHOP to its input.
  3. On the Install/Config page, pulse Install Dependencies if you have not already installed the required libraries. Restart TouchDesigner after installation completes.
  4. Enter your API key in the AssemblyAI API Key field, or let it load automatically if stored in KeyManager. Pulse Get API Key to open the AssemblyAI dashboard if you need to obtain one.
  5. On the STTAssemblyAI page, select your preferred Speech Model.
  6. Toggle Streaming Active to On. The operator will auto-connect if needed.
  7. Speak into your microphone. Transcribed text appears in the output DATs in real-time.
  8. When finished, toggle Connected to Off to end the session and stop billing.
  1. On the STTAssemblyAI page, open the Speech Model menu.
  2. Select from the available models:
    • Universal English — Optimized for English with low latency. Good default for most use cases.
    • Universal Multilingual — Supports multiple languages with automatic detection.
    • Whisper RT — AssemblyAI’s real-time Whisper implementation.
    • Universal-3 Pro — Highest accuracy model with speaker diarization support. Billed at $0.45/hour instead of $0.15/hour.
  3. Note that the connection must be re-established when switching models. Toggle Connected Off and back On after changing the model.
  1. On the Model Settings page, enable Speaker Labels.
  2. Set Max Speakers to the expected number of speakers in the audio (1-10).
  3. Connect and start streaming. The segments table will include a Speaker column identifying which speaker produced each turn.
  1. On the Model Settings page, enter domain-specific vocabulary in the Key Terms (comma-separated) field. For example: TouchDesigner, CHOP, SOP, GLSL for a TouchDesigner-focused conversation.
  2. If transcribing medical content, set Domain to Medical for specialized vocabulary recognition.
  3. Enable Language Detection when working with multilingual audio to automatically identify the spoken language per segment.
  1. On the Model Settings page, enable Format Turns to receive punctuated, cased transcripts.
  2. Adjust End of Turn Threshold (0 to 1) to control how confidently the model must detect a pause before finalizing a turn. Higher values require longer, more definitive pauses. Not available with Universal-3 Pro.
  3. Set Min Turn Silence (ms) to control the minimum silence duration before a turn can end when the model is confident.
  4. Set Max Turn Silence (ms) to define the maximum silence before a turn is always ended, regardless of confidence.
  5. Adjust VAD Threshold to control voice activity detection sensitivity. Higher values require stronger voice signals to trigger detection.

The operator exposes its state through tdu.Dependency attributes, matching the pattern used by other STT operators. A Script CHOP inside the operator converts these into CHOP channels for use in TouchDesigner networks.

Available channel groups (configurable on the Script CHOP):

  • Pulse Eventstranscription_complete, empty_transcription, sentence_end. These pulse briefly when the corresponding event occurs, useful for triggering downstream logic.
  • Status Dataworker_active, model_ready, transcription_active, download_in_progress, connected, streaming_active, ready. Continuous status channels reflecting the operator’s current state.
  • Result Datalast_has_segments, last_text_length, last_timestamp, last_confidence, last_is_final. Metadata about the most recent transcription result.
  • The Sample Rate and Audio Encoding fields display the current audio configuration. The default 16kHz PCM 16-bit LE offers the best balance of quality and bandwidth for streaming transcription.
  • Set a reasonable Idle Timeout on the Model Settings page to automatically disconnect when no audio has been received for a period. This provides a safety net against forgotten connections, but do not rely on it as your primary cost control.
  • Use Copy to Clipboard to quickly grab the current transcript for pasting elsewhere.
  • The Estimated Total Cost ($) field tracks cumulative cost across all sessions. Keep in mind that Universal-3 Pro is billed at 3x the rate of other models.
  • Use Key Terms when transcribing content with specialized vocabulary to improve accuracy.
  • “websockets library not installed” — Pulse Install Dependencies on the Install/Config page and restart TouchDesigner.
  • Connection fails — Verify your API key is correct and that your network allows outbound WebSocket connections to streaming.assemblyai.com.
  • No transcription appearing — Ensure your audio input CHOP is outputting valid audio data and that Streaming Active is toggled On. Check the operator’s Logger for debug messages.
  • macOS SSL errors — The operator automatically sets the SSL_CERT_FILE environment variable using the certifi package. If issues persist, ensure certifi is installed via Install Dependencies.
  • Speaker labels not appearing — Speaker diarization must be enabled before connecting. Toggle Connected Off, enable Speaker Labels on the Model Settings page, then reconnect.
  • The operator connects to AssemblyAI’s v3 streaming WebSocket endpoint (wss://streaming.assemblyai.com/v3/ws).
  • Format Turns and End of Turn Threshold are not available when using the Universal-3 Pro model, which handles turn formatting internally.
  • Audio is converted from float32 to int16 PCM before streaming, in 50ms chunks as recommended by AssemblyAI.
  • The Idle Timeout feature is implemented locally by the operator, not by the AssemblyAI API. It monitors when the last audio chunk was received and disconnects after the configured period of inactivity.
  • Cost estimation is calculated locally: $0.15/hour for Universal English, Universal Multilingual, and Whisper RT; $0.45/hour for Universal-3 Pro.
Status (Status) op('stt_assemblyai').par.Status Str
Default:
"" (Empty String)
Streaming Active (Active) op('stt_assemblyai').par.Active Toggle
Default:
False
Copy to Clipboard (Copytranscript) op('stt_assemblyai').par.Copytranscript Pulse
Default:
False
Connection Sttstatus (Sttstatus) op('stt_assemblyai').par.Sttstatus Str
Default:
Disconnected
Connected (Connected) op('stt_assemblyai').par.Connected Toggle
Default:
False
Output Segments (out1) (Segments) op('stt_assemblyai').par.Segments Toggle
Default:
False
Sample Rate (Samplerate) op('stt_assemblyai').par.Samplerate Menu
Default:
16000
Options:
16000, 44100, 48000
Audio Encoding (Encoding) op('stt_assemblyai').par.Encoding Menu
Default:
pcm_s16le
Options:
pcm_s16le, pcm_mulaw
Get API Key (Getapikey) op('stt_assemblyai').par.Getapikey Pulse
Default:
False
Speech Model (Speechmodel) op('stt_assemblyai').par.Speechmodel Menu
Default:
universal-streaming-english
Options:
universal-streaming-english, universal-streaming-multilingual, whisper-rt, u3-rt-pro
Estimated Total Cost ($) (Estopcost) op('stt_assemblyai').par.Estopcost Float
Default:
0.0
Range:
0 to 1
Slider Range:
0 to 1
Clear Transcript (Cleartranscript) op('stt_assemblyai').par.Cleartranscript Pulse
Default:
False
Speaker Labels (Speakerlabels) op('stt_assemblyai').par.Speakerlabels Toggle
Default:
False
Max Speakers (Maxspeakers) op('stt_assemblyai').par.Maxspeakers Int
Default:
0
Range:
1 to 10
Slider Range:
1 to 10
Language Detection (Languagedetection) op('stt_assemblyai').par.Languagedetection Toggle
Default:
False
Key Terms (comma-separated) (Keytermsprompt) op('stt_assemblyai').par.Keytermsprompt Str
Default:
"" (Empty String)
Domain (Domain) op('stt_assemblyai').par.Domain Menu
Default:
none
Options:
none, medical-v1
VAD Threshold (Vadthreshold) op('stt_assemblyai').par.Vadthreshold Float
Default:
0.0
Range:
0 to 1
Slider Range:
0 to 1
Format Turns (Formatturns) op('stt_assemblyai').par.Formatturns Toggle
Default:
False
End of Turn Threshold (Endofturnthreshold) op('stt_assemblyai').par.Endofturnthreshold Float
Default:
0.0
Range:
0 to 1
Slider Range:
0 to 1
Min Turn Silence (ms) (Minturnsilence) op('stt_assemblyai').par.Minturnsilence Int
Default:
0
Range:
50 to 2000
Slider Range:
50 to 2000
Max Turn Silence (ms) (Maxturnsilence) op('stt_assemblyai').par.Maxturnsilence Int
Default:
0
Range:
100 to 5000
Slider Range:
100 to 5000
Idle Timeout (minutes) (Idletimeout) op('stt_assemblyai').par.Idletimeout Int
Default:
0
Range:
1 to 60
Slider Range:
1 to 60
STT Provider (Provider) op('stt_assemblyai').par.Provider Menu
Default:
assemblyai
Options:
assemblyai
AssemblyAI API Key (Apikey) op('stt_assemblyai').par.Apikey Str
Default:
API KEY LOADED (KeyManager)
Install Dependencies (Installdependencies) op('stt_assemblyai').par.Installdependencies Pulse
Default:
False
v1.2.02026-03-26
  • Migrate to Universal Streaming v3 WebSocket API, remove assemblyai SDK dependency - Add Speechmodel parameter (universal-streaming-english/multilingual/whisper-rt/u3-rt-pro) - Add Model Settings page: speaker diarization, language detection, key terms, VAD threshold, domain - Rename EndOfTurn to IsFinal in segments_out schema - Add header enforcement to segments_out on init - Update LastTranscriptionResult key end_of_turn to is_final - Rename Script CHOP channel last_end_of_turn to last_is_final
  • Initial commit
v1.1.02025-08-29

ADDED chop channels and depdencies for parity with whisper and kyutai

cleaned menu and added segements parameter to show segemnts in out1 instead of the whole transcript

v1.0.12025-08-17

cleaned menu to match other tts / stt operators

v1.0.02025-07-28
  • Initial Release: Real-time speech-to-text transcription using AssemblyAI Streaming v3 API
  • WebSocket Connection: Persistent connection management with proper cost control and cleanup
  • Single Toggle Interface: Connected parameter handles both connect/disconnect operations
  • Auto-Connection: Active parameter automatically connects if needed when streaming is enabled
  • Parameter Mode Respect: Only updates parameters in CONSTANT mode, preserves expressions/binds
  • API Key Management: Supports ChatTD KeyManager and local config file storage
  • Audio Processing: Real-time audio buffering and streaming with configurable sample rates (16kHz, 44.1kHz, 48kHz)
  • Turn Detection: Configurable end-of-turn detection with threshold and silence parameters
  • Output DATs:
    • transcription_out: Full transcript text
    • segments_out: Individual segments with timestamps, confidence, and turn markers
    • session_info: Current session status and duration tracking
    • cost_history: Persistent cost tracking across all sessions
  • Cost Management:
    • Real-time cost tracking at $0.15/hour based on session duration
    • Estopcost parameter showing total estimated lifetime cost
    • Persistent cost history that survives project saves/reopens
    • Cost calculations exclude currency symbols for clean data handling
  • Idle Timeout: Configurable auto-disconnect after specified minutes of no audio input
  • Session Persistence: Transcript data persists across TouchDesigner sessions via DAT storage
  • Operator Reset: Complete reset functionality including cost history and log clearing
  • Dependency Management: Automated installation of assemblyai and websockets packages
  • Cost Optimization: Proper WebSocket session cleanup to prevent unnecessary billing
  • Async Processing: Full async/await implementation using TDAsyncIO for non-blocking operation