STT AssemblyAI

v1.2.0Updated

Overview

The STT AssemblyAI operator transcribes audio into text in real-time using AssemblyAI’s v3 streaming WebSocket API. It accepts live audio input from a CHOP, streams it to AssemblyAI’s servers, and returns continuous transcription results with turn detection, confidence scores, speaker identification, and automatic punctuation. Four speech models are available ranging from low-latency English-only to multilingual and high-accuracy options.

Key Features

Four speech models: Universal English, Universal Multilingual, Whisper RT, and Universal-3 Pro
Real-time speaker diarization with configurable speaker count
Automatic language detection for multilingual audio
Key terms prompting to improve recognition of domain-specific vocabulary
Medical domain specialization mode
Configurable VAD threshold, turn detection, and idle timeout
Per-session cost tracking with cumulative totals
Channel monitoring via tdu.Dependency for integration with CHOP networks

Requirements

An active AssemblyAI account and API key.
The websockets and certifi Python libraries. Use the Install Dependencies button on the Install/Config page to install them automatically.

Input/Output

Inputs

The operator accepts a single CHOP input carrying audio data. An Audio Device In CHOP is the most common source.

Outputs

transcription_out — A text DAT containing the full running transcript.
segments_out — A table DAT with individual turn segments. Columns: Start, End, Text, Confidence, IsFinal, Speaker, and Language. Enable Output Segments (out1) on the STTAssemblyAI page to route this table to the operator’s first output connector.
session_info — A table DAT tracking the current session ID, status, duration, and provider (including which speech model is in use).
cost_history — A table DAT logging each session’s start/end times, duration, and estimated cost. This data persists across project saves.

Usage Examples

Real-time Transcription from a Microphone

Place an STT AssemblyAI operator in your network.
Connect an Audio Device In CHOP to its input.
On the Install/Config page, pulse Install Dependencies if you have not already installed the required libraries. Restart TouchDesigner after installation completes.
Enter your API key in the AssemblyAI API Key field, or let it load automatically if stored in KeyManager. Pulse Get API Key to open the AssemblyAI dashboard if you need to obtain one.
On the STTAssemblyAI page, select your preferred Speech Model.
Toggle Streaming Active to On. The operator will auto-connect if needed.
Speak into your microphone. Transcribed text appears in the output DATs in real-time.
When finished, toggle Connected to Off to end the session and stop billing.

Choosing a Speech Model

On the STTAssemblyAI page, open the Speech Model menu.
Select from the available models:
- Universal English — Optimized for English with low latency. Good default for most use cases.
- Universal Multilingual — Supports multiple languages with automatic detection.
- Whisper RT — AssemblyAI’s real-time Whisper implementation.
- Universal-3 Pro — Highest accuracy model with speaker diarization support. Billed at $0.45/hour instead of $0.15/hour.
Note that the connection must be re-established when switching models. Toggle Connected Off and back On after changing the model.

Speaker Diarization

On the Model Settings page, enable Speaker Labels.
Set Max Speakers to the expected number of speakers in the audio (1-10).
Connect and start streaming. The segments table will include a Speaker column identifying which speaker produced each turn.

Improving Recognition Accuracy

On the Model Settings page, enter domain-specific vocabulary in the Key Terms (comma-separated) field. For example: TouchDesigner, CHOP, SOP, GLSL for a TouchDesigner-focused conversation.
If transcribing medical content, set Domain to Medical for specialized vocabulary recognition.
Enable Language Detection when working with multilingual audio to automatically identify the spoken language per segment.

Tuning Turn Detection

On the Model Settings page, enable Format Turns to receive punctuated, cased transcripts.
Adjust End of Turn Threshold (0 to 1) to control how confidently the model must detect a pause before finalizing a turn. Higher values require longer, more definitive pauses. Not available with Universal-3 Pro.
Set Min Turn Silence (ms) to control the minimum silence duration before a turn can end when the model is confident.
Set Max Turn Silence (ms) to define the maximum silence before a turn is always ended, regardless of confidence.
Adjust VAD Threshold to control voice activity detection sensitivity. Higher values require stronger voice signals to trigger detection.

Channel Monitoring

The operator exposes its state through tdu.Dependency attributes, matching the pattern used by other STT operators. A Script CHOP inside the operator converts these into CHOP channels for use in TouchDesigner networks.

Available channel groups (configurable on the Script CHOP):

Pulse Events — transcription_complete, empty_transcription, sentence_end. These pulse briefly when the corresponding event occurs, useful for triggering downstream logic.
Status Data — worker_active, model_ready, transcription_active, download_in_progress, connected, streaming_active, ready. Continuous status channels reflecting the operator’s current state.
Result Data — last_has_segments, last_text_length, last_timestamp, last_confidence, last_is_final. Metadata about the most recent transcription result.

Best Practices

The Sample Rate and Audio Encoding fields display the current audio configuration. The default 16kHz PCM 16-bit LE offers the best balance of quality and bandwidth for streaming transcription.
Set a reasonable Idle Timeout on the Model Settings page to automatically disconnect when no audio has been received for a period. This provides a safety net against forgotten connections, but do not rely on it as your primary cost control.
Use Copy to Clipboard to quickly grab the current transcript for pasting elsewhere.
The Estimated Total Cost ($) field tracks cumulative cost across all sessions. Keep in mind that Universal-3 Pro is billed at 3x the rate of other models.
Use Key Terms when transcribing content with specialized vocabulary to improve accuracy.

Troubleshooting

“websockets library not installed” — Pulse Install Dependencies on the Install/Config page and restart TouchDesigner.
Connection fails — Verify your API key is correct and that your network allows outbound WebSocket connections to streaming.assemblyai.com.
No transcription appearing — Ensure your audio input CHOP is outputting valid audio data and that Streaming Active is toggled On. Check the operator’s Logger for debug messages.
macOS SSL errors — The operator automatically sets the SSL_CERT_FILE environment variable using the certifi package. If issues persist, ensure certifi is installed via Install Dependencies.
Speaker labels not appearing — Speaker diarization must be enabled before connecting. Toggle Connected Off, enable Speaker Labels on the Model Settings page, then reconnect.

Technical Notes

The operator connects to AssemblyAI’s v3 streaming WebSocket endpoint (wss://streaming.assemblyai.com/v3/ws).
Format Turns and End of Turn Threshold are not available when using the Universal-3 Pro model, which handles turn formatting internally.
Audio is converted from float32 to int16 PCM before streaming, in 50ms chunks as recommended by AssemblyAI.
The Idle Timeout feature is implemented locally by the operator, not by the AssemblyAI API. It monitors when the last audio chunk was received and disconnects after the configured period of inactivity.
Cost estimation is calculated locally: $0.15/hour for Universal English, Universal Multilingual, and Whisper RT; $0.45/hour for Universal-3 Pro.

Parameters

STTAssemblyAI

Status (Status) op('stt_assemblyai').par.Status Str

Default:: "" (Empty String)

Streaming Active (Active) op('stt_assemblyai').par.Active Toggle

Default:: False

Copy to Clipboard (Copytranscript) op('stt_assemblyai').par.Copytranscript Pulse

Default:: False

Connection Sttstatus (Sttstatus) op('stt_assemblyai').par.Sttstatus Str

Default:: Disconnected

Connected (Connected) op('stt_assemblyai').par.Connected Toggle

Default:: False

Output Segments (out1) (Segments) op('stt_assemblyai').par.Segments Toggle

Default:: False

Get API Key (Getapikey) op('stt_assemblyai').par.Getapikey Pulse

Default:: False

Estimated Total Cost ($) (Estopcost) op('stt_assemblyai').par.Estopcost Float

Default:: 0.0
Range:: 0 to 1
Slider Range:: 0 to 1

Clear Transcript (Cleartranscript) op('stt_assemblyai').par.Cleartranscript Pulse

Default:: False

Model Settings

Speaker Labels (Speakerlabels) op('stt_assemblyai').par.Speakerlabels Toggle

Default:: False

Max Speakers (Maxspeakers) op('stt_assemblyai').par.Maxspeakers Int

Default:: 0
Range:: 1 to 10
Slider Range:: 1 to 10

Language Detection (Languagedetection) op('stt_assemblyai').par.Languagedetection Toggle

Default:: False

Key Terms (comma-separated) (Keytermsprompt) op('stt_assemblyai').par.Keytermsprompt Str

Default:: "" (Empty String)

VAD Threshold (Vadthreshold) op('stt_assemblyai').par.Vadthreshold Float

Default:: 0.0
Range:: 0 to 1
Slider Range:: 0 to 1

Format Turns (Formatturns) op('stt_assemblyai').par.Formatturns Toggle

Default:: False

End of Turn Threshold (Endofturnthreshold) op('stt_assemblyai').par.Endofturnthreshold Float

Default:: 0.0
Range:: 0 to 1
Slider Range:: 0 to 1

Min Turn Silence (ms) (Minturnsilence) op('stt_assemblyai').par.Minturnsilence Int

Default:: 0
Range:: 50 to 2000
Slider Range:: 50 to 2000

Max Turn Silence (ms) (Maxturnsilence) op('stt_assemblyai').par.Maxturnsilence Int

Default:: 0
Range:: 100 to 5000
Slider Range:: 100 to 5000

Idle Timeout (minutes) (Idletimeout) op('stt_assemblyai').par.Idletimeout Int

Default:: 0
Range:: 1 to 60
Slider Range:: 1 to 60

Install/Config

AssemblyAI API Key (Apikey) op('stt_assemblyai').par.Apikey Str

Default:: API KEY LOADED (KeyManager)

Install Dependencies (Installdependencies) op('stt_assemblyai').par.Installdependencies Pulse

Default:: False

Changelog

v1.2.02026-03-26

Migrate to Universal Streaming v3 WebSocket API, remove assemblyai SDK dependency - Add Speechmodel parameter (universal-streaming-english/multilingual/whisper-rt/u3-rt-pro) - Add Model Settings page: speaker diarization, language detection, key terms, VAD threshold, domain - Rename EndOfTurn to IsFinal in segments_out schema - Add header enforcement to segments_out on init - Update LastTranscriptionResult key end_of_turn to is_final - Rename Script CHOP channel last_end_of_turn to last_is_final
Initial commit

v1.1.02025-08-29

ADDED chop channels and depdencies for parity with whisper and kyutai

cleaned menu and added segements parameter to show segemnts in out1 instead of the whole transcript

v1.0.12025-08-17

cleaned menu to match other tts / stt operators

v1.0.02025-07-28

Initial Release: Real-time speech-to-text transcription using AssemblyAI Streaming v3 API
WebSocket Connection: Persistent connection management with proper cost control and cleanup
Single Toggle Interface: Connected parameter handles both connect/disconnect operations
Auto-Connection: Active parameter automatically connects if needed when streaming is enabled
Parameter Mode Respect: Only updates parameters in CONSTANT mode, preserves expressions/binds
API Key Management: Supports ChatTD KeyManager and local config file storage
Audio Processing: Real-time audio buffering and streaming with configurable sample rates (16kHz, 44.1kHz, 48kHz)
Turn Detection: Configurable end-of-turn detection with threshold and silence parameters
Output DATs:

transcription_out: Full transcript text
segments_out: Individual segments with timestamps, confidence, and turn markers
session_info: Current session status and duration tracking
cost_history: Persistent cost tracking across all sessions

Cost Management:

Real-time cost tracking at $0.15/hour based on session duration
Estopcost parameter showing total estimated lifetime cost
Persistent cost history that survives project saves/reopens
Cost calculations exclude currency symbols for clean data handling

Idle Timeout: Configurable auto-disconnect after specified minutes of no audio input
Session Persistence: Transcript data persists across TouchDesigner sessions via DAT storage
Operator Reset: Complete reset functionality including cost history and log clearing
Dependency Management: Automated installation of assemblyai and websockets packages
Cost Optimization: Proper WebSocket session cleanup to prevent unnecessary billing
Async Processing: Full async/await implementation using TDAsyncIO for non-blocking operation