ACE-Step Music Generator
Overview
Section titled “Overview”The ACE-Step Music Generator integrates the ACE-Step diffusion model into TouchDesigner for text-to-music generation, audio-to-audio transformation, and audio editing. All inference runs on the external SideCar process, keeping TouchDesigner responsive. Generated audio includes a real-time waveform visualizer and optional autoplay.
Key Features
Section titled “Key Features”- Text-to-Music: Generate music from descriptive tags, genres, and structured lyrics
- Audio-to-Audio: Transform existing audio using a reference file with adjustable influence strength
- Audio Editing: Edit, repaint, retake, or extend existing audio with fine-grained control
- SideCar Architecture: All model loading and inference is offloaded to an external process
- Auto Repository Setup: Prompts to download and clone the ACE-Step repository on first use
- Settings Recall: Save and reload generation parameters from previous outputs
Requirements
Section titled “Requirements”- SideCar Operator: Must be running and connected. All model inference happens there
- SideCar Python Environment: All ACE-Step dependencies (
torch,torchaudio,librosa,diffusers, etc.) must be installed in the SideCar’s Python environment. This operator does not manage packages - Git: Must be installed and in your system PATH for automatic repository cloning
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”None required. Queries are configured via parameters. Reference audio files are specified by file path for audio-to-audio and editing modes.
Outputs
Section titled “Outputs”- Waveform Visualization: Real-time visual waveform rendered to an internal scriptTOP
- Audio Files: Generated WAV files saved to the configured output folder
Usage Examples
Section titled “Usage Examples”Text-to-Music Generation
Section titled “Text-to-Music Generation”- Ensure the SideCar is running and connected (check the About page, ‘SideCar Operator’)
- On the ACE-Step page, enter descriptive tags in ‘Prompt / Tags’ (e.g., “upbeat pop, catchy melody, female singer”)
- Enter structured lyrics in ‘Lyrics’ using tags like
[verse],[chorus] - Set ‘Audio Duration’ to the desired length in seconds
- Pulse ‘Generate Music’
- If this is your first time, a dialog will prompt you to download the ACE-Step repository — click Download and wait for it to finish, then pulse ‘Generate Music’ again
Audio-to-Audio Transformation
Section titled “Audio-to-Audio Transformation”- On the ACE-Step page, toggle ‘Enable Audio2Audio’ to On
- Set ‘Reference Audio Input’ to your source audio file
- Adjust ‘Reference Audio Strength’ — higher values stay closer to the reference
- Enter a prompt and lyrics to guide the transformation
- Pulse ‘Generate Music’
Audio Editing (Edit, Repaint, Retake, Extend)
Section titled “Audio Editing (Edit, Repaint, Retake, Extend)”- On the Edit page, toggle ‘Enable Audio Editing/Manipulation’ to On
- Set ‘Source Audio Path’ to the audio you want to modify
- Select an ‘Edit Mode’:
- Edit Audio Content: Changes the content of the audio using original and target prompts/lyrics. Requires filling in ‘Original Prompt’ and ‘Original Lyrics’ — pulse ‘Load Src Credentials’ to auto-fill these from a previous generation’s saved parameters
- Extend Audio Duration: Extends the audio by setting ‘Extend Start’ and ‘Extend End’ beyond the original boundaries
- Repaint Audio Segment: Regenerates a time region defined by start/end times
- Retake Full Audio: Regenerates the entire audio with variance control
- Adjust ‘Variance’ and ‘Variant Seed’ for variation control
- Pulse ‘Generate Music’ on the ACE-Step page
Reloading Previous Settings
Section titled “Reloading Previous Settings”- Set ‘Current Audio’ to a previously generated WAV file
- Pulse ‘Settings from Current Audio’ — this loads all generation parameters from the associated JSON file saved alongside the audio
Best Practices
Section titled “Best Practices”- Set ‘Manual Seed’ to a specific value for reproducible results, or leave at -1 for random
- Toggle ‘Add Unique Suffix to Filename’ to On to prevent overwriting previous outputs
- For audio editing, always use ‘Load Src Credentials’ to auto-fill the original prompt and lyrics rather than typing them manually
- Higher ‘Inference Steps’ improve quality but increase generation time — 60 is a good starting point
- On the Advanced page, ‘Use bfloat16 Precision’ speeds up inference on supported GPUs. Disable it on macOS or if you encounter errors
Troubleshooting
Section titled “Troubleshooting”- SideCar Not Connected: Check that the SideCar server is running. Verify the ‘SideCar Operator’ reference on the About page points to the correct operator
- Repository Missing: If the clone prompt appears repeatedly, check your internet connection and Git installation. Review the TouchDesigner console for detailed errors
- Missing Dependencies: Errors about missing Python packages (e.g.,
torch,librosa) mean you need to install them in the SideCar’s Python environment manually - torch.compile() Not Supported on Windows: The ACE-Step model does not support
torch.compile()on Windows. Leave this toggle off unless running on Linux
Research Citation
Section titled “Research Citation”Research & Licensing
ACE-STEP Project
The ACE-STEP project is an open-source initiative focused on advancing AI music generation.
ACE-Step
ACE-Step is a foundation model for music generation that integrates diffusion-based generation with advanced encoding and transformation techniques.
Technical Details
- Combines diffusion with DCAE and linear transformer architecture
- Uses MERT and m-hubert for semantic representation alignment (REPA)
- Supports text-to-music, audio-to-audio, edit, repaint, retake, and extend tasks
Research Impact
- Provides a holistic open-source architecture for state-of-the-art music generation
- Enables original music generation across diverse genres for creative production and education
Citation
@misc{gong2025acestep,
title={ACE-Step: A Step Towards Music Generation Foundation Model},
author={Junmin Gong, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
howpublished={\url{https://github.com/ace-step/ACE-Step}},
year={2025},
note={GitHub repository}
} Key Research Contributions
- Open-source foundation model for music generation using diffusion with Deep Compression AutoEncoder (DCAE) and lightweight linear transformer
- Leverages MERT and m-hubert for semantic alignment (REPA) enabling rapid training convergence
- Faster synthesis than LLM-based models (up to 4 minutes of music in 20 seconds on A100 GPU)
- Supports voice cloning, lyric editing, remixing, and track generation through fine-grained acoustic control
License
Apache License 2.0 - This model is freely available for research and commercial use.
Parameters
Section titled “Parameters”ACE-Step
Section titled “ACE-Step”op('acestep').par.Status Str Current status of the operator.
- Default:
"" (Empty String)
op('acestep').par.Active Toggle - Default:
False
op('acestep').par.Currentaudio File - Default:
"" (Empty String)
op('acestep').par.Playhead Float - Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('acestep').par.Autoplay Toggle Automatically play the audio after generation.
- Default:
False
op('acestep').par.Generate Pulse Trigger the music generation process based on current settings.
- Default:
False
op('acestep').par.Prompt Str Descriptive tags, genres, or scene descriptions. Used for text2music, audio2audio, and as a basis for edit/repaint.
- Default:
"" (Empty String)
op('acestep').par.Lyrics Str Enter lyrics with structure tags like [verse], [chorus]. Use \\\\n for newlines. Used for text2music, audio2audio, and as a basis for edit/repaint.
- Default:
"" (Empty String)
op('acestep').par.Duration Float Desired duration of the generated audio in seconds.
- Default:
0.0- Range:
- 1 to 240
- Slider Range:
- 1 to 240
op('acestep').par.Infersteps Int Number of inference steps. Higher can improve quality but takes longer.
- Default:
0- Range:
- 10 to 100
- Slider Range:
- 10 to 100
op('acestep').par.Manualseed Int Seed for reproducibility. -1 for random. Affects initial generation.
- Default:
0- Range:
- -1 to 1000000000
- Slider Range:
- -1 to 1000000000
op('acestep').par.Guidancescale Float Main classifier-free guidance scale. Used if CFG Type is not "Double Condition".
- Default:
0.0- Range:
- 1 to 30
- Slider Range:
- 1 to 30
op('acestep').par.Omegascale Float Omega scale factor for APG guidance type.
- Default:
0.0- Range:
- 0 to 20
- Slider Range:
- 0 to 20
op('acestep').par.Guidancescaletext Float Guidance scale for text prompt when CFG Type is "Double Condition".
- Default:
0.0- Range:
- 0 to 30
- Slider Range:
- 0 to 30
op('acestep').par.Guidancescalelyric Float Guidance scale for lyrics when CFG Type is "Double Condition".
- Default:
0.0- Range:
- 0 to 30
- Slider Range:
- 0 to 30
op('acestep').par.Audio2audioenable Toggle Enable audio-to-audio generation. Uses Prompt & Lyrics as guidance if provided.
- Default:
False
op('acestep').par.Refaudioinput File Path to the reference audio file for Audio2Audio mode.
- Default:
"" (Empty String)
op('acestep').par.Refaudiostrength Float Strength of the reference audio influence (0.0 to 1.0).
- Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('acestep').par.Outputfolder Folder Folder to save the generated WAV file. Relative to project or absolute.
- Default:
"" (Empty String)
op('acestep').par.Outputfilename Str Name of the generated WAV file.
- Default:
"" (Empty String)
op('acestep').par.Uniquesuffix Toggle If True, appends a timestamp to the filename to prevent overwriting.
- Default:
False
op('acestep').par.Initialize Pulse Check Dependencies, SideCar Connection, and Initialize Model.
- Default:
False
op('acestep').par.Unloadmodel Pulse Release the model from memory via SideCar.
- Default:
False
op('acestep').par.Loadsettings Pulse Load generation parameters from the JSON associated with the Current Audio file.
- Default:
False
op('acestep').par.Editaudio Toggle Master toggle to enable audio editing modes on this page.
- Default:
False
op('acestep').par.Srcaudiopath File Path to the source audio file for all edit modes.
- Default:
"" (Empty String)
op('acestep').par.Retakeseeds Int Seed for retake/repaint/extend variations. -1 for random.
- Default:
0- Range:
- -1 to 1000000000
- Slider Range:
- -1 to 1000000000
op('acestep').par.Retakevariance Float Amount of variance for retake/repaint (0.0 to 1.0).
- Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('acestep').par.Repaintstart Float Start time in seconds for repaint. For extend, negative values pad left. 0 for retake.
- Default:
0.0- Range:
- -240 to 240
- Slider Range:
- -240 to 240
op('acestep').par.Repaintend Float End time in seconds for repaint. For extend, values beyond original duration extend right. Original duration for retake.
- Default:
0.0- Range:
- -240 to 480
- Slider Range:
- -240 to 480
op('acestep').par.Transitiontime Float Duration of the transition/crossfade in seconds for repaint/extend modes. 0 for abrupt change.
- Default:
0.0- Range:
- 0 to 30
- Slider Range:
- 0 to 30
op('acestep').par.Editoriginalprompt Str The original prompt used to generate the Source Audio. Required for "Edit Audio Content" mode.
- Default:
"" (Empty String)
op('acestep').par.Editoriginallyrics Str The original lyrics used to generate the Source Audio. Required for "Edit Audio Content" mode.
- Default:
"" (Empty String)
op('acestep').par.Edittargetprompt Str Target prompt for "Edit Audio Content" mode. If empty, uses main prompt.
- Default:
"" (Empty String)
op('acestep').par.Edittargetlyrics Str Target lyrics for "Edit Audio Content" mode. If empty, uses main lyrics.
- Default:
"" (Empty String)
op('acestep').par.Editnmin Float Min influence for audio editing (0.0 to 1.0).
- Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('acestep').par.Editnmax Float Max influence for audio editing (0.0 to 1.0).
- Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('acestep').par.Editnavg Int Averaging window size for editing.
- Default:
0- Range:
- 1 to 100
- Slider Range:
- 1 to 100
op('acestep').par.Loadsrccredentials Pulse Loads prompt and lyrics from the _input_params.json associated with the Src Audio Path.
- Default:
False
Advanced
Section titled “Advanced”op('acestep').par.Guidanceinterval Float Guidance interval for CFG.
- Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('acestep').par.Guidanceintervaldecay Float Decay rate for guidance interval.
- Default:
0.0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('acestep').par.Minguidancescale Float Minimum guidance scale.
- Default:
0.0- Range:
- 0 to 30
- Slider Range:
- 0 to 30
op('acestep').par.Usergtag Toggle Enable ERG (Exponentially Smoothed Moving Average Guidance) for prompt/tags.
- Default:
False
op('acestep').par.Userglyric Toggle Enable ERG for lyrics.
- Default:
False
op('acestep').par.Usergdiffusion Toggle Enable ERG for diffusion process.
- Default:
False
op('acestep').par.Useoss Toggle Enable Optimal Step Size scheduling. Only effective if Scheduler Type is Euler.
- Default:
False
op('acestep').par.Osssteps Str Steps for OSS (Optimal Step Size) scheduling, comma-separated. MUST be used with Euler Scheduler.
- Default:
"" (Empty String)
op('acestep').par.Deviceid Int GPU device ID to use (e.g., 0, 1). Requires re-initialize.
- Default:
0- Range:
- 0 to 1
- Slider Range:
- 0 to 1
op('acestep').par.Usebf16 Toggle Use bfloat16 for faster inference (if supported). Uncheck for macOS or if errors occur. Requires re-initialize.
- Default:
False
op('acestep').par.Torchcompile Toggle Optimize model with torch.compile() for faster inference (Not supported on Windows by ACE-Step). Requires re-initialize.
- Default:
False
op('acestep').par.Modelpath Folder Path to the CLONED ACE-Step GitHub repository directory.
- Default:
"" (Empty String)
op('acestep').par.Checkpointdir Folder Path to the ACE-Step model checkpoint DIRECTORY. Leave empty for auto-download to default location inside repo.
- Default:
"" (Empty String)
Changelog
Section titled “Changelog”v2.0.02025-07-19
🎨 Audio Visualization & Enhanced User Experience
- Real-Time Audio Visualization:
- Professional waveform visualization with frequency analysis
- High-quality grayscale waveform rendering at 1280x960 resolution
- Dynamic amplitude processing with transient emphasis
- Frequency-based brightness variation for rich visual feedback
- Anti-aliasing and smooth envelope generation
- Automatic Visualization Triggers:
- Auto-visualization after successful generation
- Manual visualization via
Currentaudioparameter changes - Smart path tracking to prevent redundant processing
- Visual Feedback Enhancements:
- Black screen clearing when no audio is selected
- Visual confirmation of current audio status
- Seamless integration with audio playback controls
- Async Visualization Processing:
- Non-blocking waveform generation using TDAsyncIO
- Thread-safe audio analysis with librosa integration
- Graceful fallback to synchronous processing when needed
- Robust Error Handling:
- Fixed critical
len()type errors in visualization pipeline - Comprehensive try/catch blocks around FFT processing
- Safe numpy array type checking throughout audio pipeline
- Parameter Callback System:
- New
Currentaudio()callback method for parameter-driven visualization - Intelligent request state checking to prevent conflicts during generation
- Path validation and existence checking before processing
- Critical Stability Fixes:
- Resolved TouchDesigner crashes caused by async visualization errors
- Fixed "object of type 'int' has no len()" errors in audio processing
- Improved error handling in FFT frequency analysis
- Safe handling of edge cases in audio array processing
- Visualization Pipeline Fixes:
- Proper numpy array type validation throughout processing chain
- Graceful handling of malformed or empty audio files
- Improved error logging for debugging visualization issues
- Visual Audio Management:
- Immediate visual feedback when changing current audio file
- Clear visual indication when no audio is loaded (black screen)
- Smooth integration between generation and visualization workflows
- Status & Logging Improvements:
- Enhanced logging for visualization processes
- Clear status messages for audio loading and processing
- Improved error messages for troubleshooting
- Visualization Engine:
- Uses librosa for professional audio analysis
- Implements RMS and peak envelope detection
- FFT-based frequency analysis for visual brightness variation
- Supports both sync and async processing modes
- Integration Points:
- Seamless connection with existing audio playback system
- Compatible with all generation modes (text2music, audio2audio, editing)
- Maintains full backward compatibility with v1.0.0 workflows
- Optimized waveform generation with configurable resolution
- Efficient memory usage in visualization processing
- Non-blocking UI during visualization generation
TECHNICAL IMPROVEMENTS:
BUG FIXES:
USER EXPERIENCE ENHANCEMENTS:
TECHNICAL DETAILS:
PERFORMANCE:
v1.0.02025-06-20
🎵 Initial Release - ACE-Step Music Generation Integration
NEW FEATURES:
- Text-to-Music Generation: Generate music from text prompts and descriptive tags
- Lyrics Support: Full lyrics integration with structure tags like [verse], [chorus]
- Audio2Audio Mode: Transform existing audio using prompts and lyrics as guidance
- Advanced Audio Editing: Complete suite of audio manipulation tools:
- Edit Audio Content: Modify existing audio with target prompts/lyrics
- Repaint Audio Segment: Replace specific time segments
- Retake Full Audio: Generate variations of entire audio
- Extend Audio Duration: Extend audio beyond original length
- Professional Parameter Control:
- Inference steps, guidance scales, scheduler types (Euler, Heun)
- CFG types (APG, CFG, Zero STAR, Double Condition)
- ERG (Exponentially Smoothed Moving Average Guidance) controls
- Manual seed support for reproducible generation
- SideCar Integration: Seamless integration with SideCar server for distributed processing
- Dependency Management: Automatic detection and installation of required Python packages
- Output Management:
- Configurable output folders and filenames
- Automatic unique timestamp suffixes
- JSON parameter saving for reproducibility
- Settings Management: Load/save generation parameters from JSON files
- Audio Playback: Built-in audio playback with playhead control
- Model Management: Initialize, load, and unload models on demand
TECHNICAL FEATURES:
- Three-Page Parameter Layout:
- Main: Core generation and output settings
- Edit: Audio editing and manipulation controls
- Advanced: Professional diffusion and guidance parameters
- Async Processing: Non-blocking generation via TDAsyncIO integration
- Error Handling: Comprehensive dependency checking and error recovery
- Status Monitoring: Real-time status updates and progress tracking
SUPPORTED WORKFLOWS:
- Text → Music: Generate music from descriptive prompts
- Audio → Audio: Transform existing audio with new characteristics
- Audio Editing: Professional audio manipulation and refinement
- Batch Processing: Generate multiple variations with different seeds
REQUIREMENTS:
- ACE-Step repository (user must clone and configure)
- SideCar operator for processing
- Python dependencies (auto-installed when possible)
- Optional: Custom checkpoints directory