Florence-2
Research & Licensing
Microsoft
Microsoft Research developed Florence-2 as a lightweight yet capable vision-language model.
Florence-2
Florence-2 is a vision foundation model that uses a prompt-based approach to handle diverse vision and vision-language tasks through a unified sequence-to-sequence architecture.
Citation
Key Research Contributions
- Unified vision foundation model handling captioning, detection, OCR, segmentation, and grounding in a single architecture
- Sequence-to-sequence approach that converts all vision tasks into text generation
License
MIT - This model is freely available for research and commercial use.
Overview
Section titled “Overview”The Florence-2 LOP runs Microsoft’s Florence-2 vision foundation model through the vision_sidecar service. It handles image captioning, object detection, OCR, phrase grounding, region analysis, and prompt generation from a single operator. Connect a TOP image input, select a task, and pulse Process Image. Processing runs asynchronously, so TouchDesigner remains responsive while the model processes the image.
Requirements
Section titled “Requirements”- SideCar must be running with the
vision_sidecarservice active. The operator will attempt to start it automatically viaEnsureSidecar, but the SideCar system component must be configured first. - CUDA GPU on the machine running the vision_sidecar process. This operator is not supported on macOS.
Input/Output
Section titled “Input/Output”Inputs
Section titled “Inputs”- TOP input: The image to process. Connect any TOP.
Outputs
Section titled “Outputs”- Output 1 (output_dat): Text result from the selected task (caption, OCR text, detection results, etc.)
- Output 2 (conversation_dat): Latest interaction in conversation format (prompt + response).
A history_dat inside the operator logs all past results with model name and timestamp.
Available Tasks
Section titled “Available Tasks”The Task menu on the Florence2 page offers these vision tasks:
| Task | Description |
|---|---|
| caption | Short image caption |
| detailed_caption | Longer descriptive caption |
| more_detailed_caption | Most verbose captioning |
| region_caption | Captions for detected regions |
| dense_region_caption | Detailed captions per region |
| region_proposal | Object detection with bounding boxes |
| caption_to_phrase_grounding | Grounds text phrases to image regions (requires prompt) |
| referring_expression_segmentation | Segments regions matching a text description (requires prompt) |
| ocr | Extract text from image |
| ocr_with_region | Extract text with bounding box locations |
| docvqa | Answer questions about document images (requires prompt) |
| prompt_gen_tags | Generate prompt tags for the image |
| prompt_gen_mixed_caption | Generate mixed-style prompt caption |
| prompt_gen_analyze | Analyze image for prompt generation |
Tasks marked “requires prompt” need text in the Input Prompt field.
Basic Captioning
Section titled “Basic Captioning”- Make sure SideCar is running.
- Connect an image TOP to the Florence-2 input.
- On the Florence2 page, select a model from the Florence Model menu.
- Pulse Load Model and wait for the model to load on the server.
- Set Task to
detailed_caption. - Pulse Process Image.
- The caption appears in the output DAT.
- With SideCar running and a model loaded, connect a TOP containing text.
- Set Task to
ocr(orocr_with_regionfor positional data). - Pulse Process Image.
Phrase Grounding
Section titled “Phrase Grounding”- Set Task to
caption_to_phrase_grounding. - Enter a descriptive caption in the Input Prompt field.
- Pulse Process Image. The output contains bounding box coordinates for each phrase.
Model Variants
Section titled “Model Variants”The Florence Model menu includes the official Microsoft models as well as community fine-tunes:
- base / large: Core Florence-2 models (base is faster, large is more accurate)
- base-ft / large-ft: Fine-tuned variants with improved task performance
- DocVQA: Specialized for document question answering
- CogFlorence: Community fine-tunes with enhanced capabilities
- SD3-Captioner / Flux: Optimized for generating Stable Diffusion and Flux prompts
- PromptGen: MiaoshouAI fine-tunes for image-to-prompt generation
Visualization
Section titled “Visualization”The Fill Region Masks toggle controls whether detected regions are rendered as filled masks in the output visualization. Use Mask Selection to filter specific regions by index or label (comma-separated).
Troubleshooting
Section titled “Troubleshooting”- “ML Server not available”: The vision_sidecar service is not running. Open the SideCar system component and ensure the vision_sidecar process has started.
- “No input image”: No TOP is connected to the operator’s input. Wire a TOP into the Florence-2 input before pulsing Process Image.
- Model load fails: Check the SideCar logs for GPU memory or download errors. Larger models (large, large-ft) require more VRAM. Try a base variant if memory is limited.
- Missing packages dialog: The operator checks for
torch,transformers,timm, andeinopson initialization. If any are missing, it will prompt to install them. Note thattorchmust be version 2.1.1 or greater and must be compatible with your TouchDesigner build.
Parameters
Section titled “Parameters”Florence2
Section titled “Florence2”op('florence').par.Load Pulse - Default:
False
op('florence').par.Process Pulse - Default:
False
op('florence').par.Reset Pulse - Default:
False
op('florence').par.Active Toggle - Default:
False
op('florence').par.Status Str - Default:
"" (Empty String)
op('florence').par.Prompt Str Optional input prompt for specific tasks
- Default:
"" (Empty String)
op('florence').par.Maxtokens Int - Default:
512- Range:
- 1 to 4096
- Slider Range:
- 1 to 4096
op('florence').par.Numbeams Int - Default:
3- Range:
- 1 to 64
- Slider Range:
- 1 to 64
op('florence').par.Dosample Toggle - Default:
True
op('florence').par.Seed Int - Default:
42- Range:
- 0 to 18446700000000000000
- Slider Range:
- 0 to 10000000
op('florence').par.Fillmask Toggle - Default:
True
op('florence').par.Maskselect Str Comma-separated list of region indices or labels to mask
- Default:
"" (Empty String)
Changelog
Section titled “Changelog”v1.0.32026-03-16
- Added Florence2 vision model integration via vision_sidecar HTTP API - Implemented image processing with multiple model variants and tasks - Added async processing and result handling
v1.0.22026-03-01
- Refactor to call ml_server HTTP API directly instead of SideCar methods - Async image processing via TDAsyncIO - Base64 image encoding for HTTP transmission - Model loading via ml_server HTTP endpoint - OnModelLoaded updated for TDAsyncIO callback pattern
v1.0.12026-03-01
- Replace direct imports with importlib.metadata checks for TD 32050+ compatibility
- Initial commit
v1.0.02024-11-09
Initial release