Florence-2

v1.0.3Updated

Research & Licensing

Microsoft

Microsoft Research developed Florence-2 as a lightweight yet capable vision-language model.

Florence-2

Florence-2 is a vision foundation model that uses a prompt-based approach to handle diverse vision and vision-language tasks through a unified sequence-to-sequence architecture.

Citation

Key Research Contributions

Unified vision foundation model handling captioning, detection, OCR, segmentation, and grounding in a single architecture
Sequence-to-sequence approach that converts all vision tasks into text generation

License

MIT - This model is freely available for research and commercial use.

Overview

The Florence-2 LOP runs Microsoft’s Florence-2 vision foundation model through the vision_sidecar service. It handles image captioning, object detection, OCR, phrase grounding, region analysis, and prompt generation from a single operator. Connect a TOP image input, select a task, and pulse Process Image. Processing runs asynchronously, so TouchDesigner remains responsive while the model processes the image.

Requirements

SideCar must be running with the vision_sidecar service active. The operator will attempt to start it automatically via EnsureSidecar, but the SideCar system component must be configured first.
CUDA GPU on the machine running the vision_sidecar process. This operator is not supported on macOS.

Input/Output

Inputs

TOP input: The image to process. Connect any TOP.

Outputs

Output 1 (output_dat): Text result from the selected task (caption, OCR text, detection results, etc.)
Output 2 (conversation_dat): Latest interaction in conversation format (prompt + response).

A history_dat inside the operator logs all past results with model name and timestamp.

Available Tasks

The Task menu on the Florence2 page offers these vision tasks:

Task	Description
caption	Short image caption
detailed_caption	Longer descriptive caption
more_detailed_caption	Most verbose captioning
region_caption	Captions for detected regions
dense_region_caption	Detailed captions per region
region_proposal	Object detection with bounding boxes
caption_to_phrase_grounding	Grounds text phrases to image regions (requires prompt)
referring_expression_segmentation	Segments regions matching a text description (requires prompt)
ocr	Extract text from image
ocr_with_region	Extract text with bounding box locations
docvqa	Answer questions about document images (requires prompt)
prompt_gen_tags	Generate prompt tags for the image
prompt_gen_mixed_caption	Generate mixed-style prompt caption
prompt_gen_analyze	Analyze image for prompt generation

Tasks marked “requires prompt” need text in the Input Prompt field.

Usage

Basic Captioning

Make sure SideCar is running.
Connect an image TOP to the Florence-2 input.
On the Florence2 page, select a model from the Florence Model menu.
Pulse Load Model and wait for the model to load on the server.
Set Task to detailed_caption.
Pulse Process Image.
The caption appears in the output DAT.

OCR

With SideCar running and a model loaded, connect a TOP containing text.
Set Task to ocr (or ocr_with_region for positional data).
Pulse Process Image.

Phrase Grounding

Set Task to caption_to_phrase_grounding.
Enter a descriptive caption in the Input Prompt field.
Pulse Process Image. The output contains bounding box coordinates for each phrase.

Model Variants

The Florence Model menu includes the official Microsoft models as well as community fine-tunes:

base / large: Core Florence-2 models (base is faster, large is more accurate)
base-ft / large-ft: Fine-tuned variants with improved task performance
DocVQA: Specialized for document question answering
CogFlorence: Community fine-tunes with enhanced capabilities
SD3-Captioner / Flux: Optimized for generating Stable Diffusion and Flux prompts
PromptGen: MiaoshouAI fine-tunes for image-to-prompt generation

Visualization

The Fill Region Masks toggle controls whether detected regions are rendered as filled masks in the output visualization. Use Mask Selection to filter specific regions by index or label (comma-separated).

Troubleshooting

“ML Server not available”: The vision_sidecar service is not running. Open the SideCar system component and ensure the vision_sidecar process has started.
“No input image”: No TOP is connected to the operator’s input. Wire a TOP into the Florence-2 input before pulsing Process Image.
Model load fails: Check the SideCar logs for GPU memory or download errors. Larger models (large, large-ft) require more VRAM. Try a base variant if memory is limited.
Missing packages dialog: The operator checks for torch, transformers, timm, and einops on initialization. If any are missing, it will prompt to install them. Note that torch must be version 2.1.1 or greater and must be compatible with your TouchDesigner build.

Parameters

Florence2

Load Model (Load) op('florence').par.Load Pulse

Default:: False

Process Image (Process) op('florence').par.Process Pulse

Default:: False

Reset (Reset) op('florence').par.Reset Pulse

Default:: False

Active (Active) op('florence').par.Active Toggle

Default:: False

Status (Status) op('florence').par.Status Str

Default:: "" (Empty String)

Florence Model (Florencemodel) op('florence').par.Florencemodel Menu

Default:: microsoft/Florence-2-base
Options:: microsoft/Florence-2-base, microsoft/Florence-2-base-ft, microsoft/Florence-2-large, microsoft/Florence-2-large-ft, HuggingFaceM4/Florence-2-DocVQA, thwri/CogFlorence-2.1-Large, thwri/CogFlorence-2.2-Large, gokaygokay/Florence-2-SD3-Captioner, gokaygokay/Florence-2-Flux-Large, MiaoshouAI/Florence-2-base-PromptGen-v1.5, MiaoshouAI/Florence-2-large-PromptGen-v1.5, MiaoshouAI/Florence-2-base-PromptGen-v2.0, MiaoshouAI/Florence-2-large-PromptGen-v2.0

Input Prompt (Prompt) op('florence').par.Prompt Str

Optional input prompt for specific tasks

Default:: "" (Empty String)

Max Tokens (Maxtokens) op('florence').par.Maxtokens Int

Default:: 512
Range:: 1 to 4096
Slider Range:: 1 to 4096

Num Beams (Numbeams) op('florence').par.Numbeams Int

Default:: 3
Range:: 1 to 64
Slider Range:: 1 to 64

Do Sample (Dosample) op('florence').par.Dosample Toggle

Default:: True

Random Seed (Seed) op('florence').par.Seed Int

Default:: 42
Range:: 0 to 18446700000000000000
Slider Range:: 0 to 10000000

Fill Region Masks (Fillmask) op('florence').par.Fillmask Toggle

Default:: True

Mask Selection (Maskselect) op('florence').par.Maskselect Str

Comma-separated list of region indices or labels to mask

Default:: "" (Empty String)

Changelog

v1.0.32026-03-16

Added Florence2 vision model integration via vision_sidecar HTTP API - Implemented image processing with multiple model variants and tasks - Added async processing and result handling

v1.0.22026-03-01

Refactor to call ml_server HTTP API directly instead of SideCar methods - Async image processing via TDAsyncIO - Base64 image encoding for HTTP transmission - Model loading via ml_server HTTP endpoint - OnModelLoaded updated for TDAsyncIO callback pattern

v1.0.12026-03-01

Replace direct imports with importlib.metadata checks for TD 32050+ compatibility
Initial commit

v1.0.02024-11-09

Initial release

Florence-2

Research & Licensing

Microsoft

Florence-2

Citation

Key Research Contributions

License

Overview

Requirements

Input/Output

Inputs

Outputs

Available Tasks

Usage

Basic Captioning

OCR

Phrase Grounding

Model Variants

Visualization

Troubleshooting

Parameters

Florence2

Changelog

Related Operators