Skip to content

Florence-2

v1.0.3Updated

Research & Licensing

Microsoft

Microsoft Research developed Florence-2 as a lightweight yet capable vision-language model.

Florence-2

Florence-2 is a vision foundation model that uses a prompt-based approach to handle diverse vision and vision-language tasks through a unified sequence-to-sequence architecture.

Citation

 

Key Research Contributions

  • Unified vision foundation model handling captioning, detection, OCR, segmentation, and grounding in a single architecture
  • Sequence-to-sequence approach that converts all vision tasks into text generation

License

MIT - This model is freely available for research and commercial use.

The Florence-2 LOP runs Microsoft’s Florence-2 vision foundation model through the vision_sidecar service. It handles image captioning, object detection, OCR, phrase grounding, region analysis, and prompt generation from a single operator. Connect a TOP image input, select a task, and pulse Process Image. Processing runs asynchronously, so TouchDesigner remains responsive while the model processes the image.

  • SideCar must be running with the vision_sidecar service active. The operator will attempt to start it automatically via EnsureSidecar, but the SideCar system component must be configured first.
  • CUDA GPU on the machine running the vision_sidecar process. This operator is not supported on macOS.
  • TOP input: The image to process. Connect any TOP.
  • Output 1 (output_dat): Text result from the selected task (caption, OCR text, detection results, etc.)
  • Output 2 (conversation_dat): Latest interaction in conversation format (prompt + response).

A history_dat inside the operator logs all past results with model name and timestamp.

The Task menu on the Florence2 page offers these vision tasks:

TaskDescription
captionShort image caption
detailed_captionLonger descriptive caption
more_detailed_captionMost verbose captioning
region_captionCaptions for detected regions
dense_region_captionDetailed captions per region
region_proposalObject detection with bounding boxes
caption_to_phrase_groundingGrounds text phrases to image regions (requires prompt)
referring_expression_segmentationSegments regions matching a text description (requires prompt)
ocrExtract text from image
ocr_with_regionExtract text with bounding box locations
docvqaAnswer questions about document images (requires prompt)
prompt_gen_tagsGenerate prompt tags for the image
prompt_gen_mixed_captionGenerate mixed-style prompt caption
prompt_gen_analyzeAnalyze image for prompt generation

Tasks marked “requires prompt” need text in the Input Prompt field.

  1. Make sure SideCar is running.
  2. Connect an image TOP to the Florence-2 input.
  3. On the Florence2 page, select a model from the Florence Model menu.
  4. Pulse Load Model and wait for the model to load on the server.
  5. Set Task to detailed_caption.
  6. Pulse Process Image.
  7. The caption appears in the output DAT.
  1. With SideCar running and a model loaded, connect a TOP containing text.
  2. Set Task to ocr (or ocr_with_region for positional data).
  3. Pulse Process Image.
  1. Set Task to caption_to_phrase_grounding.
  2. Enter a descriptive caption in the Input Prompt field.
  3. Pulse Process Image. The output contains bounding box coordinates for each phrase.

The Florence Model menu includes the official Microsoft models as well as community fine-tunes:

  • base / large: Core Florence-2 models (base is faster, large is more accurate)
  • base-ft / large-ft: Fine-tuned variants with improved task performance
  • DocVQA: Specialized for document question answering
  • CogFlorence: Community fine-tunes with enhanced capabilities
  • SD3-Captioner / Flux: Optimized for generating Stable Diffusion and Flux prompts
  • PromptGen: MiaoshouAI fine-tunes for image-to-prompt generation

The Fill Region Masks toggle controls whether detected regions are rendered as filled masks in the output visualization. Use Mask Selection to filter specific regions by index or label (comma-separated).

  • “ML Server not available”: The vision_sidecar service is not running. Open the SideCar system component and ensure the vision_sidecar process has started.
  • “No input image”: No TOP is connected to the operator’s input. Wire a TOP into the Florence-2 input before pulsing Process Image.
  • Model load fails: Check the SideCar logs for GPU memory or download errors. Larger models (large, large-ft) require more VRAM. Try a base variant if memory is limited.
  • Missing packages dialog: The operator checks for torch, transformers, timm, and einops on initialization. If any are missing, it will prompt to install them. Note that torch must be version 2.1.1 or greater and must be compatible with your TouchDesigner build.
Load Model (Load) op('florence').par.Load Pulse
Default:
False
Process Image (Process) op('florence').par.Process Pulse
Default:
False
Reset (Reset) op('florence').par.Reset Pulse
Default:
False
Active (Active) op('florence').par.Active Toggle
Default:
False
Status (Status) op('florence').par.Status Str
Default:
"" (Empty String)
Florence Model (Florencemodel) op('florence').par.Florencemodel Menu
Default:
microsoft/Florence-2-base
Options:
microsoft/Florence-2-base, microsoft/Florence-2-base-ft, microsoft/Florence-2-large, microsoft/Florence-2-large-ft, HuggingFaceM4/Florence-2-DocVQA, thwri/CogFlorence-2.1-Large, thwri/CogFlorence-2.2-Large, gokaygokay/Florence-2-SD3-Captioner, gokaygokay/Florence-2-Flux-Large, MiaoshouAI/Florence-2-base-PromptGen-v1.5, MiaoshouAI/Florence-2-large-PromptGen-v1.5, MiaoshouAI/Florence-2-base-PromptGen-v2.0, MiaoshouAI/Florence-2-large-PromptGen-v2.0
Precision (Precision) op('florence').par.Precision Menu
Default:
fp32
Options:
fp16, bf16, fp32
Attention Mechanism (Attention) op('florence').par.Attention Menu
Default:
sdpa
Options:
sdpa, flash_attention_2, eager
Task (Task) op('florence').par.Task Menu
Default:
detailed_caption
Options:
caption, region_caption, dense_region_caption, region_proposal, detailed_caption, more_detailed_caption, caption_to_phrase_grounding, referring_expression_segmentation, ocr, ocr_with_region, docvqa, prompt_gen_tags, prompt_gen_mixed_caption, prompt_gen_analyze
Input Prompt (Prompt) op('florence').par.Prompt Str

Optional input prompt for specific tasks

Default:
"" (Empty String)
Max Tokens (Maxtokens) op('florence').par.Maxtokens Int
Default:
512
Range:
1 to 4096
Slider Range:
1 to 4096
Num Beams (Numbeams) op('florence').par.Numbeams Int
Default:
3
Range:
1 to 64
Slider Range:
1 to 64
Do Sample (Dosample) op('florence').par.Dosample Toggle
Default:
True
Random Seed (Seed) op('florence').par.Seed Int
Default:
42
Range:
0 to 18446700000000000000
Slider Range:
0 to 10000000
Fill Region Masks (Fillmask) op('florence').par.Fillmask Toggle
Default:
True
Mask Selection (Maskselect) op('florence').par.Maskselect Str

Comma-separated list of region indices or labels to mask

Default:
"" (Empty String)
v1.0.32026-03-16
  • Added Florence2 vision model integration via vision_sidecar HTTP API - Implemented image processing with multiple model variants and tasks - Added async processing and result handling
v1.0.22026-03-01
  • Refactor to call ml_server HTTP API directly instead of SideCar methods - Async image processing via TDAsyncIO - Base64 image encoding for HTTP transmission - Model loading via ml_server HTTP endpoint - OnModelLoaded updated for TDAsyncIO callback pattern
v1.0.12026-03-01
  • Replace direct imports with importlib.metadata checks for TD 32050+ compatibility
  • Initial commit
v1.0.02024-11-09

Initial release