Source Crawl4ai Operator

v1.3.0 What's new

The Source Crawl4ai LOP utilizes the crawl4ai Python library to fetch content from web pages, sitemaps, or lists of URLs. It uses headless browsers (via Playwright) to render pages, extracts the main content, converts it to Markdown, and structures the output into a DAT table compatible with the Rag Index operator. It supports various crawling modes, URL filtering, and resource management features like concurrency limits and adaptive memory usage control.

Source Crawl4ai UI

Requirements

Python Packages:
- crawl4ai: The core crawling library.
- playwright: Required by crawl4ai for browser automation.
- requests (implicitly needed for sitemap fetching).
- These can be installed via the ChatTD operator’s Python manager by first installing crawl4ai.
Playwright Browsers: After installing the Python packages, the necessary browser binaries must be downloaded using the Install/Update Playwright Browsers parameter on this operator.
ChatTD Operator: Required for dependency management (package installation) and asynchronous task execution. Ensure the ChatTD Operator parameter on the ‘About’ page points to your configured ChatTD instance.

Input/Output

Inputs

None

Outputs

Output Table (DAT): The primary output, containing the crawled content. Columns match the requirements for the Rag Index operator:
- doc_id: Unique ID for the crawled page/chunk.
- filename: Source URL of the crawled page.
- content: Crawled content formatted as Markdown.
- metadata: JSON string containing source URL, timestamp, content length, etc.
- source_path: Source URL (duplicate of filename).
- timestamp: Unix timestamp of when the content was processed.
Internal DATs: (Accessible via operator viewer) index_table (summary view) and content_table (detailed view of selected doc).
Status/Log: Information is logged via the linked Logger component within ChatTD. Key status info is also reflected in the Status, Progress, and URLs Processed parameters.

Parameters

Page: Crawl Config

Target URL / Sitemap / .txt (Url) op('source_crawl4ai').par.Url Str

Default:: None

URL Table (Urltable) op('source_crawl4ai').par.Urltable OP

Default:: None

Include URL Patterns (Includepatterns) op('source_crawl4ai').par.Includepatterns Str

Default:: None

Exclude URL Patterns (Excludepatterns) op('source_crawl4ai').par.Excludepatterns Str

Default:: None

Current Status (Status) op('source_crawl4ai').par.Status Str

Default:: None

Progress (Progress) op('source_crawl4ai').par.Progress Float

Default:: None

URLs Processed (Urlsprocessed) op('source_crawl4ai').par.Urlsprocessed Int

Default:: None

Caution: Exposing the viewer of large index tables will be heavy Header

Start Crawl (Startcrawl) op('source_crawl4ai').par.Startcrawl Pulse

Default:: None

Stop Crawl (Stopcrawl) op('source_crawl4ai').par.Stopcrawl Pulse

Default:: None

Clear Table on Crawl (Clearontable) op('source_crawl4ai').par.Clearontable Toggle

Default:: None

Avoid Repeats (Usehistory) op('source_crawl4ai').par.Usehistory Toggle

Default:: None

Clear Scrape History (Clearhistory) op('source_crawl4ai').par.Clearhistory Pulse

Default:: None

Max Depth (Recursive) (Maxdepth) op('source_crawl4ai').par.Maxdepth Int

Default:: 2
Range:: 1 to 10

Max Concurrent Sessions (Maxconcurrent) op('source_crawl4ai').par.Maxconcurrent Int

Default:: 5
Range:: 1 to 20

Memory Threshold (%) (Memorythreshold) op('source_crawl4ai').par.Memorythreshold Float

Default:: 70.0
Range:: 30 to 95

Install/Update Playwright Browsers (Installplaywright) op('source_crawl4ai').par.Installplaywright Pulse

Default:: None

Clear Table Data (Clearoutput) op('source_crawl4ai').par.Clearoutput Pulse

Default:: None

Display File (Displayfile) op('source_crawl4ai').par.Displayfile Str

Default:: None

Select Doc (Selectdoc) op('source_crawl4ai').par.Selectdoc Int

Default:: 1
Range:: 1 to 1

Page: Agents

Agent Calls Add to Table (Agenttotable) op('source_crawl4ai').par.Agenttotable Toggle

Default:: None

Page: About

Bypass (Bypass) op('source_crawl4ai').par.Bypass Toggle

Default:: None

Show Built-in Parameters (Showbuiltin) op('source_crawl4ai').par.Showbuiltin Toggle

Default:: None

Version (Version) op('source_crawl4ai').par.Version Str

Default:: None

Last Updated (Lastupdated) op('source_crawl4ai').par.Lastupdated Str

Default:: None

Creator (Creator) op('source_crawl4ai').par.Creator Str

Default:: None

Website (Website) op('source_crawl4ai').par.Website Str

Default:: None

ChatTD Operator (Chattd) op('source_crawl4ai').par.Chattd OP

Default:: None

Clear Log (Clearlog) op('source_crawl4ai').par.Clearlog Pulse

Default:: None

Convert To Text (Converttotext) op('source_crawl4ai').par.Converttotext Toggle

Default:: None

Agent Tool Integration

🔧 GetTool Enabled 2 tools

This operator exposes 2 tools that allow Agent and Gemini Live LOPs to crawl web pages and websites to extract content, supporting both single page crawling and full website recursive crawling for AI-driven content gathering.

Use the Tool Debugger operator to inspect exact tool definitions, schemas, and parameters.

The Source Crawl4ai LOP can be used as a tool by Agent LOPs, allowing an AI to autonomously crawl web pages and websites to gather information.

Available Tools

When connected to an Agent, this operator provides the following functions:

crawl_single_page(url): Fetches and returns the text content of a single, specific web page. This is best used when the agent needs the contents of one exact URL.
crawl_full_website_recursively(url, max_depth=2): Crawls an entire website by following internal links, starting from a given URL. It processes up to 20 pages to gather comprehensive information. This is ideal when an agent needs to understand the content of a whole website, not just a single page.

How It Works

Connect to Agent: Add the Source Crawl4ai LOP to the Tool sequence parameter on an Agent LOP.
Agent Prompts: When the Agent receives a prompt that requires web content, it can choose to call one of the crawl tools.
Execution: The Source Crawl4ai LOP executes the crawl asynchronously and returns the extracted Markdown content to the Agent.
Response: The Agent then uses this content to formulate its response.

Example Agent Prompt

"Please summarize the main points from the article at https://example.com/news/latest-ai-breakthroughs and also give me an overview of the company's products from their website."

In this scenario, the Agent could:

Call crawl_single_page with the URL https://example.com/news/latest-ai-breakthroughs.
Call crawl_full_website_recursively with the URL https://example.com/products.
Use the content from both tool calls to generate a comprehensive summary and overview.

Usage Examples

Crawling a Single Page

Set ‘Target URL / Sitemap / .txt’ to the full URL (e.g., https://docs.derivative.ca/Introduction_to_Python).
Set ‘Crawl Mode’ to ‘Single Page’.
Pulse ‘Start Crawl’.
Monitor ‘Status’ and view results in the Output Table DAT.

Crawling from a Sitemap

Set ‘Target URL / Sitemap / .txt’ to the EXACT URL of the sitemap (e.g., https://example.com/sitemap.xml).
Set ‘Crawl Mode’ to ‘Sitemap Batch’.
(Optional) Set ‘Include/Exclude URL Patterns’ to filter URLs from the sitemap.
Adjust ‘Max Concurrent Sessions’ based on your system.
Pulse ‘Start Crawl’.
Monitor ‘Status’ and ‘Progress’.

Recursive Crawl of a Small Site Section

Set ‘Target URL / Sitemap / .txt’ to the starting page (e.g., https://yoursite.com/documentation/).
Set ‘Crawl Mode’ to ‘Crawl Site Links’.
Set ‘Max Depth’ (e.g., 3). Be cautious with high values on large sites.
(Optional) Set ‘Exclude URL Patterns’ to avoid specific sections (e.g., /blog /forum).
Adjust ‘Max Concurrent Sessions’.
Pulse ‘Start Crawl’.

Initial Setup (Installation)

Ensure the ‘ChatTD Operator’ parameter points to your ChatTD instance.
Use ChatTD’s Python Manager to install the ‘crawl4ai’ package.
Return to this operator. Pulse the ‘Install/Update Playwright Browsers’ parameter.
Monitor the Textport for download progress. Installation is complete when the logs indicate success.

Technical Notes

Dependencies: Requires crawl4ai and playwright Python packages, installable via ChatTD. Crucially, Playwright also needs browser binaries downloaded via the Install/Update Playwright Browsers parameter pulse.
Resource Usage: Crawling, especially in batch modes (Sitemap, Recursive, Text File), uses headless browsers and can consume significant CPU, RAM, and network bandwidth.
Concurrency: Adjust Max Concurrent Sessions carefully. Too high can destabilize TouchDesigner or your system.
Memory Management: The Memory Threshold (%) helps prevent crashes on large crawls by pausing new sessions when system RAM usage is high.
Filtering: Use Include URL Patterns and Exclude URL Patterns effectively to limit the scope of crawls and avoid unwanted pages or file types. Wildcards (*, ?) are supported.
Output Format: Content is output as Markdown in the content column of the output DAT, ready for ingestion by the Rag Index operator.
Stopping: Pulsing Stop Crawl attempts a graceful shutdown, but currently active browser tasks might take time to fully terminate.

Rag Index: Ingests the output of this operator to create a searchable index.
ChatTD: Provides core services like dependency management and asynchronous task execution required by this operator.
Source Webscraper: An alternative web scraping operator using a different backend (aiohttp, trafilatura). Might be lighter weight for simpler scraping tasks not requiring full browser rendering.