Skip to content

Save Sources

v0.1.0New

The Save Sources LOP converts rows from an input table DAT into individual Markdown files on disk. It is designed as a bridge between content acquisition operators (like web scrapers or document processors) and the RAG Index LOP, which ingests folders of Markdown files for retrieval-augmented generation.

This operator has no wired inputs. It reads from an internal input_table DAT that receives data from an upstream operator connection. The table must contain at least two columns:

  • doc_id — unique identifier per row, used as the final fallback filename
  • content — the text to write into each Markdown file

Optional columns:

  • source_path — URL used for filename generation when “Use URL for Filename” is enabled
  • A custom column specified in “Filename Column (Optional)” for alternative filename sourcing
  • One output — passes through the input table for downstream chaining

The primary output of this operator is the set of .md files written to disk in the configured output folder.

  1. Connect an upstream operator (such as a web scraper) so that its output populates the internal input_table DAT with doc_id and content columns.
  2. On the Save Config page, set “Output Folder” to the directory where files should be saved.
  3. Pulse “Save Markdown Files” to begin the export.
  4. Monitor “Current Status”, “Progress (%)”, and “Files Saved” to track the operation.
  5. When complete, the status will show the number of files saved and the time elapsed.

When saving content scraped from the web, enable “Use URL for Filename” to generate meaningful filenames from the source_path column:

  • https://example.com/articles/machine-learning becomes articles_machine-learning.md
  • https://site.com/docs/tutorial.html becomes docs_tutorial.md
  • https://blog.com/index.php?id=123 becomes index_php_id_123.md

The operator strips common extensions (.html, .php), sanitizes special characters, and truncates filenames to 100 characters.

To use a specific column for filenames instead of URLs or document IDs:

  1. Add a column to your input table with the desired filenames (e.g., a column named filename).
  2. On the Save Config page, enter the column name in “Filename Column (Optional)”.
  3. If “Use URL for Filename” is also enabled, the URL method is tried first and this column serves as a fallback.

Set “Filename Prefix (Optional)” to prepend a string to every saved filename. For example, a prefix of project_ produces files like project_articles_machine-learning.md.

  1. Use a source operator to scrape or import content into a table.
  2. Wire the output into the Save Sources operator.
  3. Set “Output Folder” to your knowledge base directory.
  4. Enable “Use URL for Filename” for web content, or configure a filename column for other sources.
  5. Pulse “Save Markdown Files” to export.
  6. Point a RAG Index operator at the same output folder to ingest the saved files.

The operator resolves filenames using a three-tier fallback strategy:

  1. URL-based — if “Use URL for Filename” is enabled and a valid source_path exists, the URL is parsed and sanitized into a filename.
  2. Fallback column — if URL generation fails or is disabled, the column specified in “Filename Column (Optional)” is used.
  3. Document ID — if both above methods fail, the doc_id column value is used as the filename.

Every row is guaranteed to produce a filename through this chain.

By default, “Overwrite Existing Files” is off. When disabled, the operator skips any file that already exists at the target path. Enable it to replace existing files during re-exports or content updates.

“Error: Missing ‘doc_id’ column” or “Missing ‘content’ column” The input table must have both doc_id and content as column headers in the first row. Verify your upstream operator is producing the expected table format.

“Error: Output folder invalid” The specified folder path could not be resolved. Ensure the path exists or that its parent directory exists (the operator will attempt to create the final folder). Use absolute paths to avoid ambiguity.

“Error: Input table empty/no header” The input table has no data rows beyond the header. Confirm your source operator has finished populating the table before pulsing “Save Markdown Files”.

Files not appearing despite successful status If “Overwrite Existing Files” is off and files with the same names already exist, they are silently skipped. Enable overwrite or use a different prefix to generate unique filenames.

Resetting after errors Pulse “Clear Status” to reset the status, progress, and file count back to their initial state.

Output Folder (Outputfolder) op('save_sources').par.Outputfolder Folder

The directory where Markdown files will be saved.

Default:
"" (Empty String)
Filename Prefix (Optional) (Filenameprefix) op('save_sources').par.Filenameprefix Str

Optional prefix to add to the beginning of each saved filename (before the doc_id).

Default:
"" (Empty String)
Filename Column (Optional) (Filenamecolumn) op('save_sources').par.Filenamecolumn Str

Optional: Specify a column name (e.g., "filename") to use for filenames instead of "doc_id". If empty or column not found, "doc_id" is used.

Default:
"" (Empty String)
Overwrite Existing Files (Overwrite) op('save_sources').par.Overwrite Toggle

If enabled, existing Markdown files with the same name will be overwritten.

Default:
False
Save Markdown Files (Savemarkdown) op('save_sources').par.Savemarkdown Pulse

Starts the process of saving content from the input DAT to Markdown files.

Default:
False
Clear Status (Clearstatus) op('save_sources').par.Clearstatus Pulse

Resets the status, progress, and files saved counters.

Default:
False
Current Status (Status) op('save_sources').par.Status Str
Default:
"" (Empty String)
Progress (%) (Progress) op('save_sources').par.Progress Float
Default:
0.0
Range:
0 to 1
Slider Range:
0 to 1
Files Saved (Filessaved) op('save_sources').par.Filessaved Int
Default:
0
Range:
0 to 1
Slider Range:
0 to 1
Use URL for Filename (Useurlasfilename) op('save_sources').par.Useurlasfilename Toggle

If enabled, attempts to create a safe filename from the "source_path" column URL.

Default:
False