Skip to content

feat: Add CRW web scraping tools#7468

Open
us wants to merge 2 commits intomicrosoft:mainfrom
us:feat/add-crw-tool
Open

feat: Add CRW web scraping tools#7468
us wants to merge 2 commits intomicrosoft:mainfrom
us:feat/add-crw-tool

Conversation

@us
Copy link
Copy Markdown

@us us commented Mar 26, 2026

Summary

Adds CRW web scraping tools to autogen-ext. CRW is an open-source web scraper for AI agents — a single Rust binary with a built-in MCP server and Firecrawl-compatible REST API.

New tools:

  • CrwScrapeTool — scrape a single URL to markdown/HTML/plaintext (POST /v1/scrape)
  • CrwCrawlTool — crawl a website across multiple pages with depth/page limits (POST /v1/crawl)
  • CrwMapTool — discover all links on a site via crawling + sitemap (POST /v1/map)

All tools follow the existing BaseTool pattern, include type hints and docstrings, and are installable via pip install "autogen-ext[crw]". Includes a sample script demonstrating usage with an AssistantAgent.

Why CRW?

  • Zero-dependency single binary (Rust), easy to self-host
  • Drop-in Firecrawl-compatible API — agents that already work with Firecrawl can switch to CRW
  • Built-in chunking with BM25/cosine ranking for RAG pipelines
  • Open source (MIT)

Add CrwScrapeTool, CrwCrawlTool, and CrwMapTool as new tool extensions
for web scraping via CRW's Firecrawl-compatible REST API.
Copilot AI review requested due to automatic review settings March 26, 2026 14:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CRW (Firecrawl-compatible) web scraping tools to autogen-ext, plus a new sample demonstrating how to use them from autogen-agentchat agents.

Changes:

  • Introduces CrwScrapeTool, CrwCrawlTool, and CrwMapTool wrappers around CRW’s /v1/scrape, /v1/crawl, and /v1/map endpoints.
  • Adds a runnable sample (app.py) and README explaining prerequisites and usage.
  • Adds a new crw optional dependency extra to autogen-ext.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
python/samples/agentchat_crw_web_scraping/app.py New sample script showing agent + tool usage for scrape/map/crawl.
python/samples/agentchat_crw_web_scraping/README.md Sample documentation and tool-to-endpoint mapping table.
python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py New CRW tool implementations and request/response models.
python/packages/autogen-ext/src/autogen_ext/tools/crw/init.py Exports CRW tool classes and result models.
python/packages/autogen-ext/pyproject.toml Adds crw extra with httpx dependency.

Comment on lines +60 to +66
class CrwScrapeTool(BaseTool[ScrapeArgs, ScrapeResult]):
"""Scrape a single URL and return its content as markdown, HTML, or plain text.

Uses the CRW web scraper's ``POST /v1/scrape`` endpoint. CRW is an open-source,
high-performance web scraper built in Rust with a Firecrawl-compatible REST API.

.. note::
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no unit tests added for the new CRW tools. autogen-ext already has a tests/tools suite (e.g., for HttpTool), so it would be good to add tests that mock CRW responses (via httpx.MockTransport/monkeypatching) to cover: success paths, success:false/missing fields, and crawl polling termination/error cases.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +13
| Tool | Function | CRW Endpoint |
|------|----------|-------------|
| `CrwScrapeTool` | Scrape a single URL | `POST /v1/scrape` |
| `CrwCrawlTool` | Crawl a website (multi-page) | `POST /v1/crawl` + `GET /v1/crawl/{id}` |
| `CrwMapTool` | Discover all links on a site | `POST /v1/map` |
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown table has an extra leading pipe (|| ...) on the header and separator rows, which renders as an empty first column in most markdown parsers. Please format the table with a single leading | per row.

Copilot uses AI. Check for mistakes.
@us
Copy link
Copy Markdown
Author

us commented Mar 26, 2026

@microsoft-github-policy-service agree

- Remove unused Console import from sample app
- Fix markdown table formatting in README
- Remove unused Literal import from _crw_tools.py
- Update CrwScrapeTool description to reflect all output formats
- Validate initial POST response in CrwCrawlTool before polling
- Add max poll limit, cancellation token check, and reuse httpx client
- Add unit tests for CrwScrapeTool, CrwCrawlTool, and CrwMapTool
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants