Conversation
Add CrwScrapeTool, CrwCrawlTool, and CrwMapTool as new tool extensions for web scraping via CRW's Firecrawl-compatible REST API.
There was a problem hiding this comment.
Pull request overview
Adds CRW (Firecrawl-compatible) web scraping tools to autogen-ext, plus a new sample demonstrating how to use them from autogen-agentchat agents.
Changes:
- Introduces
CrwScrapeTool,CrwCrawlTool, andCrwMapToolwrappers around CRW’s/v1/scrape,/v1/crawl, and/v1/mapendpoints. - Adds a runnable sample (
app.py) and README explaining prerequisites and usage. - Adds a new
crwoptional dependency extra toautogen-ext.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| python/samples/agentchat_crw_web_scraping/app.py | New sample script showing agent + tool usage for scrape/map/crawl. |
| python/samples/agentchat_crw_web_scraping/README.md | Sample documentation and tool-to-endpoint mapping table. |
| python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py | New CRW tool implementations and request/response models. |
| python/packages/autogen-ext/src/autogen_ext/tools/crw/init.py | Exports CRW tool classes and result models. |
| python/packages/autogen-ext/pyproject.toml | Adds crw extra with httpx dependency. |
| class CrwScrapeTool(BaseTool[ScrapeArgs, ScrapeResult]): | ||
| """Scrape a single URL and return its content as markdown, HTML, or plain text. | ||
|
|
||
| Uses the CRW web scraper's ``POST /v1/scrape`` endpoint. CRW is an open-source, | ||
| high-performance web scraper built in Rust with a Firecrawl-compatible REST API. | ||
|
|
||
| .. note:: |
There was a problem hiding this comment.
There are no unit tests added for the new CRW tools. autogen-ext already has a tests/tools suite (e.g., for HttpTool), so it would be good to add tests that mock CRW responses (via httpx.MockTransport/monkeypatching) to cover: success paths, success:false/missing fields, and crawl polling termination/error cases.
| | Tool | Function | CRW Endpoint | | ||
| |------|----------|-------------| | ||
| | `CrwScrapeTool` | Scrape a single URL | `POST /v1/scrape` | | ||
| | `CrwCrawlTool` | Crawl a website (multi-page) | `POST /v1/crawl` + `GET /v1/crawl/{id}` | | ||
| | `CrwMapTool` | Discover all links on a site | `POST /v1/map` | |
There was a problem hiding this comment.
The markdown table has an extra leading pipe (|| ...) on the header and separator rows, which renders as an empty first column in most markdown parsers. Please format the table with a single leading | per row.
python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py
Outdated
Show resolved
Hide resolved
python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py
Outdated
Show resolved
Hide resolved
python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py
Outdated
Show resolved
Hide resolved
python/packages/autogen-ext/src/autogen_ext/tools/crw/_crw_tools.py
Outdated
Show resolved
Hide resolved
|
@microsoft-github-policy-service agree |
- Remove unused Console import from sample app - Fix markdown table formatting in README - Remove unused Literal import from _crw_tools.py - Update CrwScrapeTool description to reflect all output formats - Validate initial POST response in CrwCrawlTool before polling - Add max poll limit, cancellation token check, and reuse httpx client - Add unit tests for CrwScrapeTool, CrwCrawlTool, and CrwMapTool
Summary
Adds CRW web scraping tools to
autogen-ext. CRW is an open-source web scraper for AI agents — a single Rust binary with a built-in MCP server and Firecrawl-compatible REST API.New tools:
CrwScrapeTool— scrape a single URL to markdown/HTML/plaintext (POST /v1/scrape)CrwCrawlTool— crawl a website across multiple pages with depth/page limits (POST /v1/crawl)CrwMapTool— discover all links on a site via crawling + sitemap (POST /v1/map)All tools follow the existing
BaseToolpattern, include type hints and docstrings, and are installable viapip install "autogen-ext[crw]". Includes a sample script demonstrating usage with an AssistantAgent.Why CRW?