feat: Add StagehandCrawler with AI-powered browser automation#1854
feat: Add StagehandCrawler with AI-powered browser automation#1854Mantisus wants to merge 15 commits intoapify:masterfrom
StagehandCrawler with AI-powered browser automation#1854Conversation
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Adds first-class Stagehand integration to Crawlee Python by introducing a StagehandCrawler (built on PlaywrightCrawler) plus corresponding browser-pool plugin/controller, enabling AI-driven page actions (act, extract, observe, execute) while keeping Crawlee’s existing routing/sessions/proxy/navigation-hook features.
Changes:
- Introduces
StagehandCrawler+ Stagehand-specific crawling contexts and exports them fromcrawlee.crawlers. - Adds
StagehandBrowserPlugin/StagehandBrowserController,StagehandOptions, andStagehandPage, integrated withBrowserPool. - Adds Stagehand documentation + examples, updates architecture docs, and replaces the older “Playwright with Stagehand” guide; updates dependencies and adds unit tests.
Reviewed changes
Copilot reviewed 21 out of 23 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Locks new optional Stagehand dependency set and adds stagehand extra resolution entries. |
pyproject.toml |
Adds stagehand optional dependency group and includes it in all. |
src/crawlee/browsers/__init__.py |
Exposes Stagehand browser plugin/controller and types via optional imports. |
src/crawlee/browsers/_stagehand_types.py |
Defines StagehandOptions and StagehandPage AI-method wrappers. |
src/crawlee/browsers/_stagehand_browser_plugin.py |
Implements StagehandBrowserPlugin lifecycle and Stagehand client initialization. |
src/crawlee/browsers/_stagehand_browser_controller.py |
Implements CDP connection + lazy session start, page creation, and header injection for Stagehand. |
src/crawlee/crawlers/__init__.py |
Exposes Stagehand crawler + contexts via optional imports. |
src/crawlee/crawlers/_stagehand/__init__.py |
Adds Stagehand crawler module exports with optional-deps handling. |
src/crawlee/crawlers/_stagehand/_stagehand_crawler.py |
Adds StagehandCrawler built on PlaywrightCrawler and auto-configures a Stagehand BrowserPool. |
src/crawlee/crawlers/_stagehand/_stagehand_crawling_context.py |
Adds Stagehand-specific crawling context dataclasses and type-narrowed page. |
src/crawlee/crawlers/_playwright/_playwright_crawler.py |
Refactors Playwright crawler to support overridable context classes and generic context typing via _build_context. |
tests/unit/browsers/test_stagehand_browser_plugin.py |
Adds unit tests for plugin activation and Stagehand client init parameter wiring. |
tests/unit/browsers/test_stagehand_browser_controller.py |
Adds unit tests for lazy session start, concurrency behavior, proxies, and header behavior. |
tests/unit/crawlers/_stagehand/test_stagehand_crawler.py |
Adds unit tests verifying context types, hook contexts, and StagehandPage AI-method delegation. |
docs/guides/stagehand_crawler.mdx |
New guide documenting StagehandCrawler, options, AI methods, and Browserbase usage. |
docs/guides/code_examples/stagehand_crawler/basic_example.py |
Example demonstrating act() + extract() with JSON schema. |
docs/guides/code_examples/stagehand_crawler/browserbase_example.py |
Example demonstrating Browserbase environment configuration. |
docs/guides/playwright_crawler_stagehand.mdx |
Removes old guide that described manual Stagehand integration with PlaywrightCrawler. |
docs/guides/code_examples/playwright_crawler_stagehand/support_classes.py |
Removes old example support classes for the manual Stagehand integration. |
docs/guides/code_examples/playwright_crawler_stagehand/browser_classes.py |
Removes old example browser plugin/controller classes for the manual Stagehand integration. |
docs/guides/code_examples/playwright_crawler_stagehand/stagehand_run.py |
Removes old “manual integration” runnable example. |
docs/guides/architecture_overview.mdx |
Updates architecture diagrams/text to include StagehandCrawler + contexts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Docs check fails due to the current versioning logic. |
vdusek
left a comment
There was a problem hiding this comment.
Mostly doc-related / style things Maybe you could also align the `.rules.md. file (about the double backticks and line width for docstrings).
| extract_result = extracted.data.result | ||
|
|
||
| await context.push_data(cast('dict[str, str | None]', extract_result)) |
There was a problem hiding this comment.
maybe explicit type rather than cast?
| extract_result = extracted.data.result | |
| await context.push_data(cast('dict[str, str | None]', extract_result)) | |
| extract_result: dict[str, str | None] = extracted.data.result | |
| await context.push_data(extract_result) |
| Browser crawlers use a real browser to render pages, enabling scraping of sites that require | ||
| JavaScript. They manage browser instances, pages, and context lifecycles. Crawlee provides | ||
| two browser crawlers: | ||
|
|
||
| - <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> utilizes the | ||
| [Playwright](https://playwright.dev/) library and provides a high-level API for controlling | ||
| and navigating browsers. You can learn more about it in the | ||
| [Playwright crawler guide](./playwright-crawler). | ||
| - <ApiLink to="class/StagehandCrawler">`StagehandCrawler`</ApiLink> extends | ||
| `PlaywrightCrawler` with AI-powered browser automation via | ||
| [Stagehand](https://github.com/browserbase/stagehand). It adds natural-language methods | ||
| (`act`, `extract`, `observe`, `execute`) directly on the page object. You can learn more | ||
| about it in the [Stagehand crawler guide](./stagehand-crawler). |
There was a problem hiding this comment.
please do not wrap lines in markdown, we don't do that, it will be wrapped by the renderer
| Because Stagehand manages the browser session internally via CDP, only Chromium is supported. | ||
| Browser settings are limited to the subset accepted by Stagehand's `BrowserLaunchOptions` - | ||
| `headless`, `args`, `viewport`, `proxy`, `locale`, `executable_path`, and a few others. | ||
| Features like full browser fingerprinting (canvas, WebGL, screen properties) and incognito | ||
| pages are not supported. Fingerprint-consistent HTTP headers (`User-Agent`, `Accept`, `sec-ch-ua`) | ||
| are still injected automatically. |
There was a problem hiding this comment.
same as https://github.com/apify/crawlee-python/pull/1854/changes#r3188872564, please check all md(x) files
| """Controller for managing a Stagehand-controlled browser instance. | ||
|
|
||
| It creates and connects to the browser lazily on the first ``new_page`` call: Stagehand | ||
| starts a session, and Playwright then connects to it via CDP. All pages share a single | ||
| browser context, as Stagehand creates the browser and its context together during session | ||
| initialisation. | ||
| """ |
There was a problem hiding this comment.
Could you please tell Claude to:
- use the whole 120 chars width for docstrings
- use only single backticks for symbols ->
''new_page''->'new_page'
this applies to all source code files
| initialisation. | ||
| """ | ||
|
|
||
| AUTOMATION_LIBRARY = 'stagehand' |
There was a problem hiding this comment.
Why is this not internal? Or why is it here at all? Can the Stagehand browser controller be used with another automation library?
| """ | ||
|
|
||
| def __init__(self, page: Page, session: AsyncSession) -> None: | ||
| super().__init__(page._impl_obj) # noqa: SLF001 |
There was a problem hiding this comment.
We rely on the internal Playwright page attribute? Is it necessary? If so, could you please add a comment here to explain & support it?
Description
Adds
StagehandCrawler- a new browser crawler powered by Stagehand that lets users interact with pages using natural language instead of CSS selectors or XPath. ExtendsPlaywrightCrawlerand inherits all of its features: routing, sessions, autoscaling, proxies, and navigation hooks.StagehandPageextends PlaywrightPagewith four AI methods:act(),extract(),observe(), andexecute().StagehandOptionsconfigures the AI model, execution environment (LOCAL/BROWSERBASE), and session parameters.StagehandBrowserPluginandStagehandBrowserControllerintegrate Stagehand into the browser pool, managing session lifecycle and fingerprint header injection.BrowserLaunchOptions.Issues
Testing
StagehandBrowserController,StagehandBrowserPlugin, andStagehandCrawlerwith Stagehand mocked out - no real LLM connection required to run the test suite.