Skip to content

fix: Filter sitemap-derived URLs by enqueue strategy#1864

Merged
vdusek merged 9 commits intomasterfrom
fix/sitemap-loader-enqueue-strategy
May 4, 2026
Merged

fix: Filter sitemap-derived URLs by enqueue strategy#1864
vdusek merged 9 commits intomasterfrom
fix/sitemap-loader-enqueue-strategy

Conversation

@vdusek
Copy link
Copy Markdown
Collaborator

@vdusek vdusek commented Apr 30, 2026

Summary

SitemapRequestLoader accepted every <loc> it parsed (including nested sitemaps and robots.txt directives) without checking the host, so a sitemap could push arbitrary URLs into the queue. This wires the existing EnqueueStrategy mechanism into the sitemap loader and the robots.txt discovery path.

What changed

  • SitemapRequestLoader.__init__ gains enqueue_strategy: EnqueueStrategy = 'same-hostname', applied to nested <sitemap><loc> and <urlset><url><loc> entries against the parent sitemap URL. Pass 'all' to opt out. The strategy is stamped on emitted Requests, so BasicCrawler._check_url_after_redirects continues policing across redirects.
  • RobotsTxtFile.get_sitemaps(*, enqueue_strategy) filters cross-host Sitemap: directives (already disallowed by spec). The discover_valid_sitemaps path is closed as well.
  • Hoisted the strategy matcher into crawlee._utils.urls.matches_enqueue_strategy so request loaders and BasicCrawler share one implementation. BasicCrawler now uses yarl.URL to match the rest of the codebase.
  • Memoised the public-suffix lookup used by same-domain so it isn't re-run per URL on hot paths.

Behavior change

The new default 'same-hostname' matches enqueue_links. Callers depending on cross-host sitemap entries must pass enqueue_strategy='all' (or 'same-domain'). RobotsTxtFile.get_sitemaps() no longer returns cross-host entries; its tests were rewritten to assert filtering, with a new test_extract_same_host_sitemaps_urls covering the legitimate path.

@github-actions github-actions Bot added this to the 139th sprint - Tooling team milestone Apr 30, 2026
@github-actions github-actions Bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Apr 30, 2026
@vdusek vdusek force-pushed the fix/sitemap-loader-enqueue-strategy branch from 5976063 to a4317a4 Compare April 30, 2026 14:33
@vdusek vdusek marked this pull request as draft April 30, 2026 14:36
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 91.19497% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.53%. Comparing base (6b78f38) to head (6f4d940).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...crawlee/request_loaders/_sitemap_request_loader.py 91.13% 7 Missing ⚠️
src/crawlee/crawlers/_basic/_basic_crawler.py 78.26% 5 Missing ⚠️
src/crawlee/_utils/robots.py 88.88% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1864      +/-   ##
==========================================
+ Coverage   92.42%   92.53%   +0.10%     
==========================================
  Files         158      158              
  Lines       11027    11078      +51     
==========================================
+ Hits        10192    10251      +59     
+ Misses        835      827       -8     
Flag Coverage Δ
unit 92.53% <91.19%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vdusek vdusek added the adhoc Ad-hoc unplanned task added during the sprint. label Apr 30, 2026
@vdusek vdusek force-pushed the fix/sitemap-loader-enqueue-strategy branch from a4317a4 to 22cb16b Compare April 30, 2026 15:09
@vdusek vdusek marked this pull request as ready for review April 30, 2026 15:10
@vdusek vdusek force-pushed the fix/sitemap-loader-enqueue-strategy branch from 22cb16b to cfc66d9 Compare April 30, 2026 15:17
Copy link
Copy Markdown
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one note about parsing URLs. Everything else is LGTM.

Comment thread src/crawlee/_utils/urls.py Outdated
Copy link
Copy Markdown
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vdusek vdusek requested a review from janbuchar May 3, 2026 07:32
@vdusek vdusek force-pushed the fix/sitemap-loader-enqueue-strategy branch from 6b437b1 to d785f08 Compare May 3, 2026 08:01
Comment thread tests/unit/_utils/test_urls.py Outdated
vdusek added 5 commits May 4, 2026 13:59
The wrapper added nothing beyond a debug log that didn't affect control flow,
since `matches_enqueue_strategy` already returns `False` when either host is
missing. Replace the two callsites with direct calls to the shared helper and
drop the now-unused `EnqueueStrategy` import.
@vdusek vdusek force-pushed the fix/sitemap-loader-enqueue-strategy branch from d785f08 to 610ae46 Compare May 4, 2026 12:01
@vdusek vdusek requested a review from janbuchar May 4, 2026 12:01
Comment thread src/crawlee/_utils/urls.py
@vdusek vdusek force-pushed the fix/sitemap-loader-enqueue-strategy branch from 4ba08d1 to 6f4d940 Compare May 4, 2026 17:34
@vdusek vdusek merged commit b3db0dc into master May 4, 2026
32 checks passed
@vdusek vdusek deleted the fix/sitemap-loader-enqueue-strategy branch May 4, 2026 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants