fix: ignore whitespace-only Disallow paths in extractUrlsFromRobotsTxt#1973
Open
juliosuas wants to merge 1 commit intosmicallef:masterfrom
Open
fix: ignore whitespace-only Disallow paths in extractUrlsFromRobotsTxt#1973juliosuas wants to merge 1 commit intosmicallef:masterfrom
juliosuas wants to merge 1 commit intosmicallef:masterfrom
Conversation
The regex r'disallow:\s*(.[^ #]*)' used '.' as the first character of the capture group, which matches any character including a space. This caused 'Disallow: ' (a whitespace-only path) to be returned as ' ', adding an invalid disallowed path to the list. Per the robots.txt specification, 'Disallow: ' with no non-whitespace content means 'allow all' and should be treated as an empty/no-op rule. Fix: replace the leading '.' with '\S' so only paths that start with a non-whitespace character are captured. This resolves the TODO comment that had been in the docstring since the original implementation. Fixes smicallef#701
Author
|
A bit more detail: per the robots.txt spec, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #701. Resolves the TODO in the docstring.
Problem
The regex
r'disallow:\s*(.[^ #]*)'used.as the first character of the capture group, which matches any character including a space. This causedDisallow:(whitespace-only path) to be returned as' ', adding an invalid disallowed path to the crawl-exclusion list.Per the robots.txt spec,
Disallow:with no path (or only whitespace) means allow all and should produce no exclusion entries.Fix
Replace the leading
.with\Sso only paths that begin with a non-whitespace character are captured: