Skip to content

Commit

Permalink
ExtractorHTML: Treat 'cite' attribute as navlink instead of embed
Browse files Browse the repository at this point in the history
The cite attribute is used to identify the source document of a blockquote. But ExtractorHTML was treating it as an embed which can cause out of scope pages to be included in a crawl incorrectly. Browsers don't use the cite attribute currently so there might be an argument for ignoring it entirely but let's at least not treat it as an embed.
  • Loading branch information
nla-manderson authored and ato committed Sep 10, 2024
1 parent 791f1c8 commit 3a447b5
Showing 1 changed file with 4 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -197,8 +197,8 @@ public void setMaxAttributeValLength(int max) {
// matched by the above. attributes known to be URIs of various
// sorts are matched specially
static final String EACH_ATTRIBUTE_EXTRACTOR =
"(?is)\\s?((href)|(action)|(on\\w*)" // 1, 2, 3, 4
+"|((?:src)|(?:srcset)|(?:lowsrc)|(?:background)|(?:cite)" // ...
"(?is)\\s?((href|(?:cite))|(action)|(on\\w*)" // 1, 2, 3, 4
+"|((?:src)|(?:srcset)|(?:lowsrc)|(?:background)" // ...
+"|(?:longdesc)|(?:usemap)|(?:profile)|(?:datasrc)" // ...
+"|(?:data-src)|(?:data-srcset)|(?:data-original)|(?:data-original-set))" // 5
+"|(codebase)|((?:classid)|(?:data))|(archive)|(code)" // 6, 7, 8, 9
Expand All @@ -210,10 +210,10 @@ public void setMaxAttributeValLength(int max) {
+"|(\\S{1,"+MAX_ATTR_VAL_REPLACE+"}))"; // 16
// groups:
// 1: attribute name
// 2: HREF - single URI relative to doc base, or occasionally javascript:
// 2: HREF, CITE - single URI relative to doc base, or occasionally javascript:
// 3: ACTION - single URI relative to doc base, or occasionally javascript:
// 4: ON[WHATEVER] - script handler
// 5: SRC,SRCSET,LOWSRC,BACKGROUND,CITE,LONGDESC,USEMAP,PROFILE, or
// 5: SRC,SRCSET,LOWSRC,BACKGROUND,LONGDESC,USEMAP,PROFILE, or
// DATA-SRC, DATA-ORIGINAL single URI relative to doc base
// DATA-SRCSET, DATA-ORIGINAL-SET multi URI relative to doc base
// 6: CODEBASE - a single URI relative to doc base, affecting other
Expand Down

0 comments on commit 3a447b5

Please sign in to comment.