Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid excess load of bots going into search facet links on entity pages #2709

Closed
bram-atmire opened this issue Dec 12, 2023 · 6 comments · Fixed by #2710
Closed

Avoid excess load of bots going into search facet links on entity pages #2709

bram-atmire opened this issue Dec 12, 2023 · 6 comments · Fixed by #2710
Assignees
Labels
bug component: SEO Search Engine Optimization performance / caching Related to performance, caching or embedded objects
Milestone

Comments

@bram-atmire
Copy link
Member

Describe the bug
We're seeing in search console for several of our clients that bots go into facet links on entity pages. Given that this doesn't contribute to the quality of the indexing (e.g. bots shouldn't be going there) and that processing these requests is resource intensive, we better avoid this behaviour al together.

To Reproduce
Steps to reproduce the behavior:

  1. Look at search console for an actively indexed DSpace 7 site, that has entities enabled
  2. Look for the patterns in the reports of crawled urls for things like:

entities/orgunit/25913818-6714-4be5-89a6-f70c8facdf7e?f.author=Wang

Expected behavior
Robots should be blocked from doing this

Proposed solution
Add following disallow directive in robots.txt:

Disallow: /entities/*?f

Related work

Previously incorrectly created in the back-end Git repo as DSpace/DSpace#9227

@bram-atmire bram-atmire added bug needs triage New issue needs triage and/or scheduling labels Dec 12, 2023
bram-atmire added a commit that referenced this issue Dec 12, 2023
@tdonohue tdonohue added component: SEO Search Engine Optimization and removed needs triage New issue needs triage and/or scheduling labels Dec 12, 2023
@tdonohue tdonohue added this to the 7.6.2 milestone Dec 12, 2023
@alanorth
Copy link
Contributor

Thanks @bram-atmire! I can imagine this is a huge load (like crawling search and browse as well) and an obvious win for bots that respect robots.txt. I'm wondering if Google's interpretation of the robot exclusion protocol supports wildcards such as this after path elements. It seems maybe? Have you tried it on a live site?

As a sysadmin I'd block these patterns in Apache / nginx just to be sure—as the Russian saying goes: "trust, but verify".


Side note, we have several patterns with trailing wildcards that will be ignored by Google bot.

@bram-atmire
Copy link
Member Author

@alanorth As far as I cansee, as long as the wild card isn't trailing, it shouldn't be ignored.

The change in this ticket came up in an email dialogue with a representative from Google Scholar.

One site where we have it in prod: https://repository.upenn.edu/robots.txt

@hutattedonmyarm
Copy link
Contributor

Wouldn't it be useful to (additionally) use add the rel="nofollow" attribute to the anchor tags in the search filters? This way we don't have to rely on how wildcards are handled by crawlers

@alanorth
Copy link
Contributor

alanorth commented Mar 6, 2024

@hutattedonmyarm if we use rel="nofollow" on search pages it would be a sign for bots to not crawl them, but they still have to load the page to read the anchor tags. In theory the robots.txt method should be better because bots can read it before.

@hutattedonmyarm
Copy link
Contributor

@alanorth Not the whole page, I was only talking about the links in search-filters.component. So the checkboxes which check/uncheck all the filters in the search results sidebar. These are implemented as links. Currently, crawlers follow them, because they're part of an entities page. But they only lead to search results

@alanorth
Copy link
Contributor

alanorth commented Mar 7, 2024

@hutattedonmyarm oh yes, I was confusing the rel=nofollow with other robot instructions in head meta tags. I think you are right that we should make those links rel=nofollow.

github-actions bot pushed a commit that referenced this issue Apr 29, 2024
Fix for issue #2709

(cherry picked from commit fbd3529)
@tdonohue tdonohue added the performance / caching Related to performance, caching or embedded objects label Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug component: SEO Search Engine Optimization performance / caching Related to performance, caching or embedded objects
Projects
Development

Successfully merging a pull request may close this issue.

4 participants