Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robots.txt: add /server/* and /app/health to list of disallowed paths #3275

Closed
wants to merge 3 commits into from

Conversation

saschaszott
Copy link
Contributor

@saschaszott saschaszott commented Aug 29, 2024

Description

This minor PR extends robots.txt by additional Disallow rules.

@saschaszott saschaszott changed the title robots.txt: add /server/* to list of disallowed paths robots.txt: add /server/* and /app/health to list of disallowed paths Aug 29, 2024
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @saschaszott . Overall, this looks good. Just a minor comment inline to recommend.

src/robots.txt.ejs Show resolved Hide resolved
@tdonohue tdonohue added bug component: SEO Search Engine Optimization 1 APPROVAL pull request only requires a single approval to merge port to dspace-7_x This PR needs to be ported to `dspace-7_x` branch for next bug-fix release port to dspace-8_x This PR needs to be ported to `dspace-8_x` branch for next bug-fix release labels Aug 29, 2024
@alanorth
Copy link
Contributor

I'm confused by this. If we disallow bots from our API, the only way they can get information about content in the repository is via directly visiting and scraping communities, collections, and items that appear in the sitemap. Is that the intent?

There are many reasons to use the API programmatically instead of hitting the Angular frontend and trying to scrape HTML. I would have thought we would encourage programmatic access to the API. For example, I operate several integrations with DSpace repositories, scripting access to their APIs because I know how to use the API. If we merge this, then "technically" I would be breaking the site's robots.txt.

What is the problem this pull request is trying to solve? Obviously we don't want bots crawling endless Discovery and Browse lists and we disallowed that long ago in all frontends because it generates massive load for no benefit (content on Discovery / Browse pages is derived from the actual items so should be crawled there instead). Is the problem that @saschaszott is seeing massive load coming from bots on their API?

It's much more efficient for a bot to use the API directly (if it knows how to) than going to Angular, which ends up calling the API anyway.

@saschaszott
Copy link
Contributor Author

@alanorth , thanks for your input. The idea behind the additional Disallow rules was to reduce the crawler / bot traffic on the DSpace backend. I'm not sure how / if crawlers will access publicly available REST API endpoints. Do you have a reference? I could not find appropriate references.

Regarding your comment:

It's much more efficient for a bot to use the API directly (if it knows how to) than going to Angular, which ends up calling the API anyway.

Why does DSpace use SSR? And how does Google represent such API responses in the overall search result?

@tdonohue
Copy link
Member

tdonohue commented Oct 21, 2024

@saschaszott : On further reflection, I don't think the Disallow /server/* has any major benefits.

The reality is the Hal Browser is a Javascript app. If a bot doesn't understand Javascript (and almost all do not -- especially SEO bots), then it will be impossible for them to "crawl" the REST API. Try turning off Javascript and using the Hal Browser -- it doesn't work. This means that a bot which is only using SSR will never be able to "crawl" the Hal Browser.

I'm also worried that blocking the entire REST API could be potentially problematic, as it could impact SEO. With @pnbecker's help, we already noticed that the bitstreams need to be accessible. What about thumbnails (/server/api/items/[uuid]/thumbnail)? What about bots that use/access OpenSearch / RSS (/server/opensearch/search)? What about bots that use/access OAI-PMH (/server/oai/)?

I'm worried that blindly disallowing the entire /server/ path is likely to have unexpected side effects for some bots. And I think it's unlikely many bots will even succeed in "crawling" the REST API. (The bot would have to understand Javascript or JSON)

So, my opinion is that we should remove the /server/* entries from the robots.txt, until there's a clear use case for them. Individual sites can add it in if they feel it will have no side effects.

@saschaszott
Copy link
Contributor Author

Thank you all for your valueable feedback. I'll close the issue as "won't fix". We'll adapt the changes to robots.txt on our own and see what happens.

@alanorth
Copy link
Contributor

For what it's worth, in our repository I have a massive list of datacenter ISP network subnets that get rate limited (implemented in nginx):

  • AS12876 (Scaleway)
  • AS132203 (Tencent)
  • AS13238 (Yandex)
  • AS200350 (Yandex Cloud)
  • AS136907 (Huawei Cloud)
  • AS14061 (Digital Ocean)
  • AS14618 (Amazon-AES)
  • AS16276 (OVH)
  • AS16509 (Amazon-02)
  • AS203020 (HostRoyale)
  • AS204287 (HostRoyale)
  • AS21859 (Zenlayer)
  • AS23576 (Naver)
  • AS24940 (Hetzner)
  • AS396982 (Google Cloud)
  • AS45102 (Alibaba US)
  • AS37963 (Alibaba CN)
  • AS50245 (Serverel)
  • AS55286 (Server Mania)
  • AS6939 (Hurricane)
  • AS8075 (Microsoft)
  • AS150436 (Byteplus)
  • AS26548 (PureVoltage)
  • AS212238 (CDNEXT - Datacamp Limited, GB)
  • AS29802 (HVC-AS, US) - HIVELOCITY, Inc.
  • AS132817 (DZCRD-AS-AP DZCRD Networks Ltd, BD)
  • AS201341 (CENTURION-INTERNET-SERVICES - trafficforce, UAB, LT) - Code200
  • AS209709 (CODE200-ISP1 - UAB code200, LT)
  • AS207223 (GLOBALCON - Global Connections Network LLC, US)
  • AS64286 (LOGICWEB, US) - Tesonet
  • AS396319 (US-INTERNET-396319, US) - OxyLabs
  • AS62874 (WEB2OBJECTS, US) - Web2Objects LLC
  • AS174 (COGENT-174, US)

IPs in these networks have all participated in scraping our repository on a massive scale—hundreds or thousands of IPs concurrently—over the last few years. Some host known bad, malicious, and bad-faith bots (confirmed on AbuseIPDB.com and Greynoise and my own logs).

It's not enough to rely on user agents or good-faith parsing of robots.txt anymore because so many bots lie or just don't bother. I am rate limiting tens of millions of IPs just to keep the server up for real users. I do have a whitelists for known-good IPs that takes precedent over this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 APPROVAL pull request only requires a single approval to merge bug component: SEO Search Engine Optimization port to dspace-7_x This PR needs to be ported to `dspace-7_x` branch for next bug-fix release port to dspace-8_x This PR needs to be ported to `dspace-8_x` branch for next bug-fix release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants