robots.txt: add /server/* and /app/health to list of disallowed paths #3275

saschaszott · 2024-08-29T11:54:26Z

Description

This minor PR extends robots.txt by additional Disallow rules.

tdonohue

Thanks @saschaszott . Overall, this looks good. Just a minor comment inline to recommend.

src/robots.txt.ejs

alanorth · 2024-10-20T07:48:38Z

I'm confused by this. If we disallow bots from our API, the only way they can get information about content in the repository is via directly visiting and scraping communities, collections, and items that appear in the sitemap. Is that the intent?

There are many reasons to use the API programmatically instead of hitting the Angular frontend and trying to scrape HTML. I would have thought we would encourage programmatic access to the API. For example, I operate several integrations with DSpace repositories, scripting access to their APIs because I know how to use the API. If we merge this, then "technically" I would be breaking the site's robots.txt.

What is the problem this pull request is trying to solve? Obviously we don't want bots crawling endless Discovery and Browse lists and we disallowed that long ago in all frontends because it generates massive load for no benefit (content on Discovery / Browse pages is derived from the actual items so should be crawled there instead). Is the problem that @saschaszott is seeing massive load coming from bots on their API?

It's much more efficient for a bot to use the API directly (if it knows how to) than going to Angular, which ends up calling the API anyway.

saschaszott · 2024-10-21T09:49:02Z

@alanorth , thanks for your input. The idea behind the additional Disallow rules was to reduce the crawler / bot traffic on the DSpace backend. I'm not sure how / if crawlers will access publicly available REST API endpoints. Do you have a reference? I could not find appropriate references.

Regarding your comment:

It's much more efficient for a bot to use the API directly (if it knows how to) than going to Angular, which ends up calling the API anyway.

Why does DSpace use SSR? And how does Google represent such API responses in the overall search result?

tdonohue · 2024-10-21T15:23:45Z

@saschaszott : On further reflection, I don't think the Disallow /server/* has any major benefits.

The reality is the Hal Browser is a Javascript app. If a bot doesn't understand Javascript (and almost all do not -- especially SEO bots), then it will be impossible for them to "crawl" the REST API. Try turning off Javascript and using the Hal Browser -- it doesn't work. This means that a bot which is only using SSR will never be able to "crawl" the Hal Browser.

I'm also worried that blocking the entire REST API could be potentially problematic, as it could impact SEO. With @pnbecker's help, we already noticed that the bitstreams need to be accessible. What about thumbnails (/server/api/items/[uuid]/thumbnail)? What about bots that use/access OpenSearch / RSS (/server/opensearch/search)? What about bots that use/access OAI-PMH (/server/oai/)?

I'm worried that blindly disallowing the entire /server/ path is likely to have unexpected side effects for some bots. And I think it's unlikely many bots will even succeed in "crawling" the REST API. (The bot would have to understand Javascript or JSON)

So, my opinion is that we should remove the /server/* entries from the robots.txt, until there's a clear use case for them. Individual sites can add it in if they feel it will have no side effects.

saschaszott · 2024-10-21T19:32:42Z

Thank you all for your valueable feedback. I'll close the issue as "won't fix". We'll adapt the changes to robots.txt on our own and see what happens.

alanorth · 2024-10-22T12:15:50Z

For what it's worth, in our repository I have a massive list of datacenter ISP network subnets that get rate limited (implemented in nginx):

AS12876 (Scaleway)
AS132203 (Tencent)
AS13238 (Yandex)
AS200350 (Yandex Cloud)
AS136907 (Huawei Cloud)
AS14061 (Digital Ocean)
AS14618 (Amazon-AES)
AS16276 (OVH)
AS16509 (Amazon-02)
AS203020 (HostRoyale)
AS204287 (HostRoyale)
AS21859 (Zenlayer)
AS23576 (Naver)
AS24940 (Hetzner)
AS396982 (Google Cloud)
AS45102 (Alibaba US)
AS37963 (Alibaba CN)
AS50245 (Serverel)
AS55286 (Server Mania)
AS6939 (Hurricane)
AS8075 (Microsoft)
AS150436 (Byteplus)
AS26548 (PureVoltage)
AS212238 (CDNEXT - Datacamp Limited, GB)
AS29802 (HVC-AS, US) - HIVELOCITY, Inc.
AS132817 (DZCRD-AS-AP DZCRD Networks Ltd, BD)
AS201341 (CENTURION-INTERNET-SERVICES - trafficforce, UAB, LT) - Code200
AS209709 (CODE200-ISP1 - UAB code200, LT)
AS207223 (GLOBALCON - Global Connections Network LLC, US)
AS64286 (LOGICWEB, US) - Tesonet
AS396319 (US-INTERNET-396319, US) - OxyLabs
AS62874 (WEB2OBJECTS, US) - Web2Objects LLC
AS174 (COGENT-174, US)

IPs in these networks have all participated in scraping our repository on a massive scale—hundreds or thousands of IPs concurrently—over the last few years. Some host known bad, malicious, and bad-faith bots (confirmed on AbuseIPDB.com and Greynoise and my own logs).

It's not enough to rely on user agents or good-faith parsing of robots.txt anymore because so many bots lie or just don't bother. I am rate limiting tens of millions of IPs just to keep the server up for real users. I do have a whitelists for known-good IPs that takes precedent over this.

added disallow rule for /server

a4d6758

github-actions bot assigned saschaszott Aug 29, 2024

add /app/health to disallow list

862d0d9

saschaszott changed the title ~~robots.txt: add /server/* to list of disallowed paths~~ robots.txt: add /server/* and /app/health to list of disallowed paths Aug 29, 2024

tdonohue reviewed Aug 29, 2024

View reviewed changes

src/robots.txt.ejs Show resolved Hide resolved

allow to crawl bitstream download endpoint

e0267ef

saschaszott closed this Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robots.txt: add /server/* and /app/health to list of disallowed paths #3275

robots.txt: add /server/* and /app/health to list of disallowed paths #3275

saschaszott commented Aug 29, 2024 •

edited by alanorth

Loading

tdonohue left a comment

alanorth commented Oct 20, 2024

saschaszott commented Oct 21, 2024

tdonohue commented Oct 21, 2024 •

edited

Loading

saschaszott commented Oct 21, 2024

alanorth commented Oct 22, 2024

robots.txt: add /server/* and /app/health to list of disallowed paths #3275

robots.txt: add /server/* and /app/health to list of disallowed paths #3275

Conversation

saschaszott commented Aug 29, 2024 • edited by alanorth Loading

Description

tdonohue left a comment

Choose a reason for hiding this comment

alanorth commented Oct 20, 2024

saschaszott commented Oct 21, 2024

tdonohue commented Oct 21, 2024 • edited Loading

saschaszott commented Oct 21, 2024

alanorth commented Oct 22, 2024

saschaszott commented Aug 29, 2024 •

edited by alanorth

Loading

tdonohue commented Oct 21, 2024 •

edited

Loading