-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
robots.txt: add /server/* and /app/health to list of disallowed paths #3275
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @saschaszott . Overall, this looks good. Just a minor comment inline to recommend.
I'm confused by this. If we disallow bots from our API, the only way they can get information about content in the repository is via directly visiting and scraping communities, collections, and items that appear in the sitemap. Is that the intent? There are many reasons to use the API programmatically instead of hitting the Angular frontend and trying to scrape HTML. I would have thought we would encourage programmatic access to the API. For example, I operate several integrations with DSpace repositories, scripting access to their APIs because I know how to use the API. If we merge this, then "technically" I would be breaking the site's What is the problem this pull request is trying to solve? Obviously we don't want bots crawling endless Discovery and Browse lists and we disallowed that long ago in all frontends because it generates massive load for no benefit (content on Discovery / Browse pages is derived from the actual items so should be crawled there instead). Is the problem that @saschaszott is seeing massive load coming from bots on their API? It's much more efficient for a bot to use the API directly (if it knows how to) than going to Angular, which ends up calling the API anyway. |
@alanorth , thanks for your input. The idea behind the additional Regarding your comment:
Why does DSpace use SSR? And how does Google represent such API responses in the overall search result? |
@saschaszott : On further reflection, I don't think the The reality is the Hal Browser is a Javascript app. If a bot doesn't understand Javascript (and almost all do not -- especially SEO bots), then it will be impossible for them to "crawl" the REST API. Try turning off Javascript and using the Hal Browser -- it doesn't work. This means that a bot which is only using SSR will never be able to "crawl" the Hal Browser. I'm also worried that blocking the entire REST API could be potentially problematic, as it could impact SEO. With @pnbecker's help, we already noticed that the bitstreams need to be accessible. What about thumbnails ( I'm worried that blindly disallowing the entire So, my opinion is that we should remove the |
Thank you all for your valueable feedback. I'll close the issue as "won't fix". We'll adapt the changes to robots.txt on our own and see what happens. |
For what it's worth, in our repository I have a massive list of datacenter ISP network subnets that get rate limited (implemented in nginx):
IPs in these networks have all participated in scraping our repository on a massive scale—hundreds or thousands of IPs concurrently—over the last few years. Some host known bad, malicious, and bad-faith bots (confirmed on AbuseIPDB.com and Greynoise and my own logs). It's not enough to rely on user agents or good-faith parsing of |
Description
This minor PR extends
robots.txt
by additionalDisallow
rules.