The API `result_count` is no more than 240 for unauthenticated requests #4476

obulat · 2024-06-12T06:53:34Z

obulat
Jun 12, 2024
Maintainer

Description

Previously, search results showed "Over 10,000 results for " label at the top. However, the users could at most view 240 results (12 pages x 20 results).

When implementing the additional search views, which have the same page depth and the same maximum number of results, we decided not to change the result_count value returned by the API, but change the result count label in the frontend to say something like "Showing top 240 [image|audio] results for 'cat'".

#4372 changed the result_count to return at most 240 for unauthenticated users. So, without the "showing top ..." label, it might make the users think that Openverse has very few results for all search terms.

The API documentation should also be updated. Currently, it says "Although there may be millions of relevant records, only the most relevant or the most recent several thousand records can be viewed. This is by design: the search endpoint should be used to find the top 10,000 most relevant results, not for exhaustive search or bulk download of every barely relevant result. "

Possible solutions

Keep returning 240 as result_count for all searches that have 240 or more results, and update the labels and the API documentation. We can re-use the label that we've always had (Over x results for query), but reduce the x to 240. Or we can change the label to "Show top 240 results for ..."
Revert the change to return 10,000 as result_count if there are 10000 or more results, and keep "Over 10,000 results".

My opinion is we should keep the 240, as that is de-facto maximum number of results that we return, but should update the code to say "Over 240 results". Currently, we only add "Over ..." when the result_count is above 10000 (which is never with the changes from #4372).

Initial note by @obulat from the original issue:

I think this was unintentional because we never discussed reducing the shown result_count for the API results. It is tricky since both 240 and 10000 are confusing: an unauthenticated user will only get at max 240 results. However, I think we wanted to always show that we do have the results, but we are not showing all of them due to the restrictions related to the API performance (to prevent scraping).

sarayourfriend · 2024-06-12T07:50:42Z

sarayourfriend
Jun 12, 2024
Collaborator

This was intentional. We have other places to find stats about how many results we have. Why expose a different non-specific number to a user? 10000 is even more obscure, it doesn't tell the user anything other than that we have a bunch of works, but they can't access them. For a scraper, maybe it's even an indication that they should crawl the tags of each work or something to try to uncover all those extra works behind the pagination barrier. It was even worse because we also showed page_count to match the useless 10000 results. If we showed 10k and a page count based on that, for someone using the API programmatically, the only way they would know that a query was exhausted, was by making a bunch of requests until the API suddenly decided they weren't allowed anymore and sent them a 401. That's absurd. Why not just say the limit? It's the real limit, for that user, at that instance.

Both are artificial barriers. 240 and an accurate page count based on that at least indicates how many real pages the user could request. It means something to API consumers. They can predict how many pages of results will exist for a query (e.g., for a frontend that wanted to show this information... maybe even ours?).

10000 doesn't do that. And it's still just as abstract/artificial as 240, and essentially an arbitrary limit (each responding to different problems being solved). Of the two, 240 (or a different value, if authenticated) is the only one with any real meaning.

I don't believe this is an issue and recommend closing it.

we are not showing all of them due to the restrictions related to the API performance (to prevent scraping)

To clarify, these are separate issues. Scraping can hurt API performance, but the primary motivation to prevent scraping is to prevent scraping. It is against our ToS. Just want to clarify that, for example, we wouldn't undo this pagination limit just because we could handle the performance of it.

0 replies

sarayourfriend · 2024-06-12T21:04:10Z

sarayourfriend
Jun 12, 2024
Collaborator

The API documentation should also be updated

For sure 👍

It didn't occur to me that this would change how the frontend presents works. I think the frontend should just say "top 240" as you suggested. "Over 10k" was already vague. You couldn't even see half that number. If we want to make the actual number of works possibly available for a search, then we could start including hits somewhere in the response (maybe as a header, X-Openverse-Query-ES-Hits?), but I don't think it should be represented in the body, right next to result_count, where there's little to no way to disambiguate the difference. Even in documentation, it would still be confusing to know what to care about, and whether people read documentation that deeply is a genuine concern.

2 replies

sarayourfriend Jun 18, 2024
Collaborator

I'll have a PR up in a moment to change the frontend to use "Top 240" (and so on).

@WordPress/openverse-frontend input would be appreciated.

zackkrida Jun 18, 2024
Collaborator

#4509

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The API `result_count` is no more than 240 for unauthenticated requests #4476

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

The API result_count is no more than 240 for unauthenticated requests #4476

obulat Jun 12, 2024 Maintainer

Description

Possible solutions

Initial note by @obulat from the original issue:

Replies: 2 comments · 2 replies

sarayourfriend Jun 12, 2024 Collaborator

sarayourfriend Jun 12, 2024 Collaborator

sarayourfriend Jun 18, 2024 Collaborator

zackkrida Jun 18, 2024 Collaborator

The API `result_count` is no more than 240 for unauthenticated requests #4476

obulat
Jun 12, 2024
Maintainer

Replies: 2 comments 2 replies

sarayourfriend
Jun 12, 2024
Collaborator

sarayourfriend
Jun 12, 2024
Collaborator

sarayourfriend Jun 18, 2024
Collaborator

zackkrida Jun 18, 2024
Collaborator