Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache status to objects #125

Open
ale-de-vries opened this issue Sep 23, 2019 · 3 comments
Open

Add cache status to objects #125

ale-de-vries opened this issue Sep 23, 2019 · 3 comments

Comments

@ale-de-vries
Copy link
Contributor

ale-de-vries commented Sep 23, 2019

For any of the data entities (i.e. AuthorRetrieval, ContentAffiliationRetrieval, AbstractRetrieval, and conceivably also the search types) it would be helpful to include a property/method that indicates whether a local data cache already exists for that entity, and if so, how old it is. This allows a script to inspect if the data needs to be fetched/refreshed from the REST endpoint, which in turn can be used to apply throttling when needed.

Background:
Note that the Scopus API endpoints enforce throttling; any requests that exceed the default request/seconds limit will fail. Also, any client that continuously exceeds throttling limits, risks having its API key suspended. This means that the client needs to monitor/control the rate at which it is calling the API to avoid such failed requests, e.g. by including a timeout (`sleep') when looping over API calls.
The challenge is that this timeout is not necessary when initiating a retrieval/search object for which a cache already existed, as for such cached objects, the API call isn't made. In fact, doing so would be unhelpful, as looping with a timeout over a series of objects that have been cached, means that initiating those objects will take longer than needed, unnecessarily increasing program run time.

(A more elegant approach would be for pybliometrics to enforce throttling, eg. by building a timeout into the get_content.py module - but that requires that module to persist the timestamp of the last request made to api.elsevier.com one way or another, which isn't trivial as this either needs to be persisted on-disk - or maintained in memory, like the elsapy library does.)

@Michael-E-Rose
Copy link
Contributor

Hi @ale-de-vries and thanks so much for this issue. You raise many of connected issues, all of which are worth thinking about!

I respond in reverse order:

  1. We cannot enforce throttling on a global level of pybliometrics (between different queries) without a lot of change to the backend. But we can easily slow down requests within one query. A colleague of mine actually experimented with this once as an effort to reduce the number of incidences of broken request and missing data in one query, but to no avail. But well, if it should help in principle, let's do it.
  2. I long thought about adding a property to all classes telling the user about when the file has last been cached (i.e. created or modified) as well. Doing so requires a new base class from which both the Search() and the Retrieval() class inherit from. Getting the modified timestamp via os is easy.
  3. Using the timestamp from 2., I plan to adapt the refresh parameters slightly. User will be able to provide an integer additional to providing a boolean. The integer will be interpreted as maximum age of the cache in days. If the file is older than the provided value, pybliometrics refreshes the file.
  4. Given these, I don't see so much the point of having a property telling the user whether the file has been cached or not. For one, there is the download parameter in the search classes. If it's set to False and the file exists, the relevant parameters are still filled. So that's how users see whether the file exists. For second, I don't see a use case for having information on the cache status if it's not True. That is, why would someone be interested in knowing whether the cached file is already there and then decide to not retrieve the corresponding information? Of course, I am open to discussion here.

@Michael-E-Rose
Copy link
Contributor

With fde4a8c, any pybliometrics class can show how old the cached file is. That's certainly a good step in the right direction.

@Michael-E-Rose
Copy link
Contributor

Throttling implemented in e32c349

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants