-
Notifications
You must be signed in to change notification settings - Fork 10
export: entry point for impact graph record export #241
base: prod
Are you sure you want to change the base?
Conversation
Awesome! Thanks Gilles! |
I deployed it on http://inspireheptest.cern.ch but what is the URL? |
date = date[0] | ||
match_obj = re.search("\d\d\d\d", date) | ||
if match_obj is not None: | ||
return int(match_obj.group()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glouppe you can better retrieve the year using bibrec.earliest_date
which is maintained by a bibcheck scripts that looks into these and many more fields.
Nope @jalavik, unapi is just close in the patch but is unrelated. The entry point is: |
Seems quite heavy at the moment. It takes 17s for a random record to fetch it but seems due to the fact that's a JSON of 1MB! |
Thanks for the deployment @jalavik ! Yes, https://inspireheptest.cern.ch/record/1340769/export/impact is the correct url. It seems to work as it should :) And yeah, it is slow because 1) the fetchers are not very fast in the place, 2) it is doing tons of queries at the application level and 3) because JSON is awfully uncompact :( |
Works well, I’ve just used the url with the initial publication list with my code and it works nicely! Gilles, just spoke with Jan, apparently redis is installed on that machine, so we could have the results cached there.
|
@glouppe Yeah, we store user sessions in redis and maybe even the cite dict, so there exist already an integration. Check out how to use it here: https://github.com/inspirehep/inspire/blob/master/bibtasklets/bst_webcoll_postprocess.py#L26-L39 |
Take into consideration that this JSON blob is 1MB so if you imagine someone is crawling us, and ping all the entries, they will all end up in redis 1MB*1Mrecord gosh. Plus, when at CERN it looks like it's quite fast to serve the given export. So we could probably save some time by serving gzipped JSON (using regular Apache support), which halves the size of the file. |
Does it make sense to revive this now that we will include the impact graph in the new design, or has it been superseded by a different technique? @glouppe @eamonnmag @kaplun |
There needs to be an endpoint in the new system to get the data back to On Fri, 23 Oct 2015 00:00 Jan Åge Lavik [email protected] wrote:
|
For this to be effective we have to pre-compute quite a certain amount of information (since for every citing and cited paper we need to keep the year of that paper, its self-citation flag WRT the current paper and we need to keep up-to-date the number of citations of that paper). Now that I have studied elasticsearch a bit more I think it doesn't look feasible for the task (since we would end up storing a massive amount of information in every record and we would need to refresh every record linked with any touched record. Probably time to revive using a dedicated graph db for this task (basically we could register to a signal, everytime a record is touched, and propagate the change to the Graph DB only for that very node. The only tricky computation is the self-citation flag, since we need to pay attention whether a given update is related to the author list and hence we need to recompute all the self-citation flags in all the incoming and outgoing relations.) |
This is meant to be an entry point for exporting the metadata necessary for building the impact graph of a record. Could we try this branch on the test machine before deployment?
CC: @jalavik @eamonnmag