Skip to content
This repository has been archived by the owner on Sep 20, 2021. It is now read-only.

export: entry point for impact graph record export #241

Open
wants to merge 1 commit into
base: prod
Choose a base branch
from
Open

export: entry point for impact graph record export #241

wants to merge 1 commit into from

Conversation

glouppe
Copy link

@glouppe glouppe commented Jun 23, 2015

This is meant to be an entry point for exporting the metadata necessary for building the impact graph of a record. Could we try this branch on the test machine before deployment?

CC: @jalavik @eamonnmag

@eamonnmag
Copy link

Awesome! Thanks Gilles!

@jalavik
Copy link

jalavik commented Jun 23, 2015

I deployed it on http://inspireheptest.cern.ch but what is the URL? /unapi/impact?recid=1340812 does not work.

@kaplun

date = date[0]
match_obj = re.search("\d\d\d\d", date)
if match_obj is not None:
return int(match_obj.group())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glouppe you can better retrieve the year using bibrec.earliest_date which is maintained by a bibcheck scripts that looks into these and many more fields.

@kaplun
Copy link

kaplun commented Jun 23, 2015

Nope @jalavik, unapi is just close in the patch but is unrelated. The entry point is:
http://inspireheptest.cern.ch/record/.../export/impact

@kaplun
Copy link

kaplun commented Jun 23, 2015

Seems quite heavy at the moment. It takes 17s for a random record to fetch it but seems due to the fact that's a JSON of 1MB!
https://inspireheptest.cern.ch/record/1340769/export/impact

@glouppe
Copy link
Author

glouppe commented Jun 23, 2015

Thanks for the deployment @jalavik !

Yes, https://inspireheptest.cern.ch/record/1340769/export/impact is the correct url. It seems to work as it should :)

And yeah, it is slow because 1) the fetchers are not very fast in the place, 2) it is doing tons of queries at the application level and 3) because JSON is awfully uncompact :(

@eamonnmag
Copy link

Works well, I’ve just used the url with the initial publication list with my code and it works nicely! Gilles, just spoke with Jan, apparently redis is installed on that machine, so we could have the results cached there.

On 23 Jun 2015, at 21:47, Gilles Louppe [email protected] wrote:

Thanks for the deployment @jalavik https://github.com/jalavik !

Yes, https://inspireheptest.cern.ch/record/1340769/export/impact https://inspireheptest.cern.ch/record/1340769/export/impact is the correct url. It seems to work as it should :)

And yeah, it is slow because 1) the fetchers are not very fast in the place and 2) because JSON is awfully uncompact :(


Reply to this email directly or view it on GitHub #241 (comment).

@jalavik
Copy link

jalavik commented Jun 24, 2015

@glouppe Yeah, we store user sessions in redis and maybe even the cite dict, so there exist already an integration. Check out how to use it here: https://github.com/inspirehep/inspire/blob/master/bibtasklets/bst_webcoll_postprocess.py#L26-L39

@kaplun
Copy link

kaplun commented Jun 24, 2015

Take into consideration that this JSON blob is 1MB so if you imagine someone is crawling us, and ping all the entries, they will all end up in redis 1MB*1Mrecord gosh. Plus, when at CERN it looks like it's quite fast to serve the given export. So we could probably save some time by serving gzipped JSON (using regular Apache support), which halves the size of the file.

@jalavik
Copy link

jalavik commented Oct 22, 2015

Does it make sense to revive this now that we will include the impact graph in the new design, or has it been superseded by a different technique? @glouppe @eamonnmag @kaplun

@eamonnmag
Copy link

There needs to be an endpoint in the new system to get the data back to
build the graph. However it will be similar to what is already there for
the citations and references.

On Fri, 23 Oct 2015 00:00 Jan Åge Lavik [email protected] wrote:

Does it make sense to revive this now that we will include the impact
graph in the new design, or has it been superseded by a different
technique? @glouppe https://github.com/glouppe @eamonnmag
https://github.com/eamonnmag @kaplun https://github.com/kaplun


Reply to this email directly or view it on GitHub
#241 (comment).

@kaplun
Copy link

kaplun commented Oct 23, 2015

For this to be effective we have to pre-compute quite a certain amount of information (since for every citing and cited paper we need to keep the year of that paper, its self-citation flag WRT the current paper and we need to keep up-to-date the number of citations of that paper).

Now that I have studied elasticsearch a bit more I think it doesn't look feasible for the task (since we would end up storing a massive amount of information in every record and we would need to refresh every record linked with any touched record.

Probably time to revive using a dedicated graph db for this task (basically we could register to a signal, everytime a record is touched, and propagate the change to the Graph DB only for that very node. The only tricky computation is the self-citation flag, since we need to pay attention whether a given update is related to the author list and hence we need to recompute all the self-citation flags in all the incoming and outgoing relations.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants