Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large numbers of HTTP requests for the same JSON-LD context due to lack of caching #292

Closed
jakubklimek opened this issue Dec 1, 2023 · 6 comments · Fixed by #304
Closed
Labels
enhancement New feature or request
Milestone

Comments

@jakubklimek
Copy link

jakubklimek commented Dec 1, 2023

Describe the bug
When parsing a JSON-LD file (JSON-LD to RDF), already loaded external contexts are not cached. In a situation where there is a hierarchy of external contexts referencing each other, and used multiple times in a file, an HTTP request is made every time such a JSON-LD context is required. This results to a large number of identical HTTP requests, flooding the server hosting the contexts, triggering rate limiting, GOAWAY messages, etc., not to mention performance issues.

To Reproduce
Try transforming inzeraty.json to RDF. It contains 36 job advertisements, reusing other JSON-LD contexts.

Expected behavior
Each required context is loaded only once per transformation. Subsequent usages of that context should use the already available copy.
E.g. the JSON-LD playground works this way and transforms the file immediately with 17 HTTP requests for contexts, some of them also duplicate.

Additional context
Originally, the server hosting the contexts, running nginx, had the default keepalive_requests 1000;, which resulted in GOAWAY in approx. 13 seconds of loading the file. I have raised the limit to 10000, which resulted in GOAWAY in 5 minutes. Only after I raised it to 100000, in 13 minutes, the file got processed.

@skodapetr
Copy link
Contributor

It seems the issue is a lack of caching in ActiveContextBuilder line 244.
On this line new document is requested every time without checking the cache.
There are cache mechanisms in JsonLdOptions, yet they are not used here.

I see several ways to tackle this issue:

  1. Introduce a custom HttpLoader with another cache. We may be able to do this outside the
    titanium-json-ld code base.
  2. Add caching to ActiveContextBuilder.
    2.1) We can leverage existing contextCache as we are working with the context here anyway.
    The main issue is that contextCache stores JsonValue instead of JsonStructure.
    2.2) We can add a third cache importedDocumentCache (to JsonLdOptions) to the existing contextCache, documentCache.
    That would not only allow to use of proper type but also when left empty (null) would not change
    existing behaviour.
    2.3) We can use documentCache, but I would rather avoid it. Using documentCache
    would lead to unnecessary parsing of JSON harming performance. This would be equivalent to 1) but in titanium-json-ld code base.

@filip26 Can I get your thoughts, is there a solution you would prefer?

@filip26
Copy link
Owner

filip26 commented Dec 18, 2023

HTTP cache must be fully handled by provided HttpLoader implementation. e.g. native Java 11 HttpClient. I'm not in favor to start implementing another HTTP cache as a part of the processor.

How to cache parsed and pre-processed context is another thing. In the next major version I would like to see a separate method allowing you to pre-fetch/process a context separatelly.

Something like

var context = JsonLd.getContent(....);
var expanded =  JsonLd.expand(input, context);

So you would be able to re-use a context of a set contexts between calls.

@skodapetr
Copy link
Contributor

So that takes 1), and 2.3) out of the question.
From my perspective the difference between 2.1 and 2.2 is backwards compatibility.

As of the

Jsonld.getContext( .... );

That is a neat idea. Yet, I do not see how this would address the issue. I would need to know all contexts used throughout the document tree, which is often not the case.

I see that you do aim on the high-level API (JsonLd object). As I should be able to use re-use cache by sharing the JsonLdOptions between the calls. Is that correct? If so, we are back to 2.1 or 2.2.

skodapetr added a commit to skodapetr/titanium-json-ld that referenced this issue Dec 29, 2023
@filip26
Copy link
Owner

filip26 commented Jan 24, 2024

@jakubklimek can we close this issue as a shareable cache has been delivered by @skodapetr ?

@jakubklimek
Copy link
Author

@filip26 I would prefer to close it when the resolution makes its way to a release and we can test it in our tool. But I will understand if you want to close it based on the fact that the corresponding test is passing. We can always raise another issue later.

@filip26 filip26 linked a pull request Jan 24, 2024 that will close this issue
@filip26
Copy link
Owner

filip26 commented Jan 24, 2024

@jakubklimek great, it's easier link an issue to PR and close it together, to keep it trackable, instead of waiting on a release and then do the closing manually. It's part of 1.4.0-SNAPSHOT now.

Please feel free to re-open or raise a new issue.

@filip26 filip26 closed this as completed Jan 24, 2024
@filip26 filip26 added this to the 1.4.0 milestone Jan 24, 2024
@filip26 filip26 added the enhancement New feature or request label Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants