Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster pprint #1145

Closed

Conversation

atsuoishimoto
Copy link

@atsuoishimoto atsuoishimoto commented Jul 28, 2023

jsonschema can sometimes be very slow when validation errors occur. This is due to the slow formatting of large schemas or data with pprint.pformat(). In this PR, I propose replacing pprint.pformat() with json.dumps(), which is expected to perform more than 10 times faster.

The following samples measure the time taken to format 300 GitHub PRs.

pprint.pformat()

$ python3 -m timeit -s "import requests, pprint;L=requests.get('https://api.github.com/repos/python/cpython/pulls?state=all').json() * 100" "pprint.pformat(L)"
1 loop, best of 5: 14.2 sec per loop

json.dumps()

$ python3 -m timeit -s "import requests, json;L=requests.get('https://api.github.com/repos/python/cpython/pulls?state=all').json() * 100" "json.dumps(L, ensure_ascii=False, indent=2, default=repr)"
1 loop, best of 5: 1.19 sec per loop

(Memo: If it's possible to add as a dependency, using orjson would make it more than 10 times faster.)

python3 -m timeit -s "import requests, orjson;L=requests.get('https://api.github.com/repos/python/cpython/pulls?state=all').json() * 100" "orjson.dumps(L, option=orjson.OPT_INDENT_2|orjson.OPT_SORT_KEYS, default=repr)"
5 loops, best of 5: 79.2 msec per loop

json.dumps() does not perform wordwrapping or other formatting pformat() does. However, such formatting can insert unnecessary newline characters into the data, sometimes making debugging hard. I believe it's more appropriate not to perform such modifications.


📚 Documentation preview 📚: https://python-jsonschema--1145.org.readthedocs.build/en/1145/

@atsuoishimoto atsuoishimoto temporarily deployed to PyPI July 28, 2023 07:15 — with GitHub Actions Inactive
@Julian
Copy link
Member

Julian commented Jul 28, 2023

Hi there, thanks for the PR.

jsonschema can sometimes be very slow when validation errors occur. This is due to the slow formatting of large schemas or data with pprint.pformat().

This sounds slightly odd -- exceptions are only formatted when they're printed (or otherwise when repr ends up getting called) -- your application repeatedly prints large numbers of exceptions coming from jsonschema?

Regardless though, this PR likely can't go forward, because it isn't the case that exception contents will be JSON serializable! The instance or schema may have been deserialized using a custom JSONDecoder which the library won't know about, or may not have originated in JSON ini the first place!

@atsuoishimoto
Copy link
Author

Hi, thank you for comment!

your application repeatedly prints large numbers of exceptions coming from jsonschema?

Yes, we output errors as log entries. Generally, they don't cause harm, but they can be time-consuming when outputting a JSON with numerous elements involved.

Regardless though, this PR likely can't go forward, because it isn't the case that exception contents will be JSON serializable!

Agreed. So the default=repr option will handle such non-serializable objects. This is also the default behavior when pprint outputs user-defined objects.

@Julian
Copy link
Member

Julian commented Jul 28, 2023

I still wouldn't want to use json.dumps, as it introduces the possibility that a schema is invalid for some instance before dumping but appears to be valid afterwards. This is the case if e.g. as mentioned the user used a custom deserializer, but also in cases like coercing numeric keys (properties) to strings.

Is there a reason in your logs you don't simply format exceptions in whatever way you'd like by pulling off whatever attributes you're interested in?

You might also be interested in #243 which is about making this precise function (pformat) injectable, and would allow you to inject json.dumps if you so preferred, without touching the internals. That'd be more like a solution I'd be comfortable with.

@atsuoishimoto
Copy link
Author

I still wouldn't want to use json.dumps, as it introduces the possibility that a schema is invalid for some instance before dumping but appears to be valid afterwards. This is the case if e.g. as mentioned the user used a custom deserializer, but also in cases like coercing numeric keys (properties) to strings.

Points taken.

Is there a reason in your logs you don't simply format exceptions in whatever way you'd like by pulling off whatever attributes you're interested in?

No, we can format our log ourselves. But I thought using json.dumps() is a reasonable default to improve jsonschema.

I'm closing the PR. Thank you!

@Julian
Copy link
Member

Julian commented Jul 28, 2023

Fair enough, I appreciate the offer(s) regardless! Thanks for giving it a shot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants