Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance for String serialization #2645

Open
kilink opened this issue Mar 11, 2024 · 8 comments
Open

Poor performance for String serialization #2645

kilink opened this issue Mar 11, 2024 · 8 comments
Labels

Comments

@kilink
Copy link

kilink commented Mar 11, 2024

Gson version

Gson 2.10.1

Java / Android version

Happening on JDK21, but presumably is an issue for any JDK, as the relevant code hasn't changed in 10+ years.

Description

Serializing a String to JSON using toJson can result in excessive copying of the internal StringBuffer array for certain inputs, in particular for Strings with a length >= 33. This all boils down to how the internal StringBuffer used by StringWriter is initialized, and its resizing behavior: it starts with a capacity of 16, then tries to use double the previous capacity + 2 (or the size of the String being written if it is larger than that). If the capacity is increased only to accommodate the size of the String being written, then the final call to write the closing double quote will trigger another resize.

Expected behavior

Avoid excessive allocations / copying as much as possible. JsonWriter's string method could check if the Writer is a StringWriter, and call ensureCapacity on its StringBuffer, to avoid the excessive resizing.

--- a/gson/src/main/java/com/google/gson/stream/JsonWriter.java
+++ b/gson/src/main/java/com/google/gson/stream/JsonWriter.java
@@ -32,6 +32,7 @@ import com.google.gson.Strictness;
 import java.io.Closeable;
 import java.io.Flushable;
 import java.io.IOException;
+import java.io.StringWriter;
 import java.io.Writer;
 import java.math.BigDecimal;
 import java.math.BigInteger;
@@ -734,6 +735,9 @@ public class JsonWriter implements Closeable, Flushable {
   }
 
   private void string(String value) throws IOException {
+    if (out instanceof StringWriter) {
+      ((StringWriter) out).getBuffer().ensureCapacity(value.length() + 2);
+    }
     String[] replacements = htmlSafe ? HTML_SAFE_REPLACEMENT_CHARS : REPLACEMENT_CHARS;
     out.write('\"');
     int last = 0;

Obviously the above really only works for Strings that don't require escaping, but that may even be sufficient for most scenarios.

Actual behavior

Reproduction steps

It's easier to look at a trivial case where the input being serialized does not require any escaping:

import com.google.common.base.Strings;
import com.google.gson.Gson;
import java.io.StringWriter;

Gson gson = new Gson();
StringWriter writer = new StringWriter();
gson.toJson(Strings.repeat("A", 33), writer);
System.out.println(writer.getBuffer().capacity()); // prints 70
System.out.println(writer.getBuffer().length()); // prints 35

The above resulted in three resizes of the internal array, the last of which being the most excessive (35 -> 75 for a single character).

@kilink kilink added the bug label Mar 11, 2024
@Marcono1234
Copy link
Collaborator

Marcono1234 commented Mar 17, 2024

This suggestion sounds reasonable, but is this rather a theoretical issue, or something you identified as performance issue when using Gson?

I assume this change would mainly help when writing large strings at the beginning of the JSON data. In the other cases it would not change much:

  • When writing small strings at the beginning using ensureCapacity might have similar effects as the implicit resizing by append
  • Writing large strings later in the JSON data might not trigger any resizing for ensureCapacity since the buffer is already large enough
  • The perfectly sized buffer shown in your "Reproduction steps" would probably only matter for top-level string values (and which don't require escaping); for non-top-level string values subsequent values or closing ] or } would still cause resizing

But I am not completely sure. It seems the main advantages with your proposed changes would be:

  • less buffer resizing for large strings at the beginning of the JSON data
  • potentially perfectly sized buffers for top-level string values

Would probably also make sense to check not only for StringWriter but also for com.google.gson.internal.Streams.AppendableWriter then, in case the user provided directly a StringBuilder or StringBuffer to Gson.toJson(Object, Appendable).

@kilink
Copy link
Author

kilink commented Sep 11, 2024

This suggestion sounds reasonable, but is this rather a theoretical issue, or something you identified as performance issue when using Gson?

Sorry for the delayed response, but this is not a theoretical issue, but a case we hit in an actual production app using Gson. I worked around the issue by supplying an explicit StringWriter that was sized appropriately.

I assume this change would mainly help when writing large strings at the beginning of the JSON data.

That sounds about right, if I recall correctly we were serializing lots of top-level Strings.

Would probably also make sense to check not only for StringWriter but also for com.google.gson.internal.Streams.AppendableWriter then, in case the user provided directly a StringBuilder or StringBuffer to Gson.toJson(Object, Appendable).

That sounds like a good idea.

@Marcono1234
Copy link
Collaborator

Thanks for the additional information!

To me it seems such a change for Gson would only help for this specific use case1, and for other use cases it would either have no effect, or might even negatively impact performance due to more resizing being necessary?
Gson's JsonWriter#string(String) method could maybe try to only apply this resizing if no JSON data has been written yet and the value is therefore at top-level, however that still makes assumptions that no escaping will be necessary.

Maybe for these use cases it would be better if the user (in this case you) pre-sizes the StringWriter / StringBuilder since the user has more knowledge about the use case than Gson can assume or infer.

What do you think?
Also @eamonnmcmanus what do you think?

Footnotes

  1. As mentioned above, writing a top-level string without any characters which need escaping. Otherwise during writing the StringWriter will have to increase its capacity anyway.

@kilink
Copy link
Author

kilink commented Sep 15, 2024

To me it seems such a change for Gson would only help for this specific use case1

Actually, this issue isn't confined to top-level String values. It can be reproduced with the following input as well:

gson.toJson(Map.of(Strings.repeat("A", 33), "value");

It just seems like a problem that can be triggered by any large String being written near the start of the serialized output.

and for other use cases it would either have no effect, or might even negatively impact performance due to more resizing being necessary?

How would it negatively impact performance? The ensureCapacity call does nothing if the buffer already has sufficient capacity. The one scenario I can imagine is if the String contains a lot of characters that need to be escaped; that could probably be addressed by padding the size asked for in ensureCapacity by more than the 2 needed for the start / end quotes. It may also be fine to ignore since I don't think the behavior in that scenario would be any worse than what it is today, would it?

@eamonnmcmanus
Copy link
Member

I think if we were going to do anything here it would be something a bit more general. When we are about to write a string of length $n$, we know we are going to add at least $n + 2$ characters to the StringWriter. For large $n$ we know that that could lead to several doublings of the buffer, each requiring a copy. We could therefore preemptively call ensureCapacity to reach the final capacity immediately that we know we're going to reach eventually anyway. So for example if you are writing a string of 100,000 characters, you would immediately resize to 147,454 rather than reaching that capacity via 13 doublings starting from the initial 16. So you might end up copying just a few characters rather than 147,412. (Here I'm using "doublings" as a shorthand for the $2n + 2$ that AbstractStringBuilder uses.)

However I feel that giant strings are probably something of a niche use case, and I'm somewhat unhappy with the idea of not only reaching into the StringBuffer behind a StringWriter but also hardcoding knowledge of the unspecified resizing algorithm of its JDK implementation.

What you could do instead would be to subclass JsonWriter and override its value(String) method so that it does something like the above before calling super.value(value). Then you can serialize with Gson.toJson(Object, Type, JsonWriter).

@kilink
Copy link
Author

kilink commented Sep 17, 2024

However I feel that giant strings are probably something of a niche use case, and I'm somewhat unhappy with the idea of not only reaching into the StringBuffer behind a StringWriter but also hardcoding knowledge of the unspecified resizing algorithm of its JDK implementation.

Sure, I probably shouldn't have included that example workaround, as I don't care much how it's solved. I think it's worthwhile to have the discussion though about the default behavior here. A lot of this just boils down to issues with the aging StringWriter class at the end of the day.

What you could do instead would be to subclass JsonWriter and override its value(String) method so that it does something like the above before calling super.value(value). Then you can serialize with Gson.toJson(Object, Type, JsonWriter).

Yep, and that is what I ended up doing in the application where this was an issue; but it still means the more frequently used toJson variant has this problem that users need to be aware of. I inherited an app where it wasn't easy to switch to a more performant JSON library, and also the baseline performance / allocation rate was kind of baked in. Someone has to go poking around to even see that there is a problem in such a scenario.

@eamonnmcmanus
Copy link
Member

We could imagine using an alternative to StringWriter for the very common case where it is created implicitly by the various Gson.toJson overloads. It's absolutely true that Gson has information that it could use to provide a better size for the initial buffer, and/or grow the buffer in a more sophisticated way than the $2n + 2$ that StringBuffer uses. But I feel that getting that right is a pretty major project. We don't want to harm the common case of serializing into a smallish JSON string. For example, we could imagine a two-pass approach where we initially scan the input just to know how much space to allocate, but that almost certainly would be worse for small outputs.

One thing we could very easily do is provide a non-default initial size for the StringWriter. 16 is pretty small and I would guess that nearly all serialized outputs exceed that. An initial size like 256 feels like it might be more appropriate. It doesn't much matter if the buffer is a bit too big: at the end we copy just the output characters into the String result, then discard the buffer. That doesn't change much in the giant-output case, though, where most of the work is going to be copying for the larger $n$ values in $2n + 2$ increases.

@Marcono1234
Copy link
Collaborator

How would it [= pre-sizing] negatively impact performance? The ensureCapacity call does nothing if the buffer already has sufficient capacity.

For example, with the current resizing it might be 16 -> 34 -> 70, assuming the complete JSON data is <= 70 chars. But with your proposed pre-sizing it might be 16 -> 50, but then it turns out there is more JSON data to write in case the JSON string is not at top-level or there are escape sequences, and then it would go to -> 102, so in the end would take up more space than currently.
I am not saying that this is that dramatic, I am just saying that there might be cases where the pre-sizing could result in less ideal behavior.
(Though I haven't verified that yet, maybe I overlooked something.)

Also, I just noticed in your originally proposed diff above you wrote ensureCapacity(value.length() + 2); is that intended? ensureCapacity seems to take the total capacity, not the free capacity. Should it rather be ensureCapacity(buffer.length() + value.length() + 2)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants