-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate measurements with random station names #125
Conversation
Hey @mtopolnik, you mentioned this would fail some of the existing contenders. Can you add a measurements.txt and corresponding out to src/test/resources and report back the result of ./test_all.sh? Thx! |
0057011
to
f1dda12
Compare
I ran it... it's not a pretty sight :)
I had to disable |
Where does the data for the CSV file come from? Would be good (and kind) to add attribution, unless you've come up with it yourself :) |
Attribution provided here (as a code comment). Does it need to be more visible? If I put it into the CSV itself, I'll need more code to ignore those lines. |
Ah, sorry, I had missed that! IMO preferable to put it in the CSV itself, should be relatively easy if you just treat all lines starting with Mostly important to license it correctly because it comes from a company, and wouldn't want them to come ruin the party (doubt they would, but you never know)! And they're relatively explicit about the data being licensed under Creative Commons Attribution 4.0 and requiring attribution. |
Oh wow, that's interesting. Curious why so many are failing. E.g. @royvanrijn, @spullara, @Palmr, any thoughts about it? Before committing this to the test suite I think we need to better understand the reasons. Admittedly, the constraints were not as well-defined initially as they'd ideally have been, and I want to be cautious of not moving goal posts (too much, anyways ;). That being said, one thing every impl definitely should satisfy is to support the maximum station name length (100), I've logged #156 for adding this to the test suite. I think that's rather uncontroversial. |
The newly added test file contains a worst-case name that has exactly 100 characters, but they are all the "scariest"-looking characters I could find in a larger sample of names. This does bring up the point that was made elsewhere -- in general, Unicode characters have unbounded length due to the combining codepoints. |
I took the maximum station name length of 100 as bytes (or some larger arbitrary number). If we choose that it will be easier to test and make sense of it. |
I think the wording is quite unambigious in that regard:
That would be UTF-8 characters, not bytes. That being said, that limit was chosen rather arbitrarily, and I don't think it changes anything substantial about the nature of the challenge or its goals (which are not to solve all character encoding oddities in the world). So IMO we wouldn't compromise by effectively limiting it to 100 bytes indeed, e.g. by testing a 100 ASCII (one byte) characters and maybe 50 two-byte chars, and consider implementations which satisfy that as valid. Thoughts? |
I agree, let's put a 100 byte limit on it. |
After this adjustment, test results have improved:
|
Sure, for me just increase the buffer size. Index: src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java b/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java
--- a/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java (revision 9cef402a3e2246d28091861e17d635c7fc214616)
+++ b/src/main/java/dev/morling/onebrc/CalculateAverage_twobiers.java (date 1704494949716)
@@ -178,7 +178,7 @@
var measurements = new ArrayList<Measurement>(100_000);
final int limit = byteBuffer.limit();
- final byte[] buffer = new byte[64];
+ final byte[] buffer = new byte[128];
while (byteBuffer.position() < limit) {
final int start = byteBuffer.position(); |
Ah yeah, nice. Could you PR this, please? Thx. |
It's a bug in the fickle SIMD part of my implementation. Was in the process of refactoring that anyway. |
Ok, cool. Let's do this then. Could someone perhaps log issues for each of those failing ones, tagging the respective authors? Thx! |
Name length goes from 1 to 100.
@mtopolnik, I see you're still doing changes here. Let me know when it's good to go. Thx! |
I just had a few ideas when I woke up in the middle of the night :) I think this should be it now, at least for a start. |
One hint, don't know if it's useful: I realized I got a ton of hanging processes from running |
Haha, ok :) One more question on the provenience of the data, is this derived from the "Basic" data set at https://simplemaps.com/data/world-cities, which is published under Creative Commons Attribution 4.0? If so, can you add this line to the top of the file:
Then it's good to go. Thx! |
Merging, thanks a lot! |
Contributes a new way to create
measurements.txt
, using 10,000 random station names with length from 1 to 100.To generate names with a realistic distribution, it uses a public list of city names. It concatenates all the names into one long string, and then uses this as a "source of name randomness". The distribution of generated name lengths is biased towards short names, but with quite frequent outliers.
Fixes #156