Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure CopyElement does not break UTF-8 characters #192

Merged
merged 4 commits into from
Oct 11, 2023

Conversation

poikilotherm
Copy link
Member

Closes #188

landreev and others added 4 commits September 15, 2023 17:42
This reproducer test will ensure the output is not bodged by broken
UTF-8 multibyte chars again.
With this commit, we introduce an analysis routine that will go over
the last few bytes when reading with CopyElement. This is faster
than converting to String and checking for the UTF-8 unknown char sign
over and over again.

By using a buffered input stream, we can rewind the stream if necessary
and read again up to the point that we don't have a broken char.

Extensive testing was added to make sure the analysis function works
with any length of multibyte UTF-8 chars
@poikilotherm poikilotherm added the bug Something isn't working label Sep 30, 2023
@poikilotherm
Copy link
Member Author

@landreev @pdurbin please take a look. Thx!

@sonarcloud
Copy link

sonarcloud bot commented Sep 30, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

96.7% 96.7% Coverage
0.0% 0.0% Duplication

Copy link
Collaborator

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the solution.
And it's great to have tests for this!
Thank you.

@pdurbin
Copy link
Member

pdurbin commented Oct 11, 2023

At sprint planning today I pulled this issue...

... into the current sprint because it looks like we're close! Thanks @poikilotherm and @landreev!

@poikilotherm poikilotherm merged commit 31fda1d into branch-5.0 Oct 11, 2023
4 checks passed
@poikilotherm poikilotherm deleted the 188-copyelement-utf8 branch October 11, 2023 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 characters can be broken in two in CopyElement, resulting in invalid records
3 participants