Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape control characters for DynamoDB source #5177

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

paulsasi
Copy link

@paulsasi paulsasi commented Nov 8, 2024

Description

Escapes control characters for the DynamoDB source.

Issues Resolved

Resolves #5027

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Paul Sasieta Arana <[email protected]>
@paulsasi paulsasi changed the title Escape control characters Escape control characters for DynamoDB source Nov 8, 2024
char c = jsonData.charAt(i);
if (Character.isISOControl(c) && c != '\t' && c != '\n' && c != '\r') {
// Replace control characters with escaped versions (e.g. \u0000 for null, \u0001 for start of heading, etc.)
sanitizedStringBuilder.append(String.format("\\u%04X", (int) c));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to write this string without calling String.format to avoid the performance penalty?

StringBuilder sanitizedStringBuilder = new StringBuilder();
for (int i = 0; i < jsonData.length(); i++) {
char c = jsonData.charAt(i);
if (Character.isISOControl(c) && c != '\t' && c != '\n' && c != '\r') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think that we should have this as an optional configuration to avoid breaking any existing behavior. Thoughts?

@@ -188,7 +188,8 @@ void test_writeSingleRecordToBuffer() throws Exception {
"and/or",
"c:\\Home",
"I take\nup multiple\nlines",
"String with some \"backquotes\"."
"String with some \"backquotes\".",
"String with some control characters: \0\1\2\3\4\5\6\7\10\11\12\13\14\15\16\17\20\21\22\23\24\25\26\27\28\29\30\31\127\b\t\n\f\r"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have a test that provides both the input and the expected output to verify that the result is what we want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] DynamoDB Source doesn't support parsing data with Control Characters
3 participants