Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add max buffer sizes and if should process bad field flag #2188

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mac477
Copy link

@mac477 mac477 commented Sep 11, 2023

It is possible that buffers and field buffers can grow with no limits what leads to big arrays allocations and large memory consumption, what finally leads to OutOfMemory exception. This PR introduces properties to limit max sizes of those buffers. And beside that ProcessBadDataFields flag which makes 'bad field' not processed. Since default is true, current behaviour would not be changed. Otherwise that field will be set to array of empty char because in such flag setting we are not interested in processing bad field. In some cases it can improve performance and memory consumption.
WARNING: Limiting max processed field buffer size can lead to field value truncation (if that buffer is too small to fit in field). But it works well with ShouldSkipRecord configured to skip too long fields.

@Rob-Hague
Copy link
Contributor

A similar feature was added in #1278 (version 30) as CsvConfiguration.MaxFieldSize. Is this different?

@mac477
Copy link
Author

mac477 commented Sep 12, 2023

A similar feature was added in #1278 (version 30) as CsvConfiguration.MaxFieldSize. Is this different?

That is completely different since it lets skip too long rows/fields. MaxFieldSize would throw whole parsing if set.

BTW: Tomorrow I will push improvements to that pr's branch.

@mac477
Copy link
Author

mac477 commented Sep 13, 2023

I've just added mentioned improvement.

Copy link

@nicemanman nicemanman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of these changes?

@@ -28,6 +28,8 @@ public record CsvConfiguration : IReaderConfiguration, IWriterConfiguration
/// <inheritdoc/>
public virtual BadDataFound BadDataFound { get; set; } = ConfigurationFunctions.BadDataFound;

public virtual bool ProcessBadDataFields { get; set; } = true;
Copy link

@nicemanman nicemanman Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other properties (not only properties) have /// <inheritdoc/> above them, you forgot to add it above this line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been added.

@@ -281,15 +369,56 @@ public async Task<bool> ReadAsync()
}
}

private async Task<bool> FillBufferAsync()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You have fully repeated the FillBuffer method's code, it's not good. Don't you think it would be better to call FillBufferAsync().Result from the FillBuffer method?
  2. The method's purpose is not clear, because of a lot of global fields. I don't think it's a good idea to refactor this class now, so add a comment line above both FillBuffer and FillBufferAsync
  3. It's a little unclear what is going on in lines 386 to 407, maybe we could call an additional method with a good name?

@@ -356,34 +494,41 @@ private ReadLineResult ReadLine(ref char c, ref char cPrev)
}

var isFirstCharOfRow = rowStartPosition == bufferPosition - 1;
if (isFirstCharOfRow && (allowComments && c == comment || ignoreBlankLines && ((c == '\r' || c == '\n') && !isNewLineSet || c == newLineFirstChar && isNewLineSet)))

if (isFirstCharOfRow &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many conditions for one if statement, hard to understand, maybe we can create several bool variables, so we get something like this -

bool4 = bool7 || bool8;
bool3 = (bool2 && bool3) || (bool4 && bool5);
if (isFirstCharOfRow && bool3)
{}

}

if (buffer[newStart] != quote || buffer[newStart + newLength - 1] != quote || newLength == 1 && buffer[newStart] == quote)
if (buffer[newStart] != quote ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could move these conditions to an additional method and call it, I think we can make the purpose of these conditions more clear

@mac477
Copy link
Author

mac477 commented Oct 11, 2023

@nicemanman thanks for review but you mostly reviewed things which are not related to this pull request. Please create new pull request regarding changes you propose which are not related to this branch.

@nicemanman
Copy link

@nicemanman thanks for review but you mostly reviewed things which are not related to this pull request. Please create new pull request regarding changes you propose which are not related to this branch.

Okay, I will, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants