Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame when parsing a CSV incorrectly detects a column type when renameDuplicatedColumns is true #7240

Open
tudor-turcu opened this issue Sep 19, 2024 · 2 comments · May be fixed by #7242
Labels
in-pr untriaged New issue has not been triaged

Comments

@tudor-turcu
Copy link

tudor-turcu commented Sep 19, 2024

System Information (please complete the following information):

  • OS & Version: Windows 11, Linux
  • ML.NET Version: 3.0.1 (Microsoft.Data.Analysis 0.21.1)
  • .NET Version: .NET 8.0

Describe the bug
DataFrame.LoadCsv() or LoadCsvFromString() incorrectly detects a column type when renameDuplicatedColumns = true and dataTypes = null or empty.

To Reproduce
Call DataFrame.LoadCsv() or LoadCsvFromString() with renameDuplicatedColumns = true and dataTypes = null or empty,
with CultureInfo.CurrentCulture = CultureInfo.InvariantCulture; // or en-US
If a column in the CSV contains on a row a valid date value,
and in a subsequent row appear two or more empty string values, one of which appears in a previous column,
the column containing a date is not considered as having a date type, but a single/float type.
Parsing the date values on that column will fail with an exception.
Probably the same issue appears for boolean columns or other types.
Sample CSV:

"Col1","Col2","Col3","Col4"
"v1.1","5/7/2017","v3.1","v4.1"
"","","v3.2","v4.2"

Expected behavior
The column type should considered to be Date type, even if it contains a few empty string values.

Screenshots, Code, Sample Projects

using Microsoft.Data.Analysis;
using System.Globalization;

namespace DataFrameColumnTypeDetectionBug
{
    internal class Program
    {
        static void Main(string[] args)
        {
            CultureInfo.CurrentCulture = CultureInfo.InvariantCulture; // or en-US

            string csv =
@"""Col1"",""Col2"",""Col3"",""Col4""
""v1.1"",""5/7/2017"",""v3.1"",""v4.1""
"""","""",""v3.2"",""v4.2""
";

            var dataFrame = DataFrame.LoadCsvFromString(csv, separator: ',', header: true, 
                dataTypes: null, // guess the column types
                renameDuplicatedColumns: true, // try to rename the duplicated columns, if any
                cultureInfo: CultureInfo.InvariantCulture);

            // when renameDuplicatedColumns: true will crash with:
            // System.FormatException: The input string '5/7/2017' was not in a correct format.
            // at System.Number.ThrowFormatException[TChar](ReadOnlySpan`1 value)
            //at System.String.System.IConvertible.ToSingle(IFormatProvider provider)
            //at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider)
            //at Microsoft.Data.Analysis.DataFrame.Append(IEnumerable`1 row, Boolean inPlace, CultureInfo cultureInfo)
            //at Microsoft.Data.Analysis.DataFrame.ReadCsvLinesIntoDataFrame(...)

            // no crash if renameDuplicatedColumns: false
            // or if dataTypes is set (no 'column type guessing')
            // or if CultureInfo.CurrentCulture has other format for DateTime (like dd.mm.yyyy) ==> the column will not be considered as DateTime, but string
        }
    }
}

Additional context
No crash if renameDuplicatedColumns: false
or if dataTypes is set (no 'column type guessing')
or if CultureInfo.CurrentCulture has other format for DateTime (like dd.mm.yyyy) ==> the column will not be considered as DateTime, but string.

Possible root cause:
This seems to be a bug in Microosft.Data.Analysis.DataFrame - in ReadCsvLinesIntoDataFrame() function, only if renameDuplicatedColumns param is true, not only the duplicated column names are renamed, but also 'duplicated' row values are 'renamed':
ex.: "345", "345.1", "345.2"...
Several empty string values become: "", ".1", ".2", ".3".
The code that tries to guess the column types will consider these former empty strings as float/single values under en-US culture, marking the entire column a having a single/float data type, even if no float values really exist on that column.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged label Sep 19, 2024
@asmirnov82
Copy link
Contributor

asmirnov82 commented Sep 20, 2024

Hello @tudor-turcu. Thank you for detailed explanation and provided code for unit testing. You are absolutely right with proposed root cause. I fixed the issue, please see #7242

@luisquintanilla, @JakeRadMSFT could you please review and merge my PR?

@tudor-turcu
Copy link
Author

Hello @tudor-turcu. Thank you for detailed explanation and provided code for unit testing. You are absolutely right with proposed root cause. I fixed the issue, please see #7242

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-pr untriaged New issue has not been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
@tudor-turcu @asmirnov82 and others