You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
DataFrame.LoadCsv() or LoadCsvFromString() incorrectly detects a column type when renameDuplicatedColumns = true and dataTypes = null or empty.
To Reproduce
Call DataFrame.LoadCsv() or LoadCsvFromString() with renameDuplicatedColumns = true and dataTypes = null or empty,
with CultureInfo.CurrentCulture = CultureInfo.InvariantCulture; // or en-US
If a column in the CSV contains on a row a valid date value,
and in a subsequent row appear two or more empty string values, one of which appears in a previous column,
the column containing a date is not considered as having a date type, but a single/float type.
Parsing the date values on that column will fail with an exception.
Probably the same issue appears for boolean columns or other types.
Sample CSV:
Expected behavior
The column type should considered to be Date type, even if it contains a few empty string values.
Screenshots, Code, Sample Projects
using Microsoft.Data.Analysis;using System.Globalization;namespaceDataFrameColumnTypeDetectionBug{internalclassProgram{staticvoidMain(string[]args){
CultureInfo.CurrentCulture = CultureInfo.InvariantCulture;// or en-USstringcsv=@"""Col1"",""Col2"",""Col3"",""Col4""""v1.1"",""5/7/2017"",""v3.1"",""v4.1"""""","""",""v3.2"",""v4.2""";vardataFrame= DataFrame.LoadCsvFromString(csv, separator:',', header:true,
dataTypes:null,// guess the column types
renameDuplicatedColumns:true,// try to rename the duplicated columns, if any
cultureInfo: CultureInfo.InvariantCulture);// when renameDuplicatedColumns: true will crash with:// System.FormatException: The input string '5/7/2017' was not in a correct format.// at System.Number.ThrowFormatException[TChar](ReadOnlySpan`1 value)//at System.String.System.IConvertible.ToSingle(IFormatProvider provider)//at System.Convert.ChangeType(Object value, Type conversionType, IFormatProvider provider)//at Microsoft.Data.Analysis.DataFrame.Append(IEnumerable`1 row, Boolean inPlace, CultureInfo cultureInfo)//at Microsoft.Data.Analysis.DataFrame.ReadCsvLinesIntoDataFrame(...)// no crash if renameDuplicatedColumns: false// or if dataTypes is set (no 'column type guessing')// or if CultureInfo.CurrentCulture has other format for DateTime (like dd.mm.yyyy) ==> the column will not be considered as DateTime, but string}}}
Additional context
No crash if renameDuplicatedColumns: false
or if dataTypes is set (no 'column type guessing')
or if CultureInfo.CurrentCulture has other format for DateTime (like dd.mm.yyyy) ==> the column will not be considered as DateTime, but string.
Possible root cause:
This seems to be a bug in Microosft.Data.Analysis.DataFrame - in ReadCsvLinesIntoDataFrame() function, only if renameDuplicatedColumns param is true, not only the duplicated column names are renamed, but also 'duplicated' row values are 'renamed':
ex.: "345", "345.1", "345.2"...
Several empty string values become: "", ".1", ".2", ".3".
The code that tries to guess the column types will consider these former empty strings as float/single values under en-US culture, marking the entire column a having a single/float data type, even if no float values really exist on that column.
The text was updated successfully, but these errors were encountered:
Hello @tudor-turcu. Thank you for detailed explanation and provided code for unit testing. You are absolutely right with proposed root cause. I fixed the issue, please see #7242
Hello @tudor-turcu. Thank you for detailed explanation and provided code for unit testing. You are absolutely right with proposed root cause. I fixed the issue, please see #7242
System Information (please complete the following information):
Describe the bug
DataFrame.LoadCsv() or LoadCsvFromString() incorrectly detects a column type when renameDuplicatedColumns = true and dataTypes = null or empty.
To Reproduce
Call DataFrame.LoadCsv() or LoadCsvFromString() with renameDuplicatedColumns = true and dataTypes = null or empty,
with CultureInfo.CurrentCulture = CultureInfo.InvariantCulture; // or en-US
If a column in the CSV contains on a row a valid date value,
and in a subsequent row appear two or more empty string values, one of which appears in a previous column,
the column containing a date is not considered as having a date type, but a single/float type.
Parsing the date values on that column will fail with an exception.
Probably the same issue appears for boolean columns or other types.
Sample CSV:
Expected behavior
The column type should considered to be Date type, even if it contains a few empty string values.
Screenshots, Code, Sample Projects
Additional context
No crash if renameDuplicatedColumns: false
or if dataTypes is set (no 'column type guessing')
or if CultureInfo.CurrentCulture has other format for DateTime (like dd.mm.yyyy) ==> the column will not be considered as DateTime, but string.
Possible root cause:
This seems to be a bug in Microosft.Data.Analysis.DataFrame - in ReadCsvLinesIntoDataFrame() function, only if renameDuplicatedColumns param is true, not only the duplicated column names are renamed, but also 'duplicated' row values are 'renamed':
ex.: "345", "345.1", "345.2"...
Several empty string values become: "", ".1", ".2", ".3".
The code that tries to guess the column types will consider these former empty strings as float/single values under en-US culture, marking the entire column a having a single/float data type, even if no float values really exist on that column.
The text was updated successfully, but these errors were encountered: