Excel formatted Date come out as 2 year digit #351

skybutter · 2021-02-25T03:44:31Z

Excel date field is not correct with only 2 digit year when read as StringType using inferSchema=false.

Expected Behavior

I would expect the string should come out as the same as shown in Excel.
However, the date string came out with only 2 digit year.

Current Behavior

I have an Excel file with a Date column. The Date column is formatted with 4 digit year, as seen below. The date is show as 6/18/2020 in the Excel file.

I have inferSchema as false to read the file. Every field is StringType in Dataframe. The value in the Dataframe only has 2 year digit instead of 4. For example, the above screenshot showed the value in the Dataframe as 6/18/20.
Here is the output from the dataframe:

Possible Solution

I debug the code and saw that the cell (XSSFCell) has 18-Jun-2020, so the code has the 4 year digit read in. Just that when it convert to string it became 2 digit year.

I believe the dataFormatter is writing it as a 2 year digit in the line stringValue ...case _ => Some(dataFormatter.formatCellValue(cell))
Internally, I think the excelFormatStr is "m/d/yy", thus the dataFormatter output as 6/18/20. In the Excel screen, it showed the cell value as 6/18/2020, so I would expect a 4 year digit in the dataframe.
I wonder if we can have this line dataFormatter.addFormat("m/d/yy", new SimpleDateFormat("MM/dd/yyyy")) added, so it can print it as 4 year digit.

Or we can add an option to specify the excel format conversion?

Steps to Reproduce (for bugs)

Read the attached file without using schema.
Date2digit.xlsx
val df = spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.load("Date2digit.xlsx")

Context

I try to read the file without using inferSchema as I want to read every field as string.
In addition, I might have some rows in the date field that is not a valid date. The source file come in like that which I have no control.

Your Environment (I tried multiple versions)

Spark version 2.4.7 and 2.4.3
Language: Scala 2.12 and 2.11
Spark-Excel version: 0.13.1 and 0.12.0
Window 10 and Databricks cluster. This happened in my local machine and on Databricks Spark cluster.

nightscape · 2021-02-25T17:16:17Z

Hey @skybutter, great report!
This sounds related to this open PR which adds a dateFormat for parsing dates:
https://github.com/crealytics/spark-excel/pull/342/files
We could use the same option for specifying how to format dates into Strings.
Once @Harivignesh3419 had the time to finish his PR, this could be added on top (or the other way round).

Amanda949 · 2024-05-21T07:28:49Z

Hi, @nightscape @Harivignesh3419 @skybutter
I have encountered the same problem again. Is there any possible to apply this PR?
Thanks

tamilselvanyes · 2024-07-23T08:46:56Z

I have encountered the same issue as well. Is there a way where we can read the data as StringType without any truncate of the date values from Excel File(.xls or .xlsx) ?
@skybutter Were you able to find a work around for this problem? Since the issue is still open!

tamilselvanyes mentioned this issue Jul 26, 2024

Excel formatted Date come out as 2 year digit even when the Excel date format is 4 year digit #879

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excel formatted Date come out as 2 year digit #351

Excel formatted Date come out as 2 year digit #351

skybutter commented Feb 25, 2021

nightscape commented Feb 25, 2021

Amanda949 commented May 21, 2024

tamilselvanyes commented Jul 23, 2024 •

edited

Loading

Excel formatted Date come out as 2 year digit #351

Excel formatted Date come out as 2 year digit #351

Comments

skybutter commented Feb 25, 2021

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment (I tried multiple versions)

nightscape commented Feb 25, 2021

Amanda949 commented May 21, 2024

tamilselvanyes commented Jul 23, 2024 • edited Loading

tamilselvanyes commented Jul 23, 2024 •

edited

Loading