Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excel formatted Date come out as 2 year digit #351

Open
skybutter opened this issue Feb 25, 2021 · 3 comments
Open

Excel formatted Date come out as 2 year digit #351

skybutter opened this issue Feb 25, 2021 · 3 comments

Comments

@skybutter
Copy link

Excel date field is not correct with only 2 digit year when read as StringType using inferSchema=false.

Expected Behavior

I would expect the string should come out as the same as shown in Excel.
However, the date string came out with only 2 digit year.

Current Behavior

I have an Excel file with a Date column. The Date column is formatted with 4 digit year, as seen below. The date is show as 6/18/2020 in the Excel file.
image

I have inferSchema as false to read the file. Every field is StringType in Dataframe. The value in the Dataframe only has 2 year digit instead of 4. For example, the above screenshot showed the value in the Dataframe as 6/18/20.
Here is the output from the dataframe:
image

Possible Solution

I debug the code and saw that the cell (XSSFCell) has 18-Jun-2020, so the code has the 4 year digit read in. Just that when it convert to string it became 2 digit year.
image
I believe the dataFormatter is writing it as a 2 year digit in the line stringValue ...case _ => Some(dataFormatter.formatCellValue(cell))
Internally, I think the excelFormatStr is "m/d/yy", thus the dataFormatter output as 6/18/20. In the Excel screen, it showed the cell value as 6/18/2020, so I would expect a 4 year digit in the dataframe.
I wonder if we can have this line dataFormatter.addFormat("m/d/yy", new SimpleDateFormat("MM/dd/yyyy")) added, so it can print it as 4 year digit.
image
Or we can add an option to specify the excel format conversion?

Steps to Reproduce (for bugs)

Read the attached file without using schema.
Date2digit.xlsx
val df = spark.read.format("com.crealytics.spark.excel")
.option("header", "true")
.load("Date2digit.xlsx")

Context

I try to read the file without using inferSchema as I want to read every field as string.
In addition, I might have some rows in the date field that is not a valid date. The source file come in like that which I have no control.

Your Environment (I tried multiple versions)

  • Spark version 2.4.7 and 2.4.3
  • Language: Scala 2.12 and 2.11
  • Spark-Excel version: 0.13.1 and 0.12.0
  • Window 10 and Databricks cluster. This happened in my local machine and on Databricks Spark cluster.
@nightscape
Copy link
Collaborator

Hey @skybutter, great report!
This sounds related to this open PR which adds a dateFormat for parsing dates:
https://github.com/crealytics/spark-excel/pull/342/files
We could use the same option for specifying how to format dates into Strings.
Once @Harivignesh3419 had the time to finish his PR, this could be added on top (or the other way round).

@Amanda949
Copy link

Hi, @nightscape @Harivignesh3419 @skybutter
I have encountered the same problem again. Is there any possible to apply this PR?
Thanks

@tamilselvanyes
Copy link

tamilselvanyes commented Jul 23, 2024

I have encountered the same issue as well. Is there a way where we can read the data as StringType without any truncate of the date values from Excel File(.xls or .xlsx) ?
@skybutter Were you able to find a work around for this problem? Since the issue is still open!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants