Extract sheet names using pyspark #856

Krukosz · 2024-04-30T08:12:19Z

Am I using the newest version of the library?

I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

I have problem with class WorkbookReader. Code in Python looks like:

reader = spark._jvm.com.crealytics.spark.excel.WorkbookReader( {"path": "Worktime.xlsx"}, spark.sparkContext._jsc.hadoopConfiguration() ) sheetnames = reader.sheetNames()

My problems:

I cannot use hadoopConfiguration explicitly due to security options
When I omit second argument in constructor I get error:

py4j.Py4JException: Constructor com.crealytics.spark.excel.WorkbookReader([class java.util.HashMap]) does not exist

In PR #196 there's a discussion about using apply method but I don't know how to call it.

Is there anyone who made working it on PySpark? I can't use Scala, because is blocked by administrator in my environment.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

- Spark version:Apache Spark 3.4.1
- Spark-Excel version: Scala 2.1
- OS:
- Cluster environment: Databricks 13.3 LTS

Anything else?

No response

The text was updated successfully, but these errors were encountered:

nightscape · 2024-04-30T08:15:22Z

Does this help? #196 (comment)

Krukosz · 2024-04-30T08:43:28Z

Oh, I tested it on "legacy" Databricks Cluster and it works.

My code:

reader = spark._jvm.com.crealytics.spark.excel.WorkbookReader.apply({"path": 'my_file.xlsx'}, spark.sparkContext._jsc.hadoopConfiguration())

d = reader.sheetNames()

print(d)

In Unity Catalog environment i'm getting error (it's directly related to Cluster Mode, it cannot be changed in my case):

py4j.security.Py4JSecurityException: Method public org.apache.hadoop.conf.Configuration org.apache.spark.api.java.JavaSparkContext.hadoopConfiguration() is not whitelisted on class class org.apache.spark.api.java.JavaSparkContext

Is there any other way to get sheet names, without WorkbookReader constructor? I'd rather not mixing crealytics spark code with pandas or any other library.

nightscape · 2024-04-30T13:12:15Z

This sounds related: https://learn.microsoft.com/en-us/answers/questions/1193968/py4j-security-py4jsecurityexception-databricks

ewong18 · 2024-09-16T15:36:11Z

We are having the same issue with our Scala code in Unity Catalog (DBR 14.3 LTS)

As per this documentation: https://learn.microsoft.com/en-us/azure/databricks/compute/access-mode-limitations#spark-api-limitations-and-requirements-for-unity-catalog-shared-access-mode sparkContext (and therefore hadoopConfiguration) can't be accessed in DBR 14.0 and newer.

So, even if there's a workaround for 13.3 for now, newer runtimes won't be able to support it.

nightscape · 2024-09-23T09:20:16Z

Hmm, that would require a bigger refactoring then because we also need a HadoopConfiguration in the standard use case (even without reading sheet names):
https://github.com/crealytics/spark-excel/blob/main/src/main/scala/com/crealytics/spark/excel/DefaultSource.scala#L38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract sheet names using pyspark #856

Extract sheet names using pyspark #856

Krukosz commented Apr 30, 2024

nightscape commented Apr 30, 2024

Krukosz commented Apr 30, 2024 •

edited

Loading

nightscape commented Apr 30, 2024

ewong18 commented Sep 16, 2024

nightscape commented Sep 23, 2024

Extract sheet names using pyspark #856

Extract sheet names using pyspark #856

Comments

Krukosz commented Apr 30, 2024

Am I using the newest version of the library?

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

nightscape commented Apr 30, 2024

Krukosz commented Apr 30, 2024 • edited Loading

nightscape commented Apr 30, 2024

ewong18 commented Sep 16, 2024

nightscape commented Sep 23, 2024

Krukosz commented Apr 30, 2024 •

edited

Loading