Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark partitioned tables #2083

Open
tombaeyens opened this issue May 15, 2024 · 2 comments
Open

Spark partitioned tables #2083

tombaeyens opened this issue May 15, 2024 · 2 comments

Comments

@tombaeyens
Copy link
Contributor

tombaeyens commented May 15, 2024

We were testing the schema validation with the Databricks connection, and we found a problem with partitioned tables.
SODA uses the columns # Partition Information and # col_name for the validation (check the first image).
We think this happens because of the table's describe (second image)
Is there anything that we can change on our side like a setting? Or is it a bug on SODA side that needs to be fixed?

spark-partition-1
spark-partition-2

The info for partition is irrelevant because the column appears in the first list and then in the partition information.

@tools-soda
Copy link

SAS-3465

@tombaeyens
Copy link
Contributor Author

Potential fix:

In pyspark one can do this: partitions_columns = [col.name for col in spark.catalog.listColumns("schema_name.table_name") if col.isPartition] and non_paritions_columns = [col.name for col in spark.catalog.listColumns("schema_name.table_name") if not col.isPartition]

(source: https://stackoverflow.com/questions/51540906/how-to-get-the-hive-partition-column-name-using-spark )

Potentially it's suffice to apply the fix:

https://github.com/sodadata/soda-core/blob/main/soda/spark/soda/data_sources/spark_data_source.py#L213

and

https://github.com/sodadata/soda-core/blob/main/soda/spark/soda/data_sources/spark_data_source.py#L355

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants