-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to query glue/Athena views ['java.lang.IllegalArgumentException: Can not create a Path from an empty string;'] #29
Comments
I have the same issue and get the same error of @mvaniterson. |
I am encountering the same issue using only glue and the spark.sql api. |
I may be late to the party, but I hope this may help someone who runs into one of those cryptic errors ( we encountered this during table creation when location was not defined properly in the catalog ) https://docs.databricks.com/data/metastores/aws-glue-metastore.html#troubleshooting
Personally we create a delta table over the same path for spark/spark sql and use Athena for generic querying to circumvent this. |
I too have been investigating this exact same issue. @bbenzikry Would you please explain a bit more about the "delta table" workaround? For now, I have to create two separate views- one from Spark and another from Athena since these are not mutually compatible. |
Hi @kironp, sorry for not replying sooner. Our method is similar to what you said you already tried. We don't consume Athena views from spark at all. We use the same glue catalog and create 2 table definitions and views - one for delta ( https://github.com/delta-io/delta ) and one for Athena. Both definitions are configured to use the same path by generating an Athena table from the delta manifest ( https://docs.delta.io/0.7.0/presto-integration.html ) |
I have created a Spark cluster with Now I can use table and display tables as following
But accessing the individual table gives the error
I have tried to enable hive metastore as
But didnt work |
For me the same issue happened, when I created a View in Athena, then tried to query it in a Glue Job via Spark. |
Is there any update on this? |
Same annoying issue here! |
I found a solution for this problem! |
I also found a workaround, that will enable both Glue (Spark) and Athena to read the same View from the Glue Catalog. My solution is based on this: TLDR: then you can overwrite 2 properties of that Glue Catalog "View" with Boto3 still in the same glue job run: ...
import boto3
spark.sql("create view YOUR_DB.YOUR_VIEW as select * from SOME_TABLE")
glue = boto3.client("glue")
view_from_spark = glue.get_table(DatabaseName="YOUR_DB", Name="YOUR_VIEW")
view_from_spark['Table']['Parameters']['presto_view'] = 'true'
view_from_spark['Table']['ViewOriginalText'] = base64_json
#base64_json Base64 encoded JSON that describes the table schema inFacebook Presto format.
glue.update_table(DatabaseName="YOUR_DB", TableInput=view_from_spark['Table']) Note: you need do some cleanup on the base_64_json should be something like this in the end... base64_json = '/* Presto View: eyJvcmlnaW5hbFNxbCI6IihcbiAgIFNFTEVDVCAqXG4gICBGUk9NXG4gICAgIHJhdy51bXNfdXNlcnNcbikgIiwiY2F0YWxvZyI6ImF3c2RhdGFjYXRhbG9nIiwic2NoZW1hIjoiY2xlY ... == */' After this you can do both: Athena: It is a hacky workaround, I am not sure this would work for all use cases... Luckily it works for us as we rely mostly on Spark and Athena is just for ad hoc querying. I will post my generic solution to this once it is ready. |
So here is my generic workaround. Keep in mind that the query has to be Both Spark and Presto compatible. I suggest to keep the SQL query of the Views as simple as possible.
import boto3
import time
def execute_blocking_athena_query(query: str):
athena = boto3.client("athena")
res = athena.start_query_execution(QueryString=query)
execution_id = res["QueryExecutionId"]
while True:
res = athena.get_query_execution(QueryExecutionId=execution_id)
state = res["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
return
if state in ["FAILED", "CANCELLED"]:
raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
time.sleep(1)
def create_cross_platform_view(db: str, table: str, query: str, spark_session):
glue = boto3.client("glue")
glue.delete_table(DatabaseName=db, Name=table)
create_view_sql = f"create view {db}.{table} as {query}"
execute_blocking_athena_query(create_view_sql)
presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
"ViewOriginalText"
]
glue.delete_table(DatabaseName=db, Name=table)
spark_session.sql(create_view_sql).show()
spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
for key in [
"DatabaseName",
"CreateTime",
"UpdateTime",
"CreatedBy",
"IsRegisteredWithLakeFormation",
"CatalogId",
]:
if key in spark_view:
del spark_view[key]
spark_view["ViewOriginalText"] = presto_schema
spark_view["Parameters"]["presto_view"] = "true"
spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)
spark_session = ... # insert code to create the session
create_cross_platform_view("YOUR_DB", "TEST_VIEW", "select * from YOUR_DB.YOUR_TABLE", spark_session) |
Thank you @IstvanM! Your solution works well. I had a couple of questions:
|
@talalryz I believe they can change it, but it is very unlikely because it would break a lot of Views for a lot of users.
So we use this solution in a limited way. |
@sbottelli How did you setting your options and configurations? |
I'm not sure why, for me this error message seems to be notebook-related, and it only means that "something is wrong". For example:
And after I resolved this issue, it also works from the notebook as well. |
Hey did you try setting the Location option on Database when creating database ? I got same error when using Database without Location option |
I was struggling with a similar issue: In AWS EMR (Hive and Spark/pyspark) I was trying to create a table in Glue Metastore (using .saveAsTable() and .bucketBy()) I was using the "default" database for my tests. The problem was an empty default location for the database... I set it to s3://my-data-bucket/path/to/tables I've followed this answer https://repost.aws/questions/QU5Vg4fVMMT02Qo3NM21CrCg The default database location regardless of the metastore must be set. Good luck! |
I am using Glue interactive sessions and this absolutely doesn't work in my case. I set the location value for all databases but still have this error. |
I'm running EMR cluster with the 'AWS Glue Data Catalog as the Metastore for Hive' option enable.
Connecting through a Spark Notebook working fine e.g
All working as expected but when querying a view got the following error:
I guess since views are not stored I somewhere have to specify a temp path but cannot find out how?
The text was updated successfully, but these errors were encountered: