Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pig HcatStorer fails with AWS Glue Data Catalog as metastore for Hive. #37

Open
dgghosalaws opened this issue Jan 13, 2021 · 11 comments
Open

Comments

@dgghosalaws
Copy link

Use case
Running the example here - > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hcatalog-pig.html
Outcome: Pig script Fails when Glue is the hive metastore.Script reports fail status.
The files are written in S3 though
Error logs

OperationException: getTokenStrForm is not supported
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(PigOutputCommitter.java:257)
	at org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigOutputFormatTez$PigOutputCommitterTez.commitJob(PigOutputFormatTez.java:98)
	at org.apache.tez.mapreduce.committer.MROutputCommitter.commitOutput(MROutputCommitter.java:99)
	at org.apache.tez.dag.app.dag.impl.DAGImpl$1.run(DAGImpl.java:1032)
	at org.apache.tez.dag.app.dag.impl.DAGImpl$1.run(DAGImpl.java:1029)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
	at org.apache.tez.dag.app.dag.impl.DAGImpl.commitOutput(DAGImpl.java:1029)
	at org.apache.tez.dag.app.dag.impl.DAGImpl.access$2000(DAGImpl.java:149)
	at org.apache.tez.dag.app.dag.impl.DAGImpl$3.call(DAGImpl.java:1108)
	at org.apache.tez.dag.app.dag.impl.DAGImpl$3.call(DAGImpl.java:1103)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.UnsupportedOperationException: getTokenStrForm is not supported
	at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.getTokenStrForm(GlueMetastoreClientDelegate.java:1583)
	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.getTokenStrForm(AWSCatalogMetastoreClient.java:516)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hive.hcatalog.common.HiveClientCache$CacheableHiveMetaStoreClient.invoke(HiveClientCache.java:590)
	at com.sun.proxy.$Proxy67.getTokenStrForm(Unknown Source)
	at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.cancelDelegationTokens(FileOutputCommitterContainer.java:1012)
	at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitJob(FileOutputCommitterContainer.java:274)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.commitJob(PigOutputCommitter.java:255)

@dgghosalaws dgghosalaws changed the title Pig HcatStorer fails with AWS Glue Data Catalog as Hive metastore for Hive. Pig HcatStorer fails with AWS Glue Data Catalog as metastore for Hive. Jan 13, 2021
@itharavi
Copy link

itharavi commented May 7, 2021

+1

@Oleks777
Copy link

+1

i get another error:
Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found)

@moneroexamples
Copy link

moneroexamples commented Jun 20, 2022

@Oleks777

I had the same issue on emr-5.36.0 (did not test other version) when trying to use pig with HCatalog, so that I can load tables from Glue to Pig:

pig -useHCatalog

In my case the solution was to manually specify the missing jar:

pig -useHCatalog -Dpig.additional.jars=/usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive2-client-1.18.0.jar

On other emr version, aws-glue-datacatalog-hive2-client-1.18.0.jar may have different number. So go to /usr/share/aws/hmclient/lib/ and check.

Then to load data from glue table:

data = LOAD 'somedatabase.sometablename' USING org.apache.hive.hcatalog.pig.HCatLoader();

then check:

describe data;

@Oleks777
Copy link

thanks @moneroexamples ! yes, it did the trick. Instead of adding the jar like you describe, you can also use REGISTER command in the script.

Looks like this solution works only for 5x EMR releases (hive2), it doesn't work for 6x. Does anyone have any advice?

@moneroexamples
Copy link

moneroexamples commented Jun 21, 2022

@Oleks777 I just checked on emr-6.6 and the following works:

pig -useHCatalog -Dpig.additional.jars=/usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive3-client-3.5.0.jar

As a side note. On EMR 6.6, hcat also does not work in itself with glue:

hcat -e "show databases;"

giving error:

Caused by: MetaException(message:Unable to instantiate a metastore client factory com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory due to: java.lang.ClassNotFoundException: Class com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory not found)

you can solve this by by setting up HIVE_AUX_JARS_PATH before you call hcat:

export  HIVE_AUX_JARS_PATH=/usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive3-client-3.5.0.jar
hcat -e "show databases;"

@Oleks777
Copy link

@moneroexamples many thanks! i spent a lot of time to compile the client for hive2 and it is good to know there is a compiled version available from AWS.
Is this path:
/usr/share/aws/hmclient/lib/aws-glue-datacatalog-hive3-client-3.5.0.jar
available on the datanodes by default or emr needs to be configured somehow in the bootstrap step?

@dgghosalaws
Copy link
Author

I request all to either support the premsie of the issue title or confirm if HCatStorer for partition write works with Glue data catalog as hive metastore. I completely get the iterations done above to make basic commands work with Pig on EMR. Thanks

@moneroexamples
Copy link

@Oleks777 Sadly I don't know how to configure EMR so that the extra paths/jars are loaded for Pig and hcat at bootstrap step.

@eagleshine
Copy link

Any update on this issue? We also encountered the same getTokenStrForm is not supported error when using HCatStorer(...) in EMR.

@zsaltys
Copy link

zsaltys commented Sep 6, 2023

I'm getting the same error when storing data to ORC or Parquet tables with latest version of EMR 6.12.0. It seems support to write to Glue tables is broken.

@zsaltys
Copy link

zsaltys commented Sep 6, 2023

After a little bit of digging we can see the problem originates here:

at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.cancelDelegationTokens(FileOutputCommitterContainer.java:1012)
at org.apache.hive.hcatalog.mapreduce.FileOutputCommitterContainer.commitJob(FileOutputCommitterContainer.java:274)

If we look at the file:

https://github.com/apache/hive/blob/920f9e535db6270a401db274eef3267d70c1fd2f/hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/FileOutputCommitterContainer.java#L258

We can see that cancellingDelegationTokens is the last thing that happens. We can also see how it's used:

https://github.com/apache/hive/blob/920f9e535db6270a401db274eef3267d70c1fd2f/hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/FileOutputCommitterContainer.java#L997

All we really need to do is to return a null instead of throwing operation not supported and then delegation cancel method should work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants