You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog.
The statistics properties are included in the Glue table properties, however, it looks that Hive is not honoring it.
When I ran the migration script to migrate the Hive meta store to Glue catalog, the same statistics became the following. I have found the below statistics is unusable through Hive.
I then manually modified table property (COLUMN_STATS_ACCURATE) in Glue console to the following and was able to convert 'COLUMN_STATS_ACCURATE' into a usable format
I didn't check the compatibility of the migrated statistics with the other EMR tools (Spark, Presto) and AWS services (Glue ETL job, Athena, Redshift Spectrum).
Regards,
Simone
The text was updated successfully, but these errors were encountered:
simobatt
changed the title
Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog.
Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog
Sep 6, 2018
Hive table and partition basic statistics are not correctly imported into AWS Glue Catalog.
The statistics properties are included in the Glue table properties, however, it looks that Hive is not honoring it.
Glue migration script is capable of migrating table and partition statistics from the Hive Metastore, however, it appears that the migration script is escaping some of the characters thus making the statistics unusable in Glue catalog:
https://github.com/aws-samples/aws-glue-samples/blob/master/utilities/Hive_metastore_migration/src/hive_metastore_migration.py#L455
When I created a table in Hive meta store, the table and column statistics looked like the following:
Table Parameters:
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true","COLUMN_STATS":{"id":"true","name":"true"}}
EXTERNAL TRUE
numFiles 1
numRows 2
rawDataSize 14
totalSize 16
transient_lastDdlTime 1536141689
When I ran the migration script to migrate the Hive meta store to Glue catalog, the same statistics became the following. I have found the below statistics is unusable through Hive.
Table Parameters:
COLUMN_STATS_ACCURATE \{\"BASIC_STATS\"\:\"true\",\"COLUMN_STATS\"\:\{\"id\"\:\"true\",\"name\"\:\"true\"\}\}
EXTERNAL TRUE
numFiles 1
numRows 2
rawDataSize 14
totalSize 16
transient_lastDdlTime 1536141689
I then manually modified table property (COLUMN_STATS_ACCURATE) in Glue console to the following and was able to convert 'COLUMN_STATS_ACCURATE' into a usable format
I didn't check the compatibility of the migrated statistics with the other EMR tools (Spark, Presto) and AWS services (Glue ETL job, Athena, Redshift Spectrum).
Regards,
Simone
The text was updated successfully, but these errors were encountered: