diff --git a/docs/gravitino-manage-hive.md b/docs/gravitino-manage-hive.md index 11030d3fa7d..61818b5d56c 100644 --- a/docs/gravitino-manage-hive.md +++ b/docs/gravitino-manage-hive.md @@ -29,7 +29,7 @@ Example JSON: ```json { - "name": "test", + "name": "test_hive_catalog", "comment": "my test Hive catalog", "type": "RELATIONAL", "provider": "hive", @@ -39,22 +39,137 @@ Example JSON: } ``` -* `provider`: Set this to "hive" to use Hive as the catalog provider. -* `metastore.uris`: This is a required configuration, and it should be the Hive metastore service URIs. -* Other configuration parameters with the `gravitino.bypass.` prefix can be added to the "properties" section and passed down to the underlying Hive metastore. +* `name`: The name of the Hive catalog to be created. +* `comment`: Optional, user custom catalog comment. +* `provider`: Must set this to "hive" in order to use Hive as the catalog provider. +* `type`: Must set this to "RELATIONAL" because Hive has a relational data structure, like `db.table`. +* `properties`: The properties of the Hive catalog. More properties information see the following catalog properties table. -### configuration +### catalog properties -| Configuration item | Description | value | -|--------------------|--------------------------------------------------------------------------------------------|---------------------------| -| `metastore.uris` | Hive metastore service address, separate multiple addresses with commas | `thrift://127.0.0.1:9083` | -| `client.pool-size` | The maximum number of Hive metastore clients in the pool for Gravitino. 1 by default value | 1 | +| Property name | Description | example value | Since version | +|---------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|---------------| +| `metastore.uris` | This is a required configuration, and it should be the Hive metastore service URIs, separate multiple addresses with commas. | `thrift://127.0.0.1:9083` | 0.2.0 | +| `client.pool-size` | The maximum number of Hive metastore clients in the pool for Gravitino. 1 by default value. | 1 | 0.2.0 | +| `gravitino.bypass.` | Property name with this prefix will be passed down to the underlying HMS client for use. Empty by default value. | `gravitino.bypass.hive.metastore.failure.retries = 3` indicate 3 times of retries upon failure of Thrift metastore calls | 0.2.0 | -## After the catalog is initialized +## Creating a Hive Schema -You can manage and operate on tables using the following URL format: +After the catalog is created, you can submit a schema JSON example to the Gravitino server using the URL format: + +```shell +http://{GravitinoServerHost}:{GravitinoServerPort}/api/metalakes/{metalake}/catalogs/{catalog}/schemas +``` + +Example JSON: + +```json +{ + "name": "test_schema", + "comment": "my test schema", + "properties": { + "location": "/user/hive/warehouse" + } +} +``` + +* `name`: The name of the Hive database to be created. +* `comment`: Optional, user custom Hive database comment. +* `properties`: The properties of the Hive database. More properties information see the following schema properties table. + +### schema properties + +| Property name | Description | example value | Since version | +|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------|---------------| +| `location` | The directory for Hive database storage. Not required, HMS will use the value of `hive.metastore.warehouse.dir` in the Hive conf file hive-site.xml by default. | `/user/hive/warehouse` | 0.1.0 | +| `gravitino.bypass.` | Property name with this prefix will be passed down to the Hive database parameters without the prefix. | `"gravitino.bypass.my-key" = "my-value"` | 0.2.0 | + + +## Creating a Hive Table + +After the schema is created, you can submit a table JSON example to the Gravitino server using the URL format: ```shell http://{GravitinoServerHost}:{GravitinoServerPort}/api/metalakes/{metalake}/catalogs/{catalog}/schemas/{schema}/tables ``` -Now you can use Hive as a catalog for managing your metadata in Gravitino. If you encounter any issues or need further assistance, refer to the Gravitino documentation or seek help from the support team. + +Example JSON: + +```json +{ + "name": "test_table", + "comment": "my test table", + "columns": [ + { + "name": "id", + "type": "int", + "comment": "id column comment" + }, + { + "name": "name", + "type": "string", + "comment": "name column comment" + }, + { + "name": "age", + "type": "int", + "comment": "age column comment" + }, + { + "name": "dt", + "type": "date", + "comment": "dt column comment" + } + ], + "partitions": [ + { + "strategy": "identity", + "fieldName": ["dt"] + } + ], + "distribution": { + "strategy": "hash", + "number": 32, + "expressions": [ + { + "expressionType": "field", + "fieldName": ["id"] + } + ] + }, + "sortOrders": [ + { + "expression": { + "expressionType": "field", + "fieldName": ["age"] + }, + "direction": "asc", + "nullOrdering": "first" + } + ], + "properties": { + "format": "ORC" + } +} +``` + +* `name`: The name of the Hive table to be created. +* `comment`: Optional, user custom Hive table comment. +* `columns`: The columns of the Hive table. +* `partitions`: Optional, the partitions of the Hive table, above example is a partitioned table with `dt` column. +* `distribution`: Optional, equivalent to the `CLUSTERED BY` clause in Hive DDL, above example table is bucketed(cluster by) `id` column. +* `sortOrders`: Optional, equivalent to the `SORTED BY` clause in Hive DDL, above example table data is sorted in increasing order of `age` in each bucket. +* `properties`: The properties of the Hive table. More properties information see the following table properties table, other properties will be passed down to the underlying Hive table parameters. + +### table properties + +| Configuration item | Description | example value | Since version | +|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|---------------| +| `location` | The location for table storage. Not required, HMS will use the database location as parent directory by default. | `/user/hive/warehouse/test_table` | 0.2.0 | +| `table-type` | Type of the table. Valid values include `MANAGED_TABLE` and `EXTERNAL_TABLE`. `MANAGED_TABLE` by default value. | `MANAGED_TABLE` | 0.2.0 | +| `format` | The table file format. Valid values include `TEXTFILE`, `SEQUENCEFILE`, `RCFILE`, `ORC`, `PARQUET`, `AVRO`, `JSON`, `CSV`, and `REGEX`. `TEXTFILE` by default value. | `ORC` | 0.2.0 | +| `input-format` | The input format class for the table. The property `format` sets the default value `org.apache.hadoop.mapred.TextInputFormat` and can change it to a different default. | `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat` | 0.2.0 | +| `output-format` | The output format class for the table. The property `format` sets the default value `org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat` and can change it to a different default. | `org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat` | 0.2.0 | +| `serde-lib` | The serde library class for the table. The property `format` sets the default value `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` and can change it to a different default. | `org.apache.hadoop.hive.ql.io.orc.OrcSerde` | 0.2.0 | +| `serde.parameter.` | The prefix of serde parameter, empty by default. | `"serde.parameter.orc.create.index" = "true"` indicate `ORC` serde lib to create row indexes | 0.2.0 | +