Skip to content

Commit

Permalink
enrich the doc
Browse files Browse the repository at this point in the history
  • Loading branch information
mchades committed Oct 24, 2023
1 parent 4d3aa90 commit a7d533d
Showing 1 changed file with 127 additions and 12 deletions.
139 changes: 127 additions & 12 deletions docs/gravitino-manage-hive.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Example JSON:

```json
{
"name": "test",
"name": "test_hive_catalog",
"comment": "my test Hive catalog",
"type": "RELATIONAL",
"provider": "hive",
Expand All @@ -39,22 +39,137 @@ Example JSON:
}
```

* `provider`: Set this to "hive" to use Hive as the catalog provider.
* `metastore.uris`: This is a required configuration, and it should be the Hive metastore service URIs.
* Other configuration parameters with the `gravitino.bypass.` prefix can be added to the "properties" section and passed down to the underlying Hive metastore.
* `name`: The name of the Hive catalog to be created.
* `comment`: Optional, user custom catalog comment.
* `provider`: Must set this to "hive" in order to use Hive as the catalog provider.
* `type`: Must set this to "RELATIONAL" because Hive has a relational data structure, like `db.table`.
* `properties`: The properties of the Hive catalog. More properties information see the following catalog properties table.

### configuration
### catalog properties

| Configuration item | Description | value |
|--------------------|--------------------------------------------------------------------------------------------|---------------------------|
| `metastore.uris` | Hive metastore service address, separate multiple addresses with commas | `thrift://127.0.0.1:9083` |
| `client.pool-size` | The maximum number of Hive metastore clients in the pool for Gravitino. 1 by default value | 1 |
| Property name | Description | example value | Since version |
|---------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|---------------|
| `metastore.uris` | This is a required configuration, and it should be the Hive metastore service URIs, separate multiple addresses with commas. | `thrift://127.0.0.1:9083` | 0.2.0 |
| `client.pool-size` | The maximum number of Hive metastore clients in the pool for Gravitino. 1 by default value. | 1 | 0.2.0 |
| `gravitino.bypass.` | Property name with this prefix will be passed down to the underlying HMS client for use. Empty by default value. | `gravitino.bypass.hive.metastore.failure.retries = 3` indicate 3 times of retries upon failure of Thrift metastore calls | 0.2.0 |

## After the catalog is initialized
## Creating a Hive Schema

You can manage and operate on tables using the following URL format:
After the catalog is created, you can submit a schema JSON example to the Gravitino server using the URL format:

```shell
http://{GravitinoServerHost}:{GravitinoServerPort}/api/metalakes/{metalake}/catalogs/{catalog}/schemas
```

Example JSON:

```json
{
"name": "test_schema",
"comment": "my test schema",
"properties": {
"location": "/user/hive/warehouse"
}
}
```

* `name`: The name of the Hive database to be created.
* `comment`: Optional, user custom Hive database comment.
* `properties`: The properties of the Hive database. More properties information see the following schema properties table.

### schema properties

| Property name | Description | example value | Since version |
|---------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------|---------------|
| `location` | The directory for Hive database storage. Not required, HMS will use the value of `hive.metastore.warehouse.dir` in the Hive conf file hive-site.xml by default. | `/user/hive/warehouse` | 0.1.0 |
| `gravitino.bypass.` | Property name with this prefix will be passed down to the Hive database parameters without the prefix. | `"gravitino.bypass.my-key" = "my-value"` | 0.2.0 |


## Creating a Hive Table

After the schema is created, you can submit a table JSON example to the Gravitino server using the URL format:

```shell
http://{GravitinoServerHost}:{GravitinoServerPort}/api/metalakes/{metalake}/catalogs/{catalog}/schemas/{schema}/tables
```
Now you can use Hive as a catalog for managing your metadata in Gravitino. If you encounter any issues or need further assistance, refer to the Gravitino documentation or seek help from the support team.

Example JSON:

```json
{
"name": "test_table",
"comment": "my test table",
"columns": [
{
"name": "id",
"type": "int",
"comment": "id column comment"
},
{
"name": "name",
"type": "string",
"comment": "name column comment"
},
{
"name": "age",
"type": "int",
"comment": "age column comment"
},
{
"name": "dt",
"type": "date",
"comment": "dt column comment"
}
],
"partitions": [
{
"strategy": "identity",
"fieldName": ["dt"]
}
],
"distribution": {
"strategy": "hash",
"number": 32,
"expressions": [
{
"expressionType": "field",
"fieldName": ["id"]
}
]
},
"sortOrders": [
{
"expression": {
"expressionType": "field",
"fieldName": ["age"]
},
"direction": "asc",
"nullOrdering": "first"
}
],
"properties": {
"format": "ORC"
}
}
```

* `name`: The name of the Hive table to be created.
* `comment`: Optional, user custom Hive table comment.
* `columns`: The columns of the Hive table.
* `partitions`: Optional, the partitions of the Hive table, above example is a partitioned table with `dt` column.
* `distribution`: Optional, equivalent to the `CLUSTERED BY` clause in Hive DDL, above example table is bucketed(cluster by) `id` column.
* `sortOrders`: Optional, equivalent to the `SORTED BY` clause in Hive DDL, above example table data is sorted in increasing order of `age` in each bucket.
* `properties`: The properties of the Hive table. More properties information see the following table properties table, other properties will be passed down to the underlying Hive table parameters.

### table properties

| Configuration item | Description | example value | Since version |
|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|---------------|
| `location` | The location for table storage. Not required, HMS will use the database location as parent directory by default. | `/user/hive/warehouse/test_table` | 0.2.0 |
| `table-type` | Type of the table. Valid values include `MANAGED_TABLE` and `EXTERNAL_TABLE`. `MANAGED_TABLE` by default value. | `MANAGED_TABLE` | 0.2.0 |
| `format` | The table file format. Valid values include `TEXTFILE`, `SEQUENCEFILE`, `RCFILE`, `ORC`, `PARQUET`, `AVRO`, `JSON`, `CSV`, and `REGEX`. `TEXTFILE` by default value. | `ORC` | 0.2.0 |
| `input-format` | The input format class for the table. The property `format` sets the default value `org.apache.hadoop.mapred.TextInputFormat` and can change it to a different default. | `org.apache.hadoop.hive.ql.io.orc.OrcInputFormat` | 0.2.0 |
| `output-format` | The output format class for the table. The property `format` sets the default value `org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat` and can change it to a different default. | `org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat` | 0.2.0 |
| `serde-lib` | The serde library class for the table. The property `format` sets the default value `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` and can change it to a different default. | `org.apache.hadoop.hive.ql.io.orc.OrcSerde` | 0.2.0 |
| `serde.parameter.` | The prefix of serde parameter, empty by default. | `"serde.parameter.orc.create.index" = "true"` indicate `ORC` serde lib to create row indexes | 0.2.0 |

0 comments on commit a7d533d

Please sign in to comment.