[GLUTEN-3902][VL] Add documentation to configure the Velox+GCS connec…

…tor (#3902)
apache · Dec 4, 2023 · 1857959 · 1857959
1 parent 27ca04c
commit 1857959
Show file tree

Hide file tree

Showing 3 changed files with 43 additions and 4 deletions.
diff --git a/docs/get-started/GlutenUsage.md b/docs/get-started/GlutenUsage.md
@@ -19,6 +19,7 @@ Please set them via `--`, e.g. `--build_type=Release`.
 | enable_iaa       | enable IAA for shuffle data de/compression          | OFF           |
 | enable_hbm       | enable HBM allocator                                | OFF           |
 | enable_s3        | build with s3 lib                                   | OFF           |
+| enable_gcs       | build with gcs lib                                  | OFF           |
 | enable_hdfs      | build with hdfs lib                                 | OFF           |
 | enable_ep_cache  | enable caching for external project build (Velox)   | OFF           |
 | skip_build_ep    | skip the build of external projects (velox)         | OFF           |
@@ -32,6 +33,7 @@ Please set them via `--`, e.g., `--velox_home=/YOUR/PATH`.
 | velox_home | Velox build path                          | GLUTEN_DIR/ep/build-velox/build/velox_ep|
 | build_type | Velox build type, CMAKE_BUILD_TYPE        | Release|
 | enable_s3  | Build Velox with -DENABLE_S3              | OFF           |
+| enable_gcs  | Build Velox with -DENABLE_GCS            | OFF           |
 | enable_hdfs | Build Velox with -DENABLE_HDFS           | OFF           |
 | build_protobuf | build protobuf from source            | ON           |
 | run_setup_script | Run Velox setup script before build | ON           |

diff --git a/docs/get-started/VeloxGCS.md b/docs/get-started/VeloxGCS.md
@@ -0,0 +1,39 @@
+---
+layout: page
+title: Using GCS with Gluten
+nav_order: 5
+parent: Getting-Started
+---
+Object stores offered by CSPs such as GCS are important for users of Gluten to store their data. This doc will discuss all details of configs, and use cases around using Gluten with object stores. In order to use a GCS endpoint as your data source, please ensure you are using the following GCS configs in your spark-defaults.conf. If you're experiencing any issues authenticating to GCS with additional auth mechanisms, please reach out to us using the 'Issues' tab.
+
+# Working with GCS
+
+## Installing the gcloud CLI
+
+To access GCS Objects using Gluten and Velox, first you have to [download an install the gcloud CLI] (https://cloud.google.com/sdk/docs/install).
+
+
+## Configuring GCS using a user account
+
+This is recommended for regular users, follow the [instructions to authorize a user account](https://cloud.google.com/sdk/docs/authorizing#user-account).
+After these steps, no specific configuration is required for Gluten, since the authorization was handled entirely by the gcloud tool.
+
+
+## Configuring GCS using a credential file
+
+For workloads that need to be fully automated, manually authorizing can be problematic. For such cases it is better to use a json file with the credentials.
+This is described in the [instructions to configure a service account]https://cloud.google.com/sdk/docs/authorizing#service-account.
+
+Such json file with the credetials can be passed to Gluten:
+
+```sh
+spark.hadoop.fs.gs.auth.type                         SERVICE_ACCOUNT_JSON_KEYFILE
+spark.hadoop.fs.gs.auth.service.account.json.keyfile // path to the json file with the credentials.
+```
+
+## Configuring GCS endpoints
+
+For cases when a GCS mock is used, an optional endpoint can be provided:
+```sh
+spark.hadoop.fs.gs.storage.root.url  // url to the mock gcs service including starting with http or https
+```
diff --git a/docs/get-started/VeloxS3.md b/docs/get-started/VeloxS3.md
@@ -10,7 +10,7 @@ Object stores offered by CSPs such as AWS S3 are important for users of Gluten t
 
 ## Configuring S3 endpoint
 
-S3 proivdes the endpoint based method to access the files, here's the example configuration. Users may need to modify some values based on real setup.
+S3 provides the endpoint based method to access the files, here's the example configuration. Users may need to modify some values based on real setup.
 
 ```sh
 spark.hadoop.fs.s3a.impl                        org.apache.hadoop.fs.s3a.S3AFileSystem
@@ -22,8 +22,6 @@ spark.hadoop.fs.s3a.connection.ssl.enabled      true
 spark.hadoop.fs.s3a.path.style.access           false
 ```
 
-Note if testing with a mock AWS S3 environment(like Minio/Ceph), users may required to modify some of the values. E.g., on Minio setup, below config is required: `spark.hadoop.fs.s3a.path.style.access true`
-
 ## Configuring S3 instance credentials
 
 S3 also provides other methods for accessing, you can also use instance credentials by setting the following config
@@ -58,7 +56,7 @@ spark.gluten.sql.columnar.backend.velox.ssdCachePath      // the folder to store
 spark.gluten.sql.columnar.backend.velox.ssdCacheSize      // the total size of the SSD cache, default is 128MB. Velox will do in-mem cache only if this value is 0.
 spark.gluten.sql.columnar.backend.velox.ssdCacheShards    // the shards of the SSD cache, default is 1.
 spark.gluten.sql.columnar.backend.velox.ssdCacheIOThreads // the IO threads for cache promoting, default is 1. Velox will try to do "read-ahead" if this value is bigger than 1 
-spark.gluten.sql.columnar.backend.velox.ssdODirect        // enbale or disable O_DIRECT on cache write, default false.
+spark.gluten.sql.columnar.backend.velox.ssdODirect        // enable or disable O_DIRECT on cache write, default false.
 ```
 
 It's recommended to mount SSDs to the cache path to get the best performance of local caching. On the start up of Spark context, the cache files will be allocated under "spark.gluten.sql.columnar.backend.velox.cachePath", with UUID based suffix, e.g. "/tmp/cache.13e8ab65-3af4-46ac-8d28-ff99b2a9ec9b0". Gluten is not able to reuse older caches for now, and the old cache files are left there after Spark context shutdown.