[GLUTEN-7267][CH]Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

KevinyhZou · 2024-09-18T12:35:15Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #7267)

How was this patch tested?

BY UT

github-actions · 2024-09-18T12:35:32Z

#7267

github-actions · 2024-09-18T12:35:49Z

Run Gluten Clickhouse CI

github-actions · 2024-09-19T07:08:49Z

Run Gluten Clickhouse CI

github-actions · 2024-09-25T04:31:34Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T02:47:12Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T02:53:18Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T04:28:06Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T04:30:46Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T06:35:00Z

Run Gluten Clickhouse CI

taiyang-li · 2024-09-29T08:04:06Z

...ckhouse/src/test/scala/org/apache/gluten/execution/hive/GlutenClickHouseHiveTableSuite.scala

+      "select id, d1.c, d1.d[0].x, d2.d['m124'].y from %s where day = '2024-09-26' and hour = '12'"
+        .format(pq_table_name)
+    withSQLConf(
+      ("spark.sql.hive.convertMetastoreParquet" -> "false"),


这俩orc和parquet的开关在什么使用场景下是false呢

当需要使用hive parquet/orc serde 读取 table 时，而不是使用spark内置的parquet/orc reader读取时，这两个配置就需要被设置为false @taiyang-li

KevinyhZou · 2024-10-09T08:07:08Z

性能测试

表schema：test_tbl (a STRING, b STRUCT<x1: STRING, x2: STRING, x3: STRING, x4: STRING, x5: STRING>)
测试sql： select count(b.x1) from test_tbl
数据量：1200W行
分别使用json/parquet/orc 三种测试存放数据，测试该SQL查询的端到端耗时情况

优化前平均耗时：
json格式： 16.52s
parquet耗时：2.02s
orc耗时：1.25s

优化后平均耗时：
json格式：12.71s
parquet耗时： 0.63s
orc耗时：0.36s

github-actions · 2024-10-30T12:53:11Z

Run Gluten Clickhouse CI

github-actions · 2024-10-31T07:02:18Z

Run Gluten Clickhouse CI

KevinyhZou marked this pull request as draft September 18, 2024 12:35

github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Sep 18, 2024

KevinyhZou changed the title ~~[GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format~~ [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format Sep 18, 2024

zouyunhe added 3 commits September 19, 2024 09:45

support nested column pruning

4fa9091

rebase and solve conflict

bb1def6

resolve conflict

4c202a6

KevinyhZou force-pushed the support_nested_project_push_down_json branch from 5ba3026 to 4c202a6 Compare September 19, 2024 07:08

zouyunhe added 2 commits September 25, 2024 09:52

use spark shema prunning

b29ce7d

use spark schema pruning

b367576

KevinyhZou added 2 commits September 25, 2024 17:21

support prunning for orc/parquet

955ad45

fix test

766e7d4

KevinyhZou changed the title ~~[GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format~~ [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json/parquet/orc format Sep 26, 2024

remove useless file

7f98477

fix ci test

c865c8c

fix ci test

d45d022

KevinyhZou marked this pull request as ready for review September 26, 2024 06:33

remove orc test

8259217

taiyang-li reviewed Sep 29, 2024

View reviewed changes

Merge branch 'main' into support_nested_project_push_down_json

f3282a1

add config for nested column pruning

53255e3

only for ch backend

f9ba278

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-7267][CH]Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

[GLUTEN-7267][CH]Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

KevinyhZou commented Sep 18, 2024 •

edited

Loading

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 19, 2024

github-actions bot commented Sep 25, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

taiyang-li Sep 29, 2024

KevinyhZou Oct 9, 2024

KevinyhZou commented Oct 9, 2024

github-actions bot commented Oct 30, 2024

github-actions bot commented Oct 31, 2024

[GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json/parquet/orc format #7268

Are you sure you want to change the base?

[GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json/parquet/orc format #7268

Conversation

KevinyhZou commented Sep 18, 2024 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 19, 2024

github-actions bot commented Sep 25, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

taiyang-li Sep 29, 2024

Choose a reason for hiding this comment

KevinyhZou Oct 9, 2024

Choose a reason for hiding this comment

KevinyhZou commented Oct 9, 2024

性能测试

github-actions bot commented Oct 30, 2024

github-actions bot commented Oct 31, 2024

[GLUTEN-7267][CH]Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

[GLUTEN-7267][CH]Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

KevinyhZou commented Sep 18, 2024 •

edited

Loading