[GLUTEN-5414][VL] FEAT: Support read CSV #5447

jinchengchenghh · 2024-04-18T06:42:25Z

The PR use Arrow's CSV reader to parse the CSV file then feed the arrow format data into Velox pipeline.
Quey plan is

*(1) ColumnarToRow
+- ArrowFileScan arrowcsv [Name#17,Language#18] Batched: true, DataFilters: [], Format: org.apache.gluten.datasource.ArrowCSVFileFormat@7772ec28, Location: InMemoryFileIndex(1 paths)[file:/mnt/DP_disk1/code/incubator-gluten/backends-velox/target/scala-2..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Name:string,Language:string>


VeloxColumnarToRowExec
+- ^(2) FilterExecTransformer (isnotnull(Name#17) AND (Name#17 = Peter))
   +- ^(2) InputIteratorTransformer[Name#17, Language#18]
      +- ^(2) InputAdapter
         +- ^(2) ArrowFileScan arrowcsv [Name#17,Language#18] Batched: true, DataFilters: [isnotnull(Name#17), (Name#17 = Pet

ArrowFileScan arrowcsv can indicate the file format has already been changed to arrow

If the specified schema is different with file schema, will fallback to vanilla Spark to generate UnsafeRow and then convert to velox ColumnarBatch then to ArrowRecordBatch, because now we don't have row -> arrow converter, and supportsBatch true means we should output a Arrow columnarbatch

This PR introduces protobuf to compile jni code, https://github.com/apache/arrow/pull/36929/files, but the higher protobuf version will cause UnsatisfiedLinkError: /tmp/jnilib-4372912739792055919.tmp: /tmp/jnilib-4372912739792055919.tmp: undefined symbol: _ZTIN6google8protobuf7MessageE, so we need to compile Arrow java dataset module

github-actions · 2024-04-18T06:42:43Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2024-04-18T06:42:59Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-04-18T06:50:04Z

gluten-core/src/main/scala/org/apache/spark/memory/SparkMemoryUtil.scala

+  class GenericRetainer[T <: AutoCloseable] {
+    private var retained: Option[T] = None
+
+    def retain(batch: T): Unit = {
+      if (retained.isDefined) {
+        throw new IllegalStateException
+      }
+      retained = Some(batch)
+    }
+
+    def release(): Unit = {
+      retained.foreach(b => b.close())
+      retained = None
+    }
+  }
+
+  class UnsafeItr[T <: AutoCloseable](delegate: Iterator[T]) extends Iterator[T] {
+    val holder = new GenericRetainer[T]()
+
+    addLeakSafeTaskCompletionListener[Unit](
+      (_: TaskContext) => {
+        holder.release()
+      })
+
+    override def hasNext: Boolean = {
+      holder.release()
+      val hasNext = delegate.hasNext
+      hasNext
+    }
+
+    override def next(): T = {
+      val b = delegate.next()
+      holder.retain(b)
+      b
+    }
+  }


Had some new APIs to replace this sort of wrappers

https://github.com/apache/incubator-gluten/blob/main/gluten-core/src/main/scala/org/apache/gluten/utils/Iterators.scala

github-actions · 2024-04-18T07:18:20Z

Run Gluten Clickhouse CI

github-actions · 2024-04-18T08:37:18Z

#5414

liujiayi771 · 2024-04-18T09:14:10Z

gluten-data/src/main/scala/org/apache/gluten/datasource/ArrowFileFormat.scala

+      sparkSession: SparkSession,
+      options: Map[String, String],
+      path: Path): Boolean = {
+    true


Perhaps using super.isSplitable(sparkSession, options, path) would be better.

super is false, but here is true. when codec is not empty, it cannot split, I will refactor here

github-actions · 2024-04-18T09:14:47Z

Run Gluten Clickhouse CI

github-actions · 2024-04-18T10:05:36Z

Run Gluten Clickhouse CI

FelixYBW · 2024-04-18T19:06:00Z

Can you paste a UI diagram in the first comment?

leesf · 2024-04-18T12:47:28Z

backends-velox/src/test/scala/org/apache/gluten/execution/TestOperator.scala

@@ -458,6 +458,17 @@ class TestOperator extends VeloxWholeStageTransformerSuite {
    }
  }

+  test("csv scan") {


does it support csv with compression like snappy?

yes, this PR supports compression codec apache/arrow#9685

github-actions · 2024-04-19T01:46:58Z

Run Gluten Clickhouse CI

github-actions · 2024-04-19T09:33:10Z

Run Gluten Clickhouse CI

github-actions · 2024-04-22T03:10:37Z

Run Gluten Clickhouse CI

github-actions · 2024-04-22T03:40:28Z

Run Gluten Clickhouse CI

github-actions · 2024-04-22T06:42:25Z

Run Gluten Clickhouse CI

github-actions · 2024-04-22T11:14:35Z

Run Gluten Clickhouse CI

github-actions · 2024-04-23T01:53:34Z

Run Gluten Clickhouse CI

github-actions · 2024-04-23T03:09:11Z

Run Gluten Clickhouse CI

github-actions · 2024-04-23T06:26:21Z

Run Gluten Clickhouse CI

github-actions · 2024-04-23T06:28:41Z

Run Gluten Clickhouse CI

github-actions · 2024-04-23T06:33:42Z

Run Gluten Clickhouse CI

github-actions · 2024-04-23T09:47:03Z

Run Gluten Clickhouse CI

github-actions · 2024-04-23T23:06:38Z

Run Gluten Clickhouse CI

github-actions · 2024-05-07T10:01:32Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-05-07T10:50:41Z

We need to add shaded exclude for arrow dataset.
Otherwise, arrow JNI cannot find the correct dataset java method.

liujiayi771 · 2024-05-07T12:51:57Z

We also need to add HDFS support for arrow dataset.

caused by: java.lang.RuntimeException: Got HDFS URI but Arrow compiled without HDFS support                                                                                                                                  
        at org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native Method)                                                                                                                              
        at org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:40)                                                                                                             
        at org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:31)                                                                                                                   
        at org.apache.gluten.utils.ArrowUtil$.makeArrowDiscovery(ArrowUtil.scala:149)                                                                                                                                        
        at org.apache.gluten.datasource.ArrowCSVFileFormat.$anonfun$buildReaderWithPartitionValues$3(ArrowCSVFileFormat.scala:281)                                                                                           
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)                                               
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)                                                                                                                
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)

But even after enabling ARROW_HDFS, there are still issues when accessing csv in HDFS.

[libprotobuf ERROR google/protobuf/descriptor_database.cc:642] File already exists in database: Security.proto                                                                                                               
[libprotobuf FATAL google/protobuf/descriptor.cc:1986] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):                                                                                                
terminate called after throwing an instance of 'google::protobuf::FatalException'                                                                                                                                            
  what():  CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):

github-actions · 2024-05-07T23:14:17Z

Run Gluten Clickhouse CI

jinchengchenghh · 2024-05-08T00:45:16Z

We also need to add HDFS support for arrow dataset.

caused by: java.lang.RuntimeException: Got HDFS URI but Arrow compiled without HDFS support                                                                                                                                  
        at org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native Method)                                                                                                                              
        at org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:40)                                                                                                             
        at org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:31)                                                                                                                   
        at org.apache.gluten.utils.ArrowUtil$.makeArrowDiscovery(ArrowUtil.scala:149)                                                                                                                                        
        at org.apache.gluten.datasource.ArrowCSVFileFormat.$anonfun$buildReaderWithPartitionValues$3(ArrowCSVFileFormat.scala:281)                                                                                           
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)                                               
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)                                                                                                                
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)

But even after enabling ARROW_HDFS, there are still issues when accessing csv in HDFS.

[libprotobuf ERROR google/protobuf/descriptor_database.cc:642] File already exists in database: Security.proto                                                                                                               
[libprotobuf FATAL google/protobuf/descriptor.cc:1986] CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):                                                                                                
terminate called after throwing an instance of 'google::protobuf::FatalException'                                                                                                                                            
  what():  CHECK failed: GeneratedDatabase()->Add(encoded_file_descriptor, size):

This may be in case of protobuf version, I will try to test CSV in HDFS

github-actions · 2024-05-08T02:11:58Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T02:39:42Z

Run Gluten Clickhouse CI

jinchengchenghh · 2024-05-08T03:01:59Z

After I enable HDFS in ARROW, I can successfully read csv file in HDFS.

scala> val filePath = "/input/student.csv"
filePath: String = /input/student.csv

scala> val df = spark.read.format("csv").option("header", "true").load(filePath)
E0508 10:25:42.841267 3005137 Exceptions.h:69] Line: /mnt/DP_disk1/code/incubator-gluten/ep/build-velox/build/velox_ep/velox/exec/Task.cpp:1850, Function:terminate, Expression:  Cancelled, Source: RUNTIME, ErrorCode: INVALID_STATE
df: org.apache.spark.sql.DataFrame = [Name: string, Language: string]

scala>
     | df.show()
+-----+--------+
| Name|Language|
+-----+--------+
| Juno|    Java|
|Peter|  Python|
|Celin|     C++|
+-----+--------+


scala> print(df.queryExecution.executedPlan)
*(1) ColumnarToRow
+- ArrowFileScan arrowcsv [Name#17,Language#18] Batched: true, DataFilters: [], Format: org.apache.gluten.datasource.ArrowCSVFileFormat@485f3327, Location: InMemoryFileIndex(1 paths)[hdfs://0.0.0.0:9000/input/student.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<Name:string,Language:string>

My local protobuf version, I'm not sure if it is related to protobuf version @liujiayi771

root@sr249:/mnt/DP_disk2/tpcds/scripts# protoc --version
libprotoc 3.21.4

github-actions · 2024-05-08T05:33:12Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T07:47:44Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T07:54:33Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T08:08:20Z

Run Gluten Clickhouse CI

liujiayi771 · 2024-05-08T10:40:46Z

@jinchengchenghh
I have successfully read CSV files in HDFS. Thanks.

jinchengchenghh · 2024-05-08T11:15:04Z

Can you help merge it? Thanks! @zhztheplayer

zhouyuan

👍

FelixYBW · 2024-05-08T16:28:56Z

@jinchengchenghh can you do a test of performance improvement? csv parser only and csv + Gluten. You may create a large table table and run TPCH Q6.

jinchengchenghh · 2024-05-11T07:31:24Z

TPCH SF2000 Q6 performance, query:
select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= '1994-01-01' and l_shipdate < '1995-01-01' and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24

lineitem data: 622G

gluten without native reader	gluten native csv reader	vanilla spark
8333	2456	8385

Test config:

--num-executors 18 \
  --driver-memory 20g \
  --executor-cores 8 \
  --executor-memory 4g \
  --master local[1] \
  --deploy-mode client \
  --conf spark.executor.memoryOverhead=1g \

Test script:

val schema = new StructType().add("l_orderkey", LongType).add("l_partkey", LongType).add("l_suppkey", LongType).add("l_linenumber", LongType).add("l_quantity", DoubleType).add("l_extendedprice", DoubleType).add("l_discount", DoubleType).add("l_tax", DoubleType).add("l_returnflag", StringType).add("l_linestatus", StringType).add("l_shipdate", DateType).add("l_commitdate", DateType).add("l_receiptdate", DateType).add("l_shipinstruct", StringType).add("l_shipmode", StringType).add("l_comment", StringType)

val lineitem = spark.read.format("csv").option("header","true").schema(schema).load("file:///mnt/DP_disk2/tpch/csvdata/")
spark.sql(q6)

Note: because the file schema should match Arrow schema, so we should specify the schema by .schema(arrow_matched_schema)

FelixYBW · 2024-05-13T02:46:54Z

Th

TPCH SF2000 Q6 performance, query: select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= '1994-01-01' and l_shipdate < '1995-01-01' and l_discount between .06 - 0.01 and .06 + 0.01 and l_quantity < 24

lineitem data: 622G

csv gluten without native reader csv gluten native csv reader
8333.039907 2456
Test script:
val schema = new StructType().add("l_orderkey", LongType).add("l_partkey", LongType).add("l_suppkey", LongType).add("l_linenumber", LongType).add("l_quantity", DoubleType).add("l_extendedprice", DoubleType).add("l_discount", DoubleType).add("l_tax", DoubleType).add("l_returnflag", StringType).add("l_linestatus", StringType).add("l_shipdate", DateType).add("l_commitdate", DateType).add("l_receiptdate", DateType).add("l_shipinstruct", StringType).add("l_shipmode", StringType).add("l_comment", StringType)

val lineitem = spark.read.format("csv").option("header","true").schema(schema).load("file:///mnt/DP_disk2/tpch/csvdata/")
spark.sql(q6)
Note: because the file schema should match Arrow schema, so we should specify the schema by .schema(arrow_matched_schema)

Thank you, Chengcheng. What's the vanilla spark performance in this case? And how many task threads did you use?

zhztheplayer reviewed Apr 18, 2024

View reviewed changes

zhouyuan changed the title ~~[VL] Support read CSV~~ [GLUTEN-5414][VL] Support read CSV Apr 18, 2024

liujiayi771 reviewed Apr 18, 2024

View reviewed changes

jinchengchenghh force-pushed the csv branch from c7c60fe to c855356 Compare April 18, 2024 09:14

jinchengchenghh force-pushed the csv branch from c855356 to c205dfa Compare April 18, 2024 10:05

jinchengchenghh marked this pull request as draft April 18, 2024 23:35

leesf reviewed Apr 19, 2024

View reviewed changes

zhouyuan changed the title ~~[GLUTEN-5414][VL] Support read CSV~~ [GLUTEN-5414][VL] FEAT: Support read CSV Apr 19, 2024

jinchengchenghh force-pushed the csv branch from 9912015 to dc74fa1 Compare April 22, 2024 03:09

jinchengchenghh marked this pull request as ready for review April 23, 2024 23:51

jinchengchenghh force-pushed the csv branch from 9042d57 to d22fcc5 Compare May 7, 2024 10:01

jinchengchenghh force-pushed the csv branch from b6f4cc5 to e966215 Compare May 8, 2024 02:11

jinchengchenghh force-pushed the csv branch from e966215 to a5e19da Compare May 8, 2024 02:39

jinchengchenghh force-pushed the csv branch from a5e19da to d09846e Compare May 8, 2024 05:32

zhouyuan approved these changes May 8, 2024

View reviewed changes

zhouyuan merged commit 32775f8 into apache:main May 8, 2024
43 checks passed

jinchengchenghh added 7 commits May 8, 2024 12:51

support read csv

999aea1

fix shade

b9101d4

support hdfs

24e53cd

fix compile and velox unit test

d09846e

fix ci

96f6892

fix code style

868c82f

fix ci

350b3e0

[GLUTEN-5414][VL] FEAT: Support read CSV #5447

[GLUTEN-5414][VL] FEAT: Support read CSV #5447

Conversation

jinchengchenghh commented Apr 18, 2024 • edited Loading

github-actions bot commented Apr 18, 2024

github-actions bot commented Apr 18, 2024

zhztheplayer Apr 18, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 18, 2024

github-actions bot commented Apr 18, 2024

liujiayi771 Apr 18, 2024

Choose a reason for hiding this comment

jinchengchenghh Apr 18, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 18, 2024

github-actions bot commented Apr 18, 2024

FelixYBW commented Apr 18, 2024

leesf Apr 18, 2024

Choose a reason for hiding this comment

jinchengchenghh Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Apr 19, 2024

github-actions bot commented Apr 19, 2024

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 22, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

github-actions bot commented May 7, 2024

liujiayi771 commented May 7, 2024 • edited Loading

liujiayi771 commented May 7, 2024 • edited Loading

github-actions bot commented May 7, 2024

jinchengchenghh commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

jinchengchenghh commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

liujiayi771 commented May 8, 2024

jinchengchenghh commented May 8, 2024

zhouyuan left a comment

Choose a reason for hiding this comment

FelixYBW commented May 8, 2024

jinchengchenghh commented May 11, 2024 • edited Loading

FelixYBW commented May 13, 2024

jinchengchenghh commented Apr 18, 2024 •

edited

Loading

jinchengchenghh Apr 19, 2024 •

edited

Loading

liujiayi771 commented May 7, 2024 •

edited

Loading

liujiayi771 commented May 7, 2024 •

edited

Loading

jinchengchenghh commented May 11, 2024 •

edited

Loading