[Failing Test]: Various TPC-DS queries throw NPEs using SparkRunner #28256

mosche · 2023-08-31T13:17:46Z

What happened?

Various TPC-DS queries started throwing NPEs with the SparkRunner some while back (see here):

java.lang.NullPointerException
        at org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
        at org.apache.beam.sdk.util.WindowedValue$TimestampedWindowedValue.<init>(WindowedValue.java:312)
        at org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow.<init>(WindowedValue.java:329)
        at org.apache.beam.sdk.util.WindowedValue.of(WindowedValue.java:95)

Without looking further into the underlying root cause, this seems to be related to #27617.

Issue Failure

Failure: Test is flaky

Issue Priority

Priority: 2 (backlog / disabled test but we think the product is healthy)

Issue Components

The text was updated successfully, but these errors were encountered:

mosche · 2023-09-08T13:55:53Z

cc @aromanenko-dev

aromanenko-dev · 2023-09-27T13:51:08Z

After a quick investigation with git bisect, I confirm that it was caused by this change #27617 and this is the first bad commit 05305ede45366f158f27fc2b83b9ce00db4df2ab.

Interesting that seems it affects only Spark RDD runner with some types of pipelines, though there are no failures at VR tests.

aromanenko-dev · 2023-10-02T15:27:15Z

The CLI command to reproduce the issue:

./gradlew :sdks:java:testing:tpcds:run -Ptpcds.runner=":runners:spark:3" -Ptpcds.args=" \
  --runner=SparkRunner \
  --queries=3 \
  --tpcParallel=1 \
  --dataDirectory=/path/to/input/data/ \
  --dataSize=1GB \
  --sourceType=PARQUET \
  --resultsDirectory=/path/to/results/"

echauchot · 2023-10-03T08:23:11Z

@aromanenko-dev thanks for the root cause analysis. Do you use Beam schemas in TPCDS implementation ?

aromanenko-dev · 2023-10-03T09:45:42Z

@echauchot Yes, CSV or Parquet schema is converted into Beam schema to be able executed with Beam SQL.

aromanenko-dev · 2023-10-03T13:25:40Z

Full stacktrace:

23/10/03 15:21:23 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 9.0 (TID 45)
java.lang.NullPointerException
        at org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
        at org.apache.beam.sdk.util.WindowedValue$TimestampedWindowedValue.<init>(WindowedValue.java:312)
        at org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow.<init>(WindowedValue.java:329)
        at org.apache.beam.sdk.util.WindowedValue.of(WindowedValue.java:95)
        at org.apache.beam.runners.spark.translation.SparkCombineFn$SingleWindowWindowedAccumulator.extractOutput(SparkCombineFn.java:251)
        at org.apache.beam.runners.spark.translation.SparkCombineFn.extractOutputStream(SparkCombineFn.java:774)
        at org.apache.beam.runners.spark.translation.TransformTranslator$5.lambda$evaluate$8d6d352$1(TransformTranslator.java:351)
        at org.apache.spark.api.java.JavaPairRDD.$anonfun$flatMapValues$1(JavaPairRDD.scala:680)
        at org.apache.spark.rdd.PairRDDFunctions.$anonfun$flatMapValues$3(PairRDDFunctions.scala:763)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
        at org.apache.beam.runners.spark.translation.MultiDoFnFunction.call(MultiDoFnFunction.java:130)
        at org.apache.beam.runners.spark.translation.MultiDoFnFunction.call(MultiDoFnFunction.java:60)
        at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsToPair$1(JavaRDDLike.scala:186)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

echauchot · 2023-10-03T13:45:11Z

yes timestamp is null in some cases

echauchot · 2023-10-03T15:08:07Z

The CLI command to reproduce the issue:

./gradlew :sdks:java:testing:tpcds:run -Ptpcds.runner=":runners:spark:3" -Ptpcds.args=" \
  --runner=SparkRunner \
  --queries=3 \
  --tpcParallel=1 \
  --dataDirectory=/path/to/input/data/ \
  --dataSize=1GB \
  --sourceType=PARQUET \
  --resultsDirectory=/path/to/results/"

Can be reproduced using CSV input files ?

aromanenko-dev · 2023-10-04T13:09:05Z

Yes, I have the same issue with CSV files.
Btw, a quick fix to make it working with generated CSV data #28819

echauchot · 2023-10-04T14:12:07Z

getting a different exception on master using --sourceType=CSV :
Caused by: java.lang.IllegalArgumentException: Expect 28 fields, but actually 29 at org.apache.beam.sdk.extensions.sql.impl.schema.BeamTableUtils.csvLines2BeamRows(BeamTableUtils.java:76) at org.apache.beam.sdk.tpcds.CsvToRow.lambda$expand$43aa1fdf$1(CsvToRow.java:54) at org.apache.beam.sdk.transforms.FlatMapElements$3.processElement(FlatMapElements.java:167)

echauchot · 2023-10-04T14:15:47Z

ah yes, getting the NPE if I apply #28819

aromanenko-dev · 2023-10-26T16:01:36Z

CC: @je-ik Since you worked on Group/Combine transform translations for original Spark RDD runner, could you take a look? Is it a Spark runner issue?

je-ik · 2023-10-27T12:59:47Z

Oh my, this is an old history. :)
I walked through the code and I actually don't understand why is the accTimestamp set to null if the timestamp should be BoundedWindow.TIMESTAMP_MIN_VALUE.

I created #29162, it seems to pass locally all tests and validatesRunner suites, can you try this patch?

aromanenko-dev · 2023-10-27T14:00:07Z

@je-ik Thanks!

I quickly tested it with a couple of TPC-DS queries that were failing and it passes now.

So, I think if ValidateRunner tests pass, we have to merge this fix. Though, it's strange that this issue was not caught by any of VR test running with SparkRunner

je-ik · 2023-10-27T14:02:14Z

Yes, I'd only like to walk the code again to be sure exactly what might be the impact of the fix. Yes, it is strange it was not caught by VR tests. I'll look into it.

je-ik · 2023-10-30T09:57:44Z

I don't have a background about the TPC-DS queries, do we have the input data that I can pass to the gradle command to reproduce the NPEs (the dataDirectory)?

je-ik · 2023-10-30T10:05:34Z

Ah, I see, gs://beam-tpcds/datasets/parquet/nonpartitioned.

je-ik · 2023-10-30T10:37:43Z

I'm unable to reproduce the error locally. Complete command-line:

./gradlew :sdks:java:testing:tpcds:run -Ptpcds.runner=":runners:spark:3" -Ptpcds.args=" \
  --runner=SparkRunner \
  --queries=3,3,3,3,3,3,3,3,3,3,3 \
  --tpcParallel=1 \
  --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned/1GB \
  --dataSize=1GB \
  --sourceType=PARQUET \
  --resultsDirectory=/tmp/tpc-ds-results/"

All attempts pass as Successful on current master (c2816c8d97). The same for both partitioned and nonpartitioned versions.

aromanenko-dev · 2023-10-30T11:01:28Z

@je-ik Hmm, interesting. A couple of notes:

Why do you specify in your command several queries that are the same? Only one should be enough.
Could you change an input path to gs://beam-tpcds/datasets/parquet/partitioned/ and rerun? Seems like we have an issue here in Beam TPC-DS runner that have to fail if it doesn't read any records

je-ik · 2023-10-30T11:58:42Z

@je-ik Hmm, interesting. A couple of notes:

1. Why do you specify in your command several queries that are the same? Only one should be enough.

Just to re-run the test multiple times to reveal any flakes.

2. Could you change an input path to `gs://beam-tpcds/datasets/parquet/partitioned/` and rerun? Seems like we have an issue here in Beam TPC-DS runner that have to fail if it doesn't read any records

I tried both, same results.

je-ik · 2023-10-30T12:00:36Z

Ah, I see. I need to remove the 1GB suffix. Yes, I'll try, thanks!

aromanenko-dev · 2023-10-30T12:07:07Z

Yes, I'll fix this

je-ik · 2023-10-30T13:26:59Z

I'm obviously doing something wrong.

Running the command like this

$ ./gradlew :sdks:java:testing:tpcds:run -Ptpcds.runner=":runners:spark:3" -Ptpcds.args=" \
  --runner=SparkRunner \
  --queries=3 \
  --tpcParallel=1 \
  --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned/ \
  --dataSize=1GB \
  --sourceType=PARQUET \
  --resultsDirectory=/tmp/tpc-ds-results/"

I get a success

+--------------+------------------------------+--------------+------------+--------------+--------------------------------+--------------------------------+----------------------+
|  Query Name  |           Job Name           |  Data Size   |  Dialect   |    Status    |           Start Time           |            End Time            |  Elapsed Time(sec)   |
+--------------+------------------------------+--------------+------------+--------------+--------------------------------+--------------------------------+----------------------+
|    query3    |  query3result1698672244841   |     1GB      |  Calcite   |  Successful  |  Mon Oct 30 14:24:08 CET 2023  |  Mon Oct 30 14:24:15 CET 2023  |        6.483         |
+--------------+------------------------------+--------------+------------+--------------+--------------------------------+--------------------------------+----------------------+

but the outputs are empty

$ ls -l /tmp/tpc-ds-results/1GB/
total 0
-rw-rw-r-- 1 honza honza 0 Oct 30 14:24 query3result1698672244841-00000-of-00001.txt

Accessing the bucket seems to be working fine, e.g.:

$ gsutil ls -l gs://beam-tpcds/datasets/parquet/partitioned/1GB/catalog_page
         8  2021-03-24T06:03:33Z  gs://beam-tpcds/datasets/parquet/partitioned/1GB/catalog_page/._SUCCESS.crc
      5456  2021-03-24T06:03:33Z  gs://beam-tpcds/datasets/parquet/partitioned/1GB/catalog_page/.part-00000-43e37567-6034-4fae-bda9-db2a85216f3f-c000.snappy.parquet.crc
         0  2021-03-24T06:03:34Z  gs://beam-tpcds/datasets/parquet/partitioned/1GB/catalog_page/_SUCCESS
    697339  2021-03-24T06:03:34Z  gs://beam-tpcds/datasets/parquet/partitioned/1GB/catalog_page/part-00000-43e37567-6034-4fae-bda9-db2a85216f3f-c000.snappy.parquet
TOTAL: 4 objects, 702803 bytes (686.33 KiB)

aromanenko-dev · 2023-10-30T13:57:23Z

If the results are empty then it's very likely that the input was empty too (for some reasons) - ParquetIO doesn't fail if it didn't find any input files.
Could you try to run it with gs://beam-tpcds/datasets/parquet/nonpartitioned as input path? This is what TPC-DS Jenkins job does.

je-ik · 2023-10-30T14:02:22Z

Same results. Runs OK, but empty outputs.

je-ik · 2023-10-30T14:05:08Z

Update: I removed the last slash, and it failed! :)

+--------------+------------------------------+--------------+------------+----------+--------------+------------+----------------------+
|  Query Name  |           Job Name           |  Data Size   |  Dialect   |  Status  |  Start Time  |  End Time  |  Elapsed Time(sec)   |
+--------------+------------------------------+--------------+------------+----------+--------------+------------+----------------------+
|    query3    |  query3result1698674602714   |     1GB      |  Calcite   |  Failed  |              |            |                      |
+--------------+------------------------------+--------------+------------+----------+--------------+------------+----------------------+

command-line:

$ ./gradlew :sdks:java:testing:tpcds:run -Ptpcds.runner=":runners:spark:3" -Ptpcds.args=" \
  --runner=SparkRunner \
  --queries=3 \
  --tpcParallel=1 \
  --dataDirectory=gs://beam-tpcds/datasets/parquet/nonpartitioned \
  --dataSize=1GB \
  --sourceType=PARQUET \
  --resultsDirectory=/tmp/tpc-ds-results/"

And I got the NPE:

Caused by: java.lang.NullPointerException
        at org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:903)
        at org.apache.beam.sdk.util.WindowedValue$TimestampedWindowedValue.<init>(WindowedValue.java:312)
        at org.apache.beam.sdk.util.WindowedValue$TimestampedValueInGlobalWindow.<init>(WindowedValue.java:329)
        at org.apache.beam.sdk.util.WindowedValue.of(WindowedValue.java:95)

aromanenko-dev · 2023-10-30T14:21:01Z

I created an issue for that #29198

…ache#28256)

… null in SparkCombineFn (#28256)

mosche added spark bug failing test tpcds For TPC-DS benchmark related tasks labels Aug 31, 2023

github-actions bot added tests P2 flake labels Aug 31, 2023

je-ik mentioned this issue Oct 27, 2023

[runners-spark] Do not set accTimestamp to null in SparkCombineFn (#28256) #29162

Merged

3 tasks

je-ik self-assigned this Oct 27, 2023

je-ik added a commit to je-ik/beam that referenced this issue Oct 31, 2023

[runners-spark] Do not set accTimestamp to null in SparkCombineFn (ap…

524a7bf

…ache#28256)

je-ik closed this as completed in #29162 Oct 31, 2023

je-ik added a commit that referenced this issue Oct 31, 2023

Merge pull request #29162: [runners-spark] Do not set accTimestamp to…

6a66b72

… null in SparkCombineFn (#28256)

github-actions bot added this to the 2.52.0 Release milestone Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Failing Test]: Various TPC-DS queries throw NPEs using SparkRunner #28256

[Failing Test]: Various TPC-DS queries throw NPEs using SparkRunner #28256

mosche commented Aug 31, 2023

mosche commented Sep 8, 2023

aromanenko-dev commented Sep 27, 2023 •

edited

Loading

aromanenko-dev commented Oct 2, 2023

echauchot commented Oct 3, 2023

aromanenko-dev commented Oct 3, 2023

aromanenko-dev commented Oct 3, 2023

echauchot commented Oct 3, 2023

echauchot commented Oct 3, 2023

aromanenko-dev commented Oct 4, 2023

echauchot commented Oct 4, 2023

echauchot commented Oct 4, 2023

aromanenko-dev commented Oct 26, 2023

je-ik commented Oct 27, 2023

aromanenko-dev commented Oct 27, 2023 •

edited

Loading

je-ik commented Oct 27, 2023

je-ik commented Oct 30, 2023

je-ik commented Oct 30, 2023

je-ik commented Oct 30, 2023 •

edited

Loading

aromanenko-dev commented Oct 30, 2023 •

edited

Loading

je-ik commented Oct 30, 2023

je-ik commented Oct 30, 2023

aromanenko-dev commented Oct 30, 2023

je-ik commented Oct 30, 2023

aromanenko-dev commented Oct 30, 2023

je-ik commented Oct 30, 2023

je-ik commented Oct 30, 2023 •

edited

Loading

aromanenko-dev commented Oct 30, 2023

[Failing Test]: Various TPC-DS queries throw NPEs using SparkRunner #28256

[Failing Test]: Various TPC-DS queries throw NPEs using SparkRunner #28256

Comments

mosche commented Aug 31, 2023

What happened?

Issue Failure

Issue Priority

Issue Components

mosche commented Sep 8, 2023

aromanenko-dev commented Sep 27, 2023 • edited Loading

aromanenko-dev commented Oct 2, 2023

echauchot commented Oct 3, 2023

aromanenko-dev commented Oct 3, 2023

aromanenko-dev commented Oct 3, 2023

echauchot commented Oct 3, 2023

echauchot commented Oct 3, 2023

aromanenko-dev commented Oct 4, 2023

echauchot commented Oct 4, 2023

echauchot commented Oct 4, 2023

aromanenko-dev commented Oct 26, 2023

je-ik commented Oct 27, 2023

aromanenko-dev commented Oct 27, 2023 • edited Loading

je-ik commented Oct 27, 2023

je-ik commented Oct 30, 2023

je-ik commented Oct 30, 2023

je-ik commented Oct 30, 2023 • edited Loading

aromanenko-dev commented Oct 30, 2023 • edited Loading

je-ik commented Oct 30, 2023

je-ik commented Oct 30, 2023

aromanenko-dev commented Oct 30, 2023

je-ik commented Oct 30, 2023

aromanenko-dev commented Oct 30, 2023

je-ik commented Oct 30, 2023

je-ik commented Oct 30, 2023 • edited Loading

aromanenko-dev commented Oct 30, 2023

aromanenko-dev commented Sep 27, 2023 •

edited

Loading

aromanenko-dev commented Oct 27, 2023 •

edited

Loading

je-ik commented Oct 30, 2023 •

edited

Loading

aromanenko-dev commented Oct 30, 2023 •

edited

Loading

je-ik commented Oct 30, 2023 •

edited

Loading