Dataflow pipelines in GCP are 2x to 3x times slower in apache beam java sdk ver 250 #28765

halaharr · 2023-10-02T14:12:03Z

We are seeing Dataflow pipelines taking 2x to 3x more time to run in Apache beam SDK ver 2.50 compared to Apache beam SDK ver 2.44. As part of troubleshooting we compared the DAGS in 2.44 and 2.50 and we are seeing BQ read from table step in DAG (full table scan using DIRECT_TABLE_ACCESS) taking 3 sec to read 19 records / 13KB size in 2.44 and same exact pipeline with exactly same 19 records and 13KB size taking 1 min 5 sec in 2.50. Is this because this API has degraded in ver 2.50 since I also see throughput for this DAG step is much higher in 2.44 than 2.50. Please find the throughput graph images (elements/sec) below for both versions below

Throughput in ver 2.44 --> 0.15 sec (High)

Throughput in ver 2.50 --> 0.083 sec (Low)

Abacn · 2023-10-02T15:13:08Z

These metrics are averaged on each minute. This small amount of data is not sufficient for comparing efficiency. Usually it needs ~10 min steady throughput and multiple runs to tell the differences.

Our performance tests does not show a regression - http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1&viewPanel=23&from=now-1y&to=now

halaharr · 2023-10-02T15:27:04Z

These metrics are averaged on each minute. This small amount of data is not sufficient for comparing efficiency. Usually it needs ~10 min steady throughput and multiple runs to tell the differences.

Our performance tests does not show a regression - http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1&viewPanel=23&from=now-1y&to=now

Thanks @Abacn appreciate your quick reply. We will test with a larger BQ table (with probably 5k to 10k records) and compare the results in both versions.

halaharr · 2023-10-03T01:30:01Z

These metrics are averaged on each minute. This small amount of data is not sufficient for comparing efficiency. Usually it needs ~10 min steady throughput and multiple runs to tell the differences.

Our performance tests does not show a regression - http://metrics.beam.apache.org/d/bnlHKP3Wz/java-io-it-tests-dataflow?orgId=1&viewPanel=23&from=now-1y&to=now

Thanks @Abacn appreciate your quick reply. We will test with a larger BQ table (with probably 5k to 10k records) and compare the results in both versions.

@Abacn I am assuming you are talking about the throughput metrics. when you mentioned that metrics are averaged on each minute but what about the execution time? The Dataflow pipeline runs for over 1 min 15 sec with 250 sdk version as opposed to a 30 second run for 244 sdk version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow pipelines in GCP are 2x to 3x times slower in apache beam java sdk ver 250 #28765

Dataflow pipelines in GCP are 2x to 3x times slower in apache beam java sdk ver 250 #28765

halaharr commented Oct 2, 2023

Abacn commented Oct 2, 2023 •

edited

Loading

halaharr commented Oct 2, 2023

halaharr commented Oct 3, 2023

Dataflow pipelines in GCP are 2x to 3x times slower in apache beam java sdk ver 250 #28765

Dataflow pipelines in GCP are 2x to 3x times slower in apache beam java sdk ver 250 #28765

Comments

halaharr commented Oct 2, 2023

Abacn commented Oct 2, 2023 • edited Loading

halaharr commented Oct 2, 2023

halaharr commented Oct 3, 2023

Abacn commented Oct 2, 2023 •

edited

Loading