Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflow pipelines in GCP are 2x to 3x times slower in apache beam java sdk ver 250 #28765

Open
halaharr opened this issue Oct 2, 2023 · 3 comments

Comments

@halaharr
Copy link

halaharr commented Oct 2, 2023

We are seeing Dataflow pipelines taking 2x to 3x more time to run in Apache beam SDK ver 2.50 compared to Apache beam SDK ver 2.44. As part of troubleshooting we compared the DAGS in 2.44 and 2.50 and we are seeing BQ read from table step in DAG (full table scan using DIRECT_TABLE_ACCESS) taking 3 sec to read 19 records / 13KB size in 2.44 and same exact pipeline with exactly same 19 records and 13KB size taking 1 min 5 sec in 2.50. Is this because this API has degraded in ver 2.50 since I also see throughput for this DAG step is much higher in 2.44 than 2.50. Please find the throughput graph images (elements/sec) below for both versions below

Throughput in ver 2.44 --> 0.15 sec (High)

Throughput in ver 2.50 --> 0.083 sec (Low)

apache_beam_250 apache_beam_244
@Abacn
Copy link
Contributor

Abacn commented Oct 2, 2023

These metrics are averaged on each minute. This small amount of data is not sufficient for comparing efficiency. Usually it needs ~10 min steady throughput and multiple runs to tell the differences.

@halaharr
Copy link
Author

halaharr commented Oct 2, 2023

These metrics are averaged on each minute. This small amount of data is not sufficient for comparing efficiency. Usually it needs ~10 min steady throughput and multiple runs to tell the differences.

Thanks @Abacn appreciate your quick reply. We will test with a larger BQ table (with probably 5k to 10k records) and compare the results in both versions.

@halaharr
Copy link
Author

halaharr commented Oct 3, 2023

These metrics are averaged on each minute. This small amount of data is not sufficient for comparing efficiency. Usually it needs ~10 min steady throughput and multiple runs to tell the differences.

Thanks @Abacn appreciate your quick reply. We will test with a larger BQ table (with probably 5k to 10k records) and compare the results in both versions.

@Abacn I am assuming you are talking about the throughput metrics. when you mentioned that metrics are averaged on each minute but what about the execution time? The Dataflow pipeline runs for over 1 min 15 sec with 250 sdk version as opposed to a 30 second run for 244 sdk version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants