-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataflow pipelines in GCP are 2x to 3x times slower in apache beam java sdk ver 250 #28765
Comments
These metrics are averaged on each minute. This small amount of data is not sufficient for comparing efficiency. Usually it needs ~10 min steady throughput and multiple runs to tell the differences.
|
Thanks @Abacn appreciate your quick reply. We will test with a larger BQ table (with probably 5k to 10k records) and compare the results in both versions. |
@Abacn I am assuming you are talking about the throughput metrics. when you mentioned that metrics are averaged on each minute but what about the execution time? The Dataflow pipeline runs for over 1 min 15 sec with 250 sdk version as opposed to a 30 second run for 244 sdk version. |
We are seeing Dataflow pipelines taking 2x to 3x more time to run in Apache beam SDK ver 2.50 compared to Apache beam SDK ver 2.44. As part of troubleshooting we compared the DAGS in 2.44 and 2.50 and we are seeing BQ read from table step in DAG (full table scan using DIRECT_TABLE_ACCESS) taking 3 sec to read 19 records / 13KB size in 2.44 and same exact pipeline with exactly same 19 records and 13KB size taking 1 min 5 sec in 2.50. Is this because this API has degraded in ver 2.50 since I also see throughput for this DAG step is much higher in 2.44 than 2.50. Please find the throughput graph images (elements/sec) below for both versions below
Throughput in ver 2.44 --> 0.15 sec (High)
Throughput in ver 2.50 --> 0.083 sec (Low)
The text was updated successfully, but these errors were encountered: