-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: BigqueryIO is very slow if using storage api and dynamic destination to write data to over thousand different tables with high data skew #32508
Comments
Have you tried to profile the pipeline to figure out some potential issues? |
@liferoad There are some upstream transform I could improve but it has nothing to do with the bigquery write. The only difference in code is writing to one table or writing to many tables |
Added the dev list thread here: https://lists.apache.org/thread/gz5zhnworvcjog0o4g96lsqbw5tz6y03 |
Can you try to enable multi-plexing [1]? You can do so by setting [1] https://cloud.google.com/bigquery/docs/write-api-best-practices#connection_pool_management |
@ahmedabu98 |
What is your support ticket number? Is this streaming or batch? |
@liferoad Case 53209037 |
It is streaming at least once mode |
Can you share the latest entire code if possible? From the ticket, it seems the job with |
@liferoad The latest entire code is a little bit complicated, I can give you a simplified version for the bigquery write part
|
@liferoad one thing I'm not sure is how do we get the schema of the table? From the internal implementation it looks like it cache the table schema by TableDestination, so it will get schema for each table once? If I use |
One thing I notice that doesn't look right to me is the memory usage keeps going up. I enabled the profiler and I saw a lot of message used by |
@liferoad And in diagnostic errors table I see a lot of
|
@ns-shua our engineers provided more comments through your support ticket. Let us move the discussions to the support ticket. And I also shared this issue with our team. Thanks. |
@liferoad Thanks! |
What happened?
I'm trying to use BigqueryIO and use the Storage API as suggested in at least once mode(both pipeline and IO) My requirement is to write data to over thousand table in different projects. And the data is highly skews the top 10 tables could take 80% of the traffic. I observe the pipeline becomes super slow and CPU utilization is almost always below 30%. I think it is the data skew problem. But our data is logically partitioned in that way that I have no control of it. I tried to write same volume to data to single table(all the tables are in same schema). It perform very well even with 1/4 of the machines. The document claims DynamicDestination should perform as good as single destination. Is there any performance issue or is there any suggestions?
Here is the code I use to write to different table
This code perform much much worse than
with same amount of data
Writing to different tables the CPU usage is constantly below 30% while writing to single table CPU usage is constantly near 100%
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: