[Bug]: BigqueryIO is very slow if using storage api and dynamic destination to write data to over thousand different tables with high data skew #32508

ns-shua · 2024-09-19T18:13:46Z

What happened?

I'm trying to use BigqueryIO and use the Storage API as suggested in at least once mode(both pipeline and IO) My requirement is to write data to over thousand table in different projects. And the data is highly skews the top 10 tables could take 80% of the traffic. I observe the pipeline becomes super slow and CPU utilization is almost always below 30%. I think it is the data skew problem. But our data is logically partitioned in that way that I have no control of it. I tried to write same volume to data to single table(all the tables are in same schema). It perform very well even with 1/4 of the machines. The document claims DynamicDestination should perform as good as single destination. Is there any performance issue or is there any suggestions?

Here is the code I use to write to different table

BigQueryIO.<KV<TopicMetadata, T>>write()
        .withFormatFunction(...)
        .withoutValidation()
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
        .optimizedWrites()
        .withClustering()
        .to(
            new SerializableFunction<..>() {...} // Here I tried both SerializableFunction and DynamicDestination class
         );

This code perform much much worse than

BigQueryIO.<KV<TopicMetadata, T>>write()
        .withFormatFunction(...)
        .withoutValidation()
        .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
        .optimizedWrites()
        .withClustering()
        .to(
            "project_all.example_dataset.alldata"
         );

with same amount of data

Writing to different tables the CPU usage is constantly below 30% while writing to single table CPU usage is constantly near 100%

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

liferoad · 2024-09-19T22:50:07Z

Have you tried to profile the pipeline to figure out some potential issues?
cc @ahmedabu98

ns-shua · 2024-09-20T20:56:37Z

@liferoad There are some upstream transform I could improve but it has nothing to do with the bigquery write. The only difference in code is writing to one table or writing to many tables

liferoad · 2024-09-20T21:10:04Z

Added the dev list thread here: https://lists.apache.org/thread/gz5zhnworvcjog0o4g96lsqbw5tz6y03
@ns-shua -shua Have you opened a customer support ticket for Dataflow? It will be helpful to check your Dataflow jobs.

ahmedabu98 · 2024-09-20T21:42:58Z

Can you try to enable multi-plexing [1]? You can do so by setting --useStorageApiConnectionPool=true [2]

[1] https://cloud.google.com/bigquery/docs/write-api-best-practices#connection_pool_management
[2] https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.html#setUseStorageApiConnectionPool-java.lang.Boolean-

ns-shua · 2024-09-21T18:47:53Z

@ahmedabu98
I believe if i don't use connection pool, writing to one table won't work. So yes I've set it to true
@liferoad
I asked in mailing channel also I created support ticket but so far I got 0 useful help or tip. They mentioned they found a hotkey? I'm not sure. Can you explain if the data volume is high skewed among all the tables what would auto sharding behave, does it create more workers for hot tables?

liferoad · 2024-09-21T19:10:47Z

What is your support ticket number? Is this streaming or batch?

ns-shua · 2024-09-21T19:15:28Z

@liferoad Case 53209037
I'm confused by the memory dump, I do see a lot of StorageApiWriteUnshardedRecords but I have withAutosharding() in my code

ns-shua · 2024-09-21T19:15:55Z

It is streaming at least once mode

liferoad · 2024-09-21T19:28:04Z

Can you share the latest entire code if possible? From the ticket, it seems the job with withAutosharding does not scale down.

ns-shua added awaiting triage bug labels Sep 19, 2024

github-actions bot added java io P2 labels Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: BigqueryIO is very slow if using storage api and dynamic destination to write data to over thousand different tables with high data skew #32508

[Bug]: BigqueryIO is very slow if using storage api and dynamic destination to write data to over thousand different tables with high data skew #32508

ns-shua commented Sep 19, 2024 •

edited

Loading

liferoad commented Sep 19, 2024

ns-shua commented Sep 20, 2024

liferoad commented Sep 20, 2024 •

edited

Loading

ahmedabu98 commented Sep 20, 2024

ns-shua commented Sep 21, 2024

liferoad commented Sep 21, 2024

ns-shua commented Sep 21, 2024

ns-shua commented Sep 21, 2024

liferoad commented Sep 21, 2024

[Bug]: BigqueryIO is very slow if using storage api and dynamic destination to write data to over thousand different tables with high data skew #32508

[Bug]: BigqueryIO is very slow if using storage api and dynamic destination to write data to over thousand different tables with high data skew #32508

Comments

ns-shua commented Sep 19, 2024 • edited Loading

What happened?

Issue Priority

Issue Components

liferoad commented Sep 19, 2024

ns-shua commented Sep 20, 2024

liferoad commented Sep 20, 2024 • edited Loading

ahmedabu98 commented Sep 20, 2024

ns-shua commented Sep 21, 2024

liferoad commented Sep 21, 2024

ns-shua commented Sep 21, 2024

ns-shua commented Sep 21, 2024

liferoad commented Sep 21, 2024

ns-shua commented Sep 19, 2024 •

edited

Loading

liferoad commented Sep 20, 2024 •

edited

Loading