Google BigQuery CDC Partition Optimisation To Reduce Cost
Consumption charges for applying change data to BigQuery is based on the size of the underlying data - i.e. BigQuery is unable to update individual rows, so it mutates the entire table/partition in order to apply a change. Each mutation results in the entire underlying table/partition being read, and consumption charges being raised accordingly. For batch applies based on MERGE statements, BigQuery is not intelligent enough to isolate which partitions are affected, and hence mutates the entire tables regardless of partitioning. For a 1TB table, this results in a $5 charge for each batch of changes. In the course of a day's replication, applying changes in 5-minute intervals will result in a $14,000 monthly bill for a single table.
This scenario is largely unavoidable when using SQL statements to apply changes to BigQuery. However, the underlying BigQuery APIs allow individual partitions to be targeted and replaced without, and also allow data to be streamed into tables for scenarios where historic views are desirable.
Our current customer sees the potential for QR to be improved based on the above, as existing methodology of reducing the cost comes at the compromise of the source and target latency.
Thank you for the suggestion. We have items in the roadmap including creating clustered and partitioned tables in the target and also adjusting the SQL to take advantage of these options.
The feature supported by Replicate 2021.5. In the Release Notes:
Support for Google Cloud BigQuery clustered tables
A Create tables as clustered by primary key option has been added to the Advanced tab of the Google Cloud BigQuery target endpoint. When this option is selected, the target tables will be created as clustered (according to the first four Primary Key columns that support clustering). In general, clustered tables usually provide significantly faster query performance as well as reducing billing costs.