Google is telling us that the current Qlik implementation of its Big Query endpoint uses “micro-batching” to get data into Big Query. While that does work, it does run afoul of a couple of quotas, limiting the number of batches a day, and the number of tables/partitions in a project.
Can you create another version of the endpoint that utilizes their latest streaming ingestion, using their new Write API? Here are a couple of references:
Exactly-once delivery semantics.The Storage Write API supports exactly-once semantics through the use of stream offsets. Unlike thetabledata.insertAllmethod, the Storage Write API never writes two messages that have the same offset within a stream, if the client provides stream offsets when appending records.
Stream-level transactions.You can write data to a stream and commit the data as a single transaction. If the commit operation fails, you can safely retry the operation.
Transactions across streams.Multiple workers can create their own streams to process data independently. When all the workers have finished, you can commit all of the streams as a transaction.
Efficient protocol.The Storage Write API is more efficient than the olderinsertAllmethod because it uses gRPC streaming rather than REST over HTTP. The Storage Write API also supports binary formats in the form of protocol buffers, which are a more efficient wire format than JSON. Write requests are asynchronous with guaranteed ordering.
Schema update detection.If the underlying table schema changes while the client is streaming, then the Storage Write API notifies the client. The client can decide whether to reconnect using the updated schema, or continue to write to the existing connection.
Lower cost. The Storage Write API has a significantly lower cost than the olderinsertAllstreaming API. In addition, you can ingest up to 2 TB per month for free.
In our POC with google, we're using qlik replicate to push to kafka into 2 different topics. One topic is to store the cdc records and the other is to store schema changes. The format of the data is stored in json format. From there we have a Dataflow (google manage product of apache beam) that read the json format and insert data to BQ table. Again the purpose of the Dataflow is to reads cdc records from Kafka (json format) and reformat and appends records to BQ tables.
In this ideation, we're looking if Qlik can create a new end point for BQ where it can remove the need for Kafka and Dataflow. The process itself only involves append process (no need to actually apply insert/update/delete in BQ). Once the data is stored chronologically in BQ, my team can build a view that will make the user see only the latest data.
Currently the Dataflow is also creating the view... so ideally we would like Qlik also handles the view creation.
Hey Joe - We do in fact have this request on our radar for Replicate and our Qlik Cloud DI. It is a relatively high priority, but we have not started the work yet.
You can always ping me directly (you have my email) 🙂 and I will keep you up to date.
Wanted to update you that we did initial evaluation of this feature with the R&D and we have added it to our roadmap.
It's important item both for Replicate and Qlik cloud Data Integration, but it's not something we will be targeting in the short term, as this is not a straightforward implementation, and we are currently tied with other engagements.
NOTE: Upon clicking this link 2 tabs may open - please feel free to close the one with a login page. If you only see 1 tab with the login page, please try clicking this link first: Authenticate me! then try the link above again. Ensure pop-up blocker is off.