Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Jan 20, 2022 9:53:57 AM
Jul 14, 2020 2:01:51 PM
Qlik is a strategic partner for Snowflake and we are proud to have been named by Snowflake as the 2020 Partner of the Year for Data Engineering. The Qlik Data Integration platform is a market leader when it comes to continuous and real-time data ingestion from a wide range of enterprise data sources and we work closely with Snowflake to provide an aligned and optimized solution. Following the changes Snowflake introduced to its Cloud Services consumption billing in 2020, we realized this may have an impact on our joint customers. To get ahead of this, we embarked on an effort to gain detailed insight into this new billing model, how the use of Qlik Replicate could impact a customer’s Snowflake bill, and what changes or guidance we need to provide to help customers optimize their cost/performance.
Background
Snowflake has historically billed its customers based on consumption of data warehouse “compute” resources, along with a fee for storage. Not billed for were the computational resources consumed when parsing a SQL statement, generating an execution plan, etc. These non-warehouse computational resources that are expended are what Snowflake calls “Cloud Services”.
In the Fall of 2019, Snowflake announced that it would be adjusting its billing model and begin charging customers for Cloud Services (also sometimes called “GS” or “Global Services”). This Snowflake blog post provides some background on these changes to the billing model.
When you net things out, virtually every interaction you might have with Snowflake involves execution of a SQL statement of some sort, which in turn would result in consumption of Cloud Services resources. As Qlik Replicate can deliver very high volumes of data into Snowflake in real-time, it was important for us to understand how Replicate’s delivery of data into Snowflake would impact a customer’s Snowflake bill.
Our Research
We set out to run a series of benchmark tests to help us:
Our test environment was isolated: no other users were accessing the Snowflake database and warehouse that were used in testing. Our tests involved running a consistent and repeatable source database load that contained a mixture of INSERT, UPDATE, and DELETE operations, using Qlik Replicate to deliver the change data into Snowflake. Each test ran for a little more than two-and-one-half hours and generated more than 6 million CDC records, using different combinations of Replicate tuning parameters. The load we generated was not a large load by any means, but it was enough for us to draw some important conclusions.
Customer Data
As a part of our study, we also worked with several customers to examine their production environments. We looked at the configuration of all their production Qlik Replicate tasks to understand how they had Qlik Replicate configured, and we analyzed over 3 million SQL statements generated by Qlik Replicate in those environments to understand their respective Cloud Services resource consumption. Our analysis included both DML statements (Data Manipulation Language – statements that modify or otherwise access data stored in the database) and DDL statements (Data Definition Language – statements that perform administrative functions, define the structure of tables and indexes … essentially all statements that aren’t DML).
Conclusions
We learned quite a bit during our testing. Our key takeaways were:
Our findings from our internal benchmarks were substantiated by what we learned from the customers we worked with. We found in all cases that DML operations resulted in the bulk of Cloud Services resource consumption; and the customers consistently had Qlik Replicate configured to deliver the data in near-real time, leaving considerable room to tune Qlik Replicate to deliver larger batches at a reduced Cloud Services cost. Each customer would need to find the balancing point between acceptable latency and cost.
For details on the results of our research, see the whitepaper below.
Thanks for the article @John_Neal .
What would be the Best batch Tuning Settings for Snowflake? If a task is configured for CDC (store changes __CT tables) .
What would be min and max values to optimize credits usage on snowflake if a warehouse is not running always and auto suspends after 10 Mins ?
@ranjithreddy Personal opinion here, but IMO there isn't a "best" setting recommendation I can make. It all depends on your environment and your tolerance for latency ...
As mentioned in the article,
If you are amenable to more latency, in your case I might configure
and then evaluate and adjust as desired from there.
As far as your auto suspend being set at 10 minutes, so long as you have it set to auto resume as well the setting really doesn't enter into the conversation (once again IMO).
Excellent analysis @John_Neal !
@ranjithreddy - as John says the auto-suspend doesn't impact the cloud services cost. However, be aware it could impact overall performance of the Replicate apply processes.
Snowflake compute warehouses are not guaranteed to provide the same physical resources (VM's under the covers) on resumption. That means that the compute layer must refresh its local data cache for the objects your are working with from the central storage that Snowflake leverages. So, if you have large volumes of updates occurring, the re-caching of your data in the computer (warehouse) layer may impact performance.
(I've seen some workloads where its had an impact, and others where it hasn't - once again there are alot of variables wrt micropartitions etc.).