Between 15:43 and 23:53 (UTC) on April 20, Qlik Cloud Services had an outage. During this time, many users were unable to access Qlik Sense Enterprise for SaaS, Qlik Signup, and Digital Purchases. This post will explain what caused the outage and the changes we are making to prevent similar failures in the future.
Auth0 is an authentication and authorization platform that stores users in a database. Qlik partners with Auth0 for managing users for Qlik Sense Enterprise for SaaS, Qlik Signup and Digital purchases. Auth0 notified Qlik of an increase in Auth0 error rates.
Qlik quickly announced the incident on the Qlik Status Page and created a blog post in the Qlik Support Updates Blog. Teams within Qlik promptly began testing to determine the extent of the issue, and through that process determined that the issue did not affect customers using their own identity provider.
Auth0 continuously worked on the issue until they began to see performance improvements. Qlik was notified that the service was restored and began monitoring performance. Once Qlik was confident users no longer saw an issue, Qlik updated the Qlik Status page that the service has been restored.
Auth0 determined the issue was related to specific queries that created resource contention and impacted database performance. Within the next few days, Auth0 will be providing a detailed Root Cause Analysis (RCA), which we will link to this post. For more information on their timeline of the incident, please see Increased errors in Auth0.
Root Cause Analysis
This chart represents the timeline of events that occurred during the outage.
Time Since Issue Introduced
Apr 20 15:43
Auth0 reports increased error rate; engineering investigating.
Qlik notified and incident process initiated.
Qlik posts incident notice to status.qlikcloud.com.
Auth0 acknowledges their status page is inaccessible.
Qlik teams confirm the issue affects customers who do not use their own IDP.
Qlik updates notice to clarify impact.
Auth0 identifies database issue and acknowledges impact to entirety of environment.
Qlik updates notice that the service appears to be restored and we are monitoring.
Qlik updates status page with incident restored.
Auth0 reports that they restored the affected regions.
Auth0 reports that the issue is resolved.
We take the uptime and performance of our infrastructure seriously. We are investigating means to minimize the impact should this incident occur again. We are confident in our partner, Auth0, that they have handled the situation effectively and are taking all necessary actions to reduce the occurrence in the future.