Qlik Community

Qlik Design Blog

All about product and Qlik solutions: scripting, data modeling, visual design, extensions, best practices, etc.

Employee
Employee

dlf_headshot_small.png


Greetings Qlik Community. I pleased to introduce you to our newest Guest Blogger, David Freriks. David is a Technology Evangelist on the Innovation and Design team at Qlik. He has been working in the "big data" space for over three years, starting with Hadoop and moving onto Spark in this continuously evolving ecosystem. He has 18+ years in the BI space, helping launch new products to market. David is here today to discuss a few approaches on how Qlik can address...."Big Data".

"Big Data"
big_data.png

The term "Big Data" has been thrown around for several years, and yet it continues to have a very vague definition.  In fact, there are no two big data installations and configurations alike – insert snowflake paradigm here.   It’s no surprise, given the unique nature of “big data”, it cannot be forced into an abstract model.  These type of data systems evolve organically, and morph based on the ever changing business requirements.


If we accept that no two big data systems are alike, how can one deliver analytics from those systems with a singular approach?


Well, we can’t – in fact it would be quite limiting to do so.  Why?


Picking one and only one method of analysis prevents the basic question “What problem is the business user trying to solve?” from being answered. So what do I mean by “picking one version of analysis”? 


The market breaks it down into the following narrow paths:

  • Simple SQL on Hadoop/Spark/etc.
  • Some form of caching of SQL on Hadoop/Spark/etc
  • ETL into database then analysis

These solutions have their place, but to pick only one greatly limits a user’s ability to succeed, especially when the limits of each solution are reached.


So how does Qlik differentiate itself from the narrow approaches and tools that exist in the market?


Simple answer, variety.  Qlik is in a unique position that offers a set of techniques and strategies that allow the widest range of capabilities within a big data ecosystem.


Below are some of the approaches Qlik addresses the big data community with:


  • In-Memory Analytics:  Get the data you need and accelerate it, which provides a great solution for concepts such as data lakes.  Qlik creates a “Synch and Drink” strategy for big data.  Fast and powerful, but does not retrieve all the data, which might be ok given the requirements. Think of it as a water tower for your data lake. Do you really need 1 petabyte of log data, or maybe just the errors and anomalies over the last 30 days?

  • Direct/Live Query:  Sometimes you do need all the data, or a large set that isn’t realistic to fit into memory, or latency is a concern – then use Qlik in live query mode. The catch with this strategy is you are completely dependent on the source system to provide speed.  This scenario is best when an accelerator (Teradata, Jethro, atScale, Impala, etc) is used as a performance booster.  Qlik uses our Direct Discovery capability to enable this scenario

  • On-Demand-App-Generation: This is a “shopping cart” approach that allows users to select from a cart of content curated from the big data system.  By guiding the users to make selections this technique reduces the raw volume of data being returned from a system to just what they need, it also allows IT to place controls, security, and limiters in front of those choices so mistakes (trying to return all records from a multi-petabyte system) can be avoided.

  • API - App on Demand:  This is a API evolution of the shopping cart method above but embedded within a process or environment of another interface or mashup.  This technique allows Qlik apps to be created temporarily (i.e. session app) or permanently based on the inputs from another starting point.  This is an ideal solution for big data partners or OEM’s who would like to build Qlik integration directly into their tool.


In summary, to prevent limited interactions with whatever “big data” system you use, you need options.  Qlik is uniquely positioned in this area due to the power of the QIX engine and our ELT + Acceleration + Visualization three-in-one architecture. Since no two big data systems are alike, Qlik offers the most flexibility with solutions in the market to adapt to any data scenario big, or small.


Regards,


David Freriks

Emerging Technology Evangelist

Follow me: David Freriks (@dlfreriks) | Twitter


Tags (1)
12 Comments
Contributor III

David, nice and brief summary but you forgot to mention one important use case for Hadoop implementations - realtime/streaming analytics and this is where Tableau's LIVE mode wins over Qlik's Direct Query (which has a very long list of limitations). API option is not really feasible for your regular customer.

I think most of the Qlik's customers I spoke at the Qonnections use Hadoop to offload heavy ETL (Hive,Impala etc.) which is really not a Qlik's function or feature that Qlik can take credit for

0 Likes
1,265 Views
Employee
Employee

Interesting question.  As you know, traditional Hadoop (HDFS+MapReduce) is a batch system and completely useless for realtime and streaming analytics of any kind (regardless of query tool you use).  Until you enter into the world of Spark Streaming does this become a possibility.  However, most companies use Spark as a processing engine not a data repository so that complicates things even more....

Hive/Impala are not ETL tools, they are SQL constructs build on schemas on Hadoop.  They do have transformation capabilities (HiveQL) which Qlik natively supports (not all vendors support all functions - the one you mentioned above struggles with arrays and maps for example).  But the power of Qlik is to add even more power on the processed data with our LOAD capability after the SELECT SQL statements that run natively against Hadoop. 

Again, it's all about the use case - Qlik offers more flexibility than just being a simple query tool.

1,265 Views
Contributor III

thanks for reply, David. I think the bigger issue with Qlik is that you cannot load (if project requires so), entire tables from Hadoop. Of course one can argue you can aggregate/filter things before it gets to Qlik, but this is not always possible. Direct Query can help with some use cases but this feature is very limiting. In fact I know one of large clients of Qlik pushed that feature since they wanted to load very large tables to Qlik and failed to do it. This is the limits of in-memory technology which is not distributed.

I had discussions with Qlik at HIMSS and Qonnections about this and Qlik customers and many agreed this is going to be a problem very soon and your customers will start seeking for other tools that can work with entire tables not just aggregated versions of them.

Hopefully Qlik will come up with something sooner than later because Hadoop is not going anywhere

0 Likes
1,265 Views
Valued Contributor III

Hi Boris

I agree with you , when i create transaction table report , i use to encounter this issue. it will have out of memory issue , so i can use action button to limit data load.

Yoju mentioned that it is limitation of Qlik in memory technology. Since you know about tableau software , may i ask you tableau , does tableau have in memory feature ?

Paul

1,265 Views
Contributor III

Hi Paul, first off I am not affiliated with Qlik or Tableau. We are Qlik's customers and I am a very big fan of QlikView - we got tremendous value out of it and after 3 years we still have a lot of ideas and value to get out of it.

Now then my company is going to embark on a Big Data journey, I keep asking myself if Qlik would be a good BI tool or we would have to get something that was designed to work with Hadoop (i.e. Data Meer) or a tool that can use data from Hadoop in realtime fashion (old school tools that would generate queries to the source systems while user interacts with your BI app).

While I am not a big fan of Tableau, it does have two modes - LIVE mode and offline. Offline mode would be similar to Qlik's in-memory engine, when you preload data to a proprietary file (like QVW) and then it gets loaded to RAM. It is not as fast as Qlik's engine and Tableau is not as flexible and powerful as QlikView but much nicer and much friendlier than QlikView. Qlik launched Sense to complete with Tableau in that space.

The LIVE mode in Tableau basically a combination of old school query generator but also a highly optimized caching engine - of course it is much slower than offline mode but still Tableau does a great job using their caching engine. While it is much slower, you can put it on top of a high performance MPP database or in-memory database or Hadoop (via fast SQL on Hadoop engine like Impala or Spark on SQL) and this is where Qlik fails to deliver and you have to use some cumbersome workarounds outlined by David. If you can load your entire data set from Hadoop or aggregate it or process it in-advance before it gets to Qlik, you will be fine but in our case, we might talking about terabytes of data that we need to calculate metrics on the fly so that would not work.

Very interesting discussion though and I am sure Qlik is already thinking how to get their foot in Big Data space

0 Likes
1,265 Views
Employee
Employee

So, interesting conversation.  LIVE mode whether Qlik, Tableau, etc is completely dependent on the source system for performance, caching (which Qlik also has in live mode) is only good for the selections and data returned into the charts on the page.  Make a selection, back to the source system....


Very few companies go direct to native Hadoop b/c of performance.  Qlik users who expect sub-second response aren't happy when they enter the live query world of waiting 5,10, 60+ minutes for a report to return (Tableau users may be used to this level of performance).   That's why you need options.

It's fair to mention "I can't load an entire hadoop table into memory", but how does one visualize 1 billion records of textual log data?  It all comes back to the use case.  What is the data you need and how are you going to show it in order to tell a story and build a narrative?  Data for the sake of data isn't helpful.

By the way --- please watch this video of Qlik Sense running on one TRILLION records.

https://www.youtube.com/watch?v=ZnMDeg8V2sg

0 Likes
1,265 Views
Contributor III

agreed, most companies I talked to use Hadoop to do batch ETL, then filter/pre-aggregate and load data to their enterprise DW or prepped tables to Hive/Tez or Impala and then use SQL to pull data to their BI tool of choice. A lot of them talk about doing this in realtime, but a few actually do that. But this market will only grow - again the example is Data Meer which got a lot of funding. Tableau can do it too (yes it will be slow). Also a lot of efforts for the past few years in Hadoop are about sub-second response time (HIVE LLAP, HAWK, improvements of Impala) so it gets faster and faster and on the fly calculations over TBs of data is now reality.

Thanks for the video but here is a catch they do very simple expressions like SUM or COUNT - actually I did similar demos internally to brag about QlikView performance. I had one billion rows loaded to QVW and open on our HP server with 1Tb of RAM and it was super fast. The secret - simple expressions and super simple data model. In real life, projects are a bit more complicated than that Direct query takes it to the next level with fast backend systems like Teradata or Vertica, but again this feature comes with a very long list of limitations. The big one for me was lack of set analysis support.

thanks again for productive discussion! go Qlik!!

0 Likes
1,265 Views
Employee
Employee

Great discussion.   Boris, you are correct that Direct Discovery does not support set analysis, however, we have a Technology Partner that does.  Jethro is one of the fastest SQL on Hadoop solutions on the market today and they have recently introduced Qlik set analysis support via Direct Discovery.    Jethro now allows Qlik set analysis syntax to be directly passed through the query and it handles the calculations directly.    We have live Direct Discovery demos hosted on http://jethrodata.qlik.com/ that show the power of Qlik on Jethro+Hadoop.

0 Likes
1,265 Views
Employee
Employee

Right Hugo -- here's a video of SET analysis for direct discovery in action with Jethro.

https://www.youtube.com/watch?v=leJSWeFaWX8

0 Likes
1,265 Views
Contributor III

yep this is very impressive indeed, I saw a demo at Qonnections.


So when Qlik (or should I say Thoma Bravo) will acquire Jethro?

1,265 Views
Not applicable

Glad I found this thread – a very informative discussion.

Being a technology vendor that focuses on this exact area, I wanted to share some of our observations. I hope that my unavoidable vendor-biased perspective will be balanced with some useful info J

Most people that use BI on large datasets in Hadoop take the approach of selective extract (into a QVD or TDE) and loading into memory. The discussion here, however, is what to do when the extract and load method is not a practical option because the extract itself is still too big, the lag time is too high, or other reasons. In such case, live access to the data at its source (Direct Discovery, Live Connect) is the preferable approach. 

Indeed, Tableau’s Live Connect is the more mature interface, but Qlik’s Direct Discovery can be made to overcome its key limitations and provide a similar functionality. Specifically, the integration work done by Jethro and Qlik addresses the following issues:

  • Ability to map complex data model eg star schema) into Direct Discovery
  • Enable in-DB functionality that emulates Qlik’s set analysis functionality and syntax

A great video by Dave demonstrating this solution: https://www.youtube.com/watch?v=leJSWeFaWX8

Once you are able to connect your BI tool live to your datasets in Hadoop, you’re likely to encounter the next challenge: the performance of SQL-on-Hadoop tools (eg Hive) is usually too slow for any acceptable interaction. As noted earlier, there is constant progress being made in this area and the latest versions of Impala and Hawq are significantly faster that earlier releases. Still, the architecture used by all of these tools is nearly identical: they are all MPP / full scan engines. And while this architecture was effective with hi-end appliances (eg Teradata), it is less efficient with the switch to off-the shelf hardware, especially when combined with much larger dataset sizes, and more complex workloads.

At Jethro, we address this issue by using a different architecture: full indexing. By pre-indexing all the columns, Jethro can serve typical BI queries much faster than the full-scan based engines and with significantly less cluster resources.


We have a live demo of Jethro + Qlik available at: Qlik Sense.


As for the future of Jethro, we're enjoying the single's dating scene J

1,265 Views
Valued Contributor III

Hi David

I like your argument on why need to load very big date into one table , how to analyse it ?

Paul

Sent from my iPhone

0 Likes
1,265 Views
Labels