How do you handle "Big Data"? - Qlik Community

Michael_Tarallo · ‎2016-08-02

Greetings Qlik Community. I pleased to introduce you to our newest Guest Blogger, David Freriks. David is a Technology Evangelist on the Innovation and Design team at Qlik. He has been working in the "big data" space for over three years, starting with Hadoop and moving onto Spark in this continuously evolving ecosystem. He has 18+ years in the BI space, helping launch new products to market. David is here today to discuss a few approaches on how Qlik can address...."Big Data".

"Big Data"

The term "Big Data" has been thrown around for several years, and yet it continues to have a very vague definition. In fact, there are no two big data installations and configurations alike – insert snowflake paradigm here. It’s no surprise, given the unique nature of “big data”, it cannot be forced into an abstract model. These type of data systems evolve organically, and morph based on the ever changing business requirements.

If we accept that no two big data systems are alike, how can one deliver analytics from those systems with a singular approach?

Well, we can’t – in fact it would be quite limiting to do so. Why?

Picking one and only one method of analysis prevents the basic question “What problem is the business user trying to solve?” from being answered. So what do I mean by “picking one version of analysis”?

The market breaks it down into the following narrow paths:

Simple SQL on Hadoop/Spark/etc.
Some form of caching of SQL on Hadoop/Spark/etc
ETL into database then analysis

These solutions have their place, but to pick only one greatly limits a user’s ability to succeed, especially when the limits of each solution are reached.

So how does Qlik differentiate itself from the narrow approaches and tools that exist in the market?

Simple answer, variety. Qlik is in a unique position that offers a set of techniques and strategies that allow the widest range of capabilities within a big data ecosystem.

Below are some of the approaches Qlik addresses the big data community with:

In-Memory Analytics: Get the data you need and accelerate it, which provides a great solution for concepts such as data lakes. Qlik creates a “Synch and Drink” strategy for big data. Fast and powerful, but does not retrieve all the data, which might be ok given the requirements. Think of it as a water tower for your data lake. Do you really need 1 petabyte of log data, or maybe just the errors and anomalies over the last 30 days?

Direct/Live Query: Sometimes you do need all the data, or a large set that isn’t realistic to fit into memory, or latency is a concern – then use Qlik in live query mode. The catch with this strategy is you are completely dependent on the source system to provide speed. This scenario is best when an accelerator (Teradata, Jethro, atScale, Impala, etc) is used as a performance booster. Qlik uses our Direct Discovery capability to enable this scenario

On-Demand-App-Generation: This is a “shopping cart” approach that allows users to select from a cart of content curated from the big data system. By guiding the users to make selections this technique reduces the raw volume of data being returned from a system to just what they need, it also allows IT to place controls, security, and limiters in front of those choices so mistakes (trying to return all records from a multi-petabyte system) can be avoided.

API - App on Demand: This is a API evolution of the shopping cart method above but embedded within a process or environment of another interface or mashup. This technique allows Qlik apps to be created temporarily (i.e. session app) or permanently based on the inputs from another starting point. This is an ideal solution for big data partners or OEM’s who would like to build Qlik integration directly into their tool.

In summary, to prevent limited interactions with whatever “big data” system you use, you need options. Qlik is uniquely positioned in this area due to the power of the QIX engine and our ELT + Acceleration + Visualization three-in-one architecture. Since no two big data systems are alike, Qlik offers the most flexibility with solutions in the market to adapt to any data scenario big, or small.

Regards,

David Freriks

Emerging Technology Evangelist

Follow me: David Freriks (@dlfreriks) | Twitter

Anonymous · ‎2016-08-02

David, nice and brief summary but you forgot to mention one important use case for Hadoop implementations - realtime/streaming analytics and this is where Tableau's LIVE mode wins over Qlik's Direct Query (which has a very long list of limitations). API option is not really feasible for your regular customer.

I think most of the Qlik's customers I spoke at the Qonnections use Hadoop to offload heavy ETL (Hive,Impala etc.) which is really not a Qlik's function or feature that Qlik can take credit for

DavidFreriks · ‎2016-08-02

Interesting question. As you know, traditional Hadoop (HDFS+MapReduce) is a batch system and completely useless for realtime and streaming analytics of any kind (regardless of query tool you use). Until you enter into the world of Spark Streaming does this become a possibility. However, most companies use Spark as a processing engine not a data repository so that complicates things even more....

Hive/Impala are not ETL tools, they are SQL constructs build on schemas on Hadoop. They do have transformation capabilities (HiveQL) which Qlik natively supports (not all vendors support all functions - the one you mentioned above struggles with arrays and maps for example). But the power of Qlik is to add even more power on the processed data with our LOAD capability after the SELECT SQL statements that run natively against Hadoop.

Again, it's all about the use case - Qlik offers more flexibility than just being a simple query tool.

Anonymous · ‎2016-08-03

thanks for reply, David. I think the bigger issue with Qlik is that you cannot load (if project requires so), entire tables from Hadoop. Of course one can argue you can aggregate/filter things before it gets to Qlik, but this is not always possible. Direct Query can help with some use cases but this feature is very limiting. In fact I know one of large clients of Qlik pushed that feature since they wanted to load very large tables to Qlik and failed to do it. This is the limits of in-memory technology which is not distributed.

I had discussions with Qlik at HIMSS and Qonnections about this and Qlik customers and many agreed this is going to be a problem very soon and your customers will start seeking for other tools that can work with entire tables not just aggregated versions of them.

Hopefully Qlik will come up with something sooner than later because Hadoop is not going anywhere

paulyeo11 · ‎2016-08-04

Hi Boris

I agree with you , when i create transaction table report , i use to encounter this issue. it will have out of memory issue , so i can use action button to limit data load.

Yoju mentioned that it is limitation of Qlik in memory technology. Since you know about tableau software , may i ask you tableau , does tableau have in memory feature ?

Paul

Anonymous · ‎2016-08-04

Hi Paul, first off I am not affiliated with Qlik or Tableau. We are Qlik's customers and I am a very big fan of QlikView - we got tremendous value out of it and after 3 years we still have a lot of ideas and value to get out of it.

Now then my company is going to embark on a Big Data journey, I keep asking myself if Qlik would be a good BI tool or we would have to get something that was designed to work with Hadoop (i.e. Data Meer) or a tool that can use data from Hadoop in realtime fashion (old school tools that would generate queries to the source systems while user interacts with your BI app).

While I am not a big fan of Tableau, it does have two modes - LIVE mode and offline. Offline mode would be similar to Qlik's in-memory engine, when you preload data to a proprietary file (like QVW) and then it gets loaded to RAM. It is not as fast as Qlik's engine and Tableau is not as flexible and powerful as QlikView but much nicer and much friendlier than QlikView. Qlik launched Sense to complete with Tableau in that space.

The LIVE mode in Tableau basically a combination of old school query generator but also a highly optimized caching engine - of course it is much slower than offline mode but still Tableau does a great job using their caching engine. While it is much slower, you can put it on top of a high performance MPP database or in-memory database or Hadoop (via fast SQL on Hadoop engine like Impala or Spark on SQL) and this is where Qlik fails to deliver and you have to use some cumbersome workarounds outlined by David. If you can load your entire data set from Hadoop or aggregate it or process it in-advance before it gets to Qlik, you will be fine but in our case, we might talking about terabytes of data that we need to calculate metrics on the fly so that would not work.

Very interesting discussion though and I am sure Qlik is already thinking how to get their foot in Big Data space

DavidFreriks · ‎2016-08-04

So, interesting conversation. LIVE mode whether Qlik, Tableau, etc is completely dependent on the source system for performance, caching (which Qlik also has in live mode) is only good for the selections and data returned into the charts on the page. Make a selection, back to the source system....

Very few companies go direct to native Hadoop b/c of performance. Qlik users who expect sub-second response aren't happy when they enter the live query world of waiting 5,10, 60+ minutes for a report to return (Tableau users may be used to this level of performance). That's why you need options.

It's fair to mention "I can't load an entire hadoop table into memory", but how does one visualize 1 billion records of textual log data? It all comes back to the use case. What is the data you need and how are you going to show it in order to tell a story and build a narrative? Data for the sake of data isn't helpful.

By the way --- please watch this video of Qlik Sense running on one TRILLION records.

https://www.youtube.com/watch?v=ZnMDeg8V2sg

Anonymous · ‎2016-08-04

agreed, most companies I talked to use Hadoop to do batch ETL, then filter/pre-aggregate and load data to their enterprise DW or prepped tables to Hive/Tez or Impala and then use SQL to pull data to their BI tool of choice. A lot of them talk about doing this in realtime, but a few actually do that. But this market will only grow - again the example is Data Meer which got a lot of funding. Tableau can do it too (yes it will be slow). Also a lot of efforts for the past few years in Hadoop are about sub-second response time (HIVE LLAP, HAWK, improvements of Impala) so it gets faster and faster and on the fly calculations over TBs of data is now reality.

Thanks for the video but here is a catch they do very simple expressions like SUM or COUNT - actually I did similar demos internally to brag about QlikView performance. I had one billion rows loaded to QVW and open on our HP server with 1Tb of RAM and it was super fast. The secret - simple expressions and super simple data model. In real life, projects are a bit more complicated than that Direct query takes it to the next level with fast backend systems like Teradata or Vertica, but again this feature comes with a very long list of limitations. The big one for me was lack of set analysis support.

thanks again for productive discussion! go Qlik!!

Hugo_Sheng · ‎2016-08-04

Great discussion. Boris, you are correct that Direct Discovery does not support set analysis, however, we have a Technology Partner that does. Jethro is one of the fastest SQL on Hadoop solutions on the market today and they have recently introduced Qlik set analysis support via Direct Discovery. Jethro now allows Qlik set analysis syntax to be directly passed through the query and it handles the calculations directly. We have live Direct Discovery demos hosted on http://jethrodata.qlik.com/ that show the power of Qlik on Jethro+Hadoop.

DavidFreriks · ‎2016-08-04

Right Hugo -- here's a video of SET analysis for direct discovery in action with Jethro.

https://www.youtube.com/watch?v=leJSWeFaWX8

Anonymous · ‎2016-08-04

yep this is very impressive indeed, I saw a demo at Qonnections.

So when Qlik (or should I say Thoma Bravo) will acquire Jethro?