Who do the Kudu that you do? - Qlik Community

Michael_Tarallo · ‎2017-05-31

In this edition of the Qlik Design Blog, our Emerging Technology Evangelist, David Freriks is back discussing integration between Qlik and Kudu.

Navigating the analytics labyrinth with integration of Kudu, Impala, and Qlik

Using Hadoop for Big Data analytics is nothing new, but a new entity has entered the stale file format conversation with the backing of Cloudera – you might have heard of it, it’s called Kudu.

What is Kudu?

Let’s first take a step back and think about the dullest topic in the universe, file system storage formats. Flat files, AVRO, Parquet, ORC, etc. have been around for a while and all provide various advantages and strategies for data access optimizations in an HDFS construct. However, they all suffer from the same issue… static data that can only be appended to – unlike a real database.

So, enter Kudu – defined by Apache: “Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer.” Deconstructing that message – Kudu acts as a columnar database that allows real database operations that aren’t possible in HDFS file formats. It is now possible to interact with your Hadoop data where INSERTS, UPDATES, DELETES, ALTERS, etc. are now available as data operations. This means not just read/write capabilities for Hadoop , but also interactive operations without having to move to Hbase or other systems. IoT use cases, interactive applications, write-back, and traditional data warehousing are now possible without adding layer upon layer of additional technologies.

Now that we have a general understanding of what Kudu can do, how does this benefit Qlik? Kudu is fast, columnar, and designed for analytics – but with the ability to manipulate and transform the data to power new use cases.

Let’s start simple by showing how easy it is to move some data from an Impala table on Parquet into Kudu.

Starting in Hue we need to do some basic database-like work. To put data into a table, one needs to first create a table, so we’ll start there.

Kudu uses standard database syntax for the most part, but you’ll notice that Kudu is less specific and rigid about data types than your typical relational database – and that’s awesome. Not sure if your data is a varchar(20), or if it is smaller or larger? No worries, with Kudu – just declare it as a basic string.

Numerical data are basic as well, there a just few types to choose from based on the length of the number. This makes creating columns and designing a schema very, very straightforward and easy to setup. It also reduces data type problems when loading data.

Having a general understanding of table creation, we will go ahead and create a table we are going to copy from Parquet. It’s worth noting there are some differences here versus creating a Parquet table in Hue.

• First: A Kudu table needs to have at least 1 primary key to be created.

• Second: A Kudu table needs a partition method to distribute those primary keys

Referencing the schema design guide, we are going to use a HASH partition and use the number 3 (since we have 3 worker nodes).

In summary, we have a bunch of strings, a few integers, and some floating decimals to represent prices and profit. We’ve identified our keys and specified our partitions – let’s roll!

The query runs for a second and viola – we have our new (albeit empty) table. Next, we need some data. We have an existing table that we would like to copy over into Kudu. We will run another query to move the data and make a little tweak on the keys to match our new table.

We had to cast our customer_sk and item_sk columns from string in Parquet to int in Kudu but that’s pretty easy to do as shown in the SQL here.

We run the INSERT query and now we have our data moved over into Kudu, and even better – that table is now immediately available to query using Impala!

Enter Qlik

With the data loaded into Kudu and exposed via Impala – we can now connect to it with Qlik and start building visualizations.

Using the latest Cloudera Impala drivers , we start the process of building a Qlik app by connecting to our new data set.

Opening Qlik Sense, we will create a new connection to our cluster and select our new table.

Once we have the table and columns selected – we can modify the load script created by the data manager to directly query Kudu (versus loading the data into memory) to take advantage of the speed and power of Impala on Kudu.(we do this using Direct Discovery - NOTE the Direct Query syntax) This change is accomplished with a slight alteration in the syntax to identify dimensions and measures.

We now have live queries running against Kudu data sets through Impala.

The great part about Kudu is that we’re just getting started with the possibilities of how we can leverage the technology with Qlik. Some things we’re cooking up for the not too distant future involve write-back with Kafka and Qlik Server Side Extension integration – so stayed tuned.

Please visit cloudera.qlik.com for more demos and to see the Kudu demo in action.

Regards,

David Freriks (@dlfreriks) | Twitter
Emerging Technology Evangelist

Anonymous · ‎2017-05-31

It is nice but limitations of direct discovery feature were not mentioned (see help). The biggest one is lack of set analysis support and I've have yet to create an app without set analysis. So until this is properly supported, it is just a marketing response / workaround from Qlik rather than a good solution. Tableau, for example, has live mode which queries source data store on the fly, support ALL the features not a subset like with Direct Discovery.

Report Inappropriate Content · ‎2017-05-31

Hi Boris,

Thanks for this information, it's useful to be aware of. I haven't tried using Direct Discovery in Sense yet, but I did try to use Direct Discovery on Qlik View, and it seemed to me basically as you described - pretty much a box ticking exercise - technically possible, but virtually useless for any real life scenario. It's really a shame, as I think it's a powerful capability that is sorely missing from the Qlik stable.

Let's hope it gets some real effort and attention soon.

Regards,

Graeme

DavidFreriks · ‎2017-05-31

Well - direct discovery depends greatly on the use case... For this simple case, where the calculations are able to be passed through back to Impala - it works just fine. However, you are correct about loss of Set Analysis capabilities - but, hopefully you have watched the videos from Qonnections and seen direct discovery's successor - Associative Big Data Indexer...

kkkumar82 · ‎2017-05-31

All,

this where Jethro helps for passing set analysis through Directory Discovery

Jethro for Qlik

Hope this helps

Anonymous · ‎2017-05-31

Hi David, I'd love to learn more about Associative Big Data Index - maybe you can do another blog post on that as I cannot find any details besides that it will be awesome and scalable I am really happy to hear that Qlik has something in works as in my humble opinion BI tool that does not properly support Big Data would not survive in the following years. And I will be very sad if it happens to my favorite tool / vendor!

DavidFreriks · ‎2017-05-31

Qlik has a number of ways of working with Big Data. The top theme from Qonnections this year was "leave your data where it lives" which will be accomplished with Big Data indexer when it's out. However, until then Qlik has a number of ways of supporting "Big Data" with either in-memory applications, direct query, on-demand applications, or server side extension calculations with Python/R/Scala etc.

The options are dictated by the use case and requirements and can be used in conjunction as well.

Cheers

mangalsk · ‎2018-07-10

‌Hi,

very nice post. I wanted to confirm which port will be required?

As per impala driver- 21050 is written nd for hive 10000 Port needed.

I Tried connection using 10000 for impala where im able to fetch hive tables but not able to fetch kudu table but I can see that kudu table is available there .

Any help will be appreciate.

DavidFreriks · ‎2018-07-10

Impala usually runs on port 21050. Hive runs on port 10000. Hive does dot support Kudu - you will have to use Impala.

mangalsk · ‎2018-07-11

‌Thanks for the reply. I’m able to find kudu table but not able to load and fetch data. What can be a reason? Is it port or driver? I have used Cloudera Impala ODBC driver with 10000 port. Wanted confirmation if I use 21050 port , will it get resolve And in cloudera manager different port is written