In case you haven’t seen it – there is a super powerful unstructured search engine in the big data ecosystem called Solr. What’s great about So...
In case you haven’t seen it – there is a super powerful unstructured search engine in the big data ecosystem called Solr. What’s great about Solr is that it can index just about anything, text, xml, JSON, PDF, Word, Excel, or pretty much any kind of text based data. That means you can drop just about anything in Solr and have it searched by the Lucene core (that powers the Solr interface.
So, where does Qlik fit in you ask? Well, let’s observe what a Solr query output looks like:
Hmmm, not very user friendly, not to mention it was somewhat slow.
A little bit about what we’re looking at for these examples: This data is the collective set of Enron emails from its infamous collapse in early 2000’s. We’ve loaded this data set into our Cloudera cluster and indexed it using Solr.
Once this data was loaded and indexed we tested with a series of queries… A full query on someone with a lot of references such as Ken Lay can run upwards of 15 minutes to bring back every email that contains a reference to him.
Imagine 10’s or 100’s of users each waiting 10-15 minutes for a single question to be answered, it clearly dilutes the effectiveness of the engine as a business tool.
Qlik has a tremendously powerful REST connector that is perfectly suited for connecting to sources such as Solr. (A great video on the Qlik REST connector can be found here: https://www.youtube.com/watch?v=FqwNU_pnFt4).
Qlik In-Memory Analytics with Solr
Armed with the REST connector, and a few connection parameters… We can pull the entire Enron email dataset into the Qlik engine via Solr.
By pulling the entire data set, we now ensure that all users have sub-second access to all the data down to the most granular level, and thanks to our associative search technology – all the data has been indexed and correlated in-memory. We can gain further insights by incorporating stock market data. Combining Enron’s stock performance with their emails tells an interesting story of rising email volume along with collapsing stock prices and elevating trade volumes.
Using a mix of visualization techniques, we can see a pretty interesting collection of data, including the famous “deleted emails” gap on the bottom right chart.
Performing some additional analysis, we can drill in on the height of the crash that also correlates with the spike in email volume, followed by a rapid drop in volume.
Making a few more selections we can dive down into a specific name, or comment to filter down the result sets further.
This associative search allows us to dive down into the details of the “TO” elements of the data set and see the metrics affiliated with those names.
We can also jump over to the final sheet on the app and look at the individual emails body content filtered by our prior selections made in the application.
QIX API Powered Solr Search:
The above approach of using Qlik in-memory to front end the Solr search engine is just one of the many ways Qlik can access unstructured data in big data systems.
Let’s consider another application also using Qlik with Solr – this time with just the Qlik API’s. As a quick refresher, the Qlik engine (called QIX) is a fully API enabled engine with tremendous extensibility that allows Qlik to plug into any web based technology (like Solr). Using the awesome QlikSocial framework from the esteemed Johannes Sunden (https://github.com/johsund/QlikSocial), he adapted the webapp to connect to Solr on demand and build a full webapp from scratch.
We start with a search box… And our name(s) of interest:
Now unlike the formatted Qlik Sense app, when a user hits the “search” bar – everything will happen dynamically on the fly using the API’s.
Qlik will dynamically generate a REST connection to Solr, create and load the requesting data into memory, and then build a web app around the data using bootstrap.js and angular.
The webapp is still using the Qlik engine, so selections and the search engine are still available – but all the charts and graphics are html and d3js charts – not Qlik. We’re just powering the app and the data interactivity with the QIX engine!
Solr is an extremely powerful unstructured search engine that can benefit from the speed and structure Qlik analytics can provide as a focusing lens on the core Solr search technology. That data can be consumed in a number of formats including a completely structured Qlik Sense app, or as an API powered web application without any Qlik UI components.
Navigating the analytics labyrinth with integration of Kudu, Impala, and Qlik.
Using Hadoop for Big Data analytics is nothing new, but a...
Navigating the analytics labyrinth with integration of Kudu, Impala, and Qlik.
Using Hadoop for Big Data analytics is nothing new, but a new entity has entered the stale file format conversation with the backing of Cloudera – you might have heard of it, it’s called Kudu.
What is Kudu? Let’s first take a step back and think about the dullest topic in the universe, file system storage formats. Flat files, AVRO, Parquet, ORC, etc have been around for a while and all provide various advantages and strategies for data access optimizations in a HDFS construct. However, they all suffer from the same issue… static data that can only be appended to – unlike a real database.
So, enter Kudu – defined by Apache: “Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer.” Deconstructing that message – Kudu acts as a columnar database that allows real database operations that aren’t possible in HDFS file formats. It is now possible to interact with your Hadoop data where INSERTS, UPDATES, DELETES, ALTERS, etc. are now available as data operations.
This means not just read/write capabilities for Hadoop , but also interactive operations without having to move to Hbase or other systems. IoT use cases, interactive applications, write-back, and traditional data warehousing are now possible without adding layer upon layer of additional technologies.
Understanding what Kudu can do, how does this benefit Qlik? Kudu is fast, columnar, and designed for analytics – but with the ability to manipulate and transform the data to power new use cases.
Let’s start simple by showing how easy it is to move some data from an Impala table on Parquet into Kudu.
Starting in Hue we need to do some basic database-like work. To put data into a table, one needs to first create a table, so we’ll start there.
Kudu uses standard database syntax for the most part, but you’ll notice that Kudu is less specific and rigid about data types than your typical relational database – and that’s awesome. Not sure if your data is a varchar(20), or if it is smaller or larger? With Kudu – you don’t have to care, it’s just declare it as a basic string.
Numerics are basic as well, there a just few types to choose from based on the length of the number. This makes creating columns and designing a schema very, very straightforward and easy to setup. It also reduces data type problems when loading data in.
Understanding the basic syntax of table creation, we will go ahead and create our table we are going to copy from Parquet. It’s worth noting, that there are some differences here versus creating a Parquet table in Hue.
First: A Kudu table needs to have at least 1 primary key to be created.
Second: A Kudu table needs a partition method to distribute those primary keys
Referencing the schema design guide above, we are going to use a HASH partition and use the number 3 (since we have 3 worker nodes).
Summarizing, we have a bunch of strings, a few integers, and some floating decimals for prices and profit. We’ve identified our keys and specified our partitions – let’s roll!
The query runs for a second and viola – we have our new (albeit empty) table. Next, we need some data. We have an existing table that we would like to copy over into Kudu. We will run another query to move the data and make a little tweak on the keys to match our new table.
We had to cast our customer_sk and item_sk columns from string in Parquet to int in Kudu but that’s pretty easy to do as shown in the SQL here.
We run the INSERT query and boom… We have our data moved over into Kudu, and even better – that table is now immediately available to query using Impala!
With the data loaded into Kudu and exposed via Impala – we can now connect to it with Qlik and start building visualizations.
Using the , we start the process of building a Qlik app..
Opening Qlik Sense, we will create a new connection to our cluster and select our new table.
Once we have our data – we’ll build an app to directly query Kudu (versus loading the data into memory) to take advantage of the speed and power of Impala on Kudu. This change is accomplished with a slight alteration in the syntax to identify dimensions and measures.
We now have live queries running against Kudu datasets through Impala.
The great part about Kudu is that we’re just getting started with the possibilities of how we can leverage the technology with Qlik. Some things we’re cooking up for the not too distant future involve write-back with Kafka and Qlik Server Side Extension integration – so stayed tuned.
This demo conencts to Cloudera Manager via 20+ REST API calls to collect operation metrics around our Cloudera Cluster performance. We are collecting...
This demo conencts to Cloudera Manager via 20+ REST API calls to collect operation metrics around our Cloudera Cluster performance. We are collecting operation stats for Hive, Yarn, Spark, Kudu, Kafka, Solr, and Impala. We are also collecting detailed query metrica and performance data for Impala. This application updates every hour to refresh the latest stats.
This demo is entirely a technical demo showing how to use the more advanced special features of Cloudera Impala which is called Complex Types. Comple...
This demo is entirely a technical demo showing how to use the more advanced special features of Cloudera Impala which is called Complex Types. Complex types (also referred to as nested types) let you represent multiple data values within a single row/column position. They differ from the familiar column types such as BIGINT and STRING, known as scalar types or primitive types, which represent a single data value within a given row/column position.
In this demo, Qlik uses our direct query capability to connect to Impala to run interactive queries with a TPC-DS data set stored in parquet format. ...
In this demo, Qlik uses our direct query capability to connect to Impala to run interactive queries with a TPC-DS data set stored in parquet format. What's unique about Qlik is that even though the data is not being stored in memory initially, we still have the associative experience avaiable to the user. This capability executes queries in parallel against the Impala engine to acheive maximum performance.
We have seen the power of the Qlik APIs using Solr data, but we have created a structured application to accomplish the same type of Enron email anal...
We have seen the power of the Qlik APIs using Solr data, but we have created a structured application to accomplish the same type of Enron email analysis - but using Qlik native components. This demo also fuses together stock data to profile email volumes versus stock price and trade volumes. This demo is a good example of the DAR (Dashboard, Analysis, Report) method Qlik uses to help users navigate applications and data.
This app analyzes every US Government contract for 2011 - 2016 fiscal years. It includes over 18.7 million contracts with a total spend of over $2.6 ...
This app analyzes every US Government contract for 2011 - 2016 fiscal years. It includes over 18.7 million contracts with a total spend of over $2.6 trillion. Data was sourced from www.usaspending.gov and has been enriched with geo-spacial data for mapping capabilities. Displays key spending metrics such as total spend, # of contracts, # of vendors, and spend over time
Spark is one of the greatest Big Data advancements to appear on the scene since Hadoop. Qlik in this demo is going to leverage the power of Spark mac...
Spark is one of the greatest Big Data advancements to appear on the scene since Hadoop. Qlik in this demo is going to leverage the power of Spark machine learning to process raw transactional data into "Market Baskets". A Market Basket is a categorization of similar things sold in conjunction with each other, i.e. if I buy Product A, Product B,C and E are often sold with it, but not product D. This application merges the original Point of Sale data with the Spark machine learning processed data in-memory to analyze the Market Baskets.
This demo is based on 20+ data sources that have been loaded into HDFS and then transformed into a pure in-memory Qlik app. The datasets that have be...
This demo is based on 20+ data sources that have been loaded into HDFS and then transformed into a pure in-memory Qlik app. The datasets that have been loaded into Cloudera are from a variety of sources including: CDC, World Health Organization, Twitter, Flight Stats data, Weather data, Texas hospital data, and other clinic sources. This highly visually stunning app showcases Qlik's ability to tell a powerful story with data.
What makes Qlik different than all other analytics tools is our powerful APIs. This webapp uses Solr to query the Enron email data set on any set of ...
What makes Qlik different than all other analytics tools is our powerful APIs. This webapp uses Solr to query the Enron email data set on any set of topics. That data is returned in JSON which Qlik parses, loads into the Qlik Engine (called QIX) and indexes. That indexed data is then consumed via APIs through a Bootstrap.JS interface to build, from scratch, a webapp that uses D3js and other web technologies to present the data, but without using any Qlik interfaces other than our APIs. This webapp is for searching the Enron email data set...
This demo leverages over 16 million IOT sensor and maintenance readings sourced from Kafka and Streamsets to create a Qlik app that allows deep analy...
This demo leverages over 16 million IOT sensor and maintenance readings sourced from Kafka and Streamsets to create a Qlik app that allows deep analytics on well maintenance issues. With this app, there is the ability to drill down to a very granular level to see performance issues related with real world well production issues in Alberta, Canada.
Attunity Replicate for SAP is a high-performance, automated and easy to use data replication solution that is optimized to deliver SAP application da...
Attunity Replicate for SAP is a high-performance, automated and easy to use data replication solution that is optimized to deliver SAP application data in real-time for Big Data analytics. it moves the right SAP application data easily, securely and at scale to any major database, data warehouse or Hadoop, on premises or in the cloud. This solution builds on decades of leadership in enterprise data replication and SAP integration.
This application demonstrates a direct load from SAP ECC into Cloudera. The data is loaded directly from SAP into HDFS and then turned into Impala tables that Qlik connects to and applies complex transforms to in adding business friendly terms and time series analytics capabilities.
This demonstration showcases Qlik's ability to manage and consume large data sets in a governed environment using On-Demand Application Generation te...
This demonstration showcases Qlik's ability to manage and consume large data sets in a governed environment using On-Demand Application Generation technology built into Qlik Sense. A user can browse summary information about banking trends and then drill down into the transaction "on-demand" to get the details.The user can only get the details once they have filtered down the amount of accounts to under 100 in this example. The resulting app is the users own personalized app to explore and create new content.
The Centers for Medicare and Medicaid Services (CMS) defines Quality Measures as “tools that help us measure or quantify healthcare processes, outcom...
The Centers for Medicare and Medicaid Services (CMS) defines Quality Measures as “tools that help us measure or quantify healthcare processes, outcomes, patient perceptions, and organizational structure and/or systems that are associated with the ability to provide high-quality health care and/or that relate to one or more quality goals for health care. These goals include: effective, safe, efficient, patient-centered, equitable, and timely care.”
This application presents an approach that a health system may want to take to visualize their data so that they can target the right areas for improvement. This data set contains 62.5 million quality records for 2.76 million patients and covers across 8 health systems, with 685 Practice Groups employing 5 thousand physicians.
Cloudera is rich in metadata useful for understanding the data behind the analytics visualized by Qlik. However, this metadata is somewhat scattered ...
Cloudera is rich in metadata useful for understanding the data behind the analytics visualized by Qlik. However, this metadata is somewhat scattered across different areas of the Cloudera ecosystem. In this application, Qlik is pulling valuable metadata from Cloudera Navigator and Cloudera Manager using REST API's. These API's give insight in query usage, query performance, and metadata tags using published API calls.
Combining this REST data with a series of looping SQL calls against Impala, we are able to associate database, table, and column statistics with the Navigator and Manager data. By combining this data, we are able to create a full understand of relevant Cloudera metadata that Qlik can analyze. This application also powers the selection criteria for the upcoming Cloudera Data Explorer.
Cloudera Altus is a cloud service platform with services that enable you to use CDH to analyze and process data at scale within a public cloud infras...
Cloudera Altus is a cloud service platform with services that enable you to use CDH to analyze and process data at scale within a public cloud infrastructure. It is designed to provision clusters quickly and to make it easy for you to build and run your data workloads in the cloud.
Altus works within the cloud service provider architecture. That framework provides an excellent foundation for Qlik Sense in a cloud based solution powered by Altus. This dashboard application is powered by Altus running a TPCDS data set on S3 and Impala as the query engine.
Demonstrates how Qlik, Cloudera and DataRobot can be integrated to provide a modern analytics stack for an anti-money laundering use case. The fict...
Demonstrates how Qlik, Cloudera and DataRobot can be integrated to provide a modern analytics stack for an anti-money laundering use case. The fictional PomBar Bank has just released an international payments system, powered by Ripple. They want to extend visibility of their AML/KYC system into their Ripple transaction data.”
Cloudera's Enterprise Data Hub provides the storage and infrastructure for a secure, governed anti-money laundering system, centralizing data across all legacy banking systems, as well as from Ripple's API. DataRobot - a highly automated platform for machine learning - is used to implement an anomaly detection routine, as part of PomBar's AML workflow. Qlik then provides an efficient end-user platform for monitoring, visualizing, and transforming that data.
This is the "GA" of the Cloudera Data Explorer based off the Data Concierge platform developed by Dennis Jaskowiak and Qlik DACH SA team.
Philip Corr of Bardess Consulting rebuilt the code to enhance user viability and create guardrails for user interaction. The application is powered by the Cloudera Data Catalog developed by Dave Freriks. The data catalog collects and associates metadata from Impala, Cloudera Navigator, and Cloudera Navigator.
This software is released "AS-IS", but welcome improvements to the base code as there are many cool things that could be added to this concept.