Simplifying GenAI Data Pipelines with Qlik Talend Cloud

Ask a Question

Watch a DEMO of this capability HERE!

Generative Artificial Intelligence (GenAI) and related applications have exploded into the tech scene over the last couple of years. While the technology shows great promise, building data pipelines that leverage customers structured and unstructured data is a challenging and high effort integration activity.

Qlik Talend Cloud (QTC) Knowledge Mart data capabilities enable customers to simplify and accelerate the work needed to have their data flowing to Large Language Model (LLM) Retrieval Augmented Generation (RAG) based GenAI applications. In this blog we’ll cover this exciting new capability to simplify using your data with GenAI applications.

Watch a DEMO of this capability HERE!

Background – GenAI, LLM, RAG, Vector stores

Before diving into how QTC Knowledge Mart data capabilities assist, leveraging automation, in having enterprise data be made available seamlessly to RAG based GenAI applications, lets outline the technologies involved and the complexities found when building GenAI applications from scratch.

RAG is a method of implementing GenAI applications that ground the LLM with the data context that the LLM must use when answering a query. It is used in conjunction with LLMs to both avoid the need to train an LLM on customer specific data and limit the scope of the data the LLM will use to answer questions posted to it. While LLM based chat interfaces, such as ChatGPT, are the most readily recognizable element of a GenAI application, there are several precursor technologies and processes that need to be selected and integrated, typically with complex code-based methods.

Anatomy of a RAG based solution

A typical RAG based GenAI solution contains the following components and process flow.

To service a query from the user the RAG Application or Chat bot on enterprise data, the enterprise data needs to be loaded into a vector store with appropriate LLM embeddings. An LLM embedding refers to a vector representation of text (such as a word, sentence, or document) generated by a LLM like GPT, BERT, or other advanced models. The purpose of embeddings is to capture the semantic meaning of the text in a way that allows the model to perform various tasks, such as similarity search, classification, or language generation, more efficiently. An embedding is a high-dimensional numerical vector that represents a piece of data (like words or sentences) in such a way that semantically similar pieces of data are closer together in the vector space. This allows models to process and compare pieces of text effectively.

This vector is then passed to the LLM along with the text of the user query for the LLM to then use as the context against which the embeddings generated from the user query text to generate the response back to the user.

RAG based solution technology components

For this process to work, several technology decisions and integrations need to be made in advance.

The data source systems that currently house the enterprise data needed to answer questions. There would be typically multiple databases and applications whose data need to be integrated to achieve coherent answers. This includes unstructured text data in documents and knowledgebases.
The platform on which all this data will be integrated. Very popular cloud-based data platforms, E.g. Snowflake and Databricks.
The Vector database on which to store the enterprise data embeddings. Cloud platforms (Snowflake Cortex, Databricks Mosaic) typically provide their own Vector DB and point solutions such as ElasticSearch, Pinecone, OpenSearch, etc. are also popular choices.
The LLM to use to generate the enterprise data embeddings and completions for chatting. There are ample choices for this as well both through hyperscaler AI platforms (Azure OpenAI, Amazon Bedrock), cloud data platforms (Snowflake Cortex, Databricks Mosaic) and independent providers (OpenAI, Anthropic)

All of this together paints the following picture of the required integration.

An implementation of this solution requires large amounts of effort scripting/coding and specialized knowledge. As we’ll see next, Qlik Talend Cloud automates most of the integration and only requires configuration and selections of the technology to be utilized.

Qlik Talend Cloud – Knowledge Marts

Qlik Talend Cloud (QTC) is purpose built to simplify and accelerate the implementation of RAG based GenAI data integration pipelines by using a no code approach. Let’s cover each of the features and how they leverage automation to enable this capability in detail.

Data source connectivity

QTC offers no-code connectivity to hundreds of data sources, including enterprise systems, mainframes, SAP, databases, and SaaS applications. It offers efficient, zero footprint, and minimal impact near real-time log based Changed Data Capture (CDC) or incremental API to only send data and changes once, without the need to reload the same data over and over, from source to target. The intuitive interface allows for an easy implementation of this connectivity and movement process, as shown below.

More information available on the following link on qlik.com

Data preparation/transformation

Once the data is in the target cloud platform the next step is to prepare it for vectorization. This entails creating derived data sets with the appropriate field and record joining and filtering that feed the relevant bits of data for the LLM to use. QTC offers multi-modal transformation design experience ranging from no code Transformation Flows to pro-code GenAI assisted query crafting. Learn more about these feature on Qlik Community blog and online guide

Data modeling

Once the necessary data sets have been generated, we then define relationship metadata between data sets. This allows for the subsequent Knowledge Mart step to recognize the potential building blocks for the document to prepare and store in the Vector DB.

Knowledge Marts and Vector DB/LLM integration

The data to be vectorized needs to go through a process of parsing, chunking, embedding, and indexing. Structured data (from tables and columns) needs to be converted to document format prior to these steps. QTC shines in this area with an intuitive interface for determining the elements to include in the document. To get an idea with an example of the level of effort that QTC Knowledge Mart Tasks automate, for just one point solution LLM and Vector store integration, please refer to the following article.

From a transformation step, we select the option to create Knowledge Mart

The specify where to store the vectors

We can store vectors in either:

External vector database
Data project platform. This is dependent on the platform for the project this task is a part of. Either Snowflake Cortex or Databricks Mosaic.
Qlik Answers knowledge base. For information on this option check out this blog

3. Specify the LLM connection. This connection and specified models will be used for both creating the embeddings for storing the document data in the Vector DB and also to power the completions of the chat interface available to the implementer to test the LLM. The options here depend on the prior choice of Vector DB.

Using external LLM
Using data project platform LLM. Refer to the following for more information on Databricks Mosaic or Snowflake Cortex
In either case, a valid embeddings and completion model need to be specified

Create Knowledge Mart documents. In this step we leverage the datasets and relationships defined in the transformation task to create the documents to be vectorized. We start with a parent data set, on the far right of the model diagram, and select the child elements to become part of the document

We’re done! The next step is to prepare and run the task and test the data and LLM with the test assistant function.

Note: This interface is intended for the Knowledge Mart data implementer to test the integration of the data and processing components (LLM, Vector DB, etc.). It’s not intended to be and end user chat interface.

The completed pipeline would look like the following

Conclusion – Accelerating your GenAI journey

GenAI offers new and exciting capabilities to interact with data. Building the workflow that combines all the data sources, processing, and technologies typically entails a large effort. QTC accelerates enterprise GenAI implementations and allows for a faster time to value at a lower effort and cost than otherwise.

Whether using automatic ingestion of data from structured or unstructured sources, transformation into required data sets, the creation of a vector record with appropriate LLM embeddings, or the testing of chat answers, QTC lowers the barrier of entry and adoption to deliver RAG based GenAI solutions on your data. Reach out to your account team today to take advantage of this groundbreaking functionality.

Watch a DEMO of this capability HERE!

NOTE: Initial GA release (July 8th 2025) supports Snowflake/Cortex, OpenAI, Azure OpenAI, Amazon Bedrock, Elasticsearch, OpenSearch, and Pinecone. Support for other platforms mentioned will come in subsequent releases.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.

Comment

Subscribe by Topic