STT - Setting Up Knowledge Marts for AI

Troy_Raney · Jan 30, 2026 3:25:59 AM

Environment

Qlik Cloud
Qlik Talend Data Integration

Transcript

Hello everyone and welcome to the January edition of Techspert Talks. I'm Troy Raney and I'll be your host for today's session. Today's Techspert Talks session is Setting Up Knowledge Marts for AI with our own Simon Swan.
Thank you very much for inviting me to the session.
Absolutely. Today we're going to be talking about the context and challenges of preparing data for AI, how to use Knowledge Marts in a pipeline. We'll have a demo on that and explore some of the reasons why we should and how to get started. Now, Simon, to set the stage for why we're even talking about this, what kind of problems have you heard from customers in the industry?
Yeah, this is a good question. What we hear is that the pipelines they're creating are quite brittle. Organizations are being asked to source information from many different systems. Things like a CRM or a document store pulling all this together and as we know many large organizations this information is fragmented across the data estate. So it can be quite challenging to access all of that information.
Mhm.
Unstructured data is huge and I think most people are aware of this and I saw a figure the other day that between 80 and 90% of Enterprise data is unstructured now. And we're talking anything from like, you know, PDF reports, documentation, audio files, video files, all these different types. And so being able to access those is incredibly challenging. One thing that I think is a differentiator for Qlik is, you know, we want to commit to delivering this AI experience, you know, be it a chatbot or insights or in the world of Agentic, we're seeing people actively making decisions using AI, but they need to make sure that the data is timely and governed. And this is incredibly challenging especially when moving at the pace of AI. Any latency with data means that the insight which has been delivered basically is falling behind where it needs to be.
Yeah.
We're seeing regulatory requirements becoming much more complex and stringent worldwide in different territories.
And how is that affecting data?
Well, it means that data can't be freely accessed across the whole of the enterprise. What we need to do is make sure that the people who should see data have access to it.
I imagine in the world of AI that can be a real challenge.
Absolutely. People are basically trying to throw data at AI at the moment and as a result of that operational rigor isn't quite what it needs to be. And I think a lot of folks are used to sort of like internal processes which holds them back from launching these types of projects. You know, you've got your governance team who, you know, give you the correct information as to guidance on this. But fundamentally, it's very hard to get these off the ground.
Sure.
It's incredibly important that the data is in the right format. It's available and it's documented both LLM and RAG use. Making sure that the data is AI ready is incredibly important to customers.
Yeah. All that prep sounds really challenging.
Yeah. Because you need to have some level of ownership or accountability. And so it's not just using the technology, it's also having an operational structural change so you have the right people and you know who to speak with.
Talking about getting things off the ground, one thing I'm not seeing on here is cost.
Yes, that's rather an oversight actually. That is really important. What we're seeing from a lot of customers is that AI budgets are such that it happens at all costs if you know what I mean. That's not really not going to be sustainable. Companies are looking to do is to make sure that their AI budgets go further. Don't think any organization is looking to spend less on AI. But what they are looking to do is redirect that budget so each of their projects becomes economically viable and there will be increasing scrutiny on there. So cost is certainly a very important factor.
And you mentioned a term there: RAG. What is a RAG and why is that important?
Yeah. So RAG stands for retrieval augmented generation. Typically when organizations use large language models they're using them in such a way that they need to train them with their data. That can be a very costly b very risky which we'll talk about in a minute. So RAG basically allows you to add context on top of an LLM. This is where we get into what we call vectorization of information.
Okay.
So what would happen is that a user would then speak to a chatbot. Now the chatbot interprets that query or question and it vectorizes it. And vectors is basically like a number store. It's like a multi-dimensional store of information where you can retrieve data based on the information which has been provided. So it looks like keywords and it looks for the semantic closeness of the data. So when we question comes in, it will basically vectorize it. It will then go to the vector store and it will pull out the nearest neighbors. One thing to bear in mind here I said neighbors rather than neighbor. You basically have to pull out information and this is when people write information to a vector store they use a process called chunking. Chunking basically is the snippet of information which has been vectorized and there's typically overlap with the different chunks. So you get the full picture that comes back. So query goes in goes to the vector store they retrieve the relevant pieces of information goes back to the app and then we send that to the large language model. Now, large language models interpret textual information does a fantastic job. It's very much a generalist in terms of the way that this is set up. Now, if we supplement that with our information from RAG or from our vector store, this adds context to the answer. It now becomes a subject matter expert because we're giving it the information to interpret to give it a stronger answer. So once it goes through that process, it will use the LLM's general knowledge, anything that it's trained on plus the RAG information and then it will be used to generate a response which goes back to the user. Now it has a memory. So when you talk to an LLM, you get more and more relevant information. The amount of information that you're interacting with can become greater and greater and greater. And as a result of that, you can see slowdown on some of these as well. So it's really important that you're basically passing it just the information that it needs to be successful.
So is that the main benefit of the RAG? It optimizes it with that context?
Absolutely. You don't have to provide everything. Using a RAG makes it very easy to get started. It's also a lot cheaper than training an LLM as an example. You don't have to worry too much about corporate data knowledge. Just get the information that's required. You don't have to worry too much about the security cause it is very very scoped. So what happens is when you actually query the vector store, you only get the information in context with the filter as well. Also, LLMs have an issue where they have to be continuously trained. And this is where we get into this issue with synchronization of data. You know, do you have a robust way of making sure you're adding in new insight into your LLM? How do you replace information that's already there or remove existing information? This is where RAGs shine and this is why they're so popular with our customers today.
So how can Qlik help with this?
Yes. Well, Qlik's been a part of the Data Integration space for many many years now. So we can connect anything to anything. And so when we talk about that, you know, we're talking about data stores all the way from you know cloud data warehouses, data lakes, applications like SAP and also files, textual data sources, etc. So tremendous amount of connectivity. Typically what we see is that there's two classifications: there’s structured and unstructured data. We get into the cycle of ingesting the data, transforming it for use and then governing it as well. We can surface this as a data product as an example. So we can share it with the rest of the team. So business users can go in, they can see this data and they can then start using it for their analytics workload knowing that the data has been certified using the Qlik Trust Score as well. Prior to Knowledge Marts is that customers were basically writing this information out into the vector store and this is where we start to seeing a little bit of pain. What was happening is that they were sort of like taking it from there and then writing to the vector store and introduction of Knowledge Marts has allows us as an organization to really innovate and make it a lot easier for folks to add them to existing pipelines. What we do is for unstructured data we take those document formats we split them into chunks as I mentioned and then vectorize this information after creating the embeddings and then they can be retrieved by our chatbot. Now with structured data we do it a little bit differently because free text information always works best with LLMs. So we take our structured information and we convert it into a JSON format. Now what's very important is that we have the correct context and relationships within the data to create these JSON files which we can then ingest.
And is that whole process automated?
Yes. So what happens is that once you've connected it to your data store, you can then automate the production of these JSON files and then basically write them to your vector store. So both the entry points are slightly different but in terms of writing them to the vector store, it's the same process. You can write to the same vector store with structured and unstructured data as well, which we'll see in the demonstration later on to see what the key benefits are there.
Can you show us what it looks like adding a Knowledge Mart to a pipeline?
Yeah, I can probably show folks how to do that.
So, we're looking at a project you've got here and I see a Knowledge Mart already on the end.
Yeah, adding it is quite simple. So, any existing pipeline can turn into a Knowledge pipeline very easily. If you go to the top right, you go ahead Create New and you can basically add in you can see here Knowledge Mart or a file-based Knowledge Mart. Obviously with a file-based Knowledge Mart you need to make sure that you have a document store that you're connecting to. This case here, I've got a local file share as you can see with the Knowledge Mart; you can put it after a Transform or a Storage component. So for the Knowledge Mart here, we can store the vectors in the existing data platform. So if you're using as an example Snowflake or Data Bricks, you can use Snowflake cortex to manage everything; and so you have a choice of two things that you need to configure: the Embedding Model and the Completion Model here. Now a lot of these depend on the use case so you do need to tune these accordingly but you can see here it's pretty easy once you've got this information to go ahead and configure it you can also bring your external vector database. If I go ahead and Create New here, you can see we have support for Pine Cone open search and elastic search for your vector database case and then you also need to bring an LLM as well. And so you just very easily connect to your choice of LLM. Configuration is super simple. So basically you just have your the gateway that you use. The API key and just name it as well and you're good to go. And so what happens then is that you can add in a Knowledge Mart with the file-based Knowledge Mart. It is a little different. So basically again I can store the vectors in my data project platform or I can go to an external data store. And this is where I have the ability again to connect to Pine Cone, etc as a vector database. But the source in this particular example be things like as I said S3, Azure, Data Lake, ink storage, FTP, a local file share etc. So you can pull document information from all of these different sources. Our aim is to make this incredibly easy for people to a create these types of pipelines and B) add it into existing pipelines that are trusted today. We really want folks to be able to move quickly on this and so having these features drop into existing pipelines makes it so you can populate a vector database very quickly.
Wow. So in that Knowledge Mart file share, you can actually point it to a folder on SharePoint, perform some transformations perhaps, and then make that into a Knowledge Mart?
Yeah, this is a really good point actually. Yes, you can use all of the transformation type of features as an example. One thing you might want to do is actually change the names of the fields themselves so they work more contextually. Like as an example, Customer No or Customer Number might not be enough. Let's describe it a little bit better. This metadata makes the information be adopted by the RAG so much stronger as well. So any transformation you can do to remove inconsistent values to restructure the data. This is really what you want to do. Ideally what we're trying to achieve here is to create a model which we can then pass effectively downstream to the RAG.
Great. What's the difference between the table Knowledge Marts and the file Knowledge Marts?
With table Knowledge Marts, we actually work natively inside the Enterprise warehouse or Lakehouse. And what we're trying to do is to preserve things like these live table relationships, the entities and also the Lineage as well.
Mhm.
We're creating these JSON structures which we can then pass to the vector database. It's critical for things like you know customer 360 supply chain, anything where fresh and trusted data really matters and this is what we're feeding into these table Knowledge Marts.
And how does that compare with the file version?
Yeah, so with file Knowledge Marts, it's a little different. Basically this is unstructured text content and so things like PDFs, text, that type of stuff. It won't be able to work with things like videos and images; however, you know you can use services like Descript to transcribe this into textual format as well. Our aim is to be able to connect to all of the different variety of document stores that people have and this is at incredible scale as well. So those are the main benefits between the two.
Fantastic. I would love to see it in action. What do you got for us for a demo today?
Yeah, we'll be examining Finsight which is a fictitious finance application. This demonstration represents complete Lineage; including raw structured and unstructured sources processed by Qlik Talend Cloud, stored as vectors in Snowflake and surfaced to business users via Finsight (our financial intelligence chatbot).
Okay, great. I can see we're in Qlik Talend Data Integration.
Yes, here we can view our various integration projects which are categorized as either replication or pipeline projects. I'll be opening my primary project, the finance knowledge pipeline. This project utilizes Snowflake as our data platform. Inside the project view, you can see we have defined two distinct source systems. We have an on-prem PostgreSQL database hosting our structured financial data. Second, we have a local SMB file-share that hosts our unstructured data, specifically market news in PDF format.
Okay.
The Post SQL source feeds into a sequence of tasks: Landing, Storage, Transform and Knowledge base. The SMB source feeds into a dedicated file-based Knowledge Mart task.
Okay. Can you walk us through what this pipeline is doing?
Absolutely. I start the Landing task. The Landing task is responsible for the continuous transfer of assets from on-prem to our Landing area. It begins by performing a full historical load after which it automatically switches to change data capture or CDC mode. Tables such as trades, customer profiles, financials, and our KYC information are now being replicated. To demonstrate the real-time capabilities, I'll use my demo assistant application that our team have created to mimic database operations.
Okay, so this demo assistant is simulating the insertion of a whole bunch of new trades, new data into the system?
Yes, if we look at Qlik Talend Cloud, we can observe the system catching up and replicating these newly inserted rows in real time.
Very cool.
I will perform a similar test for customer data. Using the assistant, I'll insert a few customers. It'll update their profiles and finally delete the records. Notice how the system instantly captures these changes from the source database, ensuring the target remains synchronized. Returning to the pipeline, I'll now start the storage task and open the transform task. Here we can perform transformations using a low code drag-and-drop canvas, a procode SQL approach, or even utilize our AI assistant to generate queries via prompt. With this demo, we won't perform specific transformations. Instead, I want to highlight the data model where we have defined the relationships between our tables to establish a coherent structure.
So, what's the next step?
We're now building a structured Knowledge Mart. We model this data into facts and dimensions and apply the embedding models. This converts rows and columns into the concepts the AI can understand.
Can we take a closer look at that knowledge task?
Absolutely.
Okay. So this is the document schema and the data sets. What kind of governance do we have on this?
Absolutely. Let's examine the governance capabilities. Qlik Talend Cloud provides comprehensive data Lineage. Starting from our knowledge-based task, we can trace the data flow all the way back to specific rows in our source PostgreSQL database.
Nice.
This provides full end-to-end visibility. Now let's define the document schema. Creating this is straightforward. I select my base data set which is my customer table. I then associate related data sets. For data enrichment, one-on-one relationships, I have selected customer financials, customer profile, and the KYC info. For subdocuments, which are one to many relationships, I've selected trades as a single customer can have multiple trade execution history entries. With the schema defined, we can now interact with our structured data using the embedded Test Assistant feature directly within Qlik Talend Cloud.
Very cool. So, you can just ask it questions about the data?
Yes, under the hood, this interaction is powered by an LLM and a vector database, both hosted securely within the Snowflake data platform using Cortex and configured directly in the pipeline settings. We will now interact with the Finsight interface. This was a custom application which was built by Geoffrey Dommergue, one of Qlik's leading solution architects who speaks to customers all the time. It also huge help with this demonstration. Finsight was designed to mimic a chatbot banking assistant for financial advisers and customers.
Okay.
We have verified that the pipeline tasks are running and the replication is occurring in real time. We've also validated the data logic using the Test Assistant. Now let's move to production and see how our end users experience this data. Finsight serves as a chat application for users and advisers to ask questions about their portfolio, investments, and trades. And it's fed by the data directly from the Knowledge Mart.
Nice. So this is used as a mobile application experience on top of Qlik Talend Cloud that lets you interact with a chatbot regarding the data. Very cool.
Yeah. When I ask Finsight the same questions we used in the Test Assistant, it retrieves accurate answers using the same underlying Snowflake LLM and Vector database. At this stage, we have a production grade financial assistant. However, it currently has a limited context. It's only aware of the structured data, the customers, and the trades from our database.
Okay.
It's not yet aware of external market factors or news evidence that might impact these positions.
So, how do you address that?
Unstructured data. Financial context often lives in documents. Right now, our AI has information on our trades, but it doesn't know the market context. To bridge this gap, let's focus on the file-based Knowledge Mart. This task is designed to ingest unstructured data such as PDF documents. Using the demo assistant again, I will trigger a batch import of some new market news stories stored as PDFs on my local machine. As I start the task, you will see the unstructured data being ingested immediately. Crucially, both our structured database records and these new unstructured PDF vectors are stored in the same vector database as we share an index between both tasks.
All right. And it's already done.
Now, if I return to Finsight and ask a question that requires knowledge of both customer positions and market news, the application could answer effectively. It now understands the broader context. We can see here that Finsight is referencing the recently ingested news articles. It's combining that insight with the structural trade data to provide a much more comprehensive answer.
That's awesome.
With the pipeline now ingesting both structured and unstructured data, Finsight can now present insight as to the market moves in real time. Returning to the demo assistant, I'm going to inject a new scenario. I'm adding a new customer profile. I'm adding specific financial details for them and I'm executing a new set of trades. If we look at the CDC status for the console, we confirm that these structured changes are now being ingested and propagated through the pipeline to the vector database in real time. Simultaneously, I will use the file-based Knowledge Mart to ingest a specific news article. This article contains critical information that directly impacts the market position of the trades we just inserted for our new customer.
Okay, a bunch of new live data. How will that affect the application?
I'm going to refresh Finsight to ensure it captures the latest state. Now I'll ask the same questions as before. The result is immediate.
Wow.
Finsight is now fully aware of the new customer, their specific trades, and the implications of the breaking news article. This is successfully correlated to the structured and unstructured data in real time to provide an accurate context-aware risk assessment.
Wow, that was a very cool demo. Simon, could you talk a bit about the benefits of this?
Yeah, absolutely. And thank you to our team, as I mentioned, for putting together that demonstration. And it's great to see like a very realistic use case, especially when it's manifest within the types of applications that people are building for customers as well. First of all, how Knowledge Marts can be integrated into our existing technology. Being able to use our universal data replication and CDC is so important. Also, unifying across all of these different data types, vectorize the information from both structured and unstructured data and then write it to the same vector store as well. A multimodal knowledge master adding in this sentiment this context which really makes data come alive the governance and observability are very important so having these governance tools in place. Lineage allows people to then trace back to the data we can also look at impact analysis as well creating an AI ready architecture to support vectorized information really gives just so much more value from the data as well. Allows customers to make better decisions more quickly.
Great. So how can customers find more information about this?
Absolutely. Best place to start, read the help if they go to help.qlik.com and then search up Knowledge Marts. Very easy to get started, but there is tremendous amount of depth here as well.
Great. I'll definitely include a link to this as well. What is the first step they should take when they want to get started?
You know, get started. Being able to bring your own data platform, be it Snowflake or Data Bricks, is extremely powerful to be able to access your pipeline data in real time. You can bring your own vector store. If you're using something like Pine Cone, you can do so. Or you can connect to your file data sources, use our Test Assistant. It allows you to then look at the documents, test the embeddings, to test the vectorization with this semantic search as well. Having that fully integrated into the platform is very, very powerful. Find a line of business or a use case that you can deliver quickly. Create this ground swell around it as well. So people can see that you're not just being able to deliver AI, you're able to deliver trusted, synchronized AI in real time. And then that's when you get people's attention.
Great. Well, now it's time for Q&A. Please submit your questions to the Q&A tool on the left-hand side of the ON24 console. A few questions have already come in, so I'll just read them from the top. How can we manage AI knowledge bases or Knowledge Marts in a customer per tenant environment?
Okay, so best practices, you're using a gateway now to initiate take your data movement gateway and then install like an AI feature that you need to install on there. Now with that in mind, what happens is you're creating a different type of data movement gateway. And so I would advocate creating a specific gateway for these types of workloads.
Okay.
One of the things to bear in mind is that when creating projects, if you create a gateway, then all of the transformations, the storage steps will use that same gateway. So you can actually create a cross project pipeline. These are like modular components you can incorporate into other pipelines as well. This way you're scoped just for these AI workloads as well. And you can share them between multiple pipelines. So in terms of best practices, I would certainly advocate cross project pipelines.
I've seen those. That's pretty cool. Great. Next question: How to make data marts created in Qlik Talend Data Integration available for users on the analytics side?
Oh yeah, a lot of people using Qlik Answers as an example which is a fantastic application. So the difference between what we see with something like Qlik Answers and what we see with Knowledge Marts is the fact that Qlik Answers is very straightforward. So people can actually go ahead ingest their documents and they're ready to use for analytics workload. Both have a role in this. The Knowledge Marts are typically for sort of engineering use cases. So people are typically looking to create these like RAG or Agentic experiences that people can you know access data. The production and consumption of data is a data product. You know what you could do is once you have this information you can write it out to a data product making it available for downstream analytics consumption.
That's a great point. All right. Next question: Can you monitor the usage of the Knowledge Mart?
That's a good question and I won't answer questions I don't know all the answers to. I think this is an opportunity for us maybe to chat with some of our technical team as well. And I think we have a follow-up session planned, don't we where we actually doing a technical Q&A?
Yeah, next week we're having a Q&A with Qlik session focused on the use of AI in Qlik Talend Data Integration. So we're going to have a few more technical experts in that. So that would be a great place to bring up that question. So if you're available, just check out Techspert Events on Qlik community to register for that event and you'll get the invitation. Just put in that one and we'll bring it up in that second session. All right, next question: Do I have to make changes to support Knowledge Mart jobs?
No, that's really the beauty of it. Other than obviously having a dedicated gateway which you've actually done this installation process for AI workloads, you can integrate it into any existing pipeline and that's where I see organizations accelerate. You know, taking a trusted pipeline that's been running at scale, it's been validated and then adding in these AI workloads. It makes it very easy for people to extend that. So there's no real sort of limitation. So I would advocate for people again to try and get started and see what they can do.
Perfect. Next question: What are best practices for creating Knowledge jobs?
I think I may have already answered that one. The ability to use cross project pipelines obviously is quite important as well, but also user Test Assistant. It's a great innovation having that directly in in the platform itself.
Great. And last question we have time for today: How can I turn on this feature or is it already available?
Ah okay. So it depends. So this feature is available in Qlik Talend Cloud Enterprise. If customers have a requirement and don't have that version of the software, speak to your account representative. Share more with your particular use case. They'd love to hear from you. We're right now this is the conversations that we're having with customers on their AI initiatives. We're very well versed in understanding what's going on here and we really want to make sure that people have access to this as well. So it's within Qlik Talend Cloud Enterprise that they need to do that but we're very open to partnering up on a PC for customers that do have that; it's active in the tenant today. So you can go ahead right-click, you can add a Knowledge M or a file-based Knowledge Mart to an existing pipeline or create a new pipeline as well. I strongly believe that when delivering these types of projects planning is essential. So being able to scope your data estate to understand what's happening to be able to make sure that the data is properly governed that you're not oversharing information as an example; but basically you know make us part of that story. It's not just Qlik as an organization, we have a tremendous partner network where we can you know bring in their expertise and experience in a vertical specific way maybe for your use case. So you know just get in touch with us because we really want to make sure that we can help you achieve your AI projects.
Well Simon thank you so much for putting this together. It's a great presentation and a great feature. I'm glad we're able to share the knowledge of this and let people hopefully try it out.
Absolutely. And I think people can tell I'm actually very enthusiastic about this. I've seen the power. I've seen some of the internal conversations we've had and with customers as well. So, I'd like to take the opportunity to thank the audience today and for the great questions as well. And we look forward to seeing people in our next Q&A. If you've got any questions, please come to us. We'd be delighted to hear from you.
Great. We hope you enjoyed this session. And special thanks to Simon for presenting. We always appreciate having experts like Simon to share with us. Here's our legal disclaimer. And thank you once again. Have a great rest of your day.

STT - Setting Up Knowledge Marts for AI

STT - Setting Up Knowledge Marts for AI

Transcript

Support Techspert Thursdays

Thursday