Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Environment
Hello everyone and welcome to the February edition of Techspert Talks. I'm Troy Raney and I'll be your host for today's session. Today's Techspert Talks session is Building Streaming Pipelines with Qlik Open Lakehouse with our own Jason Hall. Jason, why don't you tell us a little bit about yourself?
Hey Troy, Jason Hall. I'm a Solution Architect at Qlik. Just hit my 1-year anniversary at Qlik. Was acquired into Qlik via Upsolver last year. And Upsolver was a company that helped Qlik build out its Data Lakehouse and Iceberg capabilities within Qlik Talend Cloud. So excited to talk with you today.
Great. Yeah. Today we're going to be talking about the whole end-to-end IoT architecture. We're going to look at Streaming data via AWS Kinesis; talk about some Open Lakehouse specifics for that and the final step in that journey is the analytics from pipelines; and really the reason we're having this session today Jason is because there's some new Streaming capabilities coming out with Qlik Talend Cloud I was hoping you could tell us some more about that.
Yeah absolutely. So it's an exciting time for the Open Lakehouse product; we announced at the AWS reinvent conference in December of 2025 the new Streaming ingestion capabilities that are being introduced to the Qlik Open Lakehouse within Qlik Talend Cloud. And at the AWS conference, since we were making these announcements, we put together an experience that we call the Tour de Qlik we wanted these attendees to have some fun in a real end-to-end Streaming data IoT use case.
Could you quickly explain what IoT stands for?
Yeah so IoT is kind of like a blanket term ‘internet of things.’ Businesses now have all of these different data generators that are network connected and can send their data somewhere. Very common IoT use case would be like a manufacturing company that has assembly lines with sensors scattered all over these assembly lines gathering device readings.
Very cool. So in that demo the participants are cycling and the cycling data was actually what was Streaming into your dashboards?
Exactly. So if I show you the end-to-end architecture that we built out, the data generator was a Zift bike with a Wahoo Kicker off the back, a stationary cycling bike that generates a whole bunch of sensor data.
Very fun.
That data is stored locally in an Influx DB instance. Technically, we have a Raspberry Pi that's sitting close to that bike. From that point, we started with a Talend portfolio to stream that data into Amazon Kinesis. Kinesis is Amazon's event management type system that can do real-time event processing. Now from there Kinesis is just sending messages, right? Qlik Open Lakehouse will then take that data from Kinesis and in near real-time build and optimize Iceberg tables in AWS that we can then run analytics against. So users could pedal their hearts out on the bike. It was about a 30-se second sprint. It's amazing how much energy you can expel in 30 seconds. And they could see how did they do against their peers. What was the metric data that streamed off the bike? There was an automate component of that. So right after they jumped off the bike, they got an email in their inbox that highlighted their key metric readings. And then we also used the Qlik Predict capability. What if I could keep the same pedaling cadence but increase my power by 10%? How much faster would I finish this ride? Really cool experience. We allowed conference attendees to act as data generators that fed into this full end-to-end IoT Streaming data analytics pipeline that was fully powered by different components of the Qlik portfolio.
It's a really cool demo. What are some other possible Streaming sources or use cases for Qlik Open Lakehouse?
Yeah, the three Streaming event sources that we're introducing support for are what you see on the left. There's Apache Kafka, which is an open- source messaging system. Amazon Kinesis, that's what we used. It's native to AWS, very simple to set up, very cost- effective, even at scale. And S3, and we often think of S3 as just a place to store a file. But S3 is often used for Streaming use cases where files can be generated in real-time, sometimes into the millions, and those files might need to be picked up and processed.
So, what else is happening in this ingestion process?
So we're converting that Streaming data into parquet files that make up the Iceberg tables that we're querying. The schema of events can change over time. New fields can show up. Strings can grow in length. So your ingestion pipeline has to be able to accommodate that and then automatically maintain row lineage to see here's a particular key column and here's the history of that key column as it's changing over the duration of the ingestion and compactions, deletes and table optimizations. This is often referred to as the secret sauce within Open Lakehouse that makes Iceberg function at scale. So we automatically perform the various types of table optimizations needed when Streaming into Iceberg so that your Iceberg table performs well when the data is being consumed and that Iceberg table uses an efficient amount of storage within your data architecture.
And I guess that's a big value that the Qlik Open Lakehouse brings to the table, right? That optimization?
Exactly. Huge value. And there's three things that need to be done when writing into Iceberg at anything approaching near real-time. You have to compact small files into larger files. You then also have to perform file cleanups. You know, I was actually working with a customer a few months ago. They wanted to just see what happened without any optimization. And within a day of real- time data ingestion, they had an Iceberg table that contained 50 gigs of data but over a terabyte of storage on the back end.
Oh wow.
And then finally the automatic scaling and healing component. These optimizations need to be performed continuously and they need to be highly fault tolerant. These are all things that again the Open Lakehouse does automatically to make sure that your Iceberg tables perform as well as they can and use the most efficient amount of storage.
So how does the data actually flow through the architecture in the background?
The Open Lakehouse architecture supports data collection from any of the different types of sources that you see over on the left. What we then provide is something called a Lakehouse cluster. This Lakehouse cluster runs in the customer's Cloud infrastructure. This is a really important aspect of the solution that makes this product very secure because it runs in the customer's network and the customer maintains all level of security controls and governance on top of the data being processed, right? Which is very important for the use case that I'm about to show you. We've got a Kinesis stream sending data into a Lakehouse cluster on AWS. We're using a combination of Amazon S3 and the AWS Glue catalog to provide Iceberg tables that are then consumed by a variety of query engines that you see over on the right.
Very cool. What kind of use cases can you show us?
Good question. We're talking about an IoT Streaming use case here, but most of your organizations out there are implementing or thinking about real-time Streaming use cases. Two of the more common ones that I've seen are one just generic event logging. These could be network security logs, firewall logs, right? Who's doing what inside of our network? Generating the logs is relatively easy. Processing and analyzing logs can be very difficult. Another very common use case that I see is user telemetry. It's really helpful for application teams to know who's doing what inside of that application.
Yeah, I appreciate that. I mean, everybody loves Streaming data and real-time data, but it's nice to apply some actual use cases to it that you've seen. Can you show us how to build a Streaming pipeline?
Yeah, let's do it.
Okay. So, we're looking at Qlik Talend Data Integration pipeline projects on your tenant. Where do you want to start?
Yeah. So, I'll go ahead and build out a new project. And this is exactly the project to power this IoT experience. So, we'll give this project a name. I'll call this demo bike data. This specifically is a data pipeline project with the Qlik Open Lakehouse as the data platform. First of all, we have to define a connection to the Iceberg catalog that we want to use.
Okay.
Connecting to a catalog. Very straightforward. We just need the region. In this example, it's AWS Glue. We need a bucket that we want to store this data in, and we need a way to authenticate to AWS. So, I'm using an AWS role here to authenticate to my catalog. We'll then define an S3 bucket where we want to land raw data point to a bucket using again authentication credentials and that's where the raw data that we ingested gets processed.
Okay.
And the Lakehouse cluster are those services that you saw in the architecture diagram that are ingesting data and processing into Iceberg.
Okay. So we've got the basics of the project and the connections we want to use. What's next?
We now need to connect to our Streaming source data. So that's a process we call onboarding. I've got my Kinesis stream already registered here, but the process to create a connection is very straightforward. You can see those three Streaming sources I referenced. Amazon Kinesis, Apache Kafka, and finally an S3 bucket. Defining a new connection is just the region your stream runs in and how you want to authenticate.
All right.
My case, I've already got the Kinesis stream created here. So I'll click Next. In this Amazon account, there's actually multiple Kinesis streams. For this use case though, I'm just going to pick the Bike Data and I'll add that into the data onboarding.
Okay.
Here's the process where we define the type of data that exists on that stream. Now, there is auto detection here if you're not sure. In my case, though, I know it's CSV data. I am going to take advantage here of an option that says automatically infer types. There's no definition within a CSV file that tells you what data type each column is, but Qlik Talend Cloud's able to try and infer data types based on the data that it sees within the file. So, we'll take advantage of that.
I love that it gives you a sample of the data so you can see what it looks like. Yeah, absolutely. So, just nice as a real-time validation that the stream contains data and it's of the type that we expect. Click Next here. Here you'll see a read data from. So this just has us say do we want to ingest all of the data that still exists on the stream including the 24 hours of history or do we just want to start this pipeline from now I'll say let's go grab that retain data we have an option to define table partitioning within our Iceberg targets no partitioning or to partition the table by date. You'll have the option of overriding each of these settings these are just now defaults for the project and finally we've got a general summary of the project and we can go ahead and create it. We have our source over on the left. We have our landing zone right here in the middle where the raw data will be processed. And then finally over on the right, we have the task that's going to process that data into Iceberg.
So once we've got the shell of this project, what's the next step?
To prepare each task that creates the necessary artifacts and folders and objects that are needed for ingestion. And then once everything's prepared, we can go ahead and kick off the task itself. Collecting that data from the stream and processing it into landing.
And what is the task on the end responsible for?
Processing this data into Iceberg tables.
Okay.
We've got one stream and we're just going to send that stream to one table, right? But you can map this out in a variety of ways. Maybe you've got multiple streams into one table or one stream into multiple tables and that's all possible here with this data mapping. But in this use case, pretty simple. One stream into one table. If we open up this target table over on the right, maybe we don't like the name of the table that's getting created. By default, it's just inherited from the stream name. But I'm just going to call this Demo Bike Data.
Looks like there's a lot of stuff you can do here.
Yeah, as a data engineer, sometimes we inherently know some things we want to do, right? Maybe I know I'm not going to need this UU ID here. just select it and remove the field, or you can change the data types of fields or perform transformations on the data. So let's assume we're given a stream. We're not sure the types of things we want to do to it. So let's just set up a basic pipeline and start to inspect some of the data.
Okay.
I'm going to change a few of the settings here within our glue catalog. We can write to different databases and schemas. And I'm just going to redirect this data. And there's an option down here to publish this data set into the Qlik Catalog. This will allow us to also apply some of the Qlik Talend Cloud capabilities around data quality and data products to this data set that we're Streaming data into. So we've renamed the table, chosen the schema that we want to ingest this data into, and now I'll just prepare this storage task getting started. You can track its progress down at the tab on the bottom. So you can see the preparation is now done successfully.
Okay.
And we'll now run it. So if I jump back to the project level here, you can see our landing task is already running. And then that raw data that we've ingested will start to process into Iceberg tables.
Can we see the data being stored in the S3 bucket?
Yeah, absolutely. Takes the name of the project. It's this Demo Bike Data table. And here you can see the name of the task that I set up. Here is a folder that contains our Iceberg tables. There's the ID of the table. And then finally, we have data and metadata folders here. So within Iceberg, you've got metadata. These are all of those snapshot files that I referenced in the slides that are processed when data is written into Iceberg. If we go back a level two into the data directory, you can also see lots of parquet files here. Every time we write into Iceberg, it generates a small file that makes up a part of the table. So after even just a few minutes worth of ingestion, we've got 12 files that are going to need to be compacted so that this table runs more efficiently.
Very cool. So I just want to wrap my head around what we've seen so far. So you set up the project. It's picking up Streaming data. It's loading that into parquet files in Iceberg tables already.
Yeah. So we have to convert the data into parquet so that it can be written into an Iceberg table. And then there's Iceberg metadata that's created so that some query engine can connect to and consume the Iceberg data itself.
Awesome.
Now if we switch back here to Qlik Talend Cloud, if I go to a project that's been running now for a while, you'll see this pipeline and specifically this top task is exactly what we just set up. Remember that publish to catalog setting that allows the Qlik Talend Cloud data quality capabilities to now be applied to this data set. And if we go into our catalog within Qlik Talend Cloud here you can see that bike data raw that we're creating. Now we can calculate some data quality metrics on top of the data.
Yeah. How does the Trust Score get calculated?
Yeah. It looks at a variety of items that is something that you can configure. Here you can see the different dimensions of trust that we apply to calculate a score. Data validity, completeness is non-null values. Accuracy timeliness. How fresh is that data? These are the different dimensions that we can calculate trust again and the weight over here is how much does that particular dimension affect the overall score. So if I jump back to the data set that we were just looking at, you can see our current Trust Score is 81%. It's actually down a little bit.
Is it possible to see what could be lowering the Trust Score?
Yeah, if I scroll down, you can see all the different fields that make up this data set and a field-by-field analysis of the quality of that data. You can see there's a few here, bike ID, event ID, and rider ID that are mostly null, right? 93% of these values are empty. It could indicate a data quality problem that we might want to address.
I love the granularity that it can tell you exactly what fields are having issues and what those issues are.
Yeah, we can show you a data profile. Shows you some of the top values collected across each metric and the distribution of data within each particular field. You can also preview some of the data that's coming in without having to go back and query the table. Here's the actual data that's being ingested into this data set. And if I click into a particular field, we can see things like validation rules that are applied. Right? Remember, part of the Trust Score is data validation and data accuracy. And while we built out this pipeline, every once in a while, we noticed some data that just didn't look right. We would look at a rider, they'd get off the bike and it looked like they were generating 100,000 watts of power, right? Which anyone who's a cycler knows that's not possible, right? These sensors are inherently faulty, right? They're very low cost and likely to fail. And those failures could be going offline altogether or maybe every once in a while it's just going to spit out some garbage data. So, what we may want to do here is apply a validation rule to this. And I've actually already created one here whether a specific meter reading is reasonable for this data.
Can we take a closer look at the settings for that?
Absolutely. And there is actually some generative AI capabilities here which can look across your data and suggest some data quality rules to apply. But in this case, I I'm pretty sure I know what I want to create. So we can create a new one. And all it really says here is this is the column that I want to analyze. Is it text, number, boolean? And you can kind of define the condition here. For this data to be valid, I think it should always be less than 5,000.
So any value that comes in outside of that will be flagged as invalid?
Exactly.
That's a good rule.
But in my case, I've already created it. That's exactly what this is. We are now going to apply this. Once it's applied, we'll refresh the calculation. That will then affect the Trust Score. It will show some of the records within this column as invalid. And this can be helpful. One, just I want to be confident that it's a valid meter reading, right? And you can see now it's flagged some that are invalid. Here's a reading of over 32,000. Here's a reading of negative one. Or that's not how bikes work, right? So, we've got some faulty readings in here. It's affected the Trust Score. You can see it's dragged the Trust Score down. And maybe this now gives me some guidance on ways that I can transform this data within the pipeline to get a better higher quality data set downstream.
Is it possible to set up some kind of alerts as well to notify an admin of anomalies like this?
Absolutely. Yeah. So the Trust Score can be visual. There's also an Automation framework within Qlik here where you can set up Automations that can check the status of Trust Scores and alert you if there's a decline or if a data set Trust Score falls behind a certain threshold that you've configured.
Very cool. Yeah. How do you solve the problem of having a bunch of unneeded data?
Yeah. So, let's go back to now the pipeline. Let's try to clean up some of this data. One of the capabilities we announced at reinvent is not just Streaming data ingestion but also Streaming data transformation. So if we open up this second version of the same task, but if I open up the table over here on the right, first off, something we actually didn't notice in data quality, but I'll call out here the start, stop, and time values were coming in as strings, and we wanted to process those into proper timestamps. We selected these three fields, edited them to change their data type into a datetime format, and then we applied the datetime function over here in the expression that converts the strings that come in through the stream into properly formatted timestamps.
Very cool.
Another transformation we applied was we're filtering this data. We notice that those bikes generate data whether or not someone's actually riding them. If I go hop on a bike and pedal for 30 seconds and then I get off the bike and there's five minutes in between when the next rider comes on, we've got five minutes of sensor data that we don't need. If you remember actually the data quality, that's what all those empty values were coming from.
Right. All those empty values.
Exactly. We had sensor data being generated when there was nobody actually sitting on the bike. So, we set up a filter condition here that checks and makes sure that both the rider ID, that's who's on the bike, and the bike ID, which is which bike they were on. If either was null, we don't even want to bother processing this into Iceberg. So, let's make sure if the rider ID is JSON and the bike ID is two. So, we'd expect the test expression here to return true. But if either of these two fields are null, now you can see it's false. So, the record will be discarded.
That's great. I love that. testing out the expression feature.
Yeah, exactly. Those are just a few basic transformations, but there's a ton of power here and how you can transform this data. The other thing I want to show, I glazed over partitioning earlier.
Uhhuh.
But partitioning within Iceberg is the best way at scale to improve performance of an Iceberg table. And kind of a default partitioning mechanism is by date. But in this example, it did make sense to partition by the event. Most of the time that you analyze the leaderboard, you're only analyzing the leaderboard within the particular event that you currently ride the bike on. That makes sense. And that'll just make everything more efficient downstream.
Okay.
We can also apply sorting to this data. And we can configure some of the data maintenance parameters. You know, how many snapshots do we want to keep? When do we want to expire snapshots? You have full control over what those settings are. So all of these settings were applied and then the data is filtered out, converted to different timestamps and the table data is now more efficiently partitioned.
And as an example, if I go back into the S3 bucket, this is the project folder that we were just looking at here, you can see the raw data, which if I drill into it, that inside of the data folder. Notice though, there's only one file. That's the benefit of the compaction that we're doing. Remember when we looked at the S3 a few minutes ago, we saw 12 files that were created by ingestion. Over time, those files get compacted down. And here's the result of that compaction.
That's very cool.
But if we jump back now and look at the transformed folder and go into the table directory, now you can see a partition that's created for event 14. 14 just maps to the data for this webinar. In the real production table, you'd see a separate folder for ReInvent, for Qlik Connect, for big data London, right? For all the different events that the Tour de Qlik is running over. And then within that partition, you see the files that make up that partition.
Okay, now you've got this clean data. It's been filtered and optimized. What's that final step look like?
Great question. If I go into the catalog here and we take the data set, so we can now create an application based off this data set. What this actually uses is an Amazon Athena connection because this data is stored in Iceberg within AWS. And Amazon Athena is a very cost-effective, very scalable query engine from Amazon that allows you to query Iceberg data. But you can use a number of different query. Some customers want this data to be mirrored into Snowflake. I can take a table and mirror this data into your data warehouse. Right? It actually doesn't copy the data. It just creates objects in the data warehouse that allow you to query your Iceberg tables that are processed by Open Lakehouse. We currently support Snowflake and Redshift mirrors. We'll be adding additional data warehouse targets in the future. But if we're mirroring this data to Snowflake now, we can use Qlik Cloud Analytics point to a Snowflake connection and consume the data from there. The way that we built this out is a leaderboard experience. Somebody rides the bike, they get off the bike, and they want to see how they did.
Mhm.
So you can see over the course of reinvent we had 470 cyclists that rode 508 times. Some general KPIs at the top. And here you see the leaderboard experience, right? There's all kinds of filters that you can set here if you want to see different events or different dates or if you just want to filter on genders or countries. If you drill into a rider and let me just find myself here. So this was me. I finished in 29th place. I was pretty happy about that out of 470 riders.
That's not bad.
I think this ride on the 3 is when I did my best. And here you can see the sensor data that we're pulling out of those Iceberg tables. Right? So over the course of the ride, here's how much power I generated. Here's how much speed I generated over the course of the ride. You can see I kind of fell off at the end. Here's the cadence that I was able to keep up. You could also do some really cool things here. We could compare to another rider. So let's say, for example, I want to see how did I compare to the top rider. Here's the gentleman who finished in first place. And I can see what did my power curve look like compared to him. You can see I was not able to generate as much power, hence his faster finish time than me. If we jump back now, this is often something we'd get off the bike and they'd say, "How could I do better? Is there a way that I can improve my time?" So, if I again find myself here, you'll see there's an option here to run a race prediction. And so I could take my best race time and see that I generated a max power of 729 and a cadence of 86. What if I could keep that power up but increase my cadence to 100? Or maybe even what if knowing that I increased my cadence to 100 but generated a little less power, right? What would that do to my simulated time? So in this case, increasing my cadence at the sacrifice of power would have resulted in a slower race time. My guess is that my best result would be if I could keep consistent cadence at a higher power output, I could shave a little bit of time off, right? And this is all based on modeling all of the sensor data that we have over the course of the event and then running predictive algorithms against that data to enable this kind of what if type analysis.
I love how it actually spelled it out in natural language, too, like what that change would mean for your final outcome.
Very much so. Right. And this leaderboard experience is all Qlik Analytics. But remember what's powering all of this Analytics on the data side is high-performing efficient Iceberg tables that are ingested into and optimized by the Open Lakehouse capability within Qlik Talend Cloud through a variety of optimization services and data transformations to make sure that the data that we're building this analytics on is of high quality provided within a scalable and cost-effective data architecture.
Okay, now it's time for Q&A. Please submit your questions to the Q&A tool on the left- hand side of your ON24 console. First question, can Qlik Open Lakehouse provide data to another analytics tool like PowerBI, Tableau, Excel, etc.?
Oh, absolutely. Yeah. So, so we're using Qlik Analytics specifically to interface with our Iceberg data, but any analytics tool that can run queries against an Iceberg table using a supported query engine should be able to connect to and consume. That's much of the value proposition of Open Lakehouse and why we called it Open Lakehouse was to ensure that this data is accessible from the widest range of engines. Yeah. Any analytics engine that can query Iceberg tables through a supported query engine will be able to consume data from the pipelines that we've built out during this demo.
Great. Next question: What is a good monitoring tool for reviewing the data activity and measuring latency?
Yeah, so there are ways to monitor Qlik pipelines externally. So whatever monitoring tool you have assuming it has the capability to integrate within Qlik should be perfectly acceptable. The two ways that you can monitor from within Qlik are I showed you kind of the visual monitoring more from an inspection basis as often as you want or need. You can check up on the healthier data for your most critical pipelines. Maybe even build dashboards around critical pipelines and inspect their data quality and freshness. There's also the Qlik Automate framework as well. So if you want to generate notifications based on key metrics or key performance indicators that are coming out of your pipelines you can send notifications to your inbox to Slack, to Teams, you know a variety of targeted approaches. So there's ways to monitor from within Qlik, but there's also ways to monitor externally using you know whatever your monitoring tool of choice is.
That's right. Great. Next question. At what point in the pipeline is best practice for applying transformations?
Yeah, it's more of a philosophical question. I guess one approach is that we brought in raw data from Kinesis. We loaded data into landing and that could effectively serve as our bronze layer. Right now, this is just raw data. It's exactly as it came from the source and it's processed into that landing zone. Right? We then decided to apply transformations to that data so that once that data lands in Iceberg, it's of higher quality. I often will call those like inline transformations. And the argument for that approach is if we know there's bad data in the source, let's filter that out as quickly as possible so that it's not consuming resources throughout our data pipeline.
Now, the other approach is let's get that raw data into Iceberg so that we have a table that's quarriable with data that's as close to raw as possible. Even if there's bad data in the source, I want to see it so that I can maybe do some forensic analysis on it or fully understand the lineage of data as it moves through the tiers. And so in that example, you might want to take advantage of this top option. Let's just get the data into Iceberg so that it's quarriable in its exact raw format. And then maybe you build additional transformations on top of that. There's pros and cons to both methods. I can't really stand here and say you should definitely do it this way. It depends on the exact use case and the approach that you want to take with your data. But the nice thing is that you're provided with these options as you build out the pipelines. Either apply transformations as early in the pipeline as you can to filter out and improve data quality or provide transformations downstream. Let's just ingest the data into Iceberg, give us our raw data in a quarriable format, and then we'll build out transformations beyond that.
Okay, that's fair. Next question. Is this AWS Kinesis Streaming possible using Qlik Replicate the on-prem?
Yeah, it is. So Qlik Replicate does support Kinesis as a target for replication. So if you're using Qlik Replicate on-prem, you can send CDC data collected by Replicate to Kinesis. That data could then of course be picked up by Open Lakehouse. So very much possible. Replicate supports Kafka. Replicate supports Kinesis and Replicate supports S3. All three of our Streaming ingestion sources could be configured by Replicate. There's also roadmap items due out later this year to further simplify the integration of Qlik Replicate with Open Lakehouse. So if you want to get started today, like I said through one of those supported sources and targets you could certainly get going right now; but there is some enhancements coming later this year to tighten up that integration as well.
Great. Next question: What version and types of Kafka does Open Lakehouse support?
You know, Kafka is an open-source Streaming event system. There are lots of different (quote unquote) flavors of Kafka as an open-source product. Lots of different vendors have taken that open-source software and kind of wrapped their own service around Kafka. I mean, the general answer to that is anything that's Kafka compatible or compliant should just work. We're using Kafka specific APIs to read from Kafka. For example, Amazon MSK, which I saw the question quoted out, is fully supported. MSK is Amazon's managed service around Kafka, but under the covers, it's just Kafka. And there are dozens of vendor managed Kafka services. As long as they're using Kafka under the hoods and as long as they support the Kafka APIs for authentication and reading of data, it should just work.
All right, next question: If I have semi-structured data on my stream, for example, JSON, can I extract and flatten the data via transformations?
Good question. I'll actually build out a very quick demo here. Right. So, this orders data set, it's semi-structured, right? So, we'll autodetect its form is JSON. Yep, it did auto detected as JSON. And notice that some of these fields, like for example, the customer field is a root level field, but within that field, we've got an email address, a first name, a last name. Right? Also, there's an items array, right? So, remember Elizabeth here placed an order. That order may have contained multiple items. And those items themselves are stored within arrays. Each event here is one row, but I've got nested data and I've got an array element within. So we'll now go and build this out. I've got my pipeline. And just so we have some data to work with, let me go ahead and prepare and run the landing task. So now if I open up the Streaming transform task here, we've got our orders data going to an orders table. Let me open up the table definition here. The customer field here and the data field are stored as STRUCT, right? A STRUCT just references. We're keeping that nested format. In our Iceberg table, we will have one column and within that one column, we will have all of our nested fields. And within the data array, we will have one column and within that one row for that column, we will have all of the array elements. Now at the settings level, if we want to automatically unnest all of this data into separate columns, we can just do that generally. That will take each sub-field of a nested column and automatically map it to a separate table column within our output.
That's very cool that just a click will do all that.
Exactly. One click, and it'll do it all. You can also be more selective about it. For example, if I want to take this customer field, you'll just see an unnest button here. It found four sub-fields contained within the customer field. If I just want to unnest all of them, then I can just do a select all and I can choose whether or not I want to preserve the nested field or remove the original STRUCT from the data set. And you'll see now we've created four columns in our target which are the nested fields within that original STRUCT. Now similarly with the data array, we've got multiple items within that same array. You can see you can also do things like flatten and unnest that as well.
Now I noticed one of the fields you unnested, “customer address.” It came out and it's got another structured data type. So is that…
Yeah.
So if you…
Exactly. If you even think this is multiple levels of nesting. If you think about an address you've got a city, a state, a zip code, a country. The first level of unnesting left the address as a STRUCT, but it peeled out and unnested the email address, first name and last name. If we want to keep going, I can select address, unnest that as well. And here you can see all the nested address fields. Let's just peel out the city, the country, the state. And now I've unnested those three fields, city, country, state, but I also have the full address which remains as a STRUCT in my output.
That's really cool.
How you define your transformation logic to handle that nesting and array flattening is very simple to manipulate within the pipeline itself.
Great. And last question we have today, will this demo be at Qlik Connect in 2026?
Absolutely. Yeah, I'm glad that question came out. Absolutely. If you're attending Qlik Connect, and we hope to see as many there as possible. There's (I think I'm at liberty to say this). Yes, we will have the Tour de Qlik there. We will have the bike set up. You'll be able to participate in this exact same use case, see how you show up on the leaderboard and have a fun time doing it. So, if you happen to be at the Qlik Connect conference, definitely come by, find- it'll be hard to miss. There will probably be cowbells ringing and people yelling and breathing heavy and I'll be there to kind of work the experience with several of my co-workers. So, definitely swing by and see how you can do.
Awesome. Well, Jason, thank you so much for showing this to us, you know, kind of lifting the hood on what Streaming data into Iceberg tables and Open Lakehouse looks like. Yeah. Thank you so much.
Yeah. So, thanks Troy. Thank you to the audience for attending today. I hope this was interesting look at a Streaming data pipelines within Qlik Talend Cloud and our Open Lakehouse capabilities. If anybody has further questions, definitely reach out to us in whatever way is most efficient. And for those of you that are considering or planning on attending Qlik Connect, it'd be great to meet you in person. So, come find the bikes and come say hi.
Great. Thank you everyone. We hope you enjoyed this session. And special thanks to Jason for presenting. We always appreciate having experts like Jason to share with us. Here's our legal disclaimer. And thank you once again. Have a great rest of your day.