Re: The Secret Life of SubField - Qlik Community

Report Inappropriate Content · ‎2013-08-29

The QlikView documentation describes SubField() function as the following:

In its three-parameter version, this script function returns a given substring from a larger string s with delimiter 'delimiter'. index is an optional integer denoting which of the substrings should be returned. If index is omitted when subfield is used in a field expression in a load statement, the subfield function will cause the load statement to automatically generate one full record of input data for each substring that can be found in s.

In its two-parameter version, the subfield function generates one record for each substring that can be taken from a larger string s with the delimiter 'delimiter'. If several subfield functions are used in the same load statement, the Cartesian product of all combinations will be generated.

NOTE: Ouch, I just realized I made a "noob" error on posting this.. When experimenting with SubField, I was looking at the QVD for the first output (without subfield), and looking at a Table Box for the second output (with subfield). The tablebox automatically eliminates redundant rows, while the QVD, of course does not. Oops

The really awesome feature of SubField is that it automatically eliminates duplicate records generated by the function. This is particularly handy when working with large data sets.

In my current project, I have a dataset that describes what combination of projects and zips that a specific customer wants business. The dataset has three fields: Customer, Zip(List), Project(list). Zip and Project contain comma-separated lists in their fields, which is not in a friendly format. To do analysis in QlikView, I need to get each project and zip combination for each customer onto separate records. Fortunately, SubField does all the heavy lifting for me.

In the following example, the fields are separated by a semicolon. Within the zip and project fileds, you can see the comma separated lists.

customer;zip;project

338;87124,87125;1,2

338;87124;1,3

339;87124,87129;1,2,3

339;94903;1

338;87124;1,3

Loading this into QlikView without SubField:

LOAD
customer,
zip,
project
FROM
test.txt
(txt, codepage is 1252, embedded labels, delimiter is ';', msq);

I get the following unusable table.:

customer	zip	project
338	87124,87125	1,2
338	87124	1,3
339	87124,87129	1,2,3
339	94903	1
338	87124,87125	1,3

However, incorporating the SubField() function to break out the zip and project fields:

LOAD
customer,
SubField(zip,',') as zip,
Subfield(project,',') as project
FROM
test.txt
(txt, codepage is 1252, embedded labels, delimiter is ';', msq);

I get a Cartesian join containing every DISTINCT combination of zip and project for each customer:

customer	zip	project
338	87124	1
338	87124	2
338	87124	3
338	87125	1
338	87125	2
339	87124	1
339	87124	2
339	87124	3
339	87129	1
339	87129	2
339	87129	3
339	94903	1

Even though I have repeated data in my input, my output contains only unique combinations. I found this exceptionally useful. My production input dataset contained 78,000 lines. After applying SubField to the LOAD statement on Project and Zip, my table resulted in 20 million unique records ready for analysis.

This is my first post of any type in the QlikView Community. I have read and learned so much from many other folks on this forum, and I felt it was time to give back . Next up (when I get the chance), I will post on an end-to-end system to split out current QVD's by a specified time period (aka "sharding"), then incrementally load data to these sharded QVD's and create new ones as the specified time period changes (e.g. 2013-April.qvd to 2013-May.qvd), and finally have the Tier 2 loader load these QVD's based on a specified number of time periods back. It'll be interesting to see if I can get that cleaned up for a post on here. Cheers!

Miguel_Angel_Baeyens · ‎2013-08-29

Welcome Chadd and Congratulations for such and explanatory first post! Keep them coming!

Miguel

Report Inappropriate Content · ‎2013-08-29

Note, Miguel suggested I post this to the Resource Library docs section. I have copied it over there, pending approval. http://community.qlik.com/docs/DOC-4657

Miguel_Angel_Baeyens · ‎2013-08-29

It's now live!

Report Inappropriate Content · ‎2013-08-29

Woot!

Report Inappropriate Content · ‎2013-08-30

Ouch, I just realized I made a "noob" error. When experimenting with SubField, I was looking at the QVD for the first output (without subfield), and looking at a Table Box for the second output (with subfield). The tablebox automatically eliminates redundant rows, while the QVD, of course does not. Oops

rajeshvaswani77 · ‎2013-08-30

I like the title of the post, sounds mysterious.

thanks,

Rajesh Vaswani

Report Inappropriate Content · ‎2013-08-30

I just ran my script with LOAD DISTINCT. My row count dropped from 19,244,625 to 19,012,623, but my script time increased from 12 minutes to 31 minutes. Gotta take this into consideration.