Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

The Secret Life of SubField

The QlikView documentation describes SubField() function as the following:

In its three-parameter version, this script function returns a given substring from a larger string s with delimiter 'delimiter'. index is an optional integer denoting which of the substrings should be returned. If index is omitted when subfield is used in a field expression in a load statement, the subfield function will cause the load statement to automatically generate one full record of input data for each substring that can be found in s.

In its two-parameter version, the subfield function generates one record for each substring that can be taken from a larger string s with the delimiter 'delimiter'. If several subfield functions are used in the same load statement, the Cartesian product of all combinations will be generated.

NOTE: Ouch, I just realized I made a "noob" error on posting this..  When experimenting with SubField, I was looking at the QVD for the first output (without subfield), and looking at a Table Box for the second output (with subfield).  The tablebox automatically eliminates redundant rows, while the QVD, of course does not.  Oops

The really awesome feature of SubField is that it automatically eliminates duplicate records generated by the function.  This is particularly handy when working with large data sets.

In my current project, I have a dataset that describes what combination of projects and zips that a specific customer wants business. The dataset has three fields: Customer, Zip(List), Project(list).    Zip and Project contain comma-separated lists in their fields, which is not in a friendly format.  To do analysis in QlikView, I need to get each project and zip combination for each customer onto separate records.  Fortunately, SubField does all the heavy lifting for me.

In the following example, the fields are separated by a semicolon.  Within the zip and project fileds, you can see the comma separated lists.

customer;zip;project

338;87124,87125;1,2

338;87124;1,3

339;87124,87129;1,2,3

339;94903;1

338;87124;1,3

Loading this into QlikView without SubField:

LOAD
customer,
zip,
project
FROM
test.txt
(
txt, codepage is 1252, embedded labels, delimiter is ';', msq);

I get the following unusable table.:

customerzipproject
33887124,871251,2
338871241,3
33987124,871291,2,3
339949031
33887124,871251,3

However, incorporating the SubField() function to break out the zip and project fields:

LOAD
customer,
SubField(zip,',')         as zip,
Subfield(project,',')     as project
FROM
test.txt
(
txt, codepage is 1252, embedded labels, delimiter is ';', msq);

I get a Cartesian join containing every DISTINCT combination of zip and project for each customer:

customerzipproject
338871241
338871242
338871243
338871251
338871252
339871241
339871242
339871243
339871291
339871292
339871293
339949031

Even though I have repeated data in my input, my output contains only unique combinations.  I found this exceptionally useful. My production input dataset contained 78,000 lines.  After applying SubField to the LOAD statement on Project and Zip, my table resulted in 20 million unique records ready for analysis.

This is my first post of any type in the QlikView Community. I have read and learned so much from many other folks on this forum, and I felt it was time to give back .  Next up (when I get the chance), I will post on an end-to-end system to split out current QVD's by a specified time period (aka "sharding"), then incrementally load data to these sharded QVD's and create new ones as the specified time period changes (e.g. 2013-April.qvd to 2013-May.qvd), and finally have the Tier 2 loader load these QVD's based on a specified number of time periods back.  It'll be interesting to see if I can get that cleaned up for a post on here.  Cheers!

7 Replies
Miguel_Angel_Baeyens

Welcome Chadd and Congratulations for such and explanatory first post! Keep them coming!

Miguel

Not applicable
Author

Note, Miguel suggested I post this to the Resource Library docs section.  I have copied it over there, pending approval. http://community.qlik.com/docs/DOC-4657

Miguel_Angel_Baeyens

It's now live!

Not applicable
Author

Woot!

Not applicable
Author

Ouch, I just realized I made a "noob" error.  When experimenting with SubField, I was looking at the QVD for the first output (without subfield), and looking at a Table Box for the second output (with subfield).  The tablebox automatically eliminates redundant rows, while the QVD, of course does not.  Oops

rajeshvaswani77
Specialist III
Specialist III

I like the title of the post, sounds mysterious.

thanks,

Rajesh Vaswani

Not applicable
Author

I just ran my script with LOAD DISTINCT. My row count dropped from 19,244,625 to 19,012,623, but my script time increased from 12 minutes to 31 minutes.  Gotta take this into consideration.