The Secret Life of SubField

    NOTE: I just realized I made an amateur error on posting this..  When experimenting with SubField, I was looking at the QVD for the first output (without subfield), and looking at a Table Box for the second output (with subfield).  The tablebox automatically eliminates redundant rows, while the QVD, of course does not.  DISTINCT is necessary to eliminate redundant rows, even when using SubField. Oops

     

    The QlikView documentation describes SubField() function as the following:

    In its three-parameter version, this script function returns a given substring from a larger string s with delimiter 'delimiter'. index is an optional integer denoting which of the substrings should be returned. If index is omitted when subfield is used in a field expression in a load statement, the subfield function will cause the load statement to automatically generate one full record of input data for each substring that can be found in s.

    In its two-parameter version, the subfield function generates one record for each substring that can be taken from a larger string s with the delimiter 'delimiter'. If several subfield functions are used in the same load statement, the Cartesian product of all combinations will be generated.

     

    The really awesome feature of SubField is that it automatically eliminates duplicate records generated by the function (note: with used of DISTINCT in LOAD statement).  This is particularly handy when working with large data sets.

     

    In my current project, I have a dataset that describes what combination of projects and zips that a specific customer wants business. The dataset has three fields: Customer, Zip(List), Project(list).    Zip and Project contain comma-separated lists in their fields, which is not in a friendly format.  To do analysis in QlikView, I need to get each project and zip combination for each customer onto separate records.  Fortunately, SubField does all the heavy lifting for me.

     

    In the following example, the fields are separated by a semicolon.  Within the zip and project fileds, you can see the comma separated lists.

    customer;zip;project

    338;87124,87125;1,2

    338;87124;1,3

    339;87124,87129;1,2,3

    339;94903;1

    338;87124;1,3

     

    Loading this into QlikView without SubField:

    LOAD
    customer,
    zip,
    project
    FROM
    test.txt
    (
    txt, codepage is 1252, embedded labels, delimiter is ';', msq);

     

    I get the following unusable table.:

    customerzipproject
    33887124,871251,2
    338871241,3
    33987124,871291,2,3
    339949031
    33887124,871251,3

     

    However, incorporating the SubField() function to break out the zip and project fields:

    LOAD DISTINCT
    customer,
    SubField(zip,',')         as zip,
    Subfield(project,',')     as project
    FROM
    test.txt
    (
    txt, codepage is 1252, embedded labels, delimiter is ';', msq);

     

     

    I get a Cartesian join containing every DISTINCT combination of zip and project for each customer:

    customerzipproject
    338871241
    338871242
    338871243
    338871251
    338871252
    339871241
    339871242
    339871243
    339871291
    339871292
    339871293
    339949031

    Even though I have repeated data in my input, my output contains only unique combinations.  I found this exceptionally useful. My production input dataset contained 78,000 lines.  After applying SubField to the LOAD statement on Project and Zip, my table resulted in 20 million unique records ready for analysis.

     

    This is my first post of any type in the QlikView Community. I have read and learned so much from many other folks on this forum, and I felt it was time to give back .  Next up (when I get the chance), I will post on an end-to-end system to split out current QVD's by a specified time period (aka "sharding"), then incrementally load data to these sharded QVD's and create new ones as the specified time period changes (e.g. 2013-April.qvd to 2013-May.qvd), and finally have the Tier 2 loader load these QVD's based on a specified number of time periods back.  It'll be interesting to see if I can get that cleaned up for a post on here.  Cheers!