7 Replies Latest reply: Aug 30, 2013 10:38 AM by Chadd McNicholas RSS

    The Secret Life of SubField

      The QlikView documentation describes SubField() function as the following:

      In its three-parameter version, this script function returns a given substring from a larger string s with delimiter 'delimiter'. index is an optional integer denoting which of the substrings should be returned. If index is omitted when subfield is used in a field expression in a load statement, the subfield function will cause the load statement to automatically generate one full record of input data for each substring that can be found in s.

      In its two-parameter version, the subfield function generates one record for each substring that can be taken from a larger string s with the delimiter 'delimiter'. If several subfield functions are used in the same load statement, the Cartesian product of all combinations will be generated.

       

      NOTE: Ouch, I just realized I made a "noob" error on posting this..  When experimenting with SubField, I was looking at the QVD for the first output (without subfield), and looking at a Table Box for the second output (with subfield).  The tablebox automatically eliminates redundant rows, while the QVD, of course does not.  Oops

       

      The really awesome feature of SubField is that it automatically eliminates duplicate records generated by the function.  This is particularly handy when working with large data sets.

       

      In my current project, I have a dataset that describes what combination of projects and zips that a specific customer wants business. The dataset has three fields: Customer, Zip(List), Project(list).    Zip and Project contain comma-separated lists in their fields, which is not in a friendly format.  To do analysis in QlikView, I need to get each project and zip combination for each customer onto separate records.  Fortunately, SubField does all the heavy lifting for me.

       

      In the following example, the fields are separated by a semicolon.  Within the zip and project fileds, you can see the comma separated lists.

      customer;zip;project

      338;87124,87125;1,2

      338;87124;1,3

      339;87124,87129;1,2,3

      339;94903;1

      338;87124;1,3

       

      Loading this into QlikView without SubField:

      LOAD
      customer,
      zip,
      project
      FROM
      test.txt
      (
      txt, codepage is 1252, embedded labels, delimiter is ';', msq);

       

      I get the following unusable table.:

      customerzipproject
      33887124,871251,2
      338871241,3
      33987124,871291,2,3
      339949031
      33887124,871251,3

       

      However, incorporating the SubField() function to break out the zip and project fields:

      LOAD
      customer,
      SubField(zip,',')         as zip,
      Subfield(project,',')     as project
      FROM
      test.txt
      (
      txt, codepage is 1252, embedded labels, delimiter is ';', msq);

       

       

      I get a Cartesian join containing every DISTINCT combination of zip and project for each customer:

      customerzipproject
      338871241
      338871242
      338871243
      338871251
      338871252
      339871241
      339871242
      339871243
      339871291
      339871292
      339871293
      339949031

      Even though I have repeated data in my input, my output contains only unique combinations.  I found this exceptionally useful. My production input dataset contained 78,000 lines.  After applying SubField to the LOAD statement on Project and Zip, my table resulted in 20 million unique records ready for analysis.

       

      This is my first post of any type in the QlikView Community. I have read and learned so much from many other folks on this forum, and I felt it was time to give back .  Next up (when I get the chance), I will post on an end-to-end system to split out current QVD's by a specified time period (aka "sharding"), then incrementally load data to these sharded QVD's and create new ones as the specified time period changes (e.g. 2013-April.qvd to 2013-May.qvd), and finally have the Tier 2 loader load these QVD's based on a specified number of time periods back.  It'll be interesting to see if I can get that cleaned up for a post on here.  Cheers!