QlikView Expressor: Parsing an XML Document

    Originally published on 07-21-2011 12:12 AM

     

    Expressor 3.3 (and any earlier release) does not include an input operator  that will read and parse an XML document.  However, expressor does  include a SAX XML parser module that you can use to incrementally process XML.  When you use this parser, events, such as the start or end  of an element or the existence of CDATA, are reported to your code through a callback function.  Code you write within these callback functions extracts the attribute and element data.

    In order to use the SAX XML parser module, your script must include the Datascript Module that implements the parser functionality.  Include the  following statement in your script.

    require "lxp"

    The lxp module includes one constructor method used to create an instance of the parser and eight parser methods.  The following table summarizes the module methods.

    sax_methods1.png

    Before you can create an instance of the parser, however, you must first define an Expressor Datascript table that contains the callback function implementations.  It is not necessary to provide implementations for all of the callback functions, but only the functions included in this table when the parser is instantiated will be called.  Consequently, if you are uncertain as to whether you will need  a callback function, include a function defined as false; this implementation can then be redefined at any time.  The following table summarizes the callback functions (more detail is available online).

    sax_callback_functions1.png

    Let's develop an example, which will illustrate how straight-forward parsing an XML document can be.  In this example, you need to process a document, menus.xml, that sets out the menu at a restaurant.

    <?xml version="1.0" encoding="utf-8" ?>
    <menus restaurant="Jack's Pit Stop">
      <menu meal="breakfast_menu">
        <food>
          <name>Belgian Waffles</name>
          <price currency="dollars">5.95</price>
          <description>two of our famous Belgian Waffles with plenty of real maple syrup</description>
          <calories>650</calories>
        </food>
        <food>
          <name>Strawberry Belgian Waffles</name>
          <price currency="dollars">7.95</price>
          <description>light Belgian waffles covered with strawberries and whipped cream</description>
          <calories>900</calories>
        </food>
        <food>
          <name>Berry-Berry Belgian Waffles</name>
          <price currency="dollars">8.95</price>
          <description>light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
          <calories>900</calories>
        </food>
        <food>
          <name>French Toast</name>
          <price currency="dollars">4.50</price>
          <description>thick slices made from our homemade
           sourdough bread</description>
          <calories>600</calories>
        </food>
        <food>
          <name>Homestyle Breakfast</name>
          <price currency="dollars">6.95</price>
          <description>two eggs, bacon or sausage, toast, and our ever-popular hash browns</description>
          <calories>950</calories>
        </food>
       </menu>
       <menu meal="lunch_menu">
        <food>
          <name>Hamburger</name>
          <price currency="dollars">5.95</price>
          <description>All beef patty, special sauce, lettuce, cheese, pickles, onions on a bun</description>
          <calories>650</calories>
        </food>
        <food>
          <name>BLT</name>
          <price currency="dollars">7.95</price>
          <description>Bacon, lettuce and tomato on wheat with french fries</description>
          <calories>900</calories>
        </food>
        <food>
          <name>Club sandwich</name>
          <price currency="dollars">8.95</price>
          <description>Ham, turkey, cheese, lettuce and tomato on toast with french fries</description>
          <calories>900</calories>
        </food>
      </menu>
      <menu meal="dinner_menu">
        <food>
          <name>Chicken fingers</name>
          <price currency="dollars">15.95</price>
          <description>Chicken fingers served with honey mustard sauce and corn on the cob</description>
          <calories>650</calories>
        </food>
        <food>
          <name>Steak</name>
          <price currency="dollars">17.95</price>
          <description>New York strip served with mashed potatoes and vegetable of the day</description>
          <calories>900</calories>
        </food>
        <food>
          <name>Lobster</name>
          <price currency="dollars">18.95</price>
          <description>Maine lobster steamed and served with melted butter,baked potato
                              and vegetable of the day</description> 
          <calories>900</calories>
        </food>
      </menu>
    </menus>

    In your application, you want to extract all the information about each food option with the exception of the description and process each food as if it were a separate record.  Therefore, for each food the meal (breakfast, lunch, dinner), name, price and calories will be included in the record.  The price entry should also include the currency.

    All of the scripting needed to parse the XML document is included in the initialize function of the Read Custom operator.  This scripting not only processes the XML document, but it also creates an Expressor Datascript table that contains each of the records that will be emitted by the operator.  Then in the read function, the code simply iterates through this table, emitting individual records.  The following figure shows the scripting within the initialize function.

    sax_read_custom1.png

    The extraction of the data is all performed in the StartElement callback function.  In the main body of this function, the meal and currency attribute values are extracted and transferred into an Expressor Datascript table.  Note that the CharacterData callback function is redefined within the StartElement function and is responsible for extracting the food name, price, and calories values and transferring them to the expressor Datascript table. The statement on line 22 may appear confusing.  Since white space, which includes control characters such as new lines, is treated as character data, the string.find function with a pattern capture is used to ensure that only character data composed of alphanumeric and punctuation characters are extracted from the XML document. In the EndElement callback function, the table holding the particulars of each food are transferred into another table named results.  The results table is then available to the read function, which extracts each entry and emits it as a separate record.

    sax_read_custom2.png

    What makes the coding within the read function especially easy is the fact that the entries in the table food have the same index names as the entries in the output record.  Consequently, it is a simple matter to assign each element of the table results to an output record.  There is no need to explicitly transfer each value into the output record.