Reading an XML Document

    As of version 3.8, QlikView Expressor does not include an input operator to read an XML document.  If you want to use an XML document as input data to a dataflow, you must use the Read Custom operator to parse the document into individual records.  In developing this code, you rely on the Lua Object Model, which is included as a datascript module in the QlikView Expressor product.

     

    Before discussing the coding, let's examine a simple XML document displayed in an XML viewer such as Internet Explorer.

    xml_document.png

    While the XML viewer displays the document's content in a clear format, the document's actual content might be clearly formatted, with a structure identical to that displayed by the XML viewer, or it may lack formatting with the entire text contained in a single line.  When you use the Lua Object Model it is essential that you know how the document's content is formatted.

     

    The Lua Object Model converts an XML document into a numerically indexed Expressor Datascript table where each entry in this table is itself another table that contains one string indexed entry (index name 'tag') that is the name of the XML element's tag, and one or more numerically indexed entries that may be either the XML element's value or another table representing a child element.  The XML declaration and the root element are not included in this table.

     

    Because the processing is a little easier to follow, let's first consider the case where the XML document is not formatted and the entire content is contained in a single line of text (although in this document, the content spans multiple lines)

     

    <?xml version="1.0" encoding="utf-8"?><Root><Customers><Customer CustomerID="XYZ"><CompanyName>XYZ Market</CompanyName><ContactName>John Doe</ContactName><ContactTitle>Marketing Manager</ContactTitle><Phone>(212) 555-1212</Phone><FullAddress><Address>100 Park Avenue.</Address><City>New York</City><State>NY</State></FullAddress></Customer></Customers></Root>

     

    Assuming that the XML document's file system location is C:\data\Customers.xml, your code to read and parse this document into the Lua Object Model would be similar to the following.

     

    -- reference the datascript module
    require "lom"

    -- setup XML parser
    lom = lxp.lom

    -- create a file handle to the file
    handle = io.open("C:\\data\\Customers.xml","r")

    -- read and parse the XML document
    tbl = lom.parse(handle:read("*a"))
    handle:close()

     

    The variable tbl now contains the object model of the XML document.  Let's examine each entry in this table.

     

    Since tbl is a numerically indexed table, the first entry at index 1 (that is,tbl[1]) is another table with a string indexed entry with the index 'tag'.  If you were to display the value of this table entry (that is, tbl[1].tag), you would see the XML tag name Customers.  This nested table also contains a numerically indexed entry.  If you examine the original XML document, you will see that the <Customers> element contains the child element <Customer>, so the value of the numerically indexed entry is another Lua Object Model table (and tbl[1][1].tag is Customer).  And since tbl[1][1] is also a Lua Object Model table, it includes a numerically indexed entry.  As the <Customer> tag contains multiple child elements, the value of tbl[1][1][1] is also a table with the tag value CompanyName.  But now the XML element CompanyName contains CDATA not another child element, so the value of its numerically indexed entry (that is, tbl[1][1][1][1]) is XYZ Market.

     

    Similar analysis will lead to the name of the contact person (tbl[1][1][2].tag is ContactName and tbl[1][1][2][1] is John Doe) as well as the other details for this customer.

     

    If the XML docuent contained multiple <Customer> elements, the extraction logic would remain the same except that the index value for the second dimension would be incremented from 1 to 2 (that is, tbl[1][2][1][1] would be the name of a second company).  In order to extract all of the details of each company from the XML document, your code would need to include a loop, incrementing the second dimension with each iteration.  If  tbl[1][#] returns nil instead of a table, then you know you have processed all the <Customer> elements.

     

    Once you have a feeling for how to work through the Lua Object Model, the coding in the Read Custom operator is straight-forward.

    xml_document_2.png

    Now, what about the situation in which the original XML document is nicely formatted.  In this case, when the document is parsed, each new line control character is also added as an independent entry in the Lua Object Model table.  In this situation, the new line that follows the opening root tag is the table entry tbl[1] and the entry corresponding to the <Customers> element is tbl[2].  Likewise, the <Customer> element is tbl[2][2].  That is, to move through the Lua Obect Model table you will need to set the initial indices to 2 and increment by 2 rather than 1 as illustrated in the following code.

    xml_document_5.png

    There is one more piece of information that needs to be extracted from each <Customer>, the value of the CustomerID attribute.  In the Lua Object Model, each XML element's attributes are contained in a table with the index value 'attr'.  The above coding could include the following additional line of code immediately after declaring the table customer.  You would also need to add a CustomerID attribute to the output.

     

    xml_document_6.png

     

    **********************

     

    Let's examine another example that is a little more 'real world.'  In the Microsoft SQL Server Adventure Works database, the table Sales.Individual includes a column - Demographics - that contains XML content.  Each record's entry in this column is unformatted and stored as a single line.  If you copy one entry into a text document and open it in an XML viewer, you will see the names of the individual elements in the entry.

    xml_element.png

    When parsed into the Lua object model, this XML document will be converted into a numerically indexed Expressor Datascript table and each entry in that table will be another table.  Each of these nested tables will include one string indexed entry ('tag') and one numerically indexed entry that contains the value of that element in the XML document.  Since there are no nested elements or formatting in the original XML document, your code can iterate through the top level table using the ipairs iterator function.  For each iteration, ipairs returns the key and the table entry referenced by that key.  In the following code fragment, which shows code that could be in a transform operator function rule, the XML value in the Demographics column from each record in the Sales.Individual table is copied into a variable and then parsed into the Lua object model and the individual XML element values assigned to the function rule's output parameters.

     

    require 'lom'

    lom = lxp.lom

    content = input.Demographics

    parsed_content = lom.parse(content)

    for k,v in ipairs(parsed_content) do
      output[v.tag = v[1])
    end

     

    For additional information, see the description of the Lua Object Model and the Writing an XML Document knowledge base article for a more general discussion on handling attributes in the XML elements.