QlikView Expressor: How To Process Records Rejected by a Read File Operator

    This is a companion article to How To Process Rejected Records. That deals with records rejected by the Read Excel operator and applies to Read operators in other expressor Extensions and to the Read Custom operator.

    Parsing records rejected by the Read File operator is simpler than parsing records from Read Excel and related operators. The Read File operator retains the sequence of fields and attributes when it writes them to the RecordData field of the Reject Record Schema. Because of that, it is not necessary for the Read File operator to write header records to describe the field and attribute sequence. That greatly simplifies the datascript required to parse the RecordData field and reconstruct the rejected records.

    As explained in the Processing Rejected Records article, when a Read operator rejects records, it may be sufficient to simply send those records to a Write File operator and examine the records in the output file. If the intent is, however, to correct or otherwise use those records, examining each error and changing the data could be very cumbersome. The more efficient approach would be to reprocess the records as they come out the reject port.

    Records rejected by input operators such as Read File and Read Table are structured into the following fields:

    RejectType
    RecordNumber
    RecordData
    RejectReason
    RejectMessage

    The record data as it was constituted before being rejected is contained in the RecordData field. To process that data, it must first be reconstructed from the rejected record format. That requires extracting the data from the RecordData field and allocating each of the CSV elements in it to the original record fields.

    Note: All non-input operators that have  a reject port emit rejected records with the existing attributes of the  record, that is, they do not restructure the records the way input  operators do. Reprocessing records rejected by non-input operators do  not have to be reconstructed the way records rejected by input operators  do.

    The Transform operator is used to reconstruct rejected records. To illustrate the logic used to reconstruct rejected data, we will use a CSV file containing a list of US presidents. The following table shows the header and first few rows in the file. Notice that Thomas Jefferson's first name field is blank. That is one record that will produce an error.

    place,lname,fname,party
    1,Washington,George,none
    2,Adams,John,Federalist
    3,Jefferson,Democratic-Republican
    4,Madison,James,Democratic-Republican
    5,Monroe,James,Democratic-Republican
    6,Adams,John Q,Democratic-Republican
    7,Jackson,Andrew,Democratic
    8,Van Buren,Martin,Democratic
    9,Harrison,William H,Whig
    10,Tyler,John,Whig

    The following dataflow uses an Aggregate operator to count the number of Democratic presidents and the number of Republican presidents in United States history.

    657d1336771699-bpreadfiledf1.png

    The Read File operator uses Semantic Type constraints to separate Democratic and Republican party presidents from all the others, such as Whig and Democratic-Republican. The following screen shows the constraint set on Allowed values.658d1337002926-bpcsvattributedb.png

    This constraint setting (Democratic,Republican) will cause all the records whose "party" value is not Democratic or Republican to be rejected.

    To illustrate the rejected records that will be sent out the Read File operator's reject port, we will use the following dataflow, which is like the one above but

    with the Write File  operator attached directly to the reject port. This allows us to see the format of the rejected records.

    659d1337003290-bpcsvtestdf.png

    The CSV file produced by the Write File 3 operator is:

    _________________________________________________________________

    RejectType,RecordNumber,RecordData,RejectReason,RejectMessage
    1,2,"1,Washington,George,none","1|TRANSLITERATE-1021,party,none",The value 'none' for attribute 'party' is not a member of the allowed list
    1,3,"2,Adams,John,Federalist","1|TRANSLITERATE-1021,party,Federalist",The value 'Federalist' for attribute 'party' is not a member of the allowed list
    3,4,"3,Jefferson,Democratic-Republican",0|notEnoughFields,Input record 4 from file '.\presidents.dat' is being rejected because only 3 of the expected 4 fields are present.
    1,5,"4,Madison,James,Democratic-Republican","1|TRANSLITERATE-1021,party,Democratic-Republican",The value 'Democratic-Republican' for attribute 'party' is not a member of the allowed list
    1,6,"5,Monroe,James,Democratic-Republican","1|TRANSLITERATE-1021,party,Democratic-Republican",The value 'Democratic-Republican' for attribute 'party' is not a member of the allowed list
    1,7,"6,Adams,John Q,Democratic-Republican","1|TRANSLITERATE-1021,party,Democratic-Republican",The value 'Democratic-Republican' for attribute 'party' is not a member of the allowed list
    1,10,"9,Harrison,William H,Whig","1|TRANSLITERATE-1021,party,Whig",The value 'Whig' for attribute 'party' is not a member of the allowed list
    1,11,"10,Tyler,John,Whig","1|TRANSLITERATE-1021,party,Whig",The value 'Whig' for attribute 'party' is not a member of the allowed list
    1,13,"12,Taylor,Zachary,Whig","1|TRANSLITERATE-1021,party,Whig",The value 'Whig' for attribute 'party' is not a member of the allowed list
    1,14,"13,Fillmore,Millard,Whig","1|TRANSLITERATE-1021,party,Whig",The value 'Whig' for attribute 'party' is not a member of the allowed list

    __________________________________________________________________

    The first row is the standard header row for a rejected records file. The subsequent rows contain the rejected records. All but the third have a RejectType 1, which indicates the records were rejected for constraint violations. The RejectMessage for all of them indicates they were rejected because they did not have a value for the "party" attribute that was either Democratic or Republican.

    The third record has a RejectType 3, which indicates there was an error encountered while reading the data from the CSV file. In this case, the RejectMessage reports that one of the expected four fields was missing. As we saw above in the CSV file, the record for Thomas Jefferson is missing the fname field.

    Note that the sequence in which the fields are presented in the RecordData field of the rejected record output is the same as in the CSV file and in this Schema and matching Composite Type.

    661d1337005227-bpcsvschema.png

    Because the sequence of record fields in the RecordData field is consistent and predictable, the process of reconstructing the original records is much simpler than that required to reconstruct records rejected by operators such as Read Excel (see How To Process Rejected Records).

    Returning to the first dataflow shown above, we will now configure the Transform operator to reconstruct the original records before they were rejected in the Read File operator.

    The logic of the transformation is contained in one Function Rule, labeled Rule_1.

    663d1337006535-bpcsvfunctionrule.png

    The RecordData attribute from the rejected record is mapped to the Rule_1 input. The output attributes come from the Composite Type in the downstream Write File 2 operator.  Those attributes are the four from the original records plus an  attribute to capture the RejectMessage content so that it can be kept  with the record. The five output attributes that parallel the input attributes (marked by the arrow symbols) appear automatically in the output attributes, but they are not used because the downstream operator uses the other output attributes.

    The transform function uses a function from an expressor library, so that library must be included with a require statement at the beginning of the rule. The rule also constructs three variables as tables. The data in the header table is constant; it is the header field names used in the original input records. They do not have to be exactly the same as the original header names, but it simplifies the transformation to use the same names. The names must match the names used by the downstream operator, and in this case, the Write File 2 operator uses the PresidentsRejectSchema, which uses the original header names.

    The transform function begins by using the expressor.csv.lineToCSV function to convert the data in RecordData to a CSV record. The resulting record is the original input record read by the Read File operator.

    Next, a for loop uses the ipairs function (one of the Datascript basic functions) to match the header fields in the indexed header table to the values in the indexed rec table and places them in the output table, which becomes a dictionary or associative array:

    output { fname="George",lname="Washington",party="",place=1,Error_Message="The value 'none' for attribute 'party' is not a member of the allowed list" }

    The labeled values become the output parameters of the Function Rule, and they go to the matching attributes in the Transform operator's output.

    While the transform function processes the RecordData field from each record, the input attribute RejectMessage is mapped directly to the Error_Message output attribute so that it can be transferred downstream with the record data.

    In this example, when the rejected records have been processed by the Transform operator, they are propagated to a Write File operator so the results can be viewed. In most cases, further processing would be performed after the Transform operator, but in this case, we simply want to see the results of the processing in the Transform operator. The Write File 2 operator  uses a Schema that maps to the attributes propagated from the Transform operator:

    666d1336165551-bppresidentsrejectschema.png

    The Error_Message attribute is mapped to the RejectReason field in the Delimited Schema. The output file to which the Write File 2 sends the data appears as follows:

    ___________________________________________________________

    place,lname,fname,party,RejectReason
    1,Washington,George,none,The value 'none' for attribute 'party' is not a member of the allowed list
    2,Adams,John,Federalist,The value 'Federalist' for attribute 'party' is not a member of the allowed list
    3,Jefferson,Democratic-Republican,,Input record 4 from file '.\presidents.dat' is being rejected because only 3 of the expected 4 fields are present.
    4,Madison,James,Democratic-Republican,The value 'Democratic-Republican' for attribute 'party' is not a member of the allowed list
    5,Monroe,James,Democratic-Republican,The value 'Democratic-Republican' for attribute 'party' is not a member of the allowed list
    6,Adams,John Q,Democratic-Republican,The value 'Democratic-Republican' for attribute 'party' is not a member of the allowed list
    9,Harrison,William H,Whig,The value 'Whig' for attribute 'party' is not a member of the allowed list
    10,Tyler,John,Whig,The value 'Whig' for attribute 'party' is not a member of the allowed list
    12,Taylor,Zachary,Whig,The value 'Whig' for attribute 'party' is not a member of the allowed list
    13,Fillmore,Millard,Whig,The value 'Whig' for attribute 'party' is not a member of the allowed list

    _____________________________________________________________