topic Re: Parse delimited column based on dynamic format, among non-delimited columns in Connectivity & Data Prep

Parse delimited column based on dynamic format, among non-delimited columns

Wed, 20 Apr 2016 22:26:34 GMT

Hello again,

My first question here was answered in short order, I'm hoping for the same here! This one is (I think) a little more complex.

I have a tool-generated .xls file that contains many columns. Two of these columns are multipurpose (which sucks). Essentially I need to parse these two columns on load and associate them with one-another in some fashion. I've attached a sample, but bear with me here:

The columns are pipe-delimited, with a space on each end of the pipe. Ex: "xyz | abc | 123"
The number of delimited attributes is dynamic. Ex: "xyz | 123" vs. the next row "123 | abc | jkl"
The order of the delimited attributes is dynamic, with respect to other rows, but will always match the same row's other column. That is to say the order of delimited column 1 on row 23 will always match the order of delimited column 2 on row 23, but not necessarily on row 24 (with one exception, see number 6).
The first delimited column contains the names of the attributes delimited in the second column. Ex: column 1 = "type | product | amount" and column 2 = "type abc | product xyz | 123"
Sometimes there are no delimiters, as there is only one value in each column.
The columns may not have the same number of delimited attributes. Ex: column 1 = "serial | product | amount" and column 2 = "product xyz | 7"

The ideal solution is one that parses the two columns into a number of new columns, let's say a maximum of 6 for each delimited column, named Custom Attribute 1 through 6, then Custom Value 1 through 6 with the corresponding values populating each new column on the row. Ex:

Delimited 1	Delimited 2	Custom Attribute 1	Custom Attribute 2	Custom Value 1	Custom Value 2
product \| serial	abc \| 1234-asdf	product	serial	abc	1234-asdf

Now if there is a better way to associate these parsed values to the rest of the table, please do pose some ideas. The real want is to be able to report on this in such a way that I can pick a row ID and say what the values are for the parsed columns, and run various metrics on the information therein, via counts, sums, etc. In a delimited column, it's somewhat useless.

I would also be ok using these delimited columns to create new tables that correspond to an associated transaction ID (in the attached, consider Random Column A or B as a unique key).

Re: Parse delimited column based on dynamic format, among non-delimited columns

swuehl — Wed, 20 Apr 2016 22:57:19 GMT

You can look into SubField() to separate the values and create new records, but one thing may be problematic:

The columns may not have the same number of delimited attributes. Ex: column 1 = "serial | product | amount" and column 2 = "product xyz | 7"

This example seems to show a missing serial in column2. Is this correct? Wouldn't this break the other rule, that the order is preserved?

Re: Parse delimited column based on dynamic format, among non-delimited columns

swuehl — Wed, 20 Apr 2016 23:05:41 GMT

Maybe try this for a start:


INPUT:
LOAD [Random Column A],
    [Delimited Field 1],
    [Delimited Field 2],
    [Random Column B],
    recno() as RecID
FROM
[Exceptions Transactions - Example.xls]
(biff, embedded labels, table is Data$);


LOAD RecID,
  Subfield( [Delimited Field 1],'|',IterNo() ) as DelField1,
  Subfield( [Delimited Field 2],'|',IterNo() ) as DelField2,
  IterNo() as DelIndex
Resident INPUT
While iterno() <= RangeMax(SubStringCount([Delimited Field 1],'|'),SubStringCount([Delimited Field 2],'|'))+1;

Re: Parse delimited column based on dynamic format, among non-delimited columns

Wed, 20 Apr 2016 23:06:18 GMT

It does break that rule, and it is generally only the case with one exception type. With the exception of the missing field, the order is preserved. I'd have to dig into the thousands of transactions to verify, but I believe I may be able to amend the rule to something like:

The order is preserved except in the instance a serial number is missing. The first delimited attribute in column 1 will be "serial number" or similar, and the first delimited attribute in column 2 SHOULD be the serial number, however it may be missing. In this case, something along the lines of the below may work:

IF(number or pipes in column 1 > number of pipes in column 2, first attribute in column 1 = null) then continue to match the other attributes together

Does that help? I will look into SubField().

Thanks!

Re: Parse delimited column based on dynamic format, among non-delimited columns

Wed, 20 Apr 2016 23:08:27 GMT

This looks like a good start. I will try this out and report back.

Thanks again,

Russ

Re: Parse delimited column based on dynamic format, among non-delimited columns

swuehl — Thu, 21 Apr 2016 14:11:18 GMT

This code is not checking for the explicite values, just aligning the available column2 values starting from the end.

For your sample data, this seems to be sufficient:

INPUT:

LOAD [Random Column A],

[Delimited Field 1],

[Delimited Field 2],

[Random Column B],

recno() as RecID

FROM

[Exceptions Transactions - Example.xls]

(biff, embedded labels, table is Data$);

LOAD RecID,

Subfield( [Delimited Field 1],'|',IterNo() ) as DelField1,

Subfield( [Delimited Field 2],'|',IterNo()-(SubStringCount([Delimited Field 1],'|')-SubStringCount([Delimited Field 2],'|') )) as DelField2,

IterNo() as DelIndex

Resident INPUT

While iterno() <= RangeMax(SubStringCount([Delimited Field 1],'|'),SubStringCount([Delimited Field 2],'|'))+1;

Re: Parse delimited column based on dynamic format, among non-delimited columns

Thu, 21 Apr 2016 18:53:30 GMT

Thanks again for your help - you're seriously good at this stuff.

This should work quite well.

Best,

Russ