2 Replies Latest reply: Oct 5, 2015 3:02 PM by yogendra ch

# Percentage of similarity of two fields

Hi to everyone.

For a project I'm making I need to compare two fields in a strange manner.

I make an example to let you understand what my problem is:

ORIGINAL TABLE:

ID     Supplier

1          A

1          D

1          B

1          A

2          B

2          C

2          A

2          D

3          B

3          E

TABLE THAT I CREATE:

ID          List of Suppliers

1                   A, B, D

2                   A, B, C, D

3                   B, E

Now I need to select the "List of Suppliers" that reapeat more than once. But not only if they appear identical, all the list of suppliers that appear equal for the 70%.

So in this case the flag will be:

ID          List of Suppliers          FLAG

1                   A, B, D                   X               because "A, B, D" is fully repeated in "A, B, C, D"

2                   A, B, C, D              X               because the 75% of "A, B, C, D" is repeated in "A, B, D"

3                   B, E                        -                because only the 50% of "B, E" is repeated

Is it possible?

Many thanks to who can help!

Filippo

• ###### Re: Percentage of similarity of two fields

I think you will need several steps for this. There are to define which supplier-entry will be the master to which will all other values compared - I assume those ID with the most suppliers and this string-list-value and the number of suppliers with them needs you as variable-values against each of the other string-list-values will be in loop compared. Maybe something like this:

t1:

Load ID, count(ID) as MaxCountID, concat(Supplier, '|') as SupplierList From xyz Group By ID;

t2:

Noconcatenate Load * Resident t1 order by MaxCountID;

let vMaxCountID = peek('MaxCountID', -1, 't2');

let vSupplierList = peek('SupplierList', -1, 't2');

t3:

ID, \$(vMaxCountID) as MaxCountID, SupplierList,

if(index('\$(vSupplierList)', subfield(SupplierList, '|', iterno())>=1, 1, 0) as FlagTemp

Resident t1 while iterno() <= substringcount(SupplierList, '|');

t4:

ID, MaxCountID, SupplierList,

if(MaxCountID * 0.7 < sum(FlagTemp), 0, 1) as Flag

Resident t3 Group Bx ID, MaxCountID, SupplierList;

drop tables t1, t2, t3;

let vMaxCountID = null(); let vSupplierList = null();

It's just written down without any testing and there might be one or another syntax- or logical issue but I think it give you some ideas how it could be solved.

- Marcus

• ###### Re: Percentage of similarity of two fields

Try this:

SampleData:

[

ID,Supplier

1,A

1,D

1,B

1,A

2,B

2,C

2,A

2,D

3,B

3,E

]

;

test1:

LOAD ID, Count(DISTINCT Supplier) as NoSupplier

Resident SampleData

Group By ID;

join

LOAD Count( DISTINCT Supplier) as TotalSupplier

Resident SampleData

;

DROP Table SampleData;

test2: