2 Replies Latest reply: Oct 5, 2015 3:02 PM by yogendra ch RSS

    Percentage of similarity of two fields

      Hi to everyone.

      I need ad advise.

      For a project I'm making I need to compare two fields in a strange manner.

      I make an example to let you understand what my problem is:

       

      ORIGINAL TABLE:

      ID     Supplier

      1          A

      1          D

      1          B

      1          A

      2          B

      2          C

      2          A

      2          D

      3          B

      3          E

       

       

      TABLE THAT I CREATE:

      ID          List of Suppliers

      1                   A, B, D

      2                   A, B, C, D

      3                   B, E

       

       

      Now I need to select the "List of Suppliers" that reapeat more than once. But not only if they appear identical, all the list of suppliers that appear equal for the 70%.

      So in this case the flag will be:

       

      ID          List of Suppliers          FLAG

      1                   A, B, D                   X               because "A, B, D" is fully repeated in "A, B, C, D"

      2                   A, B, C, D              X               because the 75% of "A, B, C, D" is repeated in "A, B, D"

      3                   B, E                        -                because only the 50% of "B, E" is repeated

       

      Is it possible?

      Many thanks to who can help!

      Filippo

        • Re: Percentage of similarity of two fields
          Marcus Sommer

          I think you will need several steps for this. There are to define which supplier-entry will be the master to which will all other values compared - I assume those ID with the most suppliers and this string-list-value and the number of suppliers with them needs you as variable-values against each of the other string-list-values will be in loop compared. Maybe something like this:

           

          t1:

          Load ID, count(ID) as MaxCountID, concat(Supplier, '|') as SupplierList From xyz Group By ID;

           

          t2:

          Noconcatenate Load * Resident t1 order by MaxCountID;

           

          let vMaxCountID = peek('MaxCountID', -1, 't2');

          let vSupplierList = peek('SupplierList', -1, 't2');

           

          t3:

          Load

               ID, $(vMaxCountID) as MaxCountID, SupplierList,

               if(index('$(vSupplierList)', subfield(SupplierList, '|', iterno())>=1, 1, 0) as FlagTemp

          Resident t1 while iterno() <= substringcount(SupplierList, '|');

           

          t4:

          Load

               ID, MaxCountID, SupplierList,

               if(MaxCountID * 0.7 < sum(FlagTemp), 0, 1) as Flag

          Resident t3 Group Bx ID, MaxCountID, SupplierList;

           

          drop tables t1, t2, t3;

          let vMaxCountID = null(); let vSupplierList = null();

           

          It's just written down without any testing and there might be one or another syntax- or logical issue but I think it give you some ideas how it could be solved.

           

          - Marcus

          • Re: Percentage of similarity of two fields
            yogendra ch

            Try this:

             

             

            SampleData:

            LOAD * Inline

            [

            ID,Supplier

             

            1,A

             

            1,D

             

            1,B

             

            1,A

             

            2,B

             

            2,C

             

            2,A

             

            2,D

             

            3,B

             

            3,E

             

            ]

            ;

             

             

            test1:

             

            LOAD ID, Count(DISTINCT Supplier) as NoSupplier

            Resident SampleData

            Group By ID;

             

            join

             

            LOAD Count( DISTINCT Supplier) as TotalSupplier

            Resident SampleData

            ;

             

            DROP Table SampleData;

             

            test2:

             

              LOAD ID,NoSupplier,TotalSupplier,if(NoSupplier/TotalSupplier>=0.5,'Yes','No') as Flag

            Resident test1

            ;

             

             

            DROP Table test1;