10 Replies Latest reply: Oct 5, 2010 4:07 AM by Amien Amien RSS

    Fuzzy join possible?

    Martin Wigchert

      Have the following issue:

      One table with transaction data [TRANSACTIONS]. In this table there is a (free format) field that contains the description of the transaction.

      Have created a mapping table with keywords and categories [MAPPINGS].

      Now I like to use the mapping table in such a way that it searches for the keywords in the description field of the TRANSACTION table. It can then take the categories of the first matching/found keyword.

      Are there possibilities in QlikView to do such a fuzzy join or fuzzy mapping?

      Table TRANSACTIONS (simplified view):

      • DATE
      • VALUE
      • DESCRIPTION

      Table MAPPINGS:

      • KEYWORD
      • CATEGORY
        • Fuzzy join possible?
          Vlad Gutkovsky

          Well, it's definitely not possible with mapping since that is a very simple 1-input-1-output method. As far as I know there is no built-in way to do a fuzzy join either, although you might be able to code this. For example, you might be able to search for a keyword using the index() or substringcount() functions. However, I imagine this would only work if you don't have too many unique values in the Keyword field. Essentially, the solution I had in mind would look like this:

           


          JOIN (TRANSACTIONS) LOAD DISTINCT * RESIDENT MAPPINGS; //large outer join

          Transactions_new:
          LOAD
          DATE,
          VALUE,
          DESCRIPTION,
          CATEGORY
          RESIDENT TRANSACTIONS
          WHERE substringcount(DESCRIPTION,KEYWORD)>0;

          DROP TABLE TRANSACTIONS;


          This would create a Cartesian join (a join of every possible combination of both tables) because the fields Description and Keyword start off as disconnected. So, for example, if Transactions has 1M rows and Mappings has 10K rows, the resulting table from the outer join would contain 10B rows which I doubt your server (or your patience) could handle. So the short answer to your question is: maybe, if you don't have too many keywords.

          Edit: By the way, if DESCRIPTION is not 100% unique, it would be cheaper to create an intermediate table that contains just DISTINCT DESCRIPTION and perform the outer join from MAPPINGS into that intermediate table. Afterwards, you can either leave the table as is or left join it back into TRANSACTIONS with CATEGORY now included.

          Regards,

          • Fuzzy join possible?
            Karl Pover

            Download Rob Wunderlich's cookbook that has an example of doing a fuzzy mapping or mapping with wildcards.

            http://robwunderlich.com/Download.html

            Regards.

              • Fuzzy join possible?
                Vlad Gutkovsky

                Wow, I humbly defer to Rob on this. Brilliant solution!

                • Fuzzy join possible?
                  Martin Wigchert

                  This indeed seems to be what I was looking for!

                  Thank you very much for making me aware of these solutions from Rob Wunderlich.

                  Great to have this forum available when you have the feeling that you got stuck Smile

                   

                  • Fuzzy join possible?
                    Amien Amien

                    where can i find the fuzzy option in the cookbook? can only find the wildcardmatch

                    also check out this option : mapsubstring

                    http://community.qlik.com/forums/t/25723.aspx

                    i had the following issues with the solution in the cookbook, perhaps someone allready found a solution for this? :

                    * issues when i have 2 hits

                    * i wanted to put the no hits in a seperate category, like 'others'

                     

                     

                     

                      • Fuzzy join possible?
                        Vlad Gutkovsky

                        The solution is script\mappingWithWildcards.qvw

                        You can check for no hits easily in the front end by using a calculated dimension to show Category: if(len(trim(Category))>0,Category,'Others')

                        Regards,

                          • Fuzzy join possible?
                            Amien Amien

                            that's for wildcards .. not fuzzy matching

                            i was really looking for a load solution

                              • Fuzzy join possible?
                                John Witherspoon

                                 


                                Amien wrote:
                                that's for wildcards .. not fuzzy matching
                                i was really looking for a load solution<div></div>


                                Wigchert asked for "searches for the keywords", which the wildcard mapping can solve. It IS a load solution. The "no hits" check that Vlad is talking about can easily be added to the load if you want that additional piece.

                                But you appear to mean something different by "fuzzy matching". What do YOU mean by it? Something like if one table has "Amien" and another has "amian", it'll match them anyway, because they're close? So you'd need some sort of "closeness function", and then map to the row with the highest value for the "closeness function"? Something like that? That would be challenging.

                            • Fuzzy join possible?
                              Rob Wunderlich

                               


                              Amien wrote:
                              i had the following issues with the solution in the cookbook, perhaps someone allready found a solution for this? :
                              * issues when i have 2 hits
                              * i wanted to put the no hits in a seperate category, like 'others'


                              In the cookbook example, the last entry in the Wildmatch table is:
                              *, Other
                              * matches everything. This is intended to catch the "no-hits" and assign them to "Other".

                              What's the issue re 2 hits? Do you want to generate another row for the second (or more) hits?

                              -Rob

                                • Fuzzy join possible?
                                  Amien Amien

                                  @Rob.. my bad here .. Others is here .. 2 hits .. i wanted to set priorities. but that also works fine .. tested it with "?1*, TestGroup"

                                  @John .. will i was indeed talking about algorithms like q-gram, jaro winkler en Lieberstein. you would need a vbscript to do this. or use a SQL procedure. cause the topic said fuzzy join i was immediately thinking of these algorithms. I don't associate fuzzy match with wildcardmatch :)