Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Nemo1
Creator
Creator

Problem with Data Mapping and Matching in huge dataset

Hello everyone, 

I have the following problem.

So I have a table A that looks like this... which would be my data source. As you can see there is a column for Text A and a Column for a classification.

Table A  
   
Text A Classification A
   
The mouse eats a carrot Nature
Prague is a nice city Geography
I am a human Human
The cat eats the mouse Nature
Sun is shinning Weather
She is a professional Social

 

Then I have table B, that has text but no classification. In the text, you find words that are to be found in Text A, such as Cats, Sun, Human, etc. I want based an algorithm or Formula that based on this words, go to the Table A, and brings me the classification. These two tables are just an example, in reality I have two huge datasets. 

So for example, for "the cat is my pet", the classification B should be "Nature"

What could I do to solve this? Could I solve it on Qlik?

 

Thaanks

Table B  
   
Text B Classification B
   
The human is complex ?
That cat is my pet ?
I do not like the mouse ?
the sun is yellow ?
he is a professional ?
Prage is in europe ?
4 Replies
marcus_sommer

Qlik has very powerful string-functions and mapping-features. Therefore it would be possible to develop an appropriate categorizing. But the most and hardest work is not the technically implementation else to develop a sensible and valid set of rules for the categorizing especially in regard to clean and prepare the data in beforehand and to determine the order of the execution and the prioritizing of the matches.

Nemo1
Creator
Creator
Author

Hey, thanks for your answer. I have already prepare the data in two datasets.. but i do not know how I could keep going now... what would you do? any suggestions is welcome, thanks

qv_testing
Specialist II
Specialist II

how many combinations do you have ?

like Cats, Sun, Human, professional

marcus_sommer

You have really a set of rules by differentiating between nouns/verbs/adjectives and further expletive and all kinds of punctuation marks? Also is the context within a sentence important or not? How to handle typos? In which order should be searched and matched?

... the human looked like the mouse to the shining sun ... // which one should win ?

Beside this take a look on mapsubstring() which could include multiple match-returns into a string which could be later evaluated.

Another common way would be to load the strings with subfield() to split it into n records on which you may apply a normal mapping, maybe something like:

m: mapping load Lookup, Return from MyRules;

t:
load *, applymap('m', Substring, '#NV') as Category, rowno() as RowNo;
load Key, subfield(String, ' ', iterno()) as SubString, recno() as RecNo, iterno() as IterNo
from MyDataset while iterno() <= substringcount(String, ' ') +1;