Solved: Compare Words from dual strings - Qlik Community

eduardo_dimperio · ‎2018-12-06

Hi, I have a table with two fields Address1 and Address2 and i need to use a wildmatch, word by word like a cartesian produt to know how many words in Address2 exists in Address1.

Ex:

Input:

Address1	Address2
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	Av.+das+Américas+1650+Bloco+4+Loja+119+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	Av+das+Americas+1650+Blc+1+Lj+102+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	Avenida+das+Américas+1600+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	Avenida+das+Américas+1650+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	Avenida+das+Américas+1650+Bloco+04/+Loja+118+Recreio+dos+Bandeirantes+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	Avenida+das+Américas+1690+Barra+da+Tijuca+Rio+de+Janeiro

Output

Address1	Num_WildMatch
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	7
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	8
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	6
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	6
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	4
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO	8

evan_kurowski · ‎2018-12-06

Hello Eduardo, you could start with something like the following:

Concatenate the A+B string pairs
Split them into all subfields
Count repetitions of each subfield grouped by pairing ID
Tabulate only subfields which repeat

[ADDRESSES]:
LOAD AutoNumberHash128(Address1, Address2) AS COMBO_ID, Upper(Address1) AS Address1, Upper(Address2) AS Address2 INLINE [
Address1, Address2
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Av.+das+Américas+1650+Bloco+4+Loja+119+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Av+das+Americas+1650+Blc+1+Lj+102+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1600+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1650+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1650+Bloco+04/+Loja+118+Recreio+dos+Bandeirantes+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1690+Barra+da+Tijuca
];

[SUBCOMPONENTS]:
LOAD COMBO_ID, Subfield(Address1 & '+' & Address2,'+') AS SUBCOMPONENTS RESIDENT ADDRESSES;

[COUNT_SUBCOMPONENTS]:
LOAD DISTINCT COMBO_ID, SUBCOMPONENTS, Count(SUBCOMPONENTS) AS COUNT RESIDENT [SUBCOMPONENTS] GROUP BY COMBO_ID, SUBCOMPONENTS;

[ISOLATE_REPEATERS]:
LOAD COMBO_ID, Count(DISTINCT SUBCOMPONENTS) AS SUBCOMPONENT_MATCHES RESIDENT [COUNT_SUBCOMPONENTS] WHERE COUNT > 1 GROUP BY COMBO_ID;
Image of repeating substrings A in B.png

View solution in original post

evan_kurowski · ‎2018-12-06

Hello Eduardo, you could start with something like the following:

Concatenate the A+B string pairs
Split them into all subfields
Count repetitions of each subfield grouped by pairing ID
Tabulate only subfields which repeat

[ADDRESSES]:
LOAD AutoNumberHash128(Address1, Address2) AS COMBO_ID, Upper(Address1) AS Address1, Upper(Address2) AS Address2 INLINE [
Address1, Address2
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Av.+das+Américas+1650+Bloco+4+Loja+119+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Av+das+Americas+1650+Blc+1+Lj+102+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1600+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1650+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1650+Bloco+04/+Loja+118+Recreio+dos+Bandeirantes+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1690+Barra+da+Tijuca
];

[SUBCOMPONENTS]:
LOAD COMBO_ID, Subfield(Address1 & '+' & Address2,'+') AS SUBCOMPONENTS RESIDENT ADDRESSES;

[COUNT_SUBCOMPONENTS]:
LOAD DISTINCT COMBO_ID, SUBCOMPONENTS, Count(SUBCOMPONENTS) AS COUNT RESIDENT [SUBCOMPONENTS] GROUP BY COMBO_ID, SUBCOMPONENTS;

[ISOLATE_REPEATERS]:
LOAD COMBO_ID, Count(DISTINCT SUBCOMPONENTS) AS SUBCOMPONENT_MATCHES RESIDENT [COUNT_SUBCOMPONENTS] WHERE COUNT > 1 GROUP BY COMBO_ID;
Image of repeating substrings A in B.png

eduardo_dimperio · ‎2018-12-07

Hey Evan, thank you for your time.

Your solution doesnt work like i needed but was very clever the way that you splited word by word after concatenate the two strings, but with that i can easely use a wildmatch.

I saw that you used a AutoNumberHash128 and i'm sorry for that extra effort, cause i already have a primary key, i just din't wrote cause i thought thats no use here (and i was wrong).
I'll end the code e post here, but the right answer is your. Thank you

eduardo_dimperio · ‎2018-12-07

Here the final code:

NoConcatenate
Aux:
load
CNPJ,
UPPER(EnderecoCompleto) AS Address1,
UPPER(End) AS Address2
Resident PlaceId;

[SUBCOMPONENTS]:
LOAD
CNPJ,
Address1,
Address2,
Subfield(Address2,'+') AS SUBCOMPONENTS
RESIDENT Aux;

[COUNT_SUBCOMPONENTS]:
LOAD DISTINCT
CNPJ,
Address1,
Address2,
SUBCOMPONENTS,
WildMatch(Address1,'*'&SUBCOMPONENTS&'*') AS Match
RESIDENT [SUBCOMPONENTS]
;

[COUNT_SUBCOMP]:
LOAD DISTINCT
CNPJ,
Address1,
Address2,
Count(SUBCOMPONENTS) AS COUNT
RESIDENT [SUBCOMPONENTS]
GROUP BY CNPJ, Address1,Address2;

DROP TABLE Aux,SUBCOMPONENTS;

[RESULT]:
LOAD DISTINCT
CNPJ,
Address1,
Address2,
sum(Match) AS Macth
RESIDENT [COUNT_SUBCOMPONENTS]
Group by
CNPJ,
Address1,
Address2,
;
Left Join(RESULT)
LOAD DISTINCT
CNPJ,
Address1,
Address2,
COUNT
Resident [COUNT_SUBCOMP];

DROP TABLE [COUNT_SUBCOMPONENTS],COUNT_SUBCOMP;

[SCORE]:
LOAD DISTINCT
CNPJ,
Address1,
Address2,
Macth/COUNT as Score
RESIDENT [RESULT];

DROP TABLE RESULT;

And Output

Now with just a Min() function i can found the likely Address

Compare Words from dual strings

Qlik Sense

String

wildmatch