Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
eduardo_dimperio
Specialist II
Specialist II

Compare Words from dual strings

Hi, I have a table with two fields Address1 and Address2 and i need to use a wildmatch, word by word like a cartesian produt to know how many words in Address2 exists in Address1.

Ex: 

Input:

Address1Address2
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIROAv.+das+Américas+1650+Bloco+4+Loja+119+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIROAv+das+Americas+1650+Blc+1+Lj+102+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIROAvenida+das+Américas+1600+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIROAvenida+das+Américas+1650+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIROAvenida+das+Américas+1650+Bloco+04/+Loja+118+Recreio+dos+Bandeirantes+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIROAvenida+das+Américas+1690+Barra+da+Tijuca+Rio+de+Janeiro

 

Output

Address1 Num_WildMatch
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO7
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO8
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO6
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO6
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO4
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO8
Labels (3)
1 Solution

Accepted Solutions
evan_kurowski
Specialist
Specialist

Hello Eduardo, you could start with something like the following:

  • Concatenate the A+B string pairs
  • Split them into all subfields
  • Count repetitions of each subfield grouped by pairing ID
  • Tabulate only subfields which repeat


 
[ADDRESSES]:
LOAD AutoNumberHash128(Address1, Address2) AS COMBO_ID, Upper(Address1) AS Address1, Upper(Address2) AS Address2 INLINE [
Address1, Address2
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Av.+das+Américas+1650+Bloco+4+Loja+119+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Av+das+Americas+1650+Blc+1+Lj+102+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1600+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1650+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1650+Bloco+04/+Loja+118+Recreio+dos+Bandeirantes+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1690+Barra+da+Tijuca
]
;
 
[SUBCOMPONENTS]:
LOAD COMBO_ID, Subfield(Address1 & '+' & Address2,'+') AS SUBCOMPONENTS RESIDENT ADDRESSES;
 
[COUNT_SUBCOMPONENTS]:
LOAD DISTINCT COMBO_ID, SUBCOMPONENTS, Count(SUBCOMPONENTS) AS COUNT RESIDENT [SUBCOMPONENTS] GROUP BY COMBO_ID, SUBCOMPONENTS;
 
[ISOLATE_REPEATERS]:
LOAD COMBO_ID, Count(DISTINCT SUBCOMPONENTS) AS SUBCOMPONENT_MATCHES RESIDENT [COUNT_SUBCOMPONENTS] WHERE COUNT > 1 GROUP BY COMBO_ID;
 Image of repeating substrings A in B.png

View solution in original post

3 Replies
evan_kurowski
Specialist
Specialist

Hello Eduardo, you could start with something like the following:

  • Concatenate the A+B string pairs
  • Split them into all subfields
  • Count repetitions of each subfield grouped by pairing ID
  • Tabulate only subfields which repeat


 
[ADDRESSES]:
LOAD AutoNumberHash128(Address1, Address2) AS COMBO_ID, Upper(Address1) AS Address1, Upper(Address2) AS Address2 INLINE [
Address1, Address2
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Av.+das+Américas+1650+Bloco+4+Loja+119+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Av+das+Americas+1650+Blc+1+Lj+102+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1600+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1650+Barra+da+Tijuca+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1650+Bloco+04/+Loja+118+Recreio+dos+Bandeirantes+Rio+de+Janeiro
AV+AMERICAS+1700+BARRA+DA+TIJUCA+RIO+DE+JANEIRO, Avenida+das+Américas+1690+Barra+da+Tijuca
]
;
 
[SUBCOMPONENTS]:
LOAD COMBO_ID, Subfield(Address1 & '+' & Address2,'+') AS SUBCOMPONENTS RESIDENT ADDRESSES;
 
[COUNT_SUBCOMPONENTS]:
LOAD DISTINCT COMBO_ID, SUBCOMPONENTS, Count(SUBCOMPONENTS) AS COUNT RESIDENT [SUBCOMPONENTS] GROUP BY COMBO_ID, SUBCOMPONENTS;
 
[ISOLATE_REPEATERS]:
LOAD COMBO_ID, Count(DISTINCT SUBCOMPONENTS) AS SUBCOMPONENT_MATCHES RESIDENT [COUNT_SUBCOMPONENTS] WHERE COUNT > 1 GROUP BY COMBO_ID;
 Image of repeating substrings A in B.png

eduardo_dimperio
Specialist II
Specialist II
Author

Hey Evan, thank you for your time.

Your solution doesnt work like i needed but was very clever the way that you splited word by word after concatenate the two strings, but with that i can easely use a wildmatch.

I saw that you used a AutoNumberHash128 and i'm sorry for that extra effort, cause i already have a primary key, i just din't wrote cause i thought thats no use here (and i was wrong).
I'll end the code e post here, but the right answer is your. Thank you

eduardo_dimperio
Specialist II
Specialist II
Author

Here the final code:

NoConcatenate
Aux:
load
CNPJ,
UPPER(EnderecoCompleto) AS Address1,
UPPER(End) AS Address2
Resident PlaceId;

[SUBCOMPONENTS]:
LOAD
CNPJ,
Address1,
Address2,
Subfield(Address2,'+') AS SUBCOMPONENTS
RESIDENT Aux;

[COUNT_SUBCOMPONENTS]:
LOAD DISTINCT
CNPJ,
Address1,
Address2,
SUBCOMPONENTS,
WildMatch(Address1,'*'&SUBCOMPONENTS&'*') AS Match
RESIDENT [SUBCOMPONENTS]
;

[COUNT_SUBCOMP]:
LOAD DISTINCT
CNPJ,
Address1,
Address2,
Count(SUBCOMPONENTS) AS COUNT
RESIDENT [SUBCOMPONENTS]
GROUP BY CNPJ, Address1,Address2;

DROP TABLE Aux,SUBCOMPONENTS;

[RESULT]:
LOAD DISTINCT
CNPJ,
Address1,
Address2,
sum(Match) AS Macth
RESIDENT [COUNT_SUBCOMPONENTS]
Group by
CNPJ,
Address1,
Address2,
;
Left Join(RESULT)
LOAD DISTINCT
CNPJ,
Address1,
Address2,
COUNT
Resident [COUNT_SUBCOMP];

DROP TABLE [COUNT_SUBCOMPONENTS],COUNT_SUBCOMP;

[SCORE]:
LOAD DISTINCT
CNPJ,
Address1,
Address2,
Macth/COUNT as Score
RESIDENT [RESULT];

DROP TABLE RESULT;

And Output

 
image.pngNow with just a Min() function i can found the likely Address