Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
mwscott1
Creator
Creator

Removing duplicate values in field during load

I have a csv file with two columns

  

NAMESTATE
John DoeNC, AL, CA, CA
Jim SmithSC
Jane DoeAR, NY
Jake SparowAL, GA, MS

some users have duplicate values in the STATE field. How can eliminate the duplicate values during the load.

  

NAMESTATE
John DoeNC, AL, CA
Jim SmithSC
Jane DoeAR, NY
Jake SparowAL, GA, MS

CA is just an example it could be any state, and multiple duplicates. End result just needs to be one state for every state represented.

1 Solution

Accepted Solutions
swuehl
MVP
MVP

You need to adapt the limiter ',' to the one used: ';', then it works just fine:

STATES:

LOAD [USER ID], Concat(DISTINCT SubSTATE,'; ') as  [LICENSE STATES]

GROUP BY [USER ID];

LOAD [USER ID],  Trim(Subfield( [LICENSE STATES], ';')) as SubSTATE;

LOAD [USER ID],

    [LICENSE STATES]

FROM

[States.xlsx]

(ooxml, embedded labels, table is Tabelle1);

USER ID LICENSE STATES
168644AR; AZ; CT; IL; MI; NC; NM; SC; TN; TX; VA; WV
229438CA
685237IL
695742CO; FL; GA; IA; IL; IN; KY; MI; NC; NY; OH; PA; SC; TN; TX; VA

View solution in original post

8 Replies
swuehl
MVP
MVP

You can split your STATE value substrings, creating a record per state, then aggregating using a distinct concat:

LOAD NAME, Concat(DISTINCT SubSTATE,', ') as STATE

GROUP BY NAME;

LOAD NAME,  Trim(Subfield(STATE, ',')) as SubSTATE;

LOAD * INLINE [

NAME STATE

John Doe NC, AL, CA, CA

Jim Smith SC

Jane Doe AR, NY

Jake Sparow AL, GA, MS

] (delimiter is '\t');

Sergey_Shuklin
Specialist
Specialist

Hello, Melvin!

You can also use a secret ability of a SubField() function! Just omit the third parameter of it when load. Like this:

LOAD DISTINCT

NAME,

Subfield(STATE,', ') as STATE

...

It will create a loop where all comma-separated values will be... separate for rows:

subfiled.png

Yes, the states will be in his own row and names are duplicated, but it can be more usefull in backend!

mwscott1
Creator
Creator
Author

I am not able to get this to work here is my script to load all the data.

LOAD [USER ID],

     [LICENSE STATES]

FROM

(ooxml, embedded labels, table is USER);

and a snap of my data

     

USER IDLICENSE STATES
168644 AR;AR;AZ;CT;IL;MI;NC;NM;SC;TN;TX;VA;WV
229438 CA;CA;CA
695742 CO;FL;FL;GA;IA;IL;IN;KY;MI;NC;NY;OH;PA;SC;TN;TX;VA
685237

IL;IL

I need to remove the duplicate states in each cell

mwscott1
Creator
Creator
Author

I am not able to get this to work here is my script to load all the data.

LOAD [USER ID],

     [LICENSE STATES]

FROM

(ooxml, embedded labels, table is USER);

and a snap of my data

     

USER IDLICENSE STATES
168644 AR;AR;AZ;CT;IL;MI;NC;NM;SC;TN;TX;VA;WV
229438 CA;CA;CA
695742 CO;FL;FL;GA;IA;IL;IN;KY;MI;NC;NY;OH;PA;SC;TN;TX;VA
685237

IL;IL

I need to remove the duplicate states in each cell

swuehl
MVP
MVP

You need to adapt the limiter ',' to the one used: ';', then it works just fine:

STATES:

LOAD [USER ID], Concat(DISTINCT SubSTATE,'; ') as  [LICENSE STATES]

GROUP BY [USER ID];

LOAD [USER ID],  Trim(Subfield( [LICENSE STATES], ';')) as SubSTATE;

LOAD [USER ID],

    [LICENSE STATES]

FROM

[States.xlsx]

(ooxml, embedded labels, table is Tabelle1);

USER ID LICENSE STATES
168644AR; AZ; CT; IL; MI; NC; NM; SC; TN; TX; VA; WV
229438CA
685237IL
695742CO; FL; GA; IA; IL; IN; KY; MI; NC; NY; OH; PA; SC; TN; TX; VA
mwscott1
Creator
Creator
Author

On a different note. Would there be a way to just identify the license states cells that have duplicate values and have them highlighted in a table object as apposed to removing the duplicates during the load.

swuehl
MVP
MVP

Shoudl be possible:

STATES:

LOAD *,

     If(SubStringCount([LICENSE STATES],';') <> SubStringCount([LICENSE STATES DEDUP],';'), 1,0) as Flag;

LOAD [USER ID],[LICENSE STATES], Concat(DISTINCT SubSTATE,'; ') as  [LICENSE STATES DEDUP]

GROUP BY [USER ID], [LICENSE STATES];

LOAD [USER ID],[LICENSE STATES],  Trim(Subfield( [LICENSE STATES], ';')) as SubSTATE;

LOAD [USER ID],

     [LICENSE STATES]

FROM

[States.xlsx]

(ooxml, embedded labels, table is Tabelle1);

USER ID LICENSE STATES Flag LICENSE STATES DEDUP
12345FL;TX0FL; TX
168644AR;AR;AZ;CT;IL;MI;NC;NM;SC;TN;TX;VA;WV1AR; AZ; CT; IL; MI; NC; NM; SC; TN; TX; VA; WV
229438CA;CA;CA1CA
685237IL;IL1IL
695742CO;FL;FL;GA;IA;IL;IN;KY;MI;NC;NY;OH;PA;SC;TN;TX;VA1CO; FL; GA; IA; IL; IN; KY; MI; NC; NY; OH; PA; SC; TN; TX; VA
MarcoWedel

Hi,

just to add a possible front end solution:

QlikCommunity_Thread_261156_Pic1.JPG

=Concat(DISTINCT Trim(SubField([LICENSE STATES],';',ValueLoop(1,Max(TOTAL SubStringCount([LICENSE STATES],';'))+1))),';')

(I think you should go with Stefan's script based approach though.)

hope this helps

regards

Marco