Skip to main content
Announcements
Qlik Introduces a New Era of Visualization! READ ALL ABOUT IT
cancel
Showing results for 
Search instead for 
Did you mean: 
nougatitati
Contributor III
Contributor III

Finding the subset of a table with only distinct entries

Hi,

This is potentially a very basic question, but I have not been able to resolve it on my own.

I have a table with 18 fields, 3 of which are timestamps which sometimes have duplicate entries all simultaneously for many records (due to a data entry error). I want to create a version of this original table with no duplicate entries for these three fields. I have tried

NoConcatenate UniqueTimestampsTable:
LOAD Distinct UnitName, [timestamp1], [timestamp2], [timestamp3]
Resident TableWithDuplicateTimestamps;

However this means I don't have any of the other attribute information for each of the records. To do my analysis, I need a table with all 18 fields included for each record. To do this, I tried a left join with:

Left Join (UniqueTimestampsTable)
AnotherOne:
Load
    [ID],
    [field0],
    [field1],
    [field2],
    [timestamp1],
    [field3],
    [field4],
Resident TableWithDuplicateTimestamps;

Where I intended the tables to join via timestamp1, and have only the entries of the former table included in the joined table (hence the left join). However, this does not work as intended, and I have numerous duplicate entries for all three timestamp entries.

I think I am overengineering my solution to this. How can I resolve this?

2 Solutions

Accepted Solutions
Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

This is why I asked you for FULL data example.... to be clear

In this case you want to do following:

  • First you want to get min ID for your timestamps
  • then Left Join rest of data to it:

like this...

 

// Get table with unique ID and Timestamps
MyTable:
Load
    [timestamp0],
    [timestamp1],
    [timestamp2],
    min([ID]) as "ID"
Resident 
     TableWithDuplicateEntries
Group By  
    [timestamp0],
    [timestamp1],
    [timestamp2]
;
// Join to it rest of the records assuming that ID and timestamps will generate unique row based on info provided by you
Left Join (MyTable)
Load Distinct
    [ID],    
    [field0],
    [field1],
    [field2],
    [timestamp0],
    [timestamp1],
    [timestamp2],
    [field3],
    [field4]
Resident 
     TableWithDuplicateEntries
;
cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.

View solution in original post

Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

In short - keys for join are differetn. You used only timestamps, I added to it ID...

Your first approach did not work as the result of distinct load had only timestamps in it. As soon as you joined table on timestamps it was exploded due to multiple rows caused by differetn ID's and other fields.

With my approach you are creating unique row based on timestamps and MIN ID

Then you use this ID as one of the keys when joining hence only rows with matching Ids/timestamps are joined

That is quite basic and I hope this is clear

cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.

View solution in original post

11 Replies
Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi it is not clear to me what is duplicated...

If whole row is duplicated and is identical then you can just use LOAD DISTINCT,

If  whole row is duplicated except from timestamps, then which timestamps do you want to keep? Min or Max or maybe some others...

Can you add an example with actual data - what you have and what you are trying to get to?

cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.
nougatitati
Contributor III
Contributor III
Author

Sure here is some example data:

IDfield0timestamp0timestamp1timestamp2
0foo18/07/2019 16:45:0218/07/2019 16:45:0618/07/2019 16:45:09
1foo18/07/2019 16:45:3118/07/2019 16:45:4318/07/2019 16:45:52
2bar18/07/2019 16:46:0218/07/2019 16:46:0618/07/2019 16:46:12
3bar18/07/2019 16:46:1518/07/2019 16:46:1818/07/2019 16:46:21
4bar18/07/2019 16:46:1518/07/2019 16:46:1818/07/2019 16:46:21
5bar18/07/2019 16:46:1518/07/2019 16:46:1818/07/2019 16:46:21
6bar18/07/2019 16:48:0518/07/2019 16:48:0818/07/2019 16:49:11
7baz18/07/2019 17:01:3018/07/2019 17:01:3318/07/2019 17:01:39

 

You see how ID and field0 are progressing, but for some of the timestamps (ID 3 to 5) have duplicates. I want a table like this

IDfield0timestamp0timestamp1timestamp2
0foo18/07/2019 16:45:0218/07/2019 16:45:0618/07/2019 16:45:09
1foo18/07/2019 16:45:3118/07/2019 16:45:4318/07/2019 16:45:52
2bar18/07/2019 16:46:0218/07/2019 16:46:0618/07/2019 16:46:12
3bar18/07/2019 16:46:1518/07/2019 16:46:1818/07/2019 16:46:21
6bar18/07/2019 16:48:0518/07/2019 16:48:0818/07/2019 16:49:11
7baz18/07/2019 17:01:3018/07/2019 17:01:3318/07/2019 17:01:39

 

I hope that makes it clearer what I am asking. Thanks!

Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

Ok - this makes more sense and i want to point out that rows are not distinct as ID is different on each row, hence no metter what join you do as long as you keep ID in the table you will always have the same number of records. Which ID do you want to keep from below table? First or Last or the Middle one?

If the only thing progressing is ID you could do for example:

Load

field0,

timestamp0,

timestamp1,

timestamp2,

min(ID) as ID

From your table

Group by

field0,

timestamp0,

timestamp1,

timestamp2,

;

 

cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.
nougatitati
Contributor III
Contributor III
Author

Hey, thanks for the reply.

Unfortunately I have an issue where if I add the line "min(ID) as ID" Qlik Sense throws an error saying "Invalid expression".

 

MyTable:
    Load
    [field0],
    [field1],
    [field2],
    [timestamp0],
    [timestamp1],
    [timestamp2],
    [field3],
    [field4],
    min([ID]) as "ID"
Resident TableWithDuplicateEntries
Group By [timestamp0], [timestamp1], [timestamp2];

 

Gives error

 

The following error occurred:
Invalid expression

 

If I remove the "min([ID]) as "ID" and the "Group By [timestamp0] ..." this error goes away. Is it clear as to why this might happen, or do you need to know more context?

EDIT: According to this link, it is the case that


In order for an aggregation (sum) to work, you will have to add all other fields to your "group by" statement.

If you only want to do the aggregation over a few fields, do the load process in two statements and join the tables.


 Is this relevant to my case?

Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

you need to group by all fields which are not aggregate

like this

MyTable:
    Load
    [field0],
    [field1],
    [field2],
    [timestamp0],
    [timestamp1],
    [timestamp2],
    [field3],
    [field4],
    min([ID]) as "ID"
Resident TableWithDuplicateEntries
Group By  
[field0],
    [field1],
    [field2],
    [timestamp0],
    [timestamp1],
    [timestamp2],
    [field3],
    [field4];

 

cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.
nougatitati
Contributor III
Contributor III
Author

When I do that I get duplicate timestamps still, because we are grouping by the other fields as well. That is, I get the first table again:

0foo18/07/2019 16:45:0218/07/2019 16:45:0618/07/2019 16:45:09
1foo18/07/2019 16:45:3118/07/2019 16:45:4318/07/2019 16:45:52
2bar18/07/2019 16:46:0218/07/2019 16:46:0618/07/2019 16:46:12
3bar18/07/2019 16:46:1518/07/2019 16:46:1818/07/2019 16:46:21
4bar18/07/2019 16:46:1518/07/2019 16:46:1818/07/2019 16:46:21
5bar18/07/2019 16:46:1518/07/2019 16:46:1818/07/2019 16:46:21
6bar18/07/2019 16:48:0518/07/2019 16:48:0818/07/2019 16:49:11
7baz18/07/2019 17:01:3018/07/2019 17:01:3318/07/2019 17:01:39

 

I need to only group by the time stamps, nothing else, but have all the other fields still available as attributes.

Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

This is why I asked you for FULL data example.... to be clear

In this case you want to do following:

  • First you want to get min ID for your timestamps
  • then Left Join rest of data to it:

like this...

 

// Get table with unique ID and Timestamps
MyTable:
Load
    [timestamp0],
    [timestamp1],
    [timestamp2],
    min([ID]) as "ID"
Resident 
     TableWithDuplicateEntries
Group By  
    [timestamp0],
    [timestamp1],
    [timestamp2]
;
// Join to it rest of the records assuming that ID and timestamps will generate unique row based on info provided by you
Left Join (MyTable)
Load Distinct
    [ID],    
    [field0],
    [field1],
    [field2],
    [timestamp0],
    [timestamp1],
    [timestamp2],
    [field3],
    [field4]
Resident 
     TableWithDuplicateEntries
;
cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.
nougatitati
Contributor III
Contributor III
Author

Fantastic, that seems to have worked! For the sake of learning, do you know why my original approach (in my first post) using "SELECT DISTINCT [Timestamp0] [Timestamp1] ..." would not work? As I understand it would have done a very similar than to the "group by" clause.

Lech_Miszkiewicz
Partner Ambassador/MVP
Partner Ambassador/MVP

In short - keys for join are differetn. You used only timestamps, I added to it ID...

Your first approach did not work as the result of distinct load had only timestamps in it. As soon as you joined table on timestamps it was exploded due to multiple rows caused by differetn ID's and other fields.

With my approach you are creating unique row based on timestamps and MIN ID

Then you use this ID as one of the keys when joining hence only rows with matching Ids/timestamps are joined

That is quite basic and I hope this is clear

cheers Lech, When applicable please mark the correct/appropriate replies as "solution" (you can mark up to 3 "solutions". Please LIKE threads if the provided solution is helpful to the problem.