Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
This is potentially a very basic question, but I have not been able to resolve it on my own.
I have a table with 18 fields, 3 of which are timestamps which sometimes have duplicate entries all simultaneously for many records (due to a data entry error). I want to create a version of this original table with no duplicate entries for these three fields. I have tried
NoConcatenate UniqueTimestampsTable:
LOAD Distinct UnitName, [timestamp1], [timestamp2], [timestamp3]
Resident TableWithDuplicateTimestamps;
However this means I don't have any of the other attribute information for each of the records. To do my analysis, I need a table with all 18 fields included for each record. To do this, I tried a left join with:
Left Join (UniqueTimestampsTable)
AnotherOne:
Load
[ID],
[field0],
[field1],
[field2],
[timestamp1],
[field3],
[field4],
Resident TableWithDuplicateTimestamps;
Where I intended the tables to join via timestamp1, and have only the entries of the former table included in the joined table (hence the left join). However, this does not work as intended, and I have numerous duplicate entries for all three timestamp entries.
I think I am overengineering my solution to this. How can I resolve this?
This is why I asked you for FULL data example.... to be clear
In this case you want to do following:
like this...
// Get table with unique ID and Timestamps
MyTable:
Load
[timestamp0],
[timestamp1],
[timestamp2],
min([ID]) as "ID"
Resident
TableWithDuplicateEntries
Group By
[timestamp0],
[timestamp1],
[timestamp2]
;
// Join to it rest of the records assuming that ID and timestamps will generate unique row based on info provided by you
Left Join (MyTable)
Load Distinct
[ID],
[field0],
[field1],
[field2],
[timestamp0],
[timestamp1],
[timestamp2],
[field3],
[field4]
Resident
TableWithDuplicateEntries
;
In short - keys for join are differetn. You used only timestamps, I added to it ID...
Your first approach did not work as the result of distinct load had only timestamps in it. As soon as you joined table on timestamps it was exploded due to multiple rows caused by differetn ID's and other fields.
With my approach you are creating unique row based on timestamps and MIN ID
Then you use this ID as one of the keys when joining hence only rows with matching Ids/timestamps are joined
That is quite basic and I hope this is clear
Hi it is not clear to me what is duplicated...
If whole row is duplicated and is identical then you can just use LOAD DISTINCT,
If whole row is duplicated except from timestamps, then which timestamps do you want to keep? Min or Max or maybe some others...
Can you add an example with actual data - what you have and what you are trying to get to?
Sure here is some example data:
ID | field0 | timestamp0 | timestamp1 | timestamp2 |
0 | foo | 18/07/2019 16:45:02 | 18/07/2019 16:45:06 | 18/07/2019 16:45:09 |
1 | foo | 18/07/2019 16:45:31 | 18/07/2019 16:45:43 | 18/07/2019 16:45:52 |
2 | bar | 18/07/2019 16:46:02 | 18/07/2019 16:46:06 | 18/07/2019 16:46:12 |
3 | bar | 18/07/2019 16:46:15 | 18/07/2019 16:46:18 | 18/07/2019 16:46:21 |
4 | bar | 18/07/2019 16:46:15 | 18/07/2019 16:46:18 | 18/07/2019 16:46:21 |
5 | bar | 18/07/2019 16:46:15 | 18/07/2019 16:46:18 | 18/07/2019 16:46:21 |
6 | bar | 18/07/2019 16:48:05 | 18/07/2019 16:48:08 | 18/07/2019 16:49:11 |
7 | baz | 18/07/2019 17:01:30 | 18/07/2019 17:01:33 | 18/07/2019 17:01:39 |
You see how ID and field0 are progressing, but for some of the timestamps (ID 3 to 5) have duplicates. I want a table like this
ID | field0 | timestamp0 | timestamp1 | timestamp2 |
0 | foo | 18/07/2019 16:45:02 | 18/07/2019 16:45:06 | 18/07/2019 16:45:09 |
1 | foo | 18/07/2019 16:45:31 | 18/07/2019 16:45:43 | 18/07/2019 16:45:52 |
2 | bar | 18/07/2019 16:46:02 | 18/07/2019 16:46:06 | 18/07/2019 16:46:12 |
3 | bar | 18/07/2019 16:46:15 | 18/07/2019 16:46:18 | 18/07/2019 16:46:21 |
6 | bar | 18/07/2019 16:48:05 | 18/07/2019 16:48:08 | 18/07/2019 16:49:11 |
7 | baz | 18/07/2019 17:01:30 | 18/07/2019 17:01:33 | 18/07/2019 17:01:39 |
I hope that makes it clearer what I am asking. Thanks!
Ok - this makes more sense and i want to point out that rows are not distinct as ID is different on each row, hence no metter what join you do as long as you keep ID in the table you will always have the same number of records. Which ID do you want to keep from below table? First or Last or the Middle one?
If the only thing progressing is ID you could do for example:
Load
field0,
timestamp0,
timestamp1,
timestamp2,
min(ID) as ID
From your table
Group by
field0,
timestamp0,
timestamp1,
timestamp2,
;
Hey, thanks for the reply.
Unfortunately I have an issue where if I add the line "min(ID) as ID" Qlik Sense throws an error saying "Invalid expression".
MyTable:
Load
[field0],
[field1],
[field2],
[timestamp0],
[timestamp1],
[timestamp2],
[field3],
[field4],
min([ID]) as "ID"
Resident TableWithDuplicateEntries
Group By [timestamp0], [timestamp1], [timestamp2];
Gives error
The following error occurred:
Invalid expression
If I remove the "min([ID]) as "ID" and the "Group By [timestamp0] ..." this error goes away. Is it clear as to why this might happen, or do you need to know more context?
EDIT: According to this link, it is the case that
In order for an aggregation (sum) to work, you will have to add all other fields to your "group by" statement.
If you only want to do the aggregation over a few fields, do the load process in two statements and join the tables.
Is this relevant to my case?
you need to group by all fields which are not aggregate
like this
MyTable:
Load
[field0],
[field1],
[field2],
[timestamp0],
[timestamp1],
[timestamp2],
[field3],
[field4],
min([ID]) as "ID"
Resident TableWithDuplicateEntries
Group By
[field0],
[field1],
[field2],
[timestamp0],
[timestamp1],
[timestamp2],
[field3],
[field4];
When I do that I get duplicate timestamps still, because we are grouping by the other fields as well. That is, I get the first table again:
0 | foo | 18/07/2019 16:45:02 | 18/07/2019 16:45:06 | 18/07/2019 16:45:09 |
1 | foo | 18/07/2019 16:45:31 | 18/07/2019 16:45:43 | 18/07/2019 16:45:52 |
2 | bar | 18/07/2019 16:46:02 | 18/07/2019 16:46:06 | 18/07/2019 16:46:12 |
3 | bar | 18/07/2019 16:46:15 | 18/07/2019 16:46:18 | 18/07/2019 16:46:21 |
4 | bar | 18/07/2019 16:46:15 | 18/07/2019 16:46:18 | 18/07/2019 16:46:21 |
5 | bar | 18/07/2019 16:46:15 | 18/07/2019 16:46:18 | 18/07/2019 16:46:21 |
6 | bar | 18/07/2019 16:48:05 | 18/07/2019 16:48:08 | 18/07/2019 16:49:11 |
7 | baz | 18/07/2019 17:01:30 | 18/07/2019 17:01:33 | 18/07/2019 17:01:39 |
I need to only group by the time stamps, nothing else, but have all the other fields still available as attributes.
This is why I asked you for FULL data example.... to be clear
In this case you want to do following:
like this...
// Get table with unique ID and Timestamps
MyTable:
Load
[timestamp0],
[timestamp1],
[timestamp2],
min([ID]) as "ID"
Resident
TableWithDuplicateEntries
Group By
[timestamp0],
[timestamp1],
[timestamp2]
;
// Join to it rest of the records assuming that ID and timestamps will generate unique row based on info provided by you
Left Join (MyTable)
Load Distinct
[ID],
[field0],
[field1],
[field2],
[timestamp0],
[timestamp1],
[timestamp2],
[field3],
[field4]
Resident
TableWithDuplicateEntries
;
Fantastic, that seems to have worked! For the sake of learning, do you know why my original approach (in my first post) using "SELECT DISTINCT [Timestamp0] [Timestamp1] ..." would not work? As I understand it would have done a very similar than to the "group by" clause.
In short - keys for join are differetn. You used only timestamps, I added to it ID...
Your first approach did not work as the result of distinct load had only timestamps in it. As soon as you joined table on timestamps it was exploded due to multiple rows caused by differetn ID's and other fields.
With my approach you are creating unique row based on timestamps and MIN ID
Then you use this ID as one of the keys when joining hence only rows with matching Ids/timestamps are joined
That is quite basic and I hope this is clear