Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
cristian_av
Creator III
Creator III

How to find duplicated values within time intervals

Hi

I have a database with the registry of people from a food service. The food service is free, but because the POS has some troubles, sometimes the records are duplicated (i.e. there are two transactions with a difference of 30 seconds, or sometimes max. 2 minutes) .

I need to find record that are duplicated and mark them (Both, the correct and the duplicated!.

What i Have is something like:

1     Person1    Food1     12/05/2014 11:01:00          NEED TO HAVE A MARK (Duplicated)

2     Person1    Food1     12/05/2014 11:01:30          NEED TO HAVE A MARK (Duplicated)

3     Person1    Food2     12/05/2014 12:00:30

4     Person2    Food1     12/05/2014 11:01:20

5     Person2    Food2     12/05/2014 11:01:30

n     ...

Mark the registry 2 is (I think..) not a big deal, cause I can use Peek or Previous.. But how do I mark the registry 1?

So in this example, I have to mark the registry 1 and 2, because it is from the same person, same food and in the same day.

I need to have at least two-three Marks, depending the time difference between the registers.

i.e      One mark if the difference of time is less than 30 seconds

          One mark if the difference of time is less than 2 minutes.. and so on

Thank you!!

1 Solution

Accepted Solutions
Not applicable

Hi Cristian

It sounds like you want to achieve a result something like this:

Find duplicates2.png

The script below works by counting the number of distinct entries against each combination of person and food.  For it to work your data must have a unique key for each Person/Food combination.  I achieved this in the example by creating a composite key from your Person and Food fields but in your real data you may have to find something that will offer uniqueness as I imagine there will be more than one Mr or Mrs X having food Y.  However, hopefully this example will demonstrate a principle by which you could solve your issue.  Of course if your Person and Food fields do generate a unique key for each Person/Food combination then the script  will work for you.

The basic steps are:

  1. Load your data
  2. Create a composite key to provide unique combinations of Person and Food
  3. Using the composite key count the number of unique records per key
  4. Calculate the earliest and latest transaction times then use these to calculate the time difference.
    (I have converted the time difference to seconds for simplicity)
  5. Identify the first record for each key and mark it as 'Correct'
  6. Identify the duplicate records for each key and flag them as 'Duplicate'
  7. Create the time intervals dimension
  8. Create a straight table to display the results

I have added a little more test data to the inline load and have included one example where there are multiple duplicate entries so you can see how the script handles such a scenario.

Copy the script below into the script editor then save and reload your application.

Data:
load * inline [
Ref,Person,Food,Transaction_Time
1, Person1, Food1, 12/05/2014 11:01:00
2, Person1, Food1, 12/05/2014 11:01:30
3, Person1, Food2, 12/05/2014 12:00:30
4, Person2, Food1, 12/05/2014 11:01:20
5, Person2, Food2, 12/05/2014 11:01:30
6, Person3, Food1, 12/05/2014 11:01:20
7, Person3, Food1, 12/05/2014 11:02:15
8, Person4, Food2, 12/05/2014 11:01:20
9, Person4, Food2, 12/05/2014 11:02:45
10, Person4, Food2, 12/05/2014 11:02:20
11, Person4, Food2, 12/05/2014 11:06:45
12, Person5, Food1, 12/05/2014 11:01:20
13, Person5, Food1, 12/05/2014 11:03:20
]
;

left join(Data)
Load
     
Ref,
     
Person & Food as KEY
Resident Data ;

left join(Data)
Load
     
Person & Food as KEY,
     
Count (Distinct Ref & Transaction_Time) as COUNT_OF_RECORDS,
     
Min(Transaction_Time) as EARLIEST_TIME,
     
Max(Transaction_Time) as LATEST_TIME,
     
Round(((Max(Transaction_Time)-Min(Transaction_Time))*86400),1) as TIME_DIFFERENCE
Resident Data
Group By Person & Food ;

left join(Data)
Load
     
Ref,
     
If(Transaction_Time=EARLIEST_TIME,'Correct',
     
If(COUNT_OF_RECORDS>1 and Transaction_Time>EARLIEST_TIME,'Duplicate',
      'Check'))
as DUPLICATE_IDENTIFIER
Resident Data;

left join(Data)
Load
Ref,
     
If(TIME_DIFFERENCE=0,'Correct',
     
If(TIME_DIFFERENCE>0 and TIME_DIFFERENCE<=30,'0-30 second gap',
     
If(TIME_DIFFERENCE>30 and TIME_DIFFERENCE<=60,'30-60 second gap',
     
If(TIME_DIFFERENCE>60 and TIME_DIFFERENCE<=90,'60-90 second gap',
     
If(TIME_DIFFERENCE>90 and TIME_DIFFERENCE<=120,'90-120 second gap',
     
If(TIME_DIFFERENCE>120,'> 2 minute gap')))))) as RESULT_FLAG
Resident Data;

Hope that helps.

Kind regards

Steve

View solution in original post

3 Replies
Not applicable

Hi Cristian

It sounds like you want to achieve a result something like this:

Find duplicates2.png

The script below works by counting the number of distinct entries against each combination of person and food.  For it to work your data must have a unique key for each Person/Food combination.  I achieved this in the example by creating a composite key from your Person and Food fields but in your real data you may have to find something that will offer uniqueness as I imagine there will be more than one Mr or Mrs X having food Y.  However, hopefully this example will demonstrate a principle by which you could solve your issue.  Of course if your Person and Food fields do generate a unique key for each Person/Food combination then the script  will work for you.

The basic steps are:

  1. Load your data
  2. Create a composite key to provide unique combinations of Person and Food
  3. Using the composite key count the number of unique records per key
  4. Calculate the earliest and latest transaction times then use these to calculate the time difference.
    (I have converted the time difference to seconds for simplicity)
  5. Identify the first record for each key and mark it as 'Correct'
  6. Identify the duplicate records for each key and flag them as 'Duplicate'
  7. Create the time intervals dimension
  8. Create a straight table to display the results

I have added a little more test data to the inline load and have included one example where there are multiple duplicate entries so you can see how the script handles such a scenario.

Copy the script below into the script editor then save and reload your application.

Data:
load * inline [
Ref,Person,Food,Transaction_Time
1, Person1, Food1, 12/05/2014 11:01:00
2, Person1, Food1, 12/05/2014 11:01:30
3, Person1, Food2, 12/05/2014 12:00:30
4, Person2, Food1, 12/05/2014 11:01:20
5, Person2, Food2, 12/05/2014 11:01:30
6, Person3, Food1, 12/05/2014 11:01:20
7, Person3, Food1, 12/05/2014 11:02:15
8, Person4, Food2, 12/05/2014 11:01:20
9, Person4, Food2, 12/05/2014 11:02:45
10, Person4, Food2, 12/05/2014 11:02:20
11, Person4, Food2, 12/05/2014 11:06:45
12, Person5, Food1, 12/05/2014 11:01:20
13, Person5, Food1, 12/05/2014 11:03:20
]
;

left join(Data)
Load
     
Ref,
     
Person & Food as KEY
Resident Data ;

left join(Data)
Load
     
Person & Food as KEY,
     
Count (Distinct Ref & Transaction_Time) as COUNT_OF_RECORDS,
     
Min(Transaction_Time) as EARLIEST_TIME,
     
Max(Transaction_Time) as LATEST_TIME,
     
Round(((Max(Transaction_Time)-Min(Transaction_Time))*86400),1) as TIME_DIFFERENCE
Resident Data
Group By Person & Food ;

left join(Data)
Load
     
Ref,
     
If(Transaction_Time=EARLIEST_TIME,'Correct',
     
If(COUNT_OF_RECORDS>1 and Transaction_Time>EARLIEST_TIME,'Duplicate',
      'Check'))
as DUPLICATE_IDENTIFIER
Resident Data;

left join(Data)
Load
Ref,
     
If(TIME_DIFFERENCE=0,'Correct',
     
If(TIME_DIFFERENCE>0 and TIME_DIFFERENCE<=30,'0-30 second gap',
     
If(TIME_DIFFERENCE>30 and TIME_DIFFERENCE<=60,'30-60 second gap',
     
If(TIME_DIFFERENCE>60 and TIME_DIFFERENCE<=90,'60-90 second gap',
     
If(TIME_DIFFERENCE>90 and TIME_DIFFERENCE<=120,'90-120 second gap',
     
If(TIME_DIFFERENCE>120,'> 2 minute gap')))))) as RESULT_FLAG
Resident Data;

Hope that helps.

Kind regards

Steve

cristian_av
Creator III
Creator III
Author

Thank you Steve!

It worked perfect

Not applicable

Hi Cristian

You're welcome.  Glad it helped.  Thanks for marking my answer as correct.

Steve