Skip to main content
Woohoo! Qlik Community has won “Best in Class Community” in the 2024 Khoros Kudos awards!
Announcements
Nov. 20th, Qlik Insider - Lakehouses: Driving the Future of Data & AI - PICK A SESSION
cancel
Showing results for 
Search instead for 
Did you mean: 
sjhussain
Partner - Creator II
Partner - Creator II

Combine daily files and keep the earliest value for each ID

Dear All,

Hope all is well with everyone.

We are working on a Telecom Project and have to combine and save distinct Customer Information from the Customer Detail Files.

Each Day we receive the Customer Details files in csv format. CustomerDetail_YYYYMMDD.csv

ACCOUNTID, NAME, MSISDN, IMSI, STATUS

We filter the YYYYMMDD from the file and make a dimension FILE_DATE MM/DD/YYYY

Each day we get the the Customer Details along with any additions. What we want to do is to keep the earliest Customer Details.

For example -

We receive a file CustomerDetail_20160707.csv with the following information.

FILE_DATE,ACCOUNTID,NAME,MSISDN,IMSI,STATUS

7/7/2016,123,DAVE SMITH,123456789,8900987654321,ACTIVE

next day we receive CustomerDetail_20160708.csv with the following information.

FILE_DATE,ACCOUNTID,NAME,MSISDN,IMSI,STATUS

7/8/2016,123,DAVE SMITH,123456789,8900987654321,ACTIVE

Now on 7/8/2016 we want to have just one record of DAVE SMITH with the FILE_DATE being 7/7/2016

7/7/2016,123,DAVE SMITH,123456789,8900987654321,ACTIVE

Then on 7/9 we receive the following file

FILE_DATE,ACCOUNTID,NAME,MSISDN,IMSI,STATUS

7/9/2016,123,JOHN,123456789,8900987654321,ACTIVE

Notice that the name changed so on 7/9 we would like to have two records


7/7/2016,123,DAVE SMITH,123456789,8900987654321,ACTIVE

7/9/2016,123,JOHN,123456789,8900987654321,ACTIVE

The way we approach it is that everyday we receive the file and then concatenate with the previous file.

We make a KEY of ACCOUNTID & NAME & MSISDN & IMSI & STATUS and then group it to find the count.  If the count is 1 then we know it is unique and we add the new date to it.

Want to find out if there is another way we could approach it? because the customer details is growing to more than 2million records and the grouping is taking long time.

Will appreciate if someone can assist.

Thanks.

5 Replies
swuehl
MVP
MVP

Maybe create a KEY made of ACCOUNTID & NAME in your script and use a WHERE NOT EXISTS(KEY, ACCOUNTID & NAME) clause when appending your new records.

sjhussain
Partner - Creator II
Partner - Creator II
Author

‌Dear Stephen

Thanks for the prompt response.

1. How would i add the new date for each new record?

2. How would i use the where not exists clause? After concatenating thw old and previous files? 

Thanks

swuehl
MVP
MVP

1. Isn't the date part of your input data / file name?

2. Assuming you first load your historic data from QVD, then concatenate your new data, then store all into the QVD, something like

Data:

LOAD * FROM Historic.qvd (qvd);

Let vToday = Date(Today(1), 'YYYYMMDD');

CONCATENATE (Data)

LOAD FILE_DATE,ACCOUNTID,NAME,MSISDN,IMSI,STATUS, ACCOUNTID & NAME as KEY

FROM CustomerDetail_$(vToday).csv

(...)

WHERE NOT EXISTS(KEY, ACCOUNTID & NAME);

STORE Data INTO Historic.qvd (qvd);

sjhussain
Partner - Creator II
Partner - Creator II
Author

‌Stephen

If i use where not exists on key and accountid&name then will it filter the distinct dates as per the requirement.  Meaning if all the fields are the same between two dates then it will take only the min date?

Also can you tell me a bit about where not exists. We have two values in it - key and accountid&name. Howbdoes it work? It looks for the key in the first table (before concat) and then compares it with accountid&name in the second (after concat) ?

Thanks.

swuehl
MVP
MVP

Exists() basically checks the values already loaded into the specified field / symbol table so far (including the current LOAD statement).

So LOAD order of your records is important, you need to load the dates / files in chronological order so that only the first / earliest date is kept.

Exists() can be used with one argument, a field name, then it checks the current field value against the loaded field values so far.

If you use Exists() with two arguments, first denotes the field / symbol table to check, second denotes an expression that creates the values you want to check. You would need this when you are building a key field that does not exists in your input table, only in your output table.