Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi guys,
Not easy to describe the issue, but I'll do my best to keep it simple:
Our data system drops a .csv file every day with product information such as product reference no., price, stock amount, etc in a folder. I would like to pick it up every day and add it to the previous day's file, building up historic data by doing so. For example:
The 20170601 csv file:
ProductRef | Price | Amount |
---|---|---|
10010001 | €100.00 | 50 |
10010002 | €49.99 | 23 |
10020004 | €150.00 | 11 |
The 20170602 csv file:
ProductRef | Price | Amount |
---|---|---|
10010001 | €100.00 | 48 |
10010002 | €49.99 | 23 |
10020004 | €145.00 | 11 |
The 20170603 csv file:
ProductRef | Price | Amount |
---|---|---|
10010001 | €100.00 | 47 |
10010002 | €49.99 | 22 |
10020004 | €145.00 | 11 |
The easiest solution would be to add a date stamp and to concatenate the files:
Date | ProductRef | Price | Amount |
---|---|---|---|
20170601 | 10010001 | €100.00 | 50 |
20170601 | 10010002 | €49.99 | 23 |
20170601 | 10020004 | €150.00 | 11 |
20170602 | 10010001 | €100.00 | 48 |
20170602 | 10010002 | €49.99 | 23 |
20170602 | 10020004 | €145.00 | 11 |
20170603 | 10010001 | €100.00 | 47 |
20170603 | 10010002 | €49.99 | 22 |
20170603 | 10020004 | €145.00 | 11 |
However, the daily file is about 23 columns wide and 130.000 rows long. Appending/concatenating this every single day would lead up to a lot of duplicate rows (except for the date field) and a huge file size. To prevent this, the ideal situation would be if I could get to an end result similar to:
From | To | ProductRef | Price | Amount |
---|---|---|---|---|
20170601 | 20170601 | 10010001 | €100.00 | 50 |
20170602 | 20170602 | 10010001 | €100.00 | 48 |
20170603 | 20170603 | 10010001 | €100.00 | 47 |
20170601 | 20170602 | 10010002 | €49.99 | 23 |
20170603 | 20170603 | 10010002 | €49.99 | 22 |
20170601 | 20170601 | 10020004 | €150.00 | 11 |
20170602 | 20170603 | 10020004 | €145.00 | 11 |
Extra tricky is: A product could have an initial stock of 50 items on 20170601, could slowly be depleted (stock = 0 at 20171231) and be replenished to a stock of 50 on 20180101. In this case a new record needs to be inserted on 20180101 (and not an edit of the existing 20170601 record's "to"-date, even though all fields except the date would be the same).
I've been looking into using hash functions to add a unique key to accomplish this, and though it has helped me to get a bit closer to a solution, I can't seem to wrap my head around this issue. Is this even at all feasible to begin with? And if so, could you point me into the right direction?
All help is very much appreciated!
Cheers!
Hi Wouter,
I think you'll need to decide what you want to get out of the data that you're going to store. You'll need to decide which of your 23 columns has data you want to keep and hopefully you can disregard some them. Then from the columns you can't discard maybe there are some that you can summarise, that is aggregate over time periods. Some data you might need to keep a daily track of, some you might aggregate over weeks or longer.
So it's possible that instead of building up one big table the best solution would be to create a number of tables, some aggregated over weeks, some aggregated over months and some not aggregated at all.
You may even only add records to your table(s) that show a change in a value from the previous day.
Sounds like fun!
Good luck
Andrew
Perhaps like this:
MyTable:
LOAD
min(Date) as From,
max(Date) as To,
ProductRef,
Price,
Amount
FROM
GROUP BY
ProductRef,
Price,
Amount
;
Hi Wouter,
I think you'll need to decide what you want to get out of the data that you're going to store. You'll need to decide which of your 23 columns has data you want to keep and hopefully you can disregard some them. Then from the columns you can't discard maybe there are some that you can summarise, that is aggregate over time periods. Some data you might need to keep a daily track of, some you might aggregate over weeks or longer.
So it's possible that instead of building up one big table the best solution would be to create a number of tables, some aggregated over weeks, some aggregated over months and some not aggregated at all.
You may even only add records to your table(s) that show a change in a value from the previous day.
Sounds like fun!
Good luck
Andrew
Hi guys,
Thanks for your help. I have decided to go for Andrew's approach; this was the fastest and easiest to accomplish and the resulting .qvd file is not as big as I'd expect (17 MB for 10 days of data), with a decreasing daily file size (+3.5 MB in the first 3 days, +1.5 MB in the last 2 days)