Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
melissapluke
Partner - Creator
Partner - Creator

Loading only the necessary data in script

Hi all,

I get a large daily file with no date stamp in it. It has 20 fields in it and roughly 500,000 lines everyday. It is an extract of the state of the data for that particular date in time.

Example:

ID            Total Remaining                    Area

1               $100                                          Site 1
2               $75                                            Site 1
3               $60                                            Site 2

Tomorrow I might get another file that looks like this:

ID            Total Remaining                    Area Owed

1               $100                                          Site 1
2               $40                                            Site 1
3               $20                                            Site 2

Where the information for ID 1 did not change, and the "Total Remaining" changed for the other 2 IDs.
The ID field is an identifier, all other fields are descriptive of the ID and can change. 

I'd like to create a trend overtime, so I understand that I could just add a made up field each day that is Date() as "Loaded Date" to trend it. However, because the files are so large and I need to load ~5 years worth of daily files, it gets extremely large. I was wondering if there was a way to somehow only load data if it is different, but if it was the same, somehow mark that it is the same as the day before for the "loaded date". Or if there was another solution to reducing the size of the load, I'd love to hear it.

 

Labels (1)
9 Replies
jheasley
Luminary Alumni
Luminary Alumni

Hello! It would be a little involved at the beginning, but you could create a QVD of all of the initial files, and create an ID out of ID&'-'&Total Remaining&'-'&Area.  once this is created, you can load this qvd first each day, then load the remaining files with a "Where Not Exists(NEW ID FIELD) in the load statement.  this will bring in only values that are not yet represented. This could also incorporate your add date if needed. Once its all loaded, you overwrite the original big qvd for the next day's reload. 

melissapluke
Partner - Creator
Partner - Creator
Author

Would I still be able to trend via that load date this way? 

melissapluke
Partner - Creator
Partner - Creator
Author

I don't understand how that would accommodate the date without creating full separate instances of things?

JordyWegman
Partner - Master
Partner - Master

Hi Melissa,

Do you really need it to be on ID level? Or can you group it on site level? This would reduce the amount of rows sufficiently.

Jordy

Climber

Work smarter, not harder
melissapluke
Partner - Creator
Partner - Creator
Author

Hi Jordy,

I unfortunately do really need it on an ID level.

JordyWegman
Partner - Master
Partner - Master

Hi Melissa,

That means that you need all the data, so you can't aggregate. Another solution would be to replace all the values in your columns with integers and create dimensions.

Example:

Old:
ID            Total Remaining                    Area Owed
1               $100                                          Site 1
2               $40                                            Site 1
3               $20                                            Site 2

Turn into:

ID            Total Remaining                    %KeyArea
1               $100                                         1
2               $40                                          1
3               $20                                          2

AreaTable
Load
  %KeyArea
  Area Owed
From [YourSource]  (qvd)
;

This way you reduce the amount of characters in your data and will make it faster.

Jordy

Climber

Work smarter, not harder
melissapluke
Partner - Creator
Partner - Creator
Author

Thanks, Jordy. I'll try that today!

DavidM
Partner - Creator II
Partner - Creator II

@jheasleythat way the data would not load if the value ever changes back to what it was in past.

eg:

Day1

ID Total Remaining Area

1           100                      1

Day2

ID Total Remaining Area

1            80                      1

Day3

ID Total Remaining Area

1           100                      1

Day3 data would not load because there would already be id 1-100-1

 

jheasley
Luminary Alumni
Luminary Alumni

Agreed - but it would if you included the date as part of the ID.