Loading only the necessary data in script

melissapluke · ‎2019-11-04

Hi all,

I get a large daily file with no date stamp in it. It has 20 fields in it and roughly 500,000 lines everyday. It is an extract of the state of the data for that particular date in time.

Example:

ID Total Remaining Area

1 $100 Site 1
2 $75 Site 1
3 $60 Site 2

Tomorrow I might get another file that looks like this:

ID Total Remaining Area Owed

1 $100 Site 1
2 $40 Site 1
3 $20 Site 2

Where the information for ID 1 did not change, and the "Total Remaining" changed for the other 2 IDs.
The ID field is an identifier, all other fields are descriptive of the ID and can change.

I'd like to create a trend overtime, so I understand that I could just add a made up field each day that is Date() as "Loaded Date" to trend it. However, because the files are so large and I need to load ~5 years worth of daily files, it gets extremely large. I was wondering if there was a way to somehow only load data if it is different, but if it was the same, somehow mark that it is the same as the day before for the "loaded date". Or if there was another solution to reducing the size of the load, I'd love to hear it.

jheasley · ‎2019-11-04

Hello! It would be a little involved at the beginning, but you could create a QVD of all of the initial files, and create an ID out of ID&'-'&Total Remaining&'-'&Area. once this is created, you can load this qvd first each day, then load the remaining files with a "Where Not Exists(NEW ID FIELD) in the load statement. this will bring in only values that are not yet represented. This could also incorporate your add date if needed. Once its all loaded, you overwrite the original big qvd for the next day's reload.

melissapluke · ‎2019-11-04

Would I still be able to trend via that load date this way?

melissapluke · ‎2019-11-04

I don't understand how that would accommodate the date without creating full separate instances of things?

JordyWegman · ‎2019-11-05

Hi Melissa,

Do you really need it to be on ID level? Or can you group it on site level? This would reduce the amount of rows sufficiently.

Jordy

Climber

Work smarter, not harder

melissapluke · ‎2019-11-05

Hi Jordy,

I unfortunately do really need it on an ID level.

JordyWegman · ‎2019-11-05

Hi Melissa,

That means that you need all the data, so you can't aggregate. Another solution would be to replace all the values in your columns with integers and create dimensions.

Example:

Old:
ID            Total Remaining                    Area Owed
1               $100                                          Site 1
2               $40                                            Site 1
3               $20                                            Site 2

Turn into:

ID            Total Remaining                    %KeyArea
1               $100                                         1
2               $40                                          1
3               $20                                          2

AreaTable
Load
  %KeyArea
  Area Owed
From [YourSource]  (qvd)
;

This way you reduce the amount of characters in your data and will make it faster.

Jordy

Climber

Work smarter, not harder

melissapluke · ‎2019-11-05

Thanks, Jordy. I'll try that today!

DavidM · ‎2019-11-05

@jheasleythat way the data would not load if the value ever changes back to what it was in past.

eg:

Day1

ID Total Remaining Area

1 100 1

Day2

ID Total Remaining Area

1 80 1

Day3

ID Total Remaining Area

1 100 1

Day3 data would not load because there would already be id 1-100-1

jheasley · ‎2019-11-05

Agreed - but it would if you included the date as part of the ID.

Advanced Authoring