- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Loading only the necessary data in script
Hi all,
I get a large daily file with no date stamp in it. It has 20 fields in it and roughly 500,000 lines everyday. It is an extract of the state of the data for that particular date in time.
Example:
ID Total Remaining Area
1 $100 Site 1
2 $75 Site 1
3 $60 Site 2
Tomorrow I might get another file that looks like this:
ID Total Remaining Area Owed
1 $100 Site 1
2 $40 Site 1
3 $20 Site 2
Where the information for ID 1 did not change, and the "Total Remaining" changed for the other 2 IDs.
The ID field is an identifier, all other fields are descriptive of the ID and can change.
I'd like to create a trend overtime, so I understand that I could just add a made up field each day that is Date() as "Loaded Date" to trend it. However, because the files are so large and I need to load ~5 years worth of daily files, it gets extremely large. I was wondering if there was a way to somehow only load data if it is different, but if it was the same, somehow mark that it is the same as the day before for the "loaded date". Or if there was another solution to reducing the size of the load, I'd love to hear it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello! It would be a little involved at the beginning, but you could create a QVD of all of the initial files, and create an ID out of ID&'-'&Total Remaining&'-'&Area. once this is created, you can load this qvd first each day, then load the remaining files with a "Where Not Exists(NEW ID FIELD) in the load statement. this will bring in only values that are not yet represented. This could also incorporate your add date if needed. Once its all loaded, you overwrite the original big qvd for the next day's reload.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Would I still be able to trend via that load date this way?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't understand how that would accommodate the date without creating full separate instances of things?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Melissa,
Do you really need it to be on ID level? Or can you group it on site level? This would reduce the amount of rows sufficiently.
Jordy
Climber
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jordy,
I unfortunately do really need it on an ID level.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Melissa,
That means that you need all the data, so you can't aggregate. Another solution would be to replace all the values in your columns with integers and create dimensions.
Example:
Old:
ID Total Remaining Area Owed
1 $100 Site 1
2 $40 Site 1
3 $20 Site 2
Turn into:
ID Total Remaining %KeyArea
1 $100 1
2 $40 1
3 $20 2
AreaTable
Load
%KeyArea
Area Owed
From [YourSource] (qvd)
;
This way you reduce the amount of characters in your data and will make it faster.
Jordy
Climber
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, Jordy. I'll try that today!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@jheasleythat way the data would not load if the value ever changes back to what it was in past.
eg:
Day1
ID Total Remaining Area
1 100 1
Day2
ID Total Remaining Area
1 80 1
Day3
ID Total Remaining Area
1 100 1
Day3 data would not load because there would already be id 1-100-1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Agreed - but it would if you included the date as part of the ID.