Sequencial LOAD

xyz_1011 · ‎2023-09-06

Hi all,

i have a table with Support-Case data that has >100M records. To optimize processing time, I would like to sequencially LOAD and process this table. I need to process the table in chunks of Incident_IDs. One Incident_ID can have multiple records in this table. So, i am aiming for an approach to start loading the first 1M IDs, then the second 1M IDs, the third 1M Ids etc.

Any good approach on how to do this ?

Many thanks in advance!

marcus_sommer · ‎2023-09-06

In your case the id itself isn't unique which means it's not usable within advanced load-processes. Maybe there are sub-id's or other unique marker which might be combined with the id to a unique key per record.

Easier as the above may be not to slice the data per id else against a period-field maybe a YYYYMM. It's simpler to create and to loop through respectively to load from. That this usually means it's not a perfect incremental approach - is in the most scenarios no significantly disadvantage else the benefits of a simple handling are more important.

xyz_1011 · ‎2023-09-06

Hej @marcus_sommer Thanks for the reply. Yes, correct. IDs are not unique. I cannot use any time timension as well, as i need the full set of data for every given ID. I wonder if i could distinct LOAD the IDs in a reference table and then LOAD from my raw table with a where exists() clause in which i adress the 1st M records, then the 2nd and so on...have to think about it.

marcus_sommer · ‎2023-09-06

I don't think that I would try to slice the data in fixed sub-sets of a million records because it will probably need some extra efforts and performance.

What about clustering the id's itself? Quite often they are just numeric and might be sliced with something like: class(ID, 10000000) and by alphanumeric id's is often any logic encoded, for channels, countries, whatever which might be usable for such slicing.

Or · ‎2023-09-06

I don't really follow why this would improve performance, but that said, depending on how IDs work, you can probably get away with chunking the table into uneven segments that still eventually go through the entire thing. For example, if IDs are numeric, you could loop through 00 to 99 and pull all IDs matching *XX. If they're alphanumeric, you can loop through letters, etc.

Data Prep