We are in the process of migrating data from Oracle to Snowflake. Part of this process we have tables with lot of data (Large Tables), which we have split into a dedicated tasks (~11 tables) and we have another task (with ~100 tables). Our tasks are taking hours to perform a full load in both the tasks. I need an expert advise how to improve the performance. Additionally we have enabled parallel load of the tasks with 11 big tables and it is taking hours for each table to load sequentially. Could you please help me fix the performance of the full reload? Are there any parameters that I can configure within my endpoints/tasks to enhance the performance?
for full load.
1. check the snowflake endpoint / max file size , is this set for 1000MB or 2000MB ?
2. if you are using parallel loading, what is full load tuning for : Maximum number of tables: ?
example, if you have Maximum number of tables: 5,
and segment loading got 10 segment , this task can only open 5 segment because of the max number of tables is set at 5.
3. what is your full load tuning : Commit rate during full load: try to increase this .
First of all this is a Replicate question, not an Enterprise Manager question as the EM is only used to trigger the tastk start but has no further control. Please consider moving this topic, or better still (imho) create a new topic in the Replicate forum with more, more concrete, information.
For example, as @Steve_Nguyen asks, what is the max-file size and full load tuning? Those are easy ones but there are more and it is best to simply provide the (relevant parts) of the JSON - obfuscation any sections which you believe are sensitive and removing highly repetitive sections (like 90 out of 100 tables).
That's the Replicate information we need to try to help. Now what about the data? "a lot of data " is hardly useable information. How many gigabytes? Given the network pipe throughput maximum, what is your back-of-the-enveloppe minimum time to load and how does that relate to the actual?
When you reach the maximum of the network pipe, adding more load streams only hurts - you need enough streams to fill the pipe, but not more. That's likely closer to 5 than to 10, but it could be 20.
Splitting a task into two can create more parallelism but reduces control and increases contention. Those hundred 'less important' tables will appear as important and will slow down the 10 important ones (or visa versa). It may be better, and easier to manage, to create a single tasks and use the per-table priority options within that task.
I haven't looked at the reptask logs for at Snowflake full-load tasks in a while. The CDC tasks have a nice log lines showing the start and end for the network transfers for each CSV file. Can you get that for full-load? May with TRACE option on TARGET_LOAD? Please share the result.
Good luck, and see you in the Replicate Forum!