This article describes the possible designs of backfill patterns. A backfill sync is a process that syncs historical data from a source to a target.
Long-running data syncs in Blendr
Record processing blends can run into some issues when they are used to sync many records.
We usually see this in contact sync use-cases where a blend is used to sync contacts between 2 CRM systems. During its first run, the blend will sync all historical contact information. In the subsequent runs, it will only sync new and updated contacts.
In some cases where an account has many contacts (>1million) this first run can cause challenges:
It can take longer than the maximum blend run duration
While this run is being executed other blends in a bundle are blocked
These challenges can be overcome by building backfill blends. These blends won't sync all data in the first run, but they will only process a small batch of the total data in each run and rerun when that is finished. When all data is synced, the blend will stop rerunning.
By splitting the total amount of records over multiple blend runs, we allow the sync process to run longer than one blend run. And depending on the mechanism used to rerun the blend, it will be possible to run other blends between the runs of the backfill blend.
Currently, there are 2 approaches to rerunning blends: using a triggered blend that triggers itself or using a scheduled blend.
This blend will trigger itself in order to perform the backfill operation. To specify which portion of the data should be synced, the blend needs to keep track of a field from the record that was synced last in its previous run (an id, an updated_at timestamp, or another field). This can be done by storing this record as a parameter in the Data Store or by sending it in the payload when retriggering the blend.
This approach will be faster than working with scheduled blends. But if a run fails, the retriggering is interrupted. And when used in a bundle, it will only allow webhook blends to run between the triggered runs. Other blends will need to wait until the initial sync is finished.
This blend will be executed according to a schedule. Similar to the triggered blend approach, it needs to be specified which portion of the data should be synced. For a scheduled blend, this can't be done by using a payload so it needs to store a "state" parameter in the Data Store.
This approach will be slower than working with triggered blends. But a failed run won't interrupt the sync and it can even be set to retry a failed batch X times before processing the next batch.