Checkpoint and Continue ======================= Marshalling long running tasks on unreliable or opportunistic compute resources demands defensive programming to ensure that results are not lost. Imagine an app running for the duration of a week only to be pre-empted on a shared resource, resulting in the loss of one week of compute time. This can be avoided with application with checkpointing and restart functionality. For the sake of simplicity we use a script which generates numbers from the fibonacci series. The script, takes a checkpoint file and also generates the output file which functions as the checkpoint for the next invocation of the same app. Application logic: ------------------ Here's the core logic of the application [source,bash] ----- (checkpoint_out) fibo ( checkpoint_in, terminate_condition, max_steps ) { if check_point_in is empty start series from 0 1 else while (terminate_condition is not valid) generate max_steps of the fibonacci series from the last two items in the checkpoint_in } ----- Swift script logic: ------------------- Here's the core logic for the swift implementation for the checkpoint and continue case: [source, bash] ----- // Set the initial checkpoint file to an empty file checkpoint[0] = empty_file; iterate steps { // Place the checkpoint result into an array, and use the checkpoint from // previous step to run the app. checkpoint[steps+1] = fibonacci ( checkpoint[steps] ); } until ( checkpoint[steps] == "done" ); ----- Config options: --------------- We use two specific config options in cases where errors are expected: 1. lazyErrors : Setting this config option to "true" ensures that swift does not terminate the entire run, upon encountering any single app failure, which is the default behavior. lazyErrors ensure that the swift run will be terminated only when no progress can be made without results from a failed app. 2. executionRetries : This defines the number of attempts swift will make to retry an app when a failure is detected. Takes an integer. executionRetries together with lazyErrors, allow swift to tolerate failures and attempt retries to ensure that the app is reattempted to rule out failures that cannot be handled ,like corrupt input. For eg, if an app was pre-empted while running on a shared resource, setting lazyErrors and executionRetries allows the app to be resubmitted with the last completed checkpoint, which gives the app the chance to continue from a known checkpoint. Checkpoint and restart from failures in parallel ------------------------------------------------ The swift script wraps the iterate logic described in the Swift script logic in a foreach loop. This allows several sequencial iterations of the application to be run in parallel. To run N parallel instances of the fibonacci thread do: [source, bash] ----- # To run 3 parallel instances: swift checkpoint_restart_parallel.swift -nsim=3 -----