DMTCP
Distributed Multithreaded checkpointing (DMTCP) checkpoints a running program on Linux with no modifications to the program or OS. It allows to restart running the program from a checkpoint.
Modules
To access dmtcp, load a dmtcp module. For example:
module load dmtcp/3.0.0
Example Programs
Here's a dummy example prints increasing integers, every 2 seconds. Copy this to a text file on Oscar and name it dmtcp_serial.c
Compile this program by running
You should have the files in your directory now:
dmtcp_serial
dmtcp_serial.c
Basic Usage
Launch a Program
The dmtcp_launch
command launches a program, and automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds
" option to the dmtcp_lauch
command.
Example: the following command launches the program dmtcp_serial
and checkpoints every 8 seconds.
As shown in the example above, a checkpoint file (ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcpp
) is created, and can be used to restart the program
Restart from a checkpoint
The dmtcp_resart
command restarts a program from a checkpoint, and also automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds
" option to the dmtcp_restart
command.
Example: the following command restarts the dmtcp_serial
program from a checkpoint, and checkpoints every 12 seconds
Batch Jobs
It is desirable goal that single job script can
launch a program if there is checkpoints, or
automatically restarts from a checkpoint if there is one or more checkpoints
The job script dmtcp_serial_job.sh
below is an example which shows how to achieve the goal:
If there is no checkpoint in the current directory, launch the program
dmtcp_serial
If one or more checkpoints exist in the current directory, restart the program
dmtcp_serial
from the latest checkpoint
First Submission - Launch a Program
Submit dmtcp_serial_job.sh
and then wait for the job to run until time out. Below shows the beginning and end of the job output file
Later Submissions - Restart from a Checkpoint
Submit dmtcp_serial_job.sh
and then wait for the job to run until time out. Below shows the beginning of the job output file, which demonstrate that the job restarts from the checkpoint of the previous job.
Job Array
The following example script
creates a sub directory for each task of a job array, and then saves a task's checkpoint in the task's own sub directory when the job script is submitted for the first time
restarts checkpoints in task subdirectories when the job script is submitted for the second time or later
Last updated
Was this helpful?