Distributed Multithreaded checkpointing (DMTCP) checkpoints a running program on Linux with no modifications to the program or OS. It allows to restart running the program from a checkpoint.
Modules
To access dmtcp, load a dmtcp module. For example:
module load dmtcp/3.0.0
Example Programs
Here's a dummy example prints increasing integers, every 2 seconds. Copy this to a text file on Oscar and name it dmtcp_serial.c
The dmtcp_launch command launches a program, and automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds" option to the dmtcp_lauch command.
Example: the following command launches the program dmtcp_serial and checkpoints every 8 seconds.
As shown in the example above, a checkpoint file (ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcpp) is created, and can be used to restart the program
Restart from a checkpoint
The dmtcp_resart command restarts a program from a checkpoint, and also automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds" option to the dmtcp_restart command.
Example: the following command restarts the dmtcp_serial program from a checkpoint, and checkpoints every 12 seconds
Submit dmtcp_serial_job.sh and then wait for the job to run until time out. Below shows the beginning and end of the job output file
$ head slurm-5157871.out -n 15
## SLURM PROLOG ###############################################################
## Job ID : 5157871
## Job Name : dmtcp_serial
## Nodelist : node1139
## CPUs : 1
## Mem/CPU : 2800 MB
## Mem/Node : 65536 MB
## Directory : /gpfs/data/ccvstaff/yliu385/Test/dmtcp/serial/batch_job
## Job Started : Wed May 18 09:38:39 EDT 2022
###############################################################################
ls: cannot access ckpt_*.dmtcp: No such file or directory
1
2
3
4
$ tail slurm-5157871.out
147
148
149
150
151
152
153
154
155
slurmstepd: error: *** JOB 5157871 ON node1139 CANCELLED AT 2022-05-18T09:43:58 DUE TO TIME LIMIT ***
Later Submissions - Restart from a Checkpoint
Submit dmtcp_serial_job.sh and then wait for the job to run until time out. Below shows the beginning of the job output file, which demonstrate that the job restarts from the checkpoint of the previous job.
creates a sub directory for each task of a job array, and then saves a task's checkpoint in the task's own sub directory when the job script is submitted for the first time
restarts checkpoints in task subdirectories when the job script is submitted for the second time or later