DMTCP

Distributed Multithreaded checkpointing (DMTCP) checkpoints a running program on Linux with no modifications to the program or OS. It allows to restart running the program from a checkpoint.

Modules

To access dmtcp, load a dmtcp module. For example:

module load dmtcp/3.0.0

Example Programs

Here's a dummy example prints increasing integers, every 2 seconds. Copy this to a text file on Oscar and name it dmtcp_serial.c

#include<stdio.h>
#include<unistd.h>

int main(int argc, char* argv[])
{
    int count = 1;
    while (1)
    {
        printf(" %2d\n",count++);
        fflush(stdout);
        sleep(2)
    }
    return 0;
}

Compile this program by running

You should have the files in your directory now:

  • dmtcp_serial

  • dmtcp_serial.c

Basic Usage

Launch a Program

The dmtcp_launch command launches a program, and automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds" option to the dmtcp_lauch command.

Example: the following command launches the program dmtcp_serial and checkpoints every 8 seconds.

As shown in the example above, a checkpoint file (ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcpp) is created, and can be used to restart the program

Restart from a checkpoint

The dmtcp_resart command restarts a program from a checkpoint, and also automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds" option to the dmtcp_restart command.

Example: the following command restarts the dmtcp_serial program from a checkpoint, and checkpoints every 12 seconds

Batch Jobs

It is desirable goal that single job script can

  • launch a program if there is checkpoints, or

  • automatically restarts from a checkpoint if there is one or more checkpoints

The job script dmtcp_serial_job.sh below is an example which shows how to achieve the goal:

  • If there is no checkpoint in the current directory, launch the program dmtcp_serial

  • If one or more checkpoints exist in the current directory, restart the program dmtcp_serial from the latest checkpoint

First Submission - Launch a Program

Submit dmtcp_serial_job.sh and then wait for the job to run until time out. Below shows the beginning and end of the job output file

Later Submissions - Restart from a Checkpoint

Submit dmtcp_serial_job.sh and then wait for the job to run until time out. Below shows the beginning of the job output file, which demonstrate that the job restarts from the checkpoint of the previous job.

Job Array

The following example script

  • creates a sub directory for each task of a job array, and then saves a task's checkpoint in the task's own sub directory when the job script is submitted for the first time

  • restarts checkpoints in task subdirectories when the job script is submitted for the second time or later

Last updated

Was this helpful?