Oscar
HomeServicesDocumentation
  • Overview
  • Quickstart
  • Getting Started
  • System Hardware
  • Account Information
  • Short "How to" Videos
  • Quick Reference
    • Common Acronyms and Terms
    • Managing Modules
    • Common Linux Commands
  • Getting Help
    • ❓FAQ
  • Citing CCV
  • CCV Account Information
  • Student Accounts
  • Offboarding
  • Connecting to Oscar
    • SSH (Terminal)
      • SSH Key Login (Passwordless SSH)
        • Mac/Linux/Windows(PowerShell)
        • Windows(PuTTY)
      • SSH Configuration File
      • X-Forwarding
      • SSH Agent Forwarding
        • Mac/Linux
        • Windows (PuTTY)
      • Arbiter2
    • Open OnDemand
      • Using File Explorer on OOD
      • Web-based Terminal App
      • Interactive Apps on OOD
      • Using Python or Conda environments in the Jupyter App
      • Using RStudio
      • Desktop App (VNC)
    • SMB (Local Mount)
    • Remote IDE (VS Code)
      • From Non-compliant Networks (2-FA)
      • Setup virtual environment and debugger
  • Managing files
    • Oscar's Filesystem
    • Transferring Files to and from Oscar
    • Transferring Files between Oscar and Campus File Storage (Replicated and Non-Replicated)
    • Resolving quota issues
      • Understanding Disk Quotas
    • Inspecting Disk Usage (Ncdu)
    • Restoring Deleted Files
    • Best Practices for I/O
    • Version Control
  • Submitting jobs
    • Running Jobs
    • Slurm Partitions
    • Interactive Jobs
    • Batch Jobs
    • Managing Jobs
    • Job Arrays
    • MPI Jobs
    • Condo/Priority Jobs
    • Dependent Jobs
    • Associations & Quality of Service (QOS)
  • GPU Computing
    • GPUs on Oscar
      • Grace Hopper GH200 GPUs
      • H100 NVL Tensor Core GPUs
      • Ampere Architecture GPUs
    • Submitting GPU Jobs
    • Intro to CUDA
    • Compiling CUDA
    • Installing Frameworks (PyTorch, TensorFlow, Jax)
      • Installing JAX
      • Installing TensorFlow
    • Mixing MPI and CUDA
  • Large Memory Computing
    • Large Memory Nodes on Oscar
  • Software
    • Software on Oscar
    • Using Modules
    • Migration of MPI Apps to Slurm 22.05.7
    • Python on Oscar
    • Python in batch jobs
    • Installing Python Packages
    • Installing R Packages
    • Using CCMake
    • Intro to Parallel Programming
    • Anaconda
    • Conda and Mamba
    • DMTCP
    • Screen
    • VASP
    • Gaussian
    • IDL
    • MPI4PY
  • Jupyter Notebooks/Labs
    • Jupyter Notebooks on Oscar
    • Jupyter Labs on Oscar
    • Tunneling into Jupyter with Windows
  • Debugging
    • Arm Forge
      • Configuring Remote Launch
      • Setting Job Submission Settings
  • MATLAB
    • Matlab GUI
    • Matlab Batch Jobs
    • Improving Performance and Memory Management
    • Parallel Matlab
  • Visualization 🕶
    • ParaView Remote Rendering
  • Singularity Containers
    • Intro to Apptainer
    • Building Images
    • Running Images
    • Accessing Oscar Filesystem
      • Example Container (TensorFlow)
    • Singularity Tips and Tricks
  • Installing Software Packages Locally
    • Installing your own version of Quantum Espresso
    • Installing your own version of Qmcpack
  • dbGaP
    • dbGaP Architecture
    • dbGaP Data Transfers
    • dbGaP Job Submission
  • RHEL9 Migration
    • RHEL-9 Migration
    • LMOD - New Module System
    • Module Changes
    • Testing Jupyter Notebooks on RHEL9 mini-cluster
  • Large Language Models
    • Ollama
Powered by GitBook
On this page
  • Modules
  • Example Programs
  • Basic Usage
  • Launch a Program
  • Restart from a checkpoint
  • Batch Jobs
  • First Submission - Launch a Program
  • Later Submissions - Restart from a Checkpoint
  • Job Array

Was this helpful?

Export as PDF
  1. Software

DMTCP

PreviousConda and MambaNextScreen

Last updated 11 months ago

Was this helpful?

(DMTCP) checkpoints a running program on Linux with no modifications to the program or OS. It allows to restart running the program from a checkpoint.

Modules

To access dmtcp, load a dmtcp module. For example:

module load dmtcp/3.0.0

Example Programs

Here's a dummy example prints increasing integers, every 2 seconds. Copy this to a text file on Oscar and name it dmtcp_serial.c

#include<stdio.h>
#include<unistd.h>

int main(int argc, char* argv[])
{
    int count = 1;
    while (1)
    {
        printf(" %2d\n",count++);
        fflush(stdout);
        sleep(2)
    }
    return 0;
}

Compile this program by running

gcc dmtcp_serial.c -o dmtcp_serial

You should have the files in your directory now:

  • dmtcp_serial

  • dmtcp_serial.c

Basic Usage

Launch a Program

The dmtcp_launch command launches a program, and automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds" option to the dmtcp_lauch command.

Example: the following command launches the program dmtcp_serial and checkpoints every 8 seconds.

$port=$(shuf -i 40000-60000 -n 1)
$dmtcp_launch -p$port -i 8 ./dmtcp_serial  
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
^C
[yliu385@node1317 interact]$ ll
total 2761
-rw------- 1 yliu385 ccvstaff 2786466 May 18 11:18 ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcp
lrwxrwxrwx 1 yliu385 ccvstaff      60 May 18 11:18 dmtcp_restart_script.sh -> dmtcp_restart_script_24f183c2194a7dc4-40000-42af82ef922a7.sh
-rwxr--r-- 1 yliu385 ccvstaff   12533 May 18 11:18 dmtcp_restart_script_24f183c2194a7dc4-40000-42af82ef922a7.sh
-rwxr-xr-x 1 yliu385 ccvstaff    8512 May 18 08:36 dmtcp_serial

As shown in the example above, a checkpoint file (ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcpp) is created, and can be used to restart the program

Restart from a checkpoint

The dmtcp_resart command restarts a program from a checkpoint, and also automatically checkpoints the program. To specify the interval (seconds) for checkpoints, add the "-i num_seconds" option to the dmtcp_restart command.

Example: the following command restarts the dmtcp_serial program from a checkpoint, and checkpoints every 12 seconds

$port=$(shuf -i 40000-60000 -n 1)
$dmtcp_restart -p $port -i 12 ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcp 
  9
 10
 11
 12
 13
 14
 15
^C
[yliu385@node1317 interact]$ dmtcp_restart -p $port -i 12 ckpt_dmtcp_serial_24f183c2194a7dc4-40000-42af86bb59385.dmtcp 
 15
 16
 17
^C

Batch Jobs

It is desirable goal that single job script can

  • launch a program if there is checkpoints, or

  • automatically restarts from a checkpoint if there is one or more checkpoints

The job script dmtcp_serial_job.sh below is an example which shows how to achieve the goal:

  • If there is no checkpoint in the current directory, launch the program dmtcp_serial

  • If one or more checkpoints exist in the current directory, restart the program dmtcp_serial from the latest checkpoint

 #!/bin/bash

#SBATCH -n 1
#SBATCH -t 5:00
#SBATCH -J dmtcp_serial

module load dmtcp/3.0.0

checkpoint_file=`ls ckpt_*.dmtcp -t|head -n 1`
checkpoint_interval=8
port=$(shuf -i 40000-60000 -n 1)

if [ -z $checkpoint_file ]; then
    dmtcp_launch -p $port -i $checkpoint_interval ./dmtcp_serial
else
    dmtcp_restart -p $port -i $checkpoint_interval $checkpoint_file
fi

First Submission - Launch a Program

Submit dmtcp_serial_job.sh and then wait for the job to run until time out. Below shows the beginning and end of the job output file

$ head  slurm-5157871.out -n 15
## SLURM PROLOG ###############################################################
##    Job ID : 5157871
##  Job Name : dmtcp_serial
##  Nodelist : node1139
##      CPUs : 1
##   Mem/CPU : 2800 MB
##  Mem/Node : 65536 MB
## Directory : /gpfs/data/ccvstaff/yliu385/Test/dmtcp/serial/batch_job
##   Job Started : Wed May 18 09:38:39 EDT 2022
###############################################################################
ls: cannot access ckpt_*.dmtcp: No such file or directory
  1
  2
  3
  4
$ tail slurm-5157871.out
 147
 148
 149
 150
 151
 152
 153
 154
 155
slurmstepd: error: *** JOB 5157871 ON node1139 CANCELLED AT 2022-05-18T09:43:58 DUE TO TIME LIMIT ***

Later Submissions - Restart from a Checkpoint

Submit dmtcp_serial_job.sh and then wait for the job to run until time out. Below shows the beginning of the job output file, which demonstrate that the job restarts from the checkpoint of the previous job.

$ head  slurm-5158218.out -n 15
## SLURM PROLOG ###############################################################
##    Job ID : 5158218
##  Job Name : dmtcp_serial
##  Nodelist : node1327
##      CPUs : 1
##   Mem/CPU : 2800 MB
##  Mem/Node : 65536 MB
## Directory : /gpfs/data/ccvstaff/yliu385/Test/dmtcp/serial/batch_job
##   Job Started : Wed May 18 09:50:39 EDT 2022
###############################################################################
 153
 154
 155
 156
 157

Job Array

The following example script

  • creates a sub directory for each task of a job array, and then saves a task's checkpoint in the task's own sub directory when the job script is submitted for the first time

  • restarts checkpoints in task subdirectories when the job script is submitted for the second time or later

#!/bin/bash

#SBATCH -n 1
#SBATCH --array=1-4
#SBATCH -t 5:00
#SBATCH -J dmtcp_job_array

module load dmtcp/3.0.0

checkpoint_interval=8
port=$((SLURM_JOB_ID %20000 + 40000))
task_dir=jobtask_$SLURM_ARRAY_TASK_ID

if [ ! -d $task_dir ]; then
    mkdir $task_dir
    cd $task_dir
    dmtcp_launch -p $port -i $checkpoint_interval ../dmtcp_serial
else
    cd $task_dir
    checkpoint_file=`ls ckpt_*.dmtcp -t|head -n 1`
    if [ -z $checkpoint_file ]; then
        dmtcp_launch -p $port -i $checkpoint_interval ../dmtcp_serial
    else
        dmtcp_restart -p $port -i $checkpoint_interval $checkpoint_file
    fi
fi

Distributed Multithreaded checkpointing