1 of 3

Installing Frameworks (PyTorch, TensorFlow, Jax)

This page describes installing popular frameworks like TensorFlow, PyTorch & JAX, etc. on your Oscar account.

Preface: Oscar is a heterogeneous cluster meaning we have nodes with different architecture GPUs (Pascal, Volta, Turing, and Ampere). We recommend building the environment first time on Ampere GPUs with the latest CUDA11 modules so it's backward compatible with older architecture GPUs.

In this example, we will install PyTorch (refer to sub-pages for TensorFlow and Jax).

Step 1: Request an interactive session on a GPU node with Ampere architecture GPUs

interact -q gpu -g 1 -f ampere -m 20g -n 4

Here, -f = feature. We only need to build on Ampere once.

Step 2: Once your session has started on a compute node, run nvidia-smi to verify the GPU and then load the appropriate modules

Step 3: Create and activate the virtual environment, unload the pre-loaded modules then load cudnn and cuda dependencies

module purge
unset LD_LIBRARY_PATH
module load cudnn cuda

Step 4: Create a new vittual environment

python -m venv pytorch.venv
source pytorch.venv/bin/activate

Step 5: Install the required packages

pip install --upgrade pip
pip install torch torchvision torchaudio

The aforementioned will install the latest version of PyTorch with cuda11 compatibility, for older versions you can specify the version by:

pip install torch torchvision torchaudio

Step 6: Test that PyTorch is able to detect GPUs

python
>>> import torch 
torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3090'

If the above functions return True and GPU model, then it's working correctly. You are all set, now you can install other necessary packages.

Installing JAX

This page describes how to install JAX with Python virtual environments

In this example, we will install Jax.

Step 1: Request an interactive session on a GPU node with Ampere architecture GPUs

Here, -f = feature. We only need to build on Ampere once.

Step 2: Once your session has started on a compute node, run nvidia-smi to verify the GPU and then load the appropriate modules

Step 3: Create and activate the virtual environment

Step 4: Install the required packages

Step 5: Test that JAX is able to detect GPUs

If the above function returns gpu, then it's working correctly. You are all set, now you can install other necessary packages.

Modify batch file: See below the example batch file with the created environment

Installing TensorFlow

Setting up a GPU-accelerated environment can be challenging due to driver dependencies, version conflicts, and other complexities. Apptainer simplifies this process by encapsulating all these details

Apptainer Using NGC Containers (Our #1 Recommendation)

There are multiple ways to install and run TensorFlow. Our recommended approach is via NGC containers. The containers are available via NGC Registry. In this example we will pull TensorFlow NGC container

Build the container:

apptainer build tensorflow-24.03-tf2-py3.simg docker://nvcr.io/nvidia/tensorflow:24.03-tf2-py3

This will take some time, and once it completes you should see a .simg file.

For your convenience, the pre-built container images are located in directory:

/oscar/runtime/software/external/ngc-containers/tensorflow.d/x86_64/

You can choose either to build your own or use one of the pre-downloaded images.

Working with Apptainer images requires lots of storage space. By default Apptainer will use ~/.apptainer as a cache directory which can cause you to go over your Home quota.

export APPTAINER_CACHEDIR=/tmp
export APPTAINER_TMPDIR=/tmp

Once the container is ready, request an interactive session with a GPU

interact -q gpu -g 1 -f ampere -m 20g -n 4

Run a container wih GPU support

export APPTAINER_BINDPATH="/oscar/home/$USER,/oscar/scratch/$USER,/oscar/data"
# Run a container with GPU support
apptainer run --nv tensorflow-24.03-tf2-py3.simg

the --nv flag is important. As it enables the NVIDA sub-system

Or, if you're executing a specific command inside the container:

# Execute a command inside the container with GPU support
$ apptainer exec --nv tensorflow-24.03-tf2-py3.simg nvidia-smi

Make sure your Tensorflow image is able to detect GPUs

$ python
>>> import tensorflow as tf
>>> tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)
True

If you need to install more custom packages, the containers itself are non-writable but we can use the --user flag to install packages inside .local Example:

Apptainer> pip install <package-name> --user

Slurm Script:

Here is how you can submit a SLURM job script by using the srun command to run your container. Here is a basic example:

#!/bin/bash
#SBATCH --nodes=1               # node count
#SBATCH -p gpu --gres=gpu:1     # number of gpus per node
#SBATCH --ntasks-per-node=1     # total number of tasks across all nodes
#SBATCH --cpus-per-task=1       # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=40G               # total memory (4 GB per cpu-core is default)
#SBATCH -t 01:00:00             # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin       # send email when job begins
#SBATCH --mail-type=end         # send email when job ends
#SBATCH --mail-user=<USERID>@brown.edu

module purge
unset LD_LIBRARY_PATH
export APPTAINER_BINDPATH="/oscar/home/$USER,/oscar/scratch/$USER,/oscar/data"
srun apptainer exec --nv tensorflow-24.03-tf2-py3.simg python examples/tensorflow_examples/models/dcgan/dcgan.py

Installing Frameworks (PyTorch, TensorFlow, Jax)

This page describes installing popular frameworks like TensorFlow, PyTorch & JAX, etc. on your Oscar account.

In this example, we will install PyTorch (refer to sub-pages for TensorFlow and Jax).

Step 1: Request an interactive session on a GPU node with Ampere architecture GPUs

interact -q gpu -g 1 -f ampere -m 20g -n 4

Here, -f = feature. We only need to build on Ampere once.

Step 2: Once your session has started on a compute node, run nvidia-smi to verify the GPU and then load the appropriate modules

Step 3: Create and activate the virtual environment, unload the pre-loaded modules then load cudnn and cuda dependencies

module purge
unset LD_LIBRARY_PATH
module load cudnn cuda

Step 4: Create a new vittual environment

python -m venv pytorch.venv
source pytorch.venv/bin/activate

Step 5: Install the required packages

pip install --upgrade pip
pip install torch torchvision torchaudio

The aforementioned will install the latest version of PyTorch with cuda11 compatibility, for older versions you can specify the version by:

pip install torch torchvision torchaudio

Step 6: Test that PyTorch is able to detect GPUs

python
>>> import torch 
torch.cuda.is_available()
True
>>> torch.cuda.get_device_name(0)
'NVIDIA GeForce RTX 3090'

If the above functions return True and GPU model, then it's working correctly. You are all set, now you can install other necessary packages.

Installing JAX

This page describes how to install JAX with Python virtual environments

In this example, we will install Jax.

Step 1: Request an interactive session on a GPU node with Ampere architecture GPUs

interact -q gpu -g 1 -f ampere -m 20g -n 4

Here, -f = feature. We only need to build on Ampere once.

Step 2: Once your session has started on a compute node, run nvidia-smi to verify the GPU and then load the appropriate modules

module purge 
unset LD_LIBRARY_PATH
module load cuda cudnn

Step 3: Create and activate the virtual environment

python -m venv jax.venv
source jax.venv/bin/activate

Step 4: Install the required packages

pip install --upgrade pip
pip  install  --upgrade  "jax[cuda12_pip]"  -f  https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Step 5: Test that JAX is able to detect GPUs

python
>>> from jax.lib import xla_bridge
>>> print(xla_bridge.get_backend().platform)
gpu

If the above function returns gpu, then it's working correctly. You are all set, now you can install other necessary packages.

Modify batch file: See below the example batch file with the created environment

#SBATCH -J RBC
#SBATCH -N 1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=3:30:00
#SBATCH --mem=64GB
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH -o RBC_job_%j.o
#SBATCH -e RBC_job_%j.e

echo $LD_LIBRARY_PATH
unset LD_LIBRARY_PATH
echo $LD_LIBRARY_PATH

source /oscar/data/gk/psaluja/jax_env.venv/bin/activate
python3 -u kernel.py

Installing TensorFlow

Setting up a GPU-accelerated environment can be challenging due to driver dependencies, version conflicts, and other complexities. Apptainer simplifies this process by encapsulating all these details

Apptainer Using NGC Containers (Our #1 Recommendation)

Build the container:

apptainer build tensorflow-24.03-tf2-py3.simg docker://nvcr.io/nvidia/tensorflow:24.03-tf2-py3

This will take some time, and once it completes you should see a .simg file.

For your convenience, the pre-built container images are located in directory:

/oscar/runtime/software/external/ngc-containers/tensorflow.d/x86_64/

You can choose either to build your own or use one of the pre-downloaded images.

Working with Apptainer images requires lots of storage space. By default Apptainer will use ~/.apptainer as a cache directory which can cause you to go over your Home quota.

export APPTAINER_CACHEDIR=/tmp
export APPTAINER_TMPDIR=/tmp

Once the container is ready, request an interactive session with a GPU

interact -q gpu -g 1 -f ampere -m 20g -n 4

Run a container wih GPU support

export APPTAINER_BINDPATH="/oscar/home/$USER,/oscar/scratch/$USER,/oscar/data"
# Run a container with GPU support
apptainer run --nv tensorflow-24.03-tf2-py3.simg

the --nv flag is important. As it enables the NVIDA sub-system

Or, if you're executing a specific command inside the container:

# Execute a command inside the container with GPU support
$ apptainer exec --nv tensorflow-24.03-tf2-py3.simg nvidia-smi

Make sure your Tensorflow image is able to detect GPUs

$ python
>>> import tensorflow as tf
>>> tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)
True

If you need to install more custom packages, the containers itself are non-writable but we can use the --user flag to install packages inside .local Example:

Apptainer> pip install <package-name> --user

Slurm Script:

Here is how you can submit a SLURM job script by using the srun command to run your container. Here is a basic example:

#!/bin/bash
#SBATCH --nodes=1               # node count
#SBATCH -p gpu --gres=gpu:1     # number of gpus per node
#SBATCH --ntasks-per-node=1     # total number of tasks across all nodes
#SBATCH --cpus-per-task=1       # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=40G               # total memory (4 GB per cpu-core is default)
#SBATCH -t 01:00:00             # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin       # send email when job begins
#SBATCH --mail-type=end         # send email when job ends
#SBATCH --mail-user=<USERID>@brown.edu

module purge
unset LD_LIBRARY_PATH
export APPTAINER_BINDPATH="/oscar/home/$USER,/oscar/scratch/$USER,/oscar/data"
srun apptainer exec --nv tensorflow-24.03-tf2-py3.simg python examples/tensorflow_examples/models/dcgan/dcgan.py