This page describes installing popular frameworks like TensorFlow, PyTorch & JAX, etc. on your Oscar account.
Preface: Oscar is a heterogeneous cluster meaning we have nodes with different architecture GPUs (Pascal, Volta, Turing, and Ampere). We recommend building the environment first time on Ampere GPUs with the latest CUDA11 modules so it's backward compatible with older architecture GPUs.
In this example, we will install PyTorch (refer to sub-pages for TensorFlow and Jax).
Step 1: Request an interactive session on a GPU node with Ampere architecture GPUs
interact -q gpu -g 1 -f ampere -m 20g -n 4
Here, -f = feature. We only need to build on Ampere once.
Step 2: Once your session has started on a compute node, run nvidia-smi
to verify the GPU and then load the appropriate modules
Step 3: Create and activate the virtual environment, unload the pre-loaded modules then load cudnn and cuda dependencies
Step 4: Create a new vittual environment
Step 5: Install the required packages
The aforementioned will install the latest version of PyTorch with cuda11 compatibility, for older versions you can specify the version by:
Step 6: Test that PyTorch is able to detect GPUs
If the above functions return True
and GPU model
, then it's working correctly. You are all set, now you can install other necessary packages.
This page describes how to install JAX with Python virtual environments
In this example, we will install Jax.
Step 1: Request an interactive session on a GPU node with Ampere architecture GPUs
Here, -f = feature. We only need to build on Ampere once.
Step 2: Once your session has started on a compute node, run nvidia-smi
to verify the GPU and then load the appropriate modules
Step 3: Create and activate the virtual environment
Step 4: Install the required packages
Step 5: Test that JAX is able to detect GPUs
If the above function returns gpu
, then it's working correctly. You are all set, now you can install other necessary packages.
Setting up a GPU-accelerated environment can be challenging due to driver dependencies, version conflicts, and other complexities. Apptainer simplifies this process by encapsulating all these details
There are multiple ways to install and run TensorFlow. Our recommended approach is via NGC containers. The containers are available via NGC Registry. In this example we will pull TensorFlow NGC container
Build the container:
This will take some time, and once it completes you should see a .simg file.
For your convenience, the pre-built container images are located in directory:
/oscar/runtime/software/external/ngc-containers/tensorflow.d/x86_64/
You can choose either to build your own or use one of the pre-downloaded images.
Working with Apptainer images requires lots of storage space. By default Apptainer will use ~/.apptainer as a cache directory which can cause you to go over your Home quota.
Once the container is ready, request an interactive session with a GPU
Run a container wih GPU support
the --nv flag is important. As it enables the NVIDA sub-system
Or, if you're executing a specific command inside the container:
Make sure your Tensorflow image is able to detect GPUs
If you need to install more custom packages, the containers itself are non-writable but we can use the --user
flag to install packages inside .local
Example:
Here is how you can submit a SLURM job script by using the srun command to run your container. Here is a basic example: