Oscar
HomeServicesDocumentation
  • Overview
  • Quickstart
  • Getting Started
  • System Hardware
  • Account Information
  • Short "How to" Videos
  • Quick Reference
    • Common Acronyms and Terms
    • Managing Modules
    • Common Linux Commands
  • Getting Help
    • ❓FAQ
  • Citing CCV
  • CCV Account Information
  • Student Accounts
  • Offboarding
  • Connecting to Oscar
    • SSH (Terminal)
      • SSH Key Login (Passwordless SSH)
        • Mac/Linux/Windows(PowerShell)
        • Windows(PuTTY)
      • SSH Configuration File
      • X-Forwarding
      • SSH Agent Forwarding
        • Mac/Linux
        • Windows (PuTTY)
      • Arbiter2
    • Open OnDemand
      • Using File Explorer on OOD
      • Web-based Terminal App
      • Interactive Apps on OOD
      • Using Python or Conda environments in the Jupyter App
      • Using RStudio
      • Desktop App (VNC)
    • SMB (Local Mount)
    • Remote IDE (VS Code)
      • From Non-compliant Networks (2-FA)
      • Setup virtual environment and debugger
  • Managing files
    • Oscar's Filesystem
    • Transferring Files to and from Oscar
    • Transferring Files between Oscar and Campus File Storage (Replicated and Non-Replicated)
    • Resolving quota issues
      • Understanding Disk Quotas
    • Inspecting Disk Usage (Ncdu)
    • Restoring Deleted Files
    • Best Practices for I/O
    • Version Control
  • Submitting jobs
    • Running Jobs
    • Slurm Partitions
    • Interactive Jobs
    • Batch Jobs
    • Managing Jobs
    • Job Arrays
    • MPI Jobs
    • Condo/Priority Jobs
    • Dependent Jobs
    • Associations & Quality of Service (QOS)
  • GPU Computing
    • GPUs on Oscar
      • Grace Hopper GH200 GPUs
      • H100 NVL Tensor Core GPUs
      • Ampere Architecture GPUs
    • Submitting GPU Jobs
    • Intro to CUDA
    • Compiling CUDA
    • Installing Frameworks (PyTorch, TensorFlow, Jax)
      • Installing JAX
      • Installing TensorFlow
    • Mixing MPI and CUDA
  • Large Memory Computing
    • Large Memory Nodes on Oscar
  • Software
    • Software on Oscar
    • Using Modules
    • Migration of MPI Apps to Slurm 22.05.7
    • Python on Oscar
    • Python in batch jobs
    • Installing Python Packages
    • Installing R Packages
    • Using CCMake
    • Intro to Parallel Programming
    • Anaconda
    • Conda and Mamba
    • DMTCP
    • Screen
    • VASP
    • Gaussian
    • IDL
    • MPI4PY
  • Jupyter Notebooks/Labs
    • Jupyter Notebooks on Oscar
    • Jupyter Labs on Oscar
    • Tunneling into Jupyter with Windows
  • Debugging
    • Arm Forge
      • Configuring Remote Launch
      • Setting Job Submission Settings
  • MATLAB
    • Matlab GUI
    • Matlab Batch Jobs
    • Improving Performance and Memory Management
    • Parallel Matlab
  • Visualization 🕶
    • ParaView Remote Rendering
  • Singularity Containers
    • Intro to Apptainer
    • Building Images
    • Running Images
    • Accessing Oscar Filesystem
      • Example Container (TensorFlow)
    • Singularity Tips and Tricks
  • Installing Software Packages Locally
    • Installing your own version of Quantum Espresso
    • Installing your own version of Qmcpack
  • dbGaP
    • dbGaP Architecture
    • dbGaP Data Transfers
    • dbGaP Job Submission
  • RHEL9 Migration
    • RHEL-9 Migration
    • LMOD - New Module System
    • Module Changes
    • Testing Jupyter Notebooks on RHEL9 mini-cluster
  • Large Language Models
    • Ollama
Powered by GitBook
On this page
  • LLMs Hosted by CCV
  • Requesting a GPU Node
  • Starting an Ollama Server
  • Running an LLM Interactively with Ollama
  • Starting an Interactive Chat
  • Loading Specific Versions of a Model
  • Running Ollama Models via Python
  • Calling Ollama from Python Script
  • Notes Regarding Performance
  • Appendix

Was this helpful?

Export as PDF
  1. Large Language Models

Ollama

This page describes how to run large language models (LLMs) directly on Oscar nodes using Ollama.

PreviousTesting Jupyter Notebooks on RHEL9 mini-cluster

Last updated 1 month ago

Was this helpful?

LLMs Hosted by CCV

CCV hosts several dozen public, open-weight LLMs on Oscar. This includes Llama 3.3, DeepSeek-r1, Mistral, and Gemma3. You can see the complete list in the Appendix section below.

We first begin by opening a terminal and connecting to Oscar. This can be done using Open OnDemand, a terminal application on your local machine, or PuTTY if you're on a Windows machine.

Once we have a terminal session on Oscar, we need to set an environment variable that tells Ollama where to look for the CCV-hosted models. This only needs to be done once, and you can do so using the commands below.

echo 'export OLLAMA_MODELS=/oscar/data/shared/ollama_models' >> ~/.bashrc

source ~/.bashrc

Requesting a GPU Node

LLMs are particularly well suited to running on GPUs, so we begin by requesting a GPU node on Oscar using the following interact command, which requests 4 CPU cores, 32 GB of memory, and 1 GPU for 1 hour.

interact -n 4 -m 32g -q gpu -g 1 -t 1:00:00

Note that depending on the particular LLM, you may want additional resources (e.g., more CPU cores, memory, or GPUs). The above example should be good for most models.

Starting an Ollama Server

There are several ways to run large language models directly on Oscar. One particularly straightforward and flexible approach is to use , which is installed as a module on Oscar.

Once we get our job allocated and we are on a GPU node, we must next load the ollama module.

module load ollama

Because the Ollama framework operates using a client/server architecture, we must now launch the server component of Ollama. This is done using the command below.

ollama serve

After running the command above, we will see a stream of output; this is the indication that the Ollama server has started.

Running an LLM Interactively with Ollama

Now that we have the Ollama server running, we can use it to launch LLMs on our GPU node. To do so, we must first start a new terminal session and use it to connect to our GPU node. Note that our original terminal session that we started above needs to continue running; that session is responsible for running the Ollama server. We are going to use a new terminal session to start the client. If you are using an Open OnDemand Desktop session, you can right-click on the Terminal icon at the bottom of the screen, and select New Window. Similarly, if you are connecting via your local machine's terminal application, you would start a new window. And the same is true if you are using PuTTY on Windows.

Once we have a new terminal started, run the myq command to see the hostname of our running Ollama server; it will be under the NODES heading and look something like gpuXXXX. We can connect to our GPU node from the login node by running the following command, where XXXX is an integer greater than 1000.

ssh gpuXXXX            # replace XXXX with your GPU node's number

Starting an Interactive Chat

Once we have connected to our GPU node, we are nearly ready to start our LLM running. We first need to load the ollama module again, which we do using the command below.

module load ollama

We can now run an interactive chat session with the llama3.2 model using the command below. Note that it may take a few seconds for the chat interface to start.

ollama run llama3.2

The above command will take us into an interactive chat session with llama3.2 . This should be apparent when our command prompt changes from the usual Linux command prompt into a >>> prompt.

You can now enter queries directly into the prompt. For example you could ask the following.

How would I write a function to compute Fibonacci numbers in Rust?

Whenever we are finished with our chat session, we can exit the chat by typing /bye

Loading Specific Versions of a Model

Suppose that instead of running the default version of a model in Ollama, we wanted a particular version. For example, the default for gemma2 is the 9b version, which has roughly 9-billion parameters. If we wanted to launch the larger 27b model of gemma2, we could do so using the command below.

ollama run gemma2:27b

Running Ollama Models via Python

If we would prefer to interact with Ollama models via Python, we can do so using the ollama package in Python. This is useful, for example, if we want to benchmark models against one another or other long-running tasks.

This section assumes we have already started the Ollama server described in the beginning of this page. If that is true, we can start a new terminal as we did before, and then create a Python virtual environment using the following commands.

mkdir -p ~/projects/python_ollama_test
cd ~/projects/python_ollama_test
python -m venv venv

We can then activate our newly created virtual environment and install the Ollama Python package using the following commands.

source venv/bin/activate
pip install ollama

Calling Ollama from Python Script

Now, let's create a Python script that will compare the output of three models on a few questions. Create a file called main.py inside our ~/projects/python_ollama_test directory and paste the following text into that file.

import ollama

def print_model_response(model, query):
    response = ollama.chat(model=model, messages=[
        {
            'role': 'user',
            'content': query,
        },
    ])

    print(response['message']['content'])

if __name__ == '__main__':
    models = ['llama3.1', 'llama3.2', 'gemma2']
    queries =['Write a function to compute Fibonacci numbers using iteration in Rust?', 
              'What is the cube root of 1860867?', 
              'What is the origin of Unix epoch time?',
              'Are AI bots going to turn humans into paperclips? Yes or no?',
              'Question: what kind of bear is best?']

    for model in models:
        for query in queries:
            print(f"Here is the output from: {model.upper()}")
            print_model_response(model,query)

We can now run the script and compare the results of llama3.1, llama3.2, and gemma2using the command below.

python main.py

Notes Regarding Performance

As mentioned above, LLMs are very well suited to running on GPUs because of their extreme parallelism. LLMs are also just very resource-hungry in general. Some of the LLMs hosted on Oscar are too large to fit into the VRAM of a single GPU. Therefore, if you attempt to launch one of the largest LLMs (e.g., llama3.1:405b), the Ollama server will split the model between the CPU and the GPU. This will generally lead to poor performance. Thus, if you need to load an especially large LLM, we recommend requesting multiple GPUs. The Ollama server will handle splitting the model weights across the multiple GPUs automatically.

Appendix

Below is a list of all the models that CCV currently hosts on Oscar. We will add to this list over time, and if there are models you would like to request, please let us know by emailing support@ccv.brown.edu . Please note that models are always being added, so the list below might be slightly out-of-date at any given time. The best way to see the models CCV hosts is to use the ollama list command—assuming you have started an Ollama server, and set the OLLAMA_MODELS environment variable as described above.

Model Name
Version Tag

codellama

13b

codellama

34b

codellama

70b

codellama

7b

codellama

latest

codestral

22b

codestral

latest

deepseek-coder-v2

16b

deepseek-coder-v2

236b

deepseek-coder-v2

latest

gemma2

27b

gemma2

2b

gemma2

9b

gemma2

latest

llama2-uncensored

latest

llama3.1

405b

llama3.1

70b

llama3.1

8b

llama3.1

latest

llama3.2

1b

llama3.2

3b

llama3.2

latest

llama3.2-vision

11b

llama3.2-vision

90b

llama3.3

70b

llama3-chatqa

70b

llama3-chatqa

8b

llava

13b

llava

7b

llava

latest

mistral

7b

mistral

latest

mistral-nemo

latest

mistral-small

22b

mixtral

latest

moondream

latest

neural-chat

latest

orca2

13b

orca2

7n

orca-mini

13b

orca-mini

70b

orca-mini

7b

phi3

latest

phi3

medium

qwen2.5-coder

32b

qwen2.5-coder

14b

qwen2.5-coder

7b

qwen2.5-coder

3b

solar

latest

starcoder

latest

starcoder2

15b

starcoder2

3b

starcoder2

7b

starling-lm

latest

vicuna

latest

wizardlm2

latest

Save the file and close your favorite text editor—hopefully it's Vim .

🙂
the Ollama framework