Ollama
This page describes how to run large language models (LLMs) directly on Oscar nodes using Ollama.
LLMs Hosted by CCV
CCV hosts several dozen public, open-weight LLMs on Oscar. This includes Llama 3.2, Phi 3, Mistral, and Gemma 2. You can see the complete list in the Appendix section below.
We first begin by opening a terminal and connecting to Oscar. This can be done using Open OnDemand, a terminal application on your local machine, or PuTTY if you're on a Windows machine.
Once we have a terminal session on Oscar, we need to set an environment variable that tells Ollama where to look for the CCV-hosted models. This only needs to be done once, and you can do so using the commands below.
Requesting a GPU Node
LLMs are particularly well suited to running on GPUs, so we begin by requesting a GPU node on Oscar using the following interact
command, which requests 4 CPU cores, 32 GB of memory, and 1 GPU for 1 hour.
Note that depending on the particular LLM, you may want additional resources (e.g., more CPU cores, memory, or GPUs). The above example should be good for most models.
Starting an Ollama Server
There are several ways to run large language models directly on Oscar. One particularly straightforward and flexible approach is to use the Ollama framework, which is installed as a module on Oscar.
Once we get our job allocated and we are on a GPU node, we must next load the ollama
module.
Because the Ollama framework operates using a client/server architecture, we must now launch the server component of Ollama. This is done using the command below.
After running the command above, we will see a stream of output; this is the indication that the Ollama server has started.
Running an LLM Interactively with Ollama
Now that we have the Ollama server running, we can use it to launch LLMs on our GPU node. To do so, we must first start a new terminal session and use it to connect to our GPU node. Note that our original terminal session that we started above needs to continue running; that session is responsible for running the Ollama server. We are going to use a new terminal session to start the client. If you are using an Open OnDemand Desktop session, you can right-click on the Terminal
icon at the bottom of the screen, and select New Window
. Similarly, if you are connecting via your local machine's terminal application, you would start a new window. And the same is true if you are using PuTTY on Windows.
Once we have a new terminal started, run the myq
command to see the hostname of our running Ollama server; it will be under the NODES
heading and look something like gpuXXXX
. We can connect to our GPU node from the login node by running the following command, where XXXX
is an integer greater than 1000
.
Starting an Interactive Chat
Once we have connected to our GPU node, we are nearly ready to start our LLM running. We first need to load the ollama
module again, which we do using the command below.
We can now run an interactive chat session with the llama3.2
model using the command below. Note that it may take a few seconds for the chat interface to start.
The above command will take us into an interactive chat session with llama3.2
. This should be apparent when our command prompt changes from the usual Linux command prompt into a >>>
prompt.
You can now enter queries directly into the prompt. For example you could ask the following.
Whenever we are finished with our chat session, we can exit the chat by typing /bye
Loading Specific Versions of a Model
Suppose that instead of running the default version of a model in Ollama, we wanted a particular version. For example, the default for gemma2
is the 9b
version, which has roughly 9-billion parameters. If we wanted to launch the larger 27b
model of gemma2
, we could do so using the command below.
Running Ollama Models via Python
If we would prefer to interact with Ollama models via Python, we can do so using the ollama
package in Python. This is useful, for example, if we want to benchmark models against one another or other long-running tasks.
This section assumes we have already started the Ollama server described in the beginning of this page. If that is true, we can start a new terminal as we did before, and then create a Python virtual environment using the following commands.
We can then activate our newly created virtual environment and install the Ollama Python package using the following commands.
Calling Ollama from Python Script
Now, let's create a Python script that will compare the output of three models on a few questions. Create a file called main.py
inside our ~/projects/python_ollama_test
directory and paste the following text into that file.
Save the file and close your favorite text editor—hopefully it's Vim 🙂.
We can now run the script and compare the results of llama3.1
, llama3.2
, and gemma2
using the command below.
Notes Regarding Performance
As mentioned above, LLMs are very well suited to running on GPUs because of their extreme parallelism. LLMs are also just very resource-hungry in general. Some of the LLMs hosted on Oscar are too large to fit into the VRAM of a single GPU. Therefore, if you attempt to launch one of the largest LLMs (e.g., llama3.1:405b
), the Ollama server will split the model between the CPU and the GPU. This will generally lead to poor performance. Thus, if you need to load an especially large LLM, we recommend requesting multiple GPUs. The Ollama server will handle splitting the model weights across the multiple GPUs automatically.
Appendix
Below is a list of all the models that CCV currently hosts on Oscar. We will add to this list over time, and if there are models you would like to request, please let us know by emailing support@ccv.brown.edu
Model Name | Version Tag |
---|---|
codellama | 13b |
codellama | 34b |
codellama | 70b |
codellama | 7b |
codellama | latest |
codestral | 22b |
codestral | latest |
deepseek-coder-v2 | 16b |
deepseek-coder-v2 | 236b |
deepseek-coder-v2 | latest |
gemma2 | 27b |
gemma2 | 2b |
gemma2 | 9b |
gemma2 | latest |
llama2-uncensored | latest |
llama3.1 | 405b |
llama3.1 | 70b |
llama3.1 | 8b |
llama3.1 | latest |
llama3.2 | 1b |
llama3.2 | 3b |
llama3.2 | latest |
llama3-chatqa | 70b |
llama3-chatqa | 8b |
llava | 13b |
llava | 7b |
llava | latest |
mistral | 7b |
mistral | latest |
mistral-nemo | latest |
mistral-small | 22b |
mixtral | latest |
moondream | latest |
neural-chat | latest |
orca2 | 13b |
orca2 | 7n |
orca-mini | 13b |
orca-mini | 70b |
orca-mini | 7b |
phi3 | latest |
phi3 | medium |
qwen2.5-coder | 32b |
qwen2.5-coder | 14b |
qwen2.5-coder | 7b |
qwen2.5-coder | 3b |
solar | latest |
starcoder | latest |
starcoder2 | 15b |
starcoder2 | 3b |
starcoder2 | 7b |
starling-lm | latest |
vicuna | latest |
wizardlm2 | latest |
Last updated