How to Run Ollama LLM on NVIDIA Jetson Orin Nano in 5 Minutes

Running a language model locally on an edge device is genuinely useful -- private inference, no cloud costs, and it works offline. The Jetson Orin Nano has enough GPU horsepower to run smaller models like Llama 3.2:1b at a reasonable speed, and Ollama makes the whole process dead simple.

Flashing and Preparing Your Jetson Orin Nano

Before anything else, your Jetson needs a proper OS and the right drivers.

Flash JetPack 6.x using SDK Manager: Grab NVIDIA's SDK Manager on a separate Ubuntu host machine and flash the latest JetPack 6 onto your Orin Nano. This gets you the correct L4T (Linux for Tegra) base, CUDA libraries, and cuDNN -- all of which Ollama needs to actually use the GPU. JetPack 5.x still works but you'll miss out on newer CUDA optimizations.
Update everything:
```
sudo apt update && sudo apt upgrade -y
```
Set max performance mode:
```
sudo nvpmodel -m 0
sudo jetson_clocks
```
This unlocks the full clock speeds. Without it, the Jetson runs in a throttled power mode and inference will be noticeably slower.

Step 1: Install Ollama

You've got two options here. Pick whichever fits your workflow.

Option A: Native Install (Fastest Setup)

One command. Ollama detects the ARM64 architecture and installs accordingly, then sets itself up as a systemd service:

curl -fsSL https://ollama.com/install.sh | sh

After install, the Ollama server starts automatically in the background. You can verify with ollama --version -- make sure you're on v0.5+ (current releases are well past that). Older versions had issues with Llama 3.2 model formats.

Option B: Docker via Jetson Containers

If you prefer containerized setups (or need to run multiple AI tools side by side), use NVIDIA's jetson-containers project:

git clone https://github.com/dusty-nv/jetson-containers.git
cd jetson-containers
sudo bash install.sh
jetson-containers run $(autotag ollama)

This approach guarantees the container has the right CUDA/cuDNN versions matched to your JetPack -- no library mismatch headaches.

Step 2: Pull the Model

With Ollama running, grab Llama 3.2:1b. It's a 1.24-billion parameter model, roughly 1.3 GB in Q8_0 quantization. Small enough to fit comfortably in the Orin Nano's memory:

ollama pull llama3.2:1b

Want something a bit more capable? Try llama3.2:3b if your Orin Nano has the 8GB configuration. The 3B model is noticeably better at reasoning tasks but will be slower. For the base 4GB Orin Nano, stick with 1b.

Step 3: Run It

ollama run llama3.2:1b

That drops you into an interactive chat session right in the terminal. Type a question, get a response. The first inference takes a few seconds as the model loads into GPU memory, then subsequent responses are faster.

You can also hit the Ollama API programmatically at http://localhost:11434 -- useful if you're building an application that needs LLM responses as part of a larger pipeline.

Optional: Add a Web Interface with Open WebUI

If you'd rather chat through a browser than a terminal, Open WebUI gives you a clean ChatGPT-style interface. Run it alongside Ollama in Docker:

docker run -d -p 3000:8080 --gpus=all \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

Then open http://<your-jetson-ip>:3000 in a browser. You get conversation history, model switching, and a much nicer experience for longer sessions.

Troubleshooting

"No GPU detected" error: This usually means CUDA libraries aren't in the path. If you did a native install, run ollama show --system and check that it reports a GPU. If not, make sure /usr/local/cuda/lib64 is in your LD_LIBRARY_PATH. The Docker/container approach avoids this issue entirely since the container image ships with the right paths.
Thermal throttling or overcurrent warnings: The 1B text model shouldn't cause this on its own, but if you're running other GPU workloads simultaneously, the Orin Nano can hit its power limit. Use tegrastats to monitor in real time.
Slow responses: Double-check that you ran nvpmodel -m 0 and jetson_clocks. Without max performance mode, inference speed drops dramatically.
Model download stalls: Ollama downloads from their CDN. If you're behind a corporate proxy, you may need to set HTTP_PROXY and HTTPS_PROXY environment variables before pulling.

Quick Reference: All Commands

# System prep
sudo apt update && sudo apt upgrade -y
sudo nvpmodel -m 0
sudo jetson_clocks

# Native install
curl -fsSL https://ollama.com/install.sh | sh

# Or Docker setup
git clone https://github.com/dusty-nv/jetson-containers.git
cd jetson-containers
sudo bash install.sh
jetson-containers run $(autotag ollama)

# Pull and run the model
ollama pull llama3.2:1b
ollama run llama3.2:1b

# Optional: web UI
docker run -d -p 3000:8080 --gpus=all \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:ollama

That's it. You've got a fully local LLM running on an edge device the size of a credit card. From here, you could integrate it into a home automation system, use it for local document Q&A, or build a voice assistant that never phones home. The Ollama API makes it straightforward to plug into Python scripts or Node.js apps for whatever you're building.