How to Run AI/LLM Workloads on a Linux Server Using Docker and GPU Passthrough (2026)

Running AI and large language models (LLMs) on your own infrastructure has shifted from an experimental idea to a production reality in 2026. Whether you’re looking to self-host models for privacy, reduce API costs, or build internal AI tooling for your team, a properly configured Linux server with GPU passthrough through Docker is the foundation you need.

In this guide, we’ll walk through the complete setup — from NVIDIA driver installation to running your first LLM container — on Ubuntu 22.04 / 24.04.


Why Run AI Workloads On-Premise?

  • Privacy — Your data never leaves your infrastructure
  • Cost control — No per-token API billing at scale
  • Latency — Local inference is faster for real-time applications
  • Customization — Fine-tune and run custom models your way

Prerequisites

  • Ubuntu 22.04 or 24.04 server (bare metal or dedicated VM)
  • NVIDIA GPU (RTX 3080 / A100 / H100 or equivalent)
  • At least 16 GB RAM (32 GB+ recommended for 13B+ models)
  • Docker 24+ installed
  • Root / sudo access

Step 1: Verify Your GPU

First, confirm the system detects your GPU:

bash

lspci | grep -i nvidia

Check the GPU model and driver compatibility:

bash

ubuntu-drivers devices

Step 2: Install NVIDIA Drivers

Option A — Automatic (Recommended)

bash

sudo ubuntu-drivers autoinstall
sudo reboot

Option B — Manual (Specific Version)

bash

sudo apt update
sudo apt install nvidia-driver-550 -y
sudo reboot

After reboot, verify the driver:

bash

nvidia-smi

Expected output will show your GPU model, driver version, and CUDA version. If you see the table, the driver is working correctly.


Step 3: Install Docker

If Docker is not already installed:

bash

sudo apt update
sudo apt install ca-certificates curl gnupg -y

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y

Verify:

bash

sudo docker run hello-world

Step 4: Install NVIDIA Container Toolkit

This is what bridges NVIDIA GPU access into Docker containers.

bash

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install nvidia-container-toolkit -y

Configure Docker to use NVIDIA runtime:

bash

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Step 5: Test GPU Access Inside Docker

Run a quick test to confirm the GPU is accessible inside a container:

bash

sudo docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

You should see the same nvidia-smi output as on the host. If it works, GPU passthrough is functioning correctly.


Step 6: Run an LLM with Ollama on Docker

Ollama is the most practical tool for self-hosting LLMs in 2026. It supports models like LLaMA 3, Mistral, Gemma, Phi-3, and many others.

Pull and Run the Ollama Docker Image

bash

sudo docker run -d \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  --restart always \
  ollama/ollama

Pull a Model (e.g., LLaMA 3)

bash

sudo docker exec -it ollama ollama pull llama3

Run a Model Interactively

bash

sudo docker exec -it ollama ollama run llama3

You now have a fully self-hosted LLM running on your Linux server, accessible over port 11434.


Step 7: Expose an OpenAI-Compatible API

Ollama exposes an OpenAI-compatible REST API at:

http://your-server-ip:11434/v1/

Test it with curl:

bash

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "What is the Linux kernel?",
  "stream": false
}'

This means you can point any OpenAI-compatible client, app, or SDK at your local server — no external API key required.


Step 8: Secure the API with Nginx Reverse Proxy + SSL

You don’t want port 11434 exposed publicly without authentication. Set up Nginx as a reverse proxy with basic auth:

bash

sudo apt install nginx apache2-utils -y
sudo htpasswd -c /etc/nginx/.htpasswd aiuser

Create an Nginx config at /etc/nginx/sites-available/ollama:

nginx

server {
    listen 443 ssl;
    server_name ai.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;

    auth_basic "Restricted";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://localhost:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Enable and reload:

bash

sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

Recommended Models by GPU VRAM

ModelParametersMin VRAMUse Case
Phi-3 Mini3.8B4 GBQuick tasks, summarization
Mistral7B8 GBGeneral purpose
LLaMA 38B8 GBGeneral purpose, coding
LLaMA 370B40 GBHigh-quality reasoning
CodeLlama13B16 GBCode generation

Monitoring GPU Usage

While the model is running, monitor GPU utilization in real time:

bash

watch -n 1 nvidia-smi

For longer-term monitoring, integrate with Prometheus + DCGM Exporter to collect GPU metrics and visualize them in Grafana.


Common Issues and Fixes

nvidia-smi not found after driver install → Reboot is required after driver installation.

Docker container can’t see GPU → Ensure nvidia-container-toolkit is installed and Docker was restarted after nvidia-ctk runtime configure.

Out of memory (OOM) errors → The model is too large for your GPU VRAM. Try a smaller quantized version (e.g., llama3:8b-instruct-q4_0).

Port 11434 not reachable → Check UFW rules. Run: sudo ufw allow 11434/tcp


Conclusion

Self-hosted AI inference on Linux is no longer a niche skill — it’s becoming a core competency for sysadmins and DevOps engineers in 2026. With Docker, NVIDIA container toolkit, and tools like Ollama, you can stand up a private LLM server in under an hour.

Pair this setup with a reverse proxy, proper firewall rules (see our UFW guide), and monitoring, and you have a production-ready AI inference server that keeps your data private and your API bills at zero.

Related posts

How to Set Up and Configure UFW Firewall on Linux (Ubuntu/Debian)

How to Convert PEM to PPK Format on Linux and Windows

How to Connect to SSH Using Linux/Ubuntu Desktop