How to Run AI/LLM Workloads on a Linux Server Using Docker and GPU Passthrough (2026)

Salman ChawhanJune 18, 20260170 views

Running AI and large language models (LLMs) on your own infrastructure has shifted from an experimental idea to a production reality in 2026. Whether you’re looking to self-host models for privacy, reduce API costs, or build internal AI tooling for your team, a properly configured Linux server with GPU passthrough through Docker is the foundation you need.

In this guide, we’ll walk through the complete setup — from NVIDIA driver installation to running your first LLM container — on Ubuntu 22.04 / 24.04.

Why Run AI Workloads On-Premise?

Privacy — Your data never leaves your infrastructure
Cost control — No per-token API billing at scale
Latency — Local inference is faster for real-time applications
Customization — Fine-tune and run custom models your way

Prerequisites

Ubuntu 22.04 or 24.04 server (bare metal or dedicated VM)
NVIDIA GPU (RTX 3080 / A100 / H100 or equivalent)
At least 16 GB RAM (32 GB+ recommended for 13B+ models)
Docker 24+ installed
Root / sudo access

Step 1: Verify Your GPU

First, confirm the system detects your GPU:

bash

lspci | grep -i nvidia

Check the GPU model and driver compatibility:

bash

ubuntu-drivers devices

Step 2: Install NVIDIA Drivers

Option A — Automatic (Recommended)

bash

sudo ubuntu-drivers autoinstall
sudo reboot

Option B — Manual (Specific Version)

bash

sudo apt update
sudo apt install nvidia-driver-550 -y
sudo reboot

After reboot, verify the driver:

bash

nvidia-smi

Expected output will show your GPU model, driver version, and CUDA version. If you see the table, the driver is working correctly.

Step 3: Install Docker

If Docker is not already installed:

bash

sudo apt update
sudo apt install ca-certificates curl gnupg -y

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y

Verify:

bash

sudo docker run hello-world

Step 4: Install NVIDIA Container Toolkit

This is what bridges NVIDIA GPU access into Docker containers.

bash

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install nvidia-container-toolkit -y

Configure Docker to use NVIDIA runtime:

bash

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Step 5: Test GPU Access Inside Docker

Run a quick test to confirm the GPU is accessible inside a container:

bash

sudo docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi

You should see the same nvidia-smi output as on the host. If it works, GPU passthrough is functioning correctly.

Step 6: Run an LLM with Ollama on Docker

Ollama is the most practical tool for self-hosting LLMs in 2026. It supports models like LLaMA 3, Mistral, Gemma, Phi-3, and many others.

Pull and Run the Ollama Docker Image

bash

sudo docker run -d \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  --restart always \
  ollama/ollama

Pull a Model (e.g., LLaMA 3)

bash

sudo docker exec -it ollama ollama pull llama3

Run a Model Interactively

bash

sudo docker exec -it ollama ollama run llama3

You now have a fully self-hosted LLM running on your Linux server, accessible over port 11434.

Step 7: Expose an OpenAI-Compatible API

Ollama exposes an OpenAI-compatible REST API at:

http://your-server-ip:11434/v1/

Test it with curl:

bash

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "What is the Linux kernel?",
  "stream": false
}'

This means you can point any OpenAI-compatible client, app, or SDK at your local server — no external API key required.

Step 8: Secure the API with Nginx Reverse Proxy + SSL

You don’t want port 11434 exposed publicly without authentication. Set up Nginx as a reverse proxy with basic auth:

bash

sudo apt install nginx apache2-utils -y
sudo htpasswd -c /etc/nginx/.htpasswd aiuser

Create an Nginx config at /etc/nginx/sites-available/ollama:

nginx

server {
    listen 443 ssl;
    server_name ai.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;

    auth_basic "Restricted";
    auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        proxy_pass http://localhost:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Enable and reload:

bash

sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

Recommended Models by GPU VRAM

Model	Parameters	Min VRAM	Use Case
Phi-3 Mini	3.8B	4 GB	Quick tasks, summarization
Mistral	7B	8 GB	General purpose
LLaMA 3	8B	8 GB	General purpose, coding
LLaMA 3	70B	40 GB	High-quality reasoning
CodeLlama	13B	16 GB	Code generation

Monitoring GPU Usage

While the model is running, monitor GPU utilization in real time:

bash

watch -n 1 nvidia-smi

For longer-term monitoring, integrate with Prometheus + DCGM Exporter to collect GPU metrics and visualize them in Grafana.

Common Issues and Fixes

nvidia-smi not found after driver install → Reboot is required after driver installation.

Docker container can’t see GPU → Ensure nvidia-container-toolkit is installed and Docker was restarted after nvidia-ctk runtime configure.

Out of memory (OOM) errors → The model is too large for your GPU VRAM. Try a smaller quantized version (e.g., llama3:8b-instruct-q4_0).

Port 11434 not reachable → Check UFW rules. Run: sudo ufw allow 11434/tcp

Conclusion

Self-hosted AI inference on Linux is no longer a niche skill — it’s becoming a core competency for sysadmins and DevOps engineers in 2026. With Docker, NVIDIA container toolkit, and tools like Ollama, you can stand up a private LLM server in under an hour.

Pair this setup with a reverse proxy, proper firewall rules (see our UFW guide), and monitoring, and you have a production-ready AI inference server that keeps your data private and your API bills at zero.

Why Run AI Workloads On-Premise?

Prerequisites

Step 1: Verify Your GPU

Step 2: Install NVIDIA Drivers

Option A — Automatic (Recommended)

Option B — Manual (Specific Version)

Step 3: Install Docker

Step 4: Install NVIDIA Container Toolkit

Step 5: Test GPU Access Inside Docker

Step 6: Run an LLM with Ollama on Docker

Pull and Run the Ollama Docker Image

Pull a Model (e.g., LLaMA 3)

Run a Model Interactively

Step 7: Expose an OpenAI-Compatible API

Step 8: Secure the API with Nginx Reverse Proxy + SSL

Recommended Models by GPU VRAM

Monitoring GPU Usage

Common Issues and Fixes

Conclusion

Install ImunifyAV Without a Hosting Panel

How to Set Up and Configure UFW Firewall on Linux (Ubuntu/Debian)

Related posts

How to Set Up and Configure UFW Firewall on Linux (Ubuntu/Debian)

How to Convert PEM to PPK Format on Linux and Windows

How to Connect to SSH Using Linux/Ubuntu Desktop