Running AI and large language models (LLMs) on your own infrastructure has shifted from an experimental idea to a production reality in 2026. Whether you’re looking to self-host models for privacy, reduce API costs, or build internal AI tooling for your team, a properly configured Linux server with GPU passthrough through Docker is the foundation you need.
In this guide, we’ll walk through the complete setup — from NVIDIA driver installation to running your first LLM container — on Ubuntu 22.04 / 24.04.
Why Run AI Workloads On-Premise?
- Privacy — Your data never leaves your infrastructure
- Cost control — No per-token API billing at scale
- Latency — Local inference is faster for real-time applications
- Customization — Fine-tune and run custom models your way
Prerequisites
- Ubuntu 22.04 or 24.04 server (bare metal or dedicated VM)
- NVIDIA GPU (RTX 3080 / A100 / H100 or equivalent)
- At least 16 GB RAM (32 GB+ recommended for 13B+ models)
- Docker 24+ installed
- Root /
sudoaccess
Step 1: Verify Your GPU
First, confirm the system detects your GPU:
bash
lspci | grep -i nvidia
Check the GPU model and driver compatibility:
bash
ubuntu-drivers devices
Step 2: Install NVIDIA Drivers
Option A — Automatic (Recommended)
bash
sudo ubuntu-drivers autoinstall
sudo reboot
Option B — Manual (Specific Version)
bash
sudo apt update
sudo apt install nvidia-driver-550 -y
sudo reboot
After reboot, verify the driver:
bash
nvidia-smi
Expected output will show your GPU model, driver version, and CUDA version. If you see the table, the driver is working correctly.
Step 3: Install Docker
If Docker is not already installed:
bash
sudo apt update
sudo apt install ca-certificates curl gnupg -y
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
Verify:
bash
sudo docker run hello-world
Step 4: Install NVIDIA Container Toolkit
This is what bridges NVIDIA GPU access into Docker containers.
bash
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install nvidia-container-toolkit -y
Configure Docker to use NVIDIA runtime:
bash
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Step 5: Test GPU Access Inside Docker
Run a quick test to confirm the GPU is accessible inside a container:
bash
sudo docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi
You should see the same nvidia-smi output as on the host. If it works, GPU passthrough is functioning correctly.
Step 6: Run an LLM with Ollama on Docker
Ollama is the most practical tool for self-hosting LLMs in 2026. It supports models like LLaMA 3, Mistral, Gemma, Phi-3, and many others.
Pull and Run the Ollama Docker Image
bash
sudo docker run -d \
--gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
--restart always \
ollama/ollama
Pull a Model (e.g., LLaMA 3)
bash
sudo docker exec -it ollama ollama pull llama3
Run a Model Interactively
bash
sudo docker exec -it ollama ollama run llama3
You now have a fully self-hosted LLM running on your Linux server, accessible over port 11434.
Step 7: Expose an OpenAI-Compatible API
Ollama exposes an OpenAI-compatible REST API at:
http://your-server-ip:11434/v1/
Test it with curl:
bash
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "What is the Linux kernel?",
"stream": false
}'
This means you can point any OpenAI-compatible client, app, or SDK at your local server — no external API key required.
Step 8: Secure the API with Nginx Reverse Proxy + SSL
You don’t want port 11434 exposed publicly without authentication. Set up Nginx as a reverse proxy with basic auth:
bash
sudo apt install nginx apache2-utils -y
sudo htpasswd -c /etc/nginx/.htpasswd aiuser
Create an Nginx config at /etc/nginx/sites-available/ollama:
nginx
server {
listen 443 ssl;
server_name ai.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;
auth_basic "Restricted";
auth_basic_user_file /etc/nginx/.htpasswd;
location / {
proxy_pass http://localhost:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Enable and reload:
bash
sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx
Recommended Models by GPU VRAM
| Model | Parameters | Min VRAM | Use Case |
|---|---|---|---|
| Phi-3 Mini | 3.8B | 4 GB | Quick tasks, summarization |
| Mistral | 7B | 8 GB | General purpose |
| LLaMA 3 | 8B | 8 GB | General purpose, coding |
| LLaMA 3 | 70B | 40 GB | High-quality reasoning |
| CodeLlama | 13B | 16 GB | Code generation |
Monitoring GPU Usage
While the model is running, monitor GPU utilization in real time:
bash
watch -n 1 nvidia-smi
For longer-term monitoring, integrate with Prometheus + DCGM Exporter to collect GPU metrics and visualize them in Grafana.
Common Issues and Fixes
nvidia-smi not found after driver install → Reboot is required after driver installation.
Docker container can’t see GPU → Ensure nvidia-container-toolkit is installed and Docker was restarted after nvidia-ctk runtime configure.
Out of memory (OOM) errors → The model is too large for your GPU VRAM. Try a smaller quantized version (e.g., llama3:8b-instruct-q4_0).
Port 11434 not reachable → Check UFW rules. Run: sudo ufw allow 11434/tcp
Conclusion
Self-hosted AI inference on Linux is no longer a niche skill — it’s becoming a core competency for sysadmins and DevOps engineers in 2026. With Docker, NVIDIA container toolkit, and tools like Ollama, you can stand up a private LLM server in under an hour.
Pair this setup with a reverse proxy, proper firewall rules (see our UFW guide), and monitoring, and you have a production-ready AI inference server that keeps your data private and your API bills at zero.