Running local AI models on Linux: GLM-4.7-Flash with vLLM on Fedora Silverblue (RTX PRO 6000 Blackwell)
In an ideal world without any bad actors it would be possible to rely on closed AI models running on someone else’s servers for absolutely everything. However, in the real world we need to weigh risks related to information security, geopolitics, biases, service availability and cost predictability. There are multiple levels of control and ownership that you can choose from, but short of manufacturing your own hardware and training your own models from the ground up the reasonable “extreme” for gaining control is running and fine-tuning open-weighted models on hardware that you own. To dig deeper into that option and research both theoretical and practical sides of running models locally I recently built a beefy AI workstation for Kipinä. In this post, I’ll share some hardware and software decisions and also the hurdles I faced with my specific setup.
This post documents the exact changes required to run one of the more capable local coding models, zai-org/GLM-4.7-Flash, with vLLM on Fedora Silverblue using Podman and an NVIDIA RTX PRO 6000 Blackwell GPU.
If you’re interested in running containerized GPU workloads on Nvidia hardware - and especially if you’re seeing CUDA Error 803 even though nvidia-smi works, this guide is for you.
Hardware specifications
As my laptop is a Macbook Pro with Apple Silicon, I wanted to also have Nvidia-based hardware so that I can test local models on both leading platforms. I also want to enable some shared usage for Kipinä experts and running non-AI workloads too, so rather than going the DGX Spark road, I decided on a workstation build with the beefiest workstation GPU currently available for AI use - the NVIDIA RTX PRO 6000 Blackwell Workstation with 96 GB of VRAM.
The GPU and RAM prices being what they are, I wanted the rest of the system to not bottleneck the GPU without going totally overboard on the CPU & motherboard cost. I found out that 192 GB ram is achievable in a stable manner on AM5, so rather than going with Threadripper, I settled on top-end Ryzen 9 9950X CPU and a X870E-CREATOR motherboard. Full specs are below:
GPU: NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB VRAM)
CPU: AMD Ryzen 9 9950X
RAM: 192 GB (4 x G.Skill Flare X5 DDR5-5600 48 GB)
Motherboard: ASUS PROART X870E-CREATOR WIFI ATX
PSU: Corsair HX1500i SHIFT
Storage: 4 TB + 2 TB Samsung 990 PRO M.2 NVMe SSDs
Software specifications
Since I want other Kipinä people to be able to run GPU workloads on the machine as well, I wanted to try out an atomic and immutable Linux distribution to keep the host system clean while allowing for containerized workloads and Toolbx-style development tool management.
This was my first time setting up CUDA and Nvidia Container Toolkit and I had some issues figuring out which drivers I need on the host and how I should point the containers to use the host driver libraries. Below are the driver & framework versions that yielded me the first working stack.
OS: Fedora Silverblue (rpm-ostree, immutable base)
Container runtime: Podman (rootless)
Driver: NVIDIA 580.119.02
CUDA: 13.0
Primary framework: PyTorch 2.9.1 + CUDA 13.0
Inference stack: vLLM (cu130 nightly)
Models tested: GLM-4.7-Flash (MoE, long-context)
Containers: NVIDIA CUDA images + vLLM OpenAI server
Key runtime detail: must prefer host NVIDIA driver libs (LD_LIBRARY_PATH=/usr/lib64) to avoid CUDA Error 803
Typical workflow: GPU workloads run in containers; HF models cached on host and bind-mounted
TL;DR (The Fix)
Before going into the long explanation, if you’ve actually faced CUDA Error 803 and you just want it working, here’s what you need:
vllm/vllm-openai:cu130-nightly
Latest transformers from Git
NVIDIA CDI hook enabled in Podman
LD_LIBRARY_PATH=/usr/lib64 inside the container
That’s it. However, if you haven’t gotten that far yet or if you need additional details, continue reading for the whole setup guide.
Nvidia driver installation on the Fedora Silverblue host
Even installing the NVIDIA drivers has its own hassles, especially with Secure Boot and LUKS enabled. Luckily, Comprehensive-Wall28 has an excellent guide for installing the drivers on Fedora desktops. I encourage you to read it & follow it to the letter. However, since there are many different branches in the guide, here’s the minimal ostree listing to see which packages I ended up installing with my hardware:
● fedora:fedora/43/x86_64/silverblue
Version: 43.20260211.0 (2026-02-11T01:21:41Z)
BaseCommit: ec73495c511530de7716612290ac0bc0f476ceba73bbce6b2cabadd3d52a5583
GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531
LayeredPackages: akmod-nvidia akmods nvidia-container-toolkit rpmdevtools vim xorg-x11-drv-nvidia xorg-x11-drv-nvidia-cuda
LocalPackages: akmods-keys-0.0.2-8.fc43.noarch rpmfusion-free-release-43-1.noarch rpmfusion-nonfree-release-43-1.noarch
Initramfs: regenerate
Issue: containerized workloads hit CUDA Error 803
After following the driver installation guide, I got to a point where everything looked correct, but when I tried to start vLLM in Podman it failed with:
RuntimeError: cudaGetDeviceCount() failed with Error 803
Inside the container:
nvidia-smi works ✅
torch.cuda.device_count() returns 1 ✅
torch.cuda.is_available() returns False ❌
This combination is the key diagnostic signal for figuring if you faced the exact same problem as me.
Root cause
The container is selecting the wrong libcuda.so.1 at runtime.
Inside the vLLM CUDA 13 image, multiple libcuda.so.1 candidates exist:
/usr/local/cuda-13.0/compat/libcuda.so.1 ❌
/usr/lib64/libcuda.so.1 ✅ (host driver)
PyTorch was resolving the CUDA compat driver library, which does not match the running kernel driver on the host. This mismatch produces Error 803, even though device nodes are present.
nvidia-smi working is not sufficient. PyTorch requires the correct user-space driver libraries.
Putting everything together: step-by-step guide after installing the drivers and nvidia-smi works on the host
Follow these instructions to get a working setup for GLM-4.7-Flash on vLLM in Podman.
Step 1: Build a vLLM Image That Supports GLM-4.7-Flash
Older vLLM releases do not recognize the GLM-4.7-Flash architecture correctly. You must use the cu130 nightly image.
Containerfile
FROM docker.io/vllm/vllm-openai:cu130-nightly # Needed only to install Transformers from git RUN apt-get update && \ apt-get install -y --no-install-recommends git && \ rm -rf /var/lib/apt/lists/* # Bleeding-edge Transformers for GLM-4.7 support RUN python3 -m pip install -U pip && \ python3 -m pip install -U "git+https://github.com/huggingface/transformers.git"
Build it:
podman build -t vllm-glm47 .
Step 2 — Ensure NVIDIA CDI Is Enabled (Critical on Silverblue)
On Fedora Silverblue, Podman does not automatically activate the NVIDIA CDI hook.
Even if:
nvidia-smi works on the host
/etc/cdi/nvidia.yaml exists
CUDA may still fail inside containers unless CDI is active.
2.1 Verify CDI Spec Exists
ls -l /etc/cdi/nvidia.yaml
If missing, generate it:
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
(Installed via nvidia-container-toolkit-base.)
2.2 Enable CDI in Podman
Edit:
sudo vim /etc/containers/containers.conf
Ensure:
[engine] cdi_enabled = true
Reboot (recommended on Silverblue).
2.3 Use the NVIDIA OCI Hook When Running Containers
On Silverblue, explicitly pass:
--hooks-dir=/etc/containers/oci/hooks.d --security-opt=label=disable -e LD_LIBRARY_PATH=/usr/lib64
Without this, you may see:
UserWarning: Can't initialize NVML torch.cuda.is_available() == False
even though the GPU is present.
Step 3 — Validate CUDA From PyTorch
⚠️ Important: Use the same flags you will use for vLLM.
podman run --rm --gpus all \ --hooks-dir=/etc/containers/oci/hooks.d \ --security-opt=label=disable \ -e LD_LIBRARY_PATH=/usr/lib64 \ --ipc=host \ vllm-glm47 \ python3 - <<'PY' import torch print(torch.__version__, torch.version.cuda, torch.cuda.is_available(), torch.cuda.device_count()) PY
Expected:
2.9.1+cu130 13.0 True 1
If:
device_count() == 0
or is_available() == False
your CDI or driver injection is not active.
Step 4 — Run vLLM
podman run --name vllm-server --replace --gpus all \ --hooks-dir=/etc/containers/oci/hooks.d \ --security-opt=label=disable \ -e LD_LIBRARY_PATH=/usr/lib64 \ --ipc=host \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ vllm-glm47 \ serve --model zai-org/GLM-4.7-Flash
You should see:
Resolved architecture: Glm4MoeLiteForCausalLM
and the model will begin loading.
Why this is required on Silverblue
Silverblue + rootless Podman differs from Docker in three important ways:
NVIDIA runtime is not automatically injected
CUDA images contain compat driver libraries that may shadow host libs
SELinux can block device access unless labels are disabled
The combination means you must:
Enable CDI
Use the OCI hooks
Prefer host driver libraries
This is not a vLLM bug or a driver bug. It is a container runtime integration detail.
Key diagnostic signals
If you see:
| Symptom | Meaning |
|---|---|
nvidia-smi works but torch fails |
→ Driver library mismatch |
| Error 803 | → Wrong libcuda.so.1 selected |
| NVML warning | → CDI/hook not active |
device_count()==1 but is_available()==False |
→ Compat lib shadowing |
Final Result
After:
Enabling CDI
Using OCI hooks
Forcing host driver libraries
GLM-4.7-Flash runs cleanly on:
Fedora Silverblue
RTX PRO 6000 Blackwell
CUDA 13
Podman (rootless)
vLLM cu130 nightly
Optional: Make the Fix Permanent
To avoid passing the environment variable every time, bake it into the image:
ENV LD_LIBRARY_PATH=/usr/lib64
Rebuild the image, and the runtime flag is no longer needed.
Optional: Debugging Tip
To see what the container is actually loading:
ldconfig -p | grep libcuda.so.1
If you see /usr/local/cuda-*/compat/libcuda.so.1 being preferred, that’s the bug.
Final Words
This setup now runs GLM-4.7-Flash reliably on Blackwell GPUs under Fedora Silverblue with Podman and vLLM.
Happy inferencing 🚀
When you need sparring with secure AI, contact Jari!
Jari Huilla, CTO & partner Kipinä
Jari Huilla is Kipinä's new CTO with an exceptionally long and diverse background in technology, having found his first job at Nokia Research Center at the age of 15. Over the years, he has worked as a developer, leader and builder of growth companies. With Kipinä, Jari brings together deep technical knowledge and business-oriented thinking. In particular, he is interested in how to make AI solutions not only technically functional but also truly fit for purpose - and how to understand their limitations, not just ignore them.