Running local AI models on Linux: GLM-4.7-Flash with vLLM on Fedora Silverblue (RTX PRO 6000 Blackwell)

In an ideal world without any bad actors, it would be possible to rely on closed AI models running on someone else's servers for absolutely everything. However, in the real world, we need to weigh risks related to information security, geopolitics, biases, service availability, and cost predictability. There are multiple levels of control and ownership that you can choose from, but short of manufacturing your own hardware and training your own models from the ground up, the reasonable “extreme” for gaining control is running and fine-tuning open-weighted models on hardware that you own. To dig deeper into that option and research both theoretical and practical sides of running models locally, I recently built a beefy AI workstation for Kipinä. In this post, I’ll share some hardware and software decisions and also the hurdles I faced with my specific setup. 

This post documents the exact changes required to run one of the more capable local coding models, zai-org/GLM-4.7-Flash, with vLLM on Fedora Silverblue using Podman and an NVIDIA RTX PRO 6000 Blackwell GPU.

If you're interested in running containerized GPU workloads on Nvidia hardware—and especially if you're seeing CUDA Error 803 even though nvidia-smi works—this guide is for you.

Hardware specifications

As my laptop is a Macbook Pro with Apple Silicon, I wanted to also have Nvidia-based hardware so that I can test local models on both leading platforms. I also want to enable some shared usage for Kipinä experts and running non-AI workloads too, so rather than going the DGX Spark route, I decided on a workstation build with the beefiest workstation GPU currently available for AI use—the NVIDIA RTX PRO 6000 Blackwell Workstation with 96 GB of VRAM.

With GPU and RAM prices being what they are, I wanted the rest of the system to not bottleneck the GPU without going totally overboard on the CPU and motherboard cost. I found out that 192 GB of RAM is achievable in a stable manner on AM5, so rather than going with Threadripper, I settled on a top-end Ryzen 9 9950X CPU and an X870E-CREATOR motherboard. Full specs are below:

GPU: NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB VRAM)

CPU: AMD Ryzen 9 9950X

RAM: 192 GB (4 x G.Skill Flare X5 DDR5-5600 48 GB)

Motherboard: ASUS PROART X870E-CREATOR WIFI ATX

PSU: Corsair HX1500i SHIFT

Storage: 4 TB + 2 TB Samsung 990 PRO M.2 NVMe SSDs

Software specifications

Since I want other Kipinä people to be able to run GPU workloads on the machine as well, I wanted to try out an atomic and immutable Linux distribution to keep the host system clean while allowing for containerized workloads and Toolbx-style development tool management.

This was my first time setting up CUDA and Nvidia Container Toolkit, and I had some issues figuring out which drivers I needed on the host and how I should point the containers to use the host driver libraries. Below are the driver and framework versions that yielded me the first working stack.

OS: Fedora Silverblue (rpm-ostree, immutable base)
Container runtime: Podman (rootless)
Driver: NVIDIA 580.119.02
CUDA: 13.0
Primary framework: PyTorch 2.9.1 + CUDA 13.0
Inference stack: vLLM (cu130 nightly)
Models tested: GLM-4.7-Flash (MoE, long-context)
Containers: NVIDIA CUDA images + vLLM OpenAI server
Key runtime detail: must prefer host NVIDIA driver libs (LD_LIBRARY_PATH=/usr/lib64) to avoid CUDA Error 803
Typical workflow: GPU workloads run in containers; HF models cached on host and bind-mounted


TL;DR (The Fix)

Before going into the long explanation, if you've actually faced CUDA Error 803 and you just want it working, here's what you need:

  1. vllm/vllm-openai:cu130-nightly

  2. Latest transformers from Git

  3. NVIDIA CDI hook enabled in Podman

  4. LD_LIBRARY_PATH=/usr/lib64 inside the container

That's it. However, if you haven't gotten that far yet or if you need additional details, continue reading for the whole setup guide.


Installing Nvidia drivers on the Fedora Silverblue host

Even installing the NVIDIA drivers has its own hassles, especially with Secure Boot and LUKS enabled. Luckily, Comprehensive-Wall28 has an excellent guide for installing the drivers on Fedora desktops. I encourage you to read it & follow it to the letter. However, since there are many different branches in the guide, here’s the minimal ostree listing to see which packages I ended up installing with my hardware:

● fedora:fedora/43/x86_64/silverblue
Version: 43.20260211.0 (2026-02-11T01:21:41Z)
               BaseCommit: ec73495c511530de7716612290ac0bc0f476ceba73bbce6b2cabadd3d52a5583
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531
          LayeredPackages: akmod-nvidia akmods nvidia-container-toolkit rpmdevtools vim xorg-x11-drv-nvidia xorg-x11-drv-nvidia-cuda
            LocalPackages: akmods-keys-0.0.2-8.fc43.noarch rpmfusion-free-release-43-1.noarch rpmfusion-nonfree-release-43-1.noarch
                Initramfs: regenerate

Issue: Containerized workloads encounter CUDA Error 803

After following the driver installation guide, I got to a point where everything looked correct, but when I tried to start vLLM in Podman, it failed with:

RuntimeError: cudaGetDeviceCount() failed with Error 803

Inside the container:

  • nvidia-smi works ✅

  • torch.cuda.device_count() returns 1 ✅

  • torch.cuda.is_available() returns False ❌

This combination is the key diagnostic signal for figuring out if you faced the exact same problem as me.

Root cause

The container is selecting the wrong libcuda.so.1 at runtime.

Inside the vLLM CUDA 13 image, multiple libcuda.so.1 candidates exist:

/usr/local/cuda-13.0/compat/libcuda.so.1 ❌

/usr/lib64/libcuda.so.1 ✅ (host driver)


PyTorch was resolving the CUDA compat driver library, which does not match the running kernel driver on the host. This mismatch produces Error 803, even though device nodes are present.

nvidia-smi working is not sufficient. PyTorch requires the correct user-space driver libraries.


Putting everything together: step-by-step guide after installing the drivers and nvidia-smi works on the host

Follow these instructions to get a working setup for GLM-4.7-Flash on vLLM in Podman.


Step 1: Build a vLLM Image That Supports GLM-4.7-Flash

Older vLLM releases do not recognize the GLM-4.7-Flash architecture correctly. You must use the cu130 nightly image.

Container file

FROM docker.io/vllm/vllm-openai:cu130-nightly

# Needed only to install Transformers from git
RUN apt-get update && \
    apt-get install -y --no-install-recommends git && \
    rm -rf /var/lib/apt/lists/*

# Bleeding-edge Transformers for GLM-4.7 support
RUN python3 -m pip install -U pip && \
    python3 -m pip install -U "git+https://github.com/huggingface/transformers.git"

Build it:

podman build -t vllm-glm47 .

Step 2 — Ensure NVIDIA CDI Is Enabled (Critical on Silverblue)

On Fedora Silverblue, Podman does not automatically activate the NVIDIA CDI hook.

Even if:

  • nvidia-smi works on the host

  • /etc/cdi/nvidia.yaml exists

CUDA may still fail inside containers unless CDI is active.

2.1 Verify CDI Spec Exists

ls -l /etc/cdi/nvidia.yaml

If missing, generate it:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

(Installed via nvidia-container-toolkit-base.)

2.2 Enable CDI in Podman

Edit:

sudo vim /etc/containers/containers.conf

Ensure:

[engine]
cdi_enabled = true

Reboot (recommended on Silverblue).

2.3 Use the NVIDIA OCI Hook When Running Containers

On Silverblue, explicitly pass:

--hooks-dir=/etc/containers/oci/hooks.d
--security-opt=label=disable
-e LD_LIBRARY_PATH=/usr/lib64

Without this, you may see:

UserWarning: Cannot initialize NVML
torch.cuda.is_available() == False

even though the GPU is present.


Step 3 — Validate CUDA From PyTorch

⚠️ Important: Use the same flags you will use for vLLM.

podman run --rm --gpus all \
  --hooks-dir=/etc/containers/oci/hooks.d \
  --security-opt=label=disable \
  -e LD_LIBRARY_PATH=/usr/lib64 \
  --ipc=host \
  vllm-glm47 \
  python3 - <<'PY'
import torch
print(torch.__version__, torch.version.cuda,
      torch.cuda.is_available(),
      torch.cuda.device_count())
PY

Expected:

2.9.1+cu130 13.0 True 1

If:

  • device_count() == 0

  • or is_available() == False

your CDI or driver injection is not active.


Step 4 — Run vLLM

podman run --name vllm-server --replace --gpus all \
  --hooks-dir=/etc/containers/oci/hooks.d \
  --security-opt=label=disable \
  -e LD_LIBRARY_PATH=/usr/lib64 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm-glm47 \
  serve --model zai-org/GLM-4.7-Flash

You should see:

Resolved architecture: Glm4MoeLiteForCausalLM

and the model will begin loading.


Why this is required on Silverblue

Silverblue + rootless Podman differs from Docker in three important ways:

  1. NVIDIA runtime is not automatically injected

  2. CUDA images contain compatible driver libraries that may shadow host libraries.

  3. SELinux can block device access unless labels are disabled

The combination means you must:

  • Enable CDI

  • Use the OCI hooks

  • Prefer host driver libraries

This is not a vLLM bug or a driver bug. It is a container runtime integration detail.


Key diagnostic signals

If you see:

Symptom Meaning
nvidia-smi works but torch fails → Driver library mismatch
Error 803 → Wrong libcuda.so.1 selected
NVML warning → CDI/hook not active
device_count()==1 but is_available()==False → Compatibility library shadowing

Final Result

After:

  • Enabling CDI

  • Using OCI hooks

  • Forcing host driver libraries

GLM-4.7 Flash runs cleanly on:

  • Fedora Silverblue

  • RTX PRO 6000 Blackwell

  • CUDA 13

  • Podman (rootless)

  • vLLM cu130 nightly


Optional: Make the Fix Permanent

To avoid passing the environment variable every time, bake it into the image:

ENV LD_LIBRARY_PATH=/usr/lib64

Rebuild the image, and the runtime flag is no longer needed.

Optional: Debugging Tip

To see what the container is actually loading:

ldconfig -p | grep libcuda.so.1

If you see /usr/local/cuda-*/compat/libcuda.so.1 being preferred, that’s the bug.


Final Words

This setup now runs GLM-4.7-Flash reliably on Blackwell GPUs under Fedora Silverblue with Podman and vLLM.

Happy inferencing 🚀


When you need sparring with secure AI, contact Jari!


Jari Huilla, CTO & partner Kipinä

Jari Huilla is Kipinä's new CTO with an exceptionally long and diverse background in technology, having found his first job at Nokia Research Center at the age of 15. Over the years, he has worked as a developer, leader, and builder of growth companies. With Kipinä, Jari brings together deep technical knowledge and business-oriented thinking. In particular, he is interested in how to make AI solutions not only technically functional but also truly fit for purpose—and how to understand their limitations, not just ignore them.

Next
Next

Kipinä Breakfast Club 29 January 2026: Hope for the future is guided by trust, humanity, and humane digital development