Running local AI models on Linux: GLM-4.7-Flash with vLLM on Fedora Silverblue (RTX PRO 6000 Blackwell)

TeknologiaTekoäly

12. helmik.

In an ideal world without any bad actors it would be possible to rely on closed AI models running on someone else’s servers for absolutely everything. However, in the real world we need to weigh risks related to information security, geopolitics, biases, service availability and cost predictability. There are multiple levels of control and ownership that you can choose from, but short of manufacturing your own hardware and training your own models from the ground up the reasonable “extreme” for gaining control is running and fine-tuning open-weighted models on hardware that you own. To dig deeper into that option and research both theoretical and practical sides of running models locally I recently built a beefy AI workstation for Kipinä. In this post, I’ll share some hardware and software decisions and also the hurdles I faced with my specific setup.

This post documents the exact changes required to run one of the more capable local coding models, zai-org/GLM-4.7-Flash, with vLLM on Fedora Silverblue using Podman and an NVIDIA RTX PRO 6000 Blackwell GPU.

If you’re interested in running containerized GPU workloads on Nvidia hardware - and especially if you’re seeing CUDA Error 803 even though nvidia-smi works, this guide is for you.

Hardware specifications

As my laptop is a Macbook Pro with Apple Silicon, I wanted to also have Nvidia-based hardware so that I can test local models on both leading platforms. I also want to enable some shared usage for Kipinä experts and running non-AI workloads too, so rather than going the DGX Spark road, I decided on a workstation build with the beefiest workstation GPU currently available for AI use - the NVIDIA RTX PRO 6000 Blackwell Workstation with 96 GB of VRAM.
The GPU and RAM prices being what they are, I wanted the rest of the system to not bottleneck the GPU without going totally overboard on the CPU & motherboard cost. I found out that 192 GB ram is achievable in a stable manner on AM5, so rather than going with Threadripper, I settled on top-end Ryzen 9 9950X CPU and a X870E-CREATOR motherboard. Full specs are below:
**GPU**: NVIDIA RTX PRO 6000 Blackwell Workstation (96 GB VRAM)
**CPU**: AMD Ryzen 9 9950X
**RAM**: 192 GB (4 x G.Skill Flare X5 DDR5-5600 48 GB)
**Motherboard**: ASUS PROART X870E-CREATOR WIFI ATX
**PSU**: Corsair HX1500i SHIFT
**Storage**: 4 TB + 2 TB Samsung 990 PRO M.2 NVMe SSDs

Software specifications

Since I want other Kipinä people to be able to run GPU workloads on the machine as well, I wanted to try out an atomic and immutable Linux distribution to keep the host system clean while allowing for containerized workloads and Toolbx-style development tool management.

This was my first time setting up CUDA and Nvidia Container Toolkit and I had some issues figuring out which drivers I need on the host and how I should point the containers to use the host driver libraries. Below are the driver & framework versions that yielded me the first working stack.

OS: Fedora Silverblue (rpm-ostree, immutable base)
Container runtime: Podman (rootless)
Driver: NVIDIA 580.119.02
CUDA: 13.0
Primary framework: PyTorch 2.9.1 + CUDA 13.0
Inference stack: vLLM (cu130 nightly)
Models tested: GLM-4.7-Flash (MoE, long-context)
Containers: NVIDIA CUDA images + vLLM OpenAI server
Key runtime detail: must prefer host NVIDIA driver libs (LD_LIBRARY_PATH=/usr/lib64) to avoid CUDA Error 803
Typical workflow: GPU workloads run in containers; HF models cached on host and bind-mounted

TL;DR (The Fix)

Before going into the long explanation, if you’ve actually faced CUDA Error 803 and you just want it working, here’s what you need:

vllm/vllm-openai:cu130-nightly
Latest transformers from Git
NVIDIA CDI hook enabled in Podman
LD_LIBRARY_PATH=/usr/lib64 inside the container

That’s it. However, if you haven’t gotten that far yet or if you need additional details, continue reading for the whole setup guide.

Nvidia driver installation on the Fedora Silverblue host

Even installing the NVIDIA drivers has its own hassles, especially with Secure Boot and LUKS enabled. Luckily, Comprehensive-Wall28 has an excellent guide for installing the drivers on Fedora desktops. I encourage you to read it & follow it to the letter. However, since there are many different branches in the guide, here’s the minimal ostree listing to see which packages I ended up installing with my hardware:

  
    ● fedora:fedora/43/x86_64/silverblue
                  Version: 43.20260211.0 (2026-02-11T01:21:41Z)
               BaseCommit: ec73495c511530de7716612290ac0bc0f476ceba73bbce6b2cabadd3d52a5583
             GPGSignature: Valid signature by C6E7F081CF80E13146676E88829B606631645531
          LayeredPackages: akmod-nvidia akmods nvidia-container-toolkit rpmdevtools vim xorg-x11-drv-nvidia xorg-x11-drv-nvidia-cuda
            LocalPackages: akmods-keys-0.0.2-8.fc43.noarch rpmfusion-free-release-43-1.noarch rpmfusion-nonfree-release-43-1.noarch
                Initramfs: regenerate
  

Issue: containerized workloads hit CUDA Error 803

After following the driver installation guide, I got to a point where everything looked correct, but when I tried to start vLLM in Podman it failed with:

RuntimeError: cudaGetDeviceCount() failed with Error 803

Inside the container:

nvidia-smi works ✅
torch.cuda.device_count() returns 1 ✅
torch.cuda.is_available() returns False ❌

This combination is the key diagnostic signal for figuring if you faced the exact same problem as me.

Root cause

The container is selecting the wrong libcuda.so.1 at runtime.

Inside the vLLM CUDA 13 image, multiple libcuda.so.1 candidates exist:

/usr/local/cuda-13.0/compat/libcuda.so.1 ❌

/usr/lib64/libcuda.so.1 ✅ (host driver)

PyTorch was resolving the CUDA compat driver library, which does not match the running kernel driver on the host. This mismatch produces Error 803, even though device nodes are present.

nvidia-smi working is not sufficient. PyTorch requires the correct user-space driver libraries.

Putting everything together: step-by-step guide after installing the drivers and nvidia-smi works on the host

Follow these instructions to get a working setup for GLM-4.7-Flash on vLLM in Podman.

Step 1: Build a vLLM Image That Supports GLM-4.7-Flash

Older vLLM releases do not recognize the GLM-4.7-Flash architecture correctly. You must use the cu130 nightly image.

Containerfile

  
    FROM docker.io/vllm/vllm-openai:cu130-nightly

# Needed only to install Transformers from git
RUN apt-get update && \
    apt-get install -y --no-install-recommends git && \
    rm -rf /var/lib/apt/lists/*

# Bleeding-edge Transformers for GLM-4.7 support
RUN python3 -m pip install -U pip && \
    python3 -m pip install -U "git+https://github.com/huggingface/transformers.git"
  

Build it:

podman build -t vllm-glm47 .

Step 2 — Ensure NVIDIA CDI Is Enabled (Critical on Silverblue)

On Fedora Silverblue, Podman does not automatically activate the NVIDIA CDI hook.

Even if:

nvidia-smi works on the host
/etc/cdi/nvidia.yaml exists

CUDA may still fail inside containers unless CDI is active.

2.1 Verify CDI Spec Exists

ls -l /etc/cdi/nvidia.yaml

If missing, generate it:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

(Installed via nvidia-container-toolkit-base.)

2.2 Enable CDI in Podman

Edit:

sudo vim /etc/containers/containers.conf

Ensure:

  
    [engine]
cdi_enabled = true

Reboot (recommended on Silverblue).

2.3 Use the NVIDIA OCI Hook When Running Containers

On Silverblue, explicitly pass:

  
    --hooks-dir=/etc/containers/oci/hooks.d
--security-opt=label=disable
-e LD_LIBRARY_PATH=/usr/lib64
  

Without this, you may see:

  
    UserWarning: Can't initialize NVML
torch.cuda.is_available() == False

even though the GPU is present.

Step 3 — Validate CUDA From PyTorch

⚠️ Important: Use the same flags you will use for vLLM.

  
    podman run --rm --gpus all \
  --hooks-dir=/etc/containers/oci/hooks.d \
  --security-opt=label=disable \
  -e LD_LIBRARY_PATH=/usr/lib64 \
  --ipc=host \
  vllm-glm47 \
  python3 - <<'PY'
import torch
print(torch.__version__, torch.version.cuda,
      torch.cuda.is_available(),
      torch.cuda.device_count())
PY
  

Expected:

2.9.1+cu130 13.0 True 1

If:

device_count() == 0
or is_available() == False

your CDI or driver injection is not active.

Step 4 — Run vLLM

  
    podman run --name vllm-server --replace --gpus all \
  --hooks-dir=/etc/containers/oci/hooks.d \
  --security-opt=label=disable \
  -e LD_LIBRARY_PATH=/usr/lib64 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm-glm47 \
  serve --model zai-org/GLM-4.7-Flash
  

You should see:

Resolved architecture: Glm4MoeLiteForCausalLM

and the model will begin loading.

Why this is required on Silverblue

Silverblue + rootless Podman differs from Docker in three important ways:

NVIDIA runtime is not automatically injected
CUDA images contain compat driver libraries that may shadow host libs
SELinux can block device access unless labels are disabled

The combination means you must:

Enable CDI
Use the OCI hooks
Prefer host driver libraries

This is not a vLLM bug or a driver bug. It is a container runtime integration detail.

Key diagnostic signals

If you see:

Symptom	Meaning
`nvidia-smi` works but torch fails	→ Driver library mismatch
Error 803	→ Wrong `libcuda.so.1` selected
NVML warning	→ CDI/hook not active
`device_count()==1` but `is_available()==False`	→ Compat lib shadowing

Final Result

After:

Enabling CDI
Using OCI hooks
Forcing host driver libraries

GLM-4.7-Flash runs cleanly on:

Fedora Silverblue
RTX PRO 6000 Blackwell
CUDA 13
Podman (rootless)
vLLM cu130 nightly

Optional: Make the Fix Permanent

To avoid passing the environment variable every time, bake it into the image:

ENV LD_LIBRARY_PATH=/usr/lib64

Rebuild the image, and the runtime flag is no longer needed.

Optional: Debugging Tip

To see what the container is actually loading:

ldconfig -p | grep libcuda.so.1

If you see /usr/local/cuda-*/compat/libcuda.so.1 being preferred, that’s the bug.

Final Words

This setup now runs GLM-4.7-Flash reliably on Blackwell GPUs under Fedora Silverblue with Podman and vLLM.

Happy inferencing 🚀

When you need sparring with secure AI, contact Jari!

Get in touch!

Jari Huilla, CTO & partner Kipinä

Jari Huilla is Kipinä's new CTO with an exceptionally long and diverse background in technology, having found his first job at Nokia Research Center at the age of 15. Over the years, he has worked as a developer, leader and builder of growth companies. With Kipinä, Jari brings together deep technical knowledge and business-oriented thinking. In particular, he is interested in how to make AI solutions not only technically functional but also truly fit for purpose - and how to understand their limitations, not just ignore them.

Tekoäly

Jari Huilla

Running local AI models on Linux: GLM-4.7-Flash with vLLM on Fedora Silverblue (RTX PRO 6000 Blackwell)

Hardware specifications

Software specifications

TL;DR (The Fix)

Nvidia driver installation on the Fedora Silverblue host

Issue: containerized workloads hit CUDA Error 803

Root cause

Putting everything together: step-by-step guide after installing the drivers and nvidia-smi works on the host

Step 1: Build a vLLM Image That Supports GLM-4.7-Flash

Step 2 — Ensure NVIDIA CDI Is Enabled (Critical on Silverblue)

2.1 Verify CDI Spec Exists

2.2 Enable CDI in Podman

2.3 Use the NVIDIA OCI Hook When Running Containers

Step 3 — Validate CUDA From PyTorch

Step 4 — Run vLLM

Why this is required on Silverblue

Key diagnostic signals

Final Result

Optional: Make the Fix Permanent

Optional: Debugging Tip

Final Words

When you need sparring with secure AI, contact Jari!

Jari Huilla, CTO & partner Kipinä

Kipinä Breakfast Club: luottamus, ihmisyys ja inhimillinen digikehitys

Jätä yhteydenottopyyntö

Running local AI models on Linux: GLM-4.7-Flash with vLLM on Fedora Silverblue (RTX PRO 6000 Blackwell)

Hardware specifications

Software specifications

TL;DR (The Fix)

Nvidia driver installation on the Fedora Silverblue host

Issue: containerized workloads hit CUDA Error 803

Root cause

Putting everything together: step-by-step guide after installing the drivers and nvidia-smi works on the host

Step 1: Build a vLLM Image That Supports GLM-4.7-Flash

Step 2 — Ensure NVIDIA CDI Is Enabled (Critical on Silverblue)

2.1 Verify CDI Spec Exists

2.2 Enable CDI in Podman

2.3 Use the NVIDIA OCI Hook When Running Containers

Step 3 — Validate CUDA From PyTorch

Step 4 — Run vLLM

Why this is required on Silverblue

Key diagnostic signals

Final Result

Optional: Make the Fix Permanent

Optional: Debugging Tip

Final Words

When you need sparring with secure AI, contact Jari!

Jari Huilla, CTO & partner Kipinä

Kipinä Breakfast Club 29.1.2026: tulevaisuudentoivoa ohjaa luottamus, ihmisyys ja inhimillinen digikehitys

Kipinä Breakfast Club: luottamus, ihmisyys ja inhimillinen digikehitys

Jätä yhteydenottopyyntö