Skip to content

Turning a GB10 Into an LLM Server: Everything That Broke First

Context

I bought an ASUS GX10 — a clone of NVIDIA’s DGX Spark. The spec sheet sounds great: Grace Blackwell GPU, 128 GB of unified memory, 273 GB/s bandwidth, sm_121 compute capability, 4TB NVMe. I wanted to turn it into a dedicated LLM server on my LAN so that Hermes Agent (running on my XPS 13 laptop, orchestrated by GPT 5.4) could use local models instead of always hitting cloud APIs.

What I expected: install a serving framework, pick a model, expose a port. A weekend project.

What I got: eight days, three AI assistants, five inference frameworks, ten-plus models, at least three full machine crashes, and one non-obvious kernel discovery that unlocked the entire project.

This is Part 1 — the exploration. Part 2 covers building the actual appliance.

The hardware constraint nobody warns you about

The GB10’s sm_121 is “desktop Blackwell.” It is not the same as sm_120 or sm_120a (datacenter Blackwell). It lacks TMEM (Tensor Memory), has only 99 KiB of shared memory per SM (vs 228 KiB on a B200), and — the big one — shares all 128 GB between CPU and GPU in a single unified memory pool.

That last point means OOM is not “your GPU process dies.” It means the Linux OOM killer takes out your SSH session, your tmux, your docker containers, and every user process on the box. I learned this the hard way. Multiple times.

Ollama: “performance seems meh”

The project started with a single prompt to Google Antigravity (“agy”), the first of three AI assistants I’d end up using:

“this is a nvidia gb10 machine, and I am trying to figure out what’s the best way to turn this into a LLM server; I have installed ollama, but performance seems meh”

Agy diagnosed three problems: a 262,144-token default context window forcing 8 GB+ of KV cache pre-allocation, 17–20 second first-prompt latency from CUDA graph compilation, and generic aarch64 binaries missing ARM optimizations for the Grace CPU. The third one was interesting but the first one was the killer — Ollama was reserving memory for a quarter-million tokens of context I’d never use.

Fair enough. What else is out there?

Everything fails (June 15–17)

SGLang: dead on arrival

Failed to get device capability: SM 12.x requires CUDA >= 12.9

The NVIDIA container had CUDA 12.x; the GB10 needs CUDA 13.0+. A second attempt to build SGLang from source inside a PyTorch 26.05 container triggered pip dependency hell — SGLang force-downgraded PyTorch to 2.9.1 (CUDA 12.4), which can’t even see the GPU. torch.cuda.is_available()False.

That second attempt also crashed the machine. Agy was building SGLang inside Docker without monitoring memory, and the compilation ate everything.

vLLM: OOM crash

Early vLLM attempts on gpt-oss-120b crashed the entire machine. vLLM’s default behavior is to claim 90% of GPU memory. On a discrete GPU that’s fine. On UMA where “GPU memory” is “all your RAM,” that’s a kernel panic waiting to happen.

NVIDIA AI Workbench: abandoned

A brief detour into NVIDIA AI Workbench hit systemd service issues and was abandoned within the hour.

The crash count begins

June 16, 09:20 — first crash. Agy had been running unmonitored background tasks that consumed all memory. I had to force power-cycle the machine.

June 16, 11:27 — second crash. SGLang recompilation, same cause, same remedy.

“so… whatever you did caused the host to run out of memory and I had to force power cycle the machine”

At this point the Python AI ecosystem had zero working paths on sm_121. No precompiled wheels existed for CUDA 13.2. Every pip install was either a dependency conflict or an OOM event.

llama.cpp: the steady baseline (June 16–17)

With the Python ecosystem proving hostile, we fell back to llama.cpp compiled locally with sm_121 support. Agy first wrote a custom benchmark harness (autoserving/) that mutated serving parameters and measured TPS — then discovered that llama-bench, the tool that ships with llama.cpp, already does this better. The handrolled harness got retired for grid search, but the autoserving scaffolding survived: agy later repurposed it as a perplexity test runner (run_perplexity.py), reusing the same Docker container and dataset setup. Not a bad outcome for a tool that got replaced on day one.

The actual grid search ran through llama-bench testing combinations of batch size (512/1024/2048), thread count (10/20), and KV cache quantization (f16/q8_0).

The results came back remarkably stable:

ModelQuantToken Gen (t/s)
gpt-oss-120b (MoE, 120B)Q4_K_M58.02
gpt-oss-120b (MoE, 120B)MXFP451.20
Qwen3.5-122B-A10B (MoE, 122B)Q4_K_M21.86
Gemma-4-31B (Dense, 31B)Q4_K_M10.41

Some findings worth noting:

Q4_K_M beats MXFP4. Despite MXFP4 being NVIDIA’s promoted format for Blackwell, it lost by 13% on llama.cpp. The reason: llama.cpp’s GGUF runtime can’t exploit the dedicated MXFP4 tensor cores. Only TRT-LLM’s ahead-of-time compiled engines can. On llama.cpp the “native” format is slower than the community standard.

The q8_0 KV cache is free. Halving KV cache memory (q8_0 vs f16) cost essentially zero throughput: 58.02 vs 58.91 TPS. Agy called this “The q8_0 KV Cache Secret,” and it was one of the few results I never had to revise.

10 threads, not 20. The Grace CPU has 10 Cortex-X925 performance cores and 10 Cortex-A725 efficiency cores. Using all 20 introduced scheduling conflicts that degraded performance. Small but consistent.

Dense models pay a steep tax. Gemma-4-31B at 10.41 TPS vs gpt-oss-120b at 58.02 TPS — a model 4x smaller running 6x slower. MoE models only activate a fraction of their parameters per token, so effective bandwidth demand is much lower. On memory-bandwidth-bound hardware this is the dominant variable.

Speculative decoding: dead end

Every draft+target combination made things worse:

TargetDraftTPSBaseline
gpt-oss 120Bgpt-oss 20B51.6058.02
Qwen 122BQwen 4B21.3521.86
Gemma 31BGemma E2B/E4B10.2210.41

Two models fighting over the same memory bus is worse than one model at full bandwidth. On discrete GPUs with separate VRAM this story might be different. On UMA it’s a dead end.

First Hermes connection

We got Qwen 3.6-35B served via llama-server on port 11305 and I connected Hermes Agent from my laptop:

“ok I got hermes launching 3 parallel agents talking to this current endpoint”

But Hermes immediately complained: “Model … has a context window of 2,048 tokens, which is below the minimum 64,000 required by Hermes Agent.” Every subsequent model configuration would need to clear that 64K bar.

TRT-LLM: one winner, everything else broken (June 17–18)

gpt-oss-120b on TRT-LLM 1.3.0rc18 worked beautifully. Under load testing it scaled to 980–992 TPS at 512 concurrent requests, and it handled 131K+ tokens of context. Its MXFP4 format natively triggers TRT-LLM’s fused multi-head attention kernel, bypassing the code paths that break on sm_121.

Everything else failed.

Qwen3.5-122B-NVFP4 — six crash attempts:

  1. Default run: KV cache OOM (auto-inferred 262K max_seq_len)
  2. Constrained bounds (32K, 0.8 fraction): AutoTuner warmup crash
  3. Disabled AutoTuner: host OOM (100 GB+ mmap spike killed SSH)
  4. Lowered fraction to 0.5: fused_moe workspace deadlock at 300s
  5. Bare minimum (4096 context, 0.2 fraction): still deadlocked
  6. Patched hang detector to 3600s: container never recovered

“I don’t think your thing is running, should we give up now?”

Attempt #3 caused another full machine reboot. That was the third OOM crash in three days.

Other TRT-LLM failures (tested again later on June 21 with Claude’s help): Coder-Next AWQ rejected by the dynamic loader (only NVFP4/FP8 supported), Qwen 35B NVFP4 hit a QuantAlgo enum mismatch plus a weight shape error, and Coder-Next FP8 OOMed on KV cache allocation because TRT-LLM’s internal overhead used 2x expected memory on UMA. All confirmed as known upstream bugs.

Definitive conclusion: TRT-LLM on GB10 only works for gpt-oss-120b (MXFP4). Everything else needs a different stack.

The vLLM breakthrough (June 19)

This was the pivotal day. Claude Code had entered the project by now — I had two AI assistants running in tmux side by side, with Claude babysitting agy’s permission prompts via a 3-minute polling loop while I stepped out for dinner.

Agy ran five models through vLLM systematically, and the first one just… worked.

#ModelTypeResult
1Qwen3.6-27B-FP8Dense, FP8SUCCESS
2Gemma-4-26B-A4B-NVFP4MoE, NVFP4SUCCESS (after fix)
3Qwen3.6-35B-A3B-NVFP4MoE, NVFP4SUCCESS (after fix)
4Qwen3.5-122B-A10B-NVFP4MoE, NVFP4SUCCESS
5Mistral-Medium-3.5-128B-NVFP4Dense, NVFP4SUCCESS

Test 4 was the Qwen 122B that had crashed TRT-LLM six times. Via vLLM with conservative memory settings it loaded at ~73 GiB and served responses in ~8 seconds. No crash, no deadlock, no drama.

The recipe

Two mandatory settings, discovered through iterative failure:

# FlashInfer has no sm_121 cubins
--attention-backend TRITON_ATTN

# CUTLASS NVFP4 uses TMEM/SMEM that sm_121 doesn't have
VLLM_NVFP4_GEMM_BACKEND=marlin

The Marlin setting is the key insight of the entire project. The GB10 is bandwidth-bottlenecked, not compute-bottlenecked. Marlin W4A16 reads smaller 4-bit weights from memory and dequantizes them to FP16 in GPU registers. This directly exploits the architecture’s constraint — less data through the 273 GB/s pipe. The native NVFP4 kernel tries to use TMEM and wide SMEM that desktop Blackwell doesn’t have. The “fallback” kernel outperforms the “native” one.

A performance inversion. The kind of thing you’d never guess and can’t find in any documentation. Agy discovered it by trial and error after Test 2 produced garbage output, then worked after setting the Marlin backend.

The env var learning curve

Agy initially set three environment variables based on NVIDIA’s docs: VLLM_USE_FLASHINFER_MOE_FP4=0, VLLM_NVFP4_GEMM_BACKEND=marlin, and VLLM_TEST_FORCE_FP8_MARLIN=1. Through crashes it emerged that the first one triggers a ValueError on Qwen models and the third crashes on merged linear layers. The safe minimal config is just VLLM_NVFP4_GEMM_BACKEND=marlin.

Docker image for all of this: nvcr.io/nvidia/vllm:26.05.post1-py3.

The Llama 70B dead end and the pivot to Coder-Next

Test 6 was nvidia/Llama-3.3-70B-Instruct-NVFP4. It loaded, it served, and it produced pure gibberish.

The root cause turned out to be a distinction I hadn’t seen documented anywhere: modelopt NVFP4 (from NVIDIA’s TensorRT Model Optimizer) and compressed-tensors NVFP4 (from llm-compressor/NeuralMagic) look identical from the outside — same format name, same file structure — but use different quantization toolchains internally. The Mistral/Qwen models that worked used compressed-tensors. The Llama checkpoint used modelopt. On sm_121, modelopt NVFP4 produces garbage through the Marlin dequantization path.

This is the kind of failure that wastes a day: the model loads, runs without errors, and generates confident nonsense. You only catch it by reading the output.

The pivot

The real turning point was finding bullpoint/Qwen3-Coder-Next-AWQ-4bit:

  • 80B total, only 10B active (512 experts, Gated DeltaNet hybrid attention)
  • AWQ INT4 — a completely different, battle-tested Marlin INT4 kernel path, no NVFP4 involved at all
  • ~40 GB weights, fitting easily in 128 GB UMA
  • 256K native context window

It passed on the first try: sub-1-second responses, clean tool calling, 54 tok/s at 4K input. The GDN layers use Triton-based Flash Linear Attention ops (not CUDA-specific), so sm_121 compatibility was never in question.

The --generation-config vllm discovery

During Hermes Agent benchmarking, Qwen 35B and 27B both appeared to completely fail tool calling — no tool_calls in the response, just text. I was ready to mark both models as incompatible.

Claude investigated and found the cause: without --generation-config vllm, Qwen3 models default to “thinking mode.” They wrap every response in <think> blocks, and the XML tool call parser can’t find the tool calls inside the thinking wrapper.

Adding one flag to all serve scripts fixed the problem. What I’d been treating as a fundamental model limitation was a one-flag configuration issue.

The Hermes bake-off (June 21)

GPT 5.4 orchestrated comprehensive benchmarks through Hermes — exact obedience, tool calling, spec analysis, code review, cross-file bug detection, long-context retrieval — across five models:

RankModelSpeedTool CallingKey Strength
1Coder-Next AWQ (80B MoE, 10B active)54 tok/sPASSOverall best
2Mistral 128B NVFP4 (128B dense)3 tok/sPASSFound subtle cross-file bug
3Qwen 35B NVFP4 (35B MoE, 3B active)30 tok/sPARTIALBest speed/quality ratio
4Gemma 26B NVFP4 (26B MoE, 4B active)30 tok/sPARTIALBest instruction obedience
5Qwen 27B FP8 (27B dense)8 tok/sFAILNot recommended

Note the pattern: MoE models (10B or 3B active) running at 30–54 tok/s, dense models (27B, 128B) crawling at 3–8 tok/s. On bandwidth-bound hardware, MoE isn’t a tradeoff — it’s the only viable architecture for interactive use.

Perplexity

I insisted on large-context-only measurements (“small context PPL numbers are scientifically uninteresting”):

ModelPPL (65K)
Qwen3.6-27B-FP8 (dense)5.31
Qwen3.6-35B NVFP4 (MoE)5.46
Coder-Next AWQ (MoE)6.12
Gemma-4-26B NVFP49117.23

Gemma’s catastrophic score was investigated in depth. Two root causes: vLLM’s BOS token detection checks for model_type "gemma" but the model reports "gemma4", so the attention-sink BOS token never gets prepended; and 25/30 layers use 1024-token sliding window attention, so feeding a single 65K-token prompt overwhelms the architecture entirely. The NVFP4 quantization was not the issue — chat completions worked fine. The correct approach would be ~1024-token overlapping chunks with manual BOS prepend, matching what EleutherAI/lm-evaluation-harness does when configured properly. We diagnosed it, wrote it up (gemma4_ppl_investigation.md), and never went back to re-measure. One for the backlog.

FP8 vs NVFP4/AWQ

Downloaded FP8 variants of the top models for comparison:

ModelAWQ/NVFP4FP8FP8 Memory
Coder-NextPPL 6.12, ~40 GBPPL 6.08, ~75 GB1.9x
Qwen 35BPPL 5.46, ~20 GBPPL 5.35, ~40 GB2.0x

FP8 offers ~0.1 PPL improvement at 2x the memory cost. On a 128 GB UMA system where every gigabyte counts, that’s an easy call.

The architecture decision

At this point I had a clear picture: Coder-Next AWQ was the best model, Qwen 35B NVFP4 was a good fast secondary, and everything else was either too slow (Mistral 128B) or broken (Qwen 27B).

The question was how to serve multiple models for Hermes — different tasks want different models. Claude investigated vLLM’s capabilities:

  1. No in-process model swap. One vLLM server = one model, fixed at launch.
  2. Sleep mode useless on UMA. Level 1 offloads device-to-host, but on GB10 they’re the same 128 GB pool. Frees nothing. Level 2 discards weights, but wake = reload from disk, which costs the same as a restart.

My key constraint dissolved the problem: I didn’t want the box listening for swap triggers. If both models Hermes routinely needs are already running on their own ports, there’s nothing to trigger. Routing becomes purely client-side.

The design: a fixed two-lane vLLM appliance. Lane A (Coder-Next AWQ, port 10101, 45% of memory) for trusted work — tool calls, code, review. Lane B (Qwen 35B NVFP4, port 10102, 35% of memory) for fast read-only tasks — triage, summarization. Cloud frontier models for escalation when neither local model is sufficient.

Both always up, both launched sequentially (simultaneous launch causes UMA contention), combined ~114 GB of 128 GB. No orchestrator, no proxy, no swap logic. Two docker run commands and a systemd dependency. A decision that dissolves a problem rather than solving it.

Building the appliance is Part 2.

The complete dead-end list

WhatWhy It Failed
Ollama262K default context, generic binaries, slow CUDA graph warmup
SGLangsm_121 needs CUDA ≥ 12.9; pip dependency hell
NVIDIA AI Workbenchsystemd issues, abandoned
vLLM via pipno precompiled wheels for CUDA 13.2 + sm_121
TRT-LLM (everything except gpt-oss-120b)format incompatibilities, UMA memory bugs
Speculative decodingbandwidth contention on single UMA pool
FP8 over AWQ/NVFP42x memory for ~0.1 PPL gain
modelopt NVFP4 (dense Llama)gibberish on sm_121 via Marlin
vLLM sleep modehost and device share the same memory pool
Dense models as primary4–10x slower than MoE equivalents

Meta-lessons

  1. sm_121 is not sm_120a. Desktop Blackwell and datacenter Blackwell look the same from the outside. They are not. TMEM, SMEM size, and UMA vs discrete — these matter for every kernel path in every framework.

  2. The “fallback” kernel can win. Marlin W4A16 outperforms native NVFP4 because reading smaller weights and dequantizing in registers exploits the bandwidth bottleneck directly. Performance inversions are real and they’re not in any documentation.

  3. “Same quant format” is a lie. modelopt NVFP4 and compressed-tensors NVFP4 look identical. One works, one produces gibberish. You can’t tell without running inference and reading the output.

  4. One flag can change everything. --generation-config vllm turned two “broken” models into working ones. Always test configuration before blaming the model.

  5. UMA makes offloading meaningless. Sleep mode, swap space, device-to-host transfer — all useless when host and device are the same pool. The only way to free real memory is to kill the process.

  6. OOM on UMA kills the machine, not just the process. Budget memory conservatively and launch containers sequentially. There is no graceful degradation.

  7. MoE is the architecture for bandwidth-bound hardware. 10B active parameters at 54 tok/s beats 27B dense at 8 tok/s. This isn’t a tradeoff; it’s the physics.

  8. The 273 GB/s ceiling is real. llama.cpp, vLLM, and TRT-LLM single-stream all converge at ~58 TPS for gpt-oss-120b. The only escape is batched throughput or buying different hardware.


Research, benchmarks, and serving scripts by Google Antigravity (“agy”) across 25 sessions over 8 days. Architecture, perplexity analysis, and the appliance design by Claude (claude-sonnet-4-6). Model rankings by Hermes Agent (GPT 5.4 orchestrator). The human did the rebooting, the tmux wrangling, and the repeated conversations about “what are you even trying to do right now.”

相關文章

  1. Same Model, Same Quant, Different Answers: Ollama vs LM Studio
  2. Subtitling Cardcaptor Sakura Archive: Three Evenings, Two Pivots
  3. Migrating honto Extraction from gemma3 to gemma4
  4. When Over-Engineering Meets Reality: The Author Database Story
  5. Script for creating New Post
  6. Client-Side Search for a Hugo Site (No Backend)