Same Model, Same Quant, Different Answers: Ollama vs LM Studio

Context

My post-creation pipeline runs all its LLM work on gemma4:26b through Ollama (structure extraction, slug romanization, and cover validation — see the gemma4 migration post). I wanted to add a second backend so I could point the same scripts at LM Studio instead of Ollama via a single env var (LLM_BACKEND=ollama|lmstudio).

The code part was easy: a duck-typed backend with generate() and vision(), a make_backend() factory, and MIME detection. The interesting part was the regression check — and the rabbit hole it led me down.

The regression check

I didn’t want “it runs” to be the bar. I wanted “switching backends doesn’t change the answers.” So I ran the cover-validation benchmark three ways, set up to compare transitively:

master + Ollama (baseline)
feature branch + Ollama (proves the refactor changed nothing)
feature branch + LM Studio (the actual question)

All three on gemma4:26b, same gold sample, same seed. Results:

Run	answered_acc	errors
master + Ollama	0.95	0
feature + Ollama	0.95	0
feature + LM Studio	0.80	0

The refactor was clean (1 and 2 identical). But Ollama and LM Studio disagreed — the same model, run two ways, scored 15 points apart. That shouldn’t happen. So I went looking for why.

Dead end #1: “it must be the quant”

My first guess was that LM Studio had quietly loaded a different quantization — maybe the QAT (quantization-aware training) variant. Easy to check: LM Studio’s native REST API (/api/v1/models, not the OpenAI-compat /v1/models) reports the actual file:

google/gemma-4-26b-a4b  ->  Q4_K_M, 4 bpw, 17.99 GB

Ollama’s gemma4:26b is also Q4_K_M, 18 GB. Same quant. Verified, not assumed. Dead end — but an important one, because it killed the easy explanation and forced a real investigation.

Dead end #2: “it must be the vision preprocessing”

The failures all clustered on one book series where the only difference between volumes is a small volume number on the cover. Gemma’s vision encoder normalizes to 896×896 and relies on “pan & scan” tiling to keep detail on big images. My covers are 1050×1500. Plausible story: LM Studio squashes the image and loses the small digit.

I even had a “smoking gun.” I cropped just the vol. 2 region and asked LM Studio to read it:

Full cover (1050×1500): read the volume as 1, 0, 0 — wrong and unstable
Tight crop: read 2, 2, 2 — correct, stable

Case closed, right? I started writing up “LM Studio loses small-glyph detail, fix it with tiling or --image-min-tokens.”

Then it fell apart. When I re-ran the full-cover read with more trials, it came back 2 ten times out of ten. Same bytes, same engine. My “decisive” 1,0,0 result was a fluke — most likely a cold model instance (LM Studio auto-unloads ~2 minutes after the last call; the API literally reports remaining_ttl_seconds) plus a slightly different prompt. I had built a whole theory on an N=3 proxy that didn’t survive N=10.

Lesson, the hard way: an unfaithful proxy is worse than no proxy. It gives you a clean, confident, wrong answer.

What actually fixed the diagnosis

I threw out the proxy and ran the exact benchmark decision call — same prompt, same schema — N=10 on each backend, after a warmup, on the three failing cases. This time it was rock-solid deterministic (0/10 or 10/10, no noise at all — so the earlier “instability” was never real):

Expected title	Cover	Truth	Ollama	LM Studio
…崩れる２ (vol 2)	vol 1	no	no ✓	yes ✗
…崩れる (no number)	vol 2	no	yes ✗	yes ✗
…崩れる２ (vol 2)	vol 2	yes	yes ✓	yes ✓

Read carefully, this says something very different from “bad image preprocessing”:

It isn’t vision. Both models can read the covers. When the expected title has no number, both backends wrongly accept a numbered cover (row 2). That’s a prompt weakness, not a backend gap.
The backend difference is one rule: “the expected title has a volume the cover lacks, so reject it” (row 1). Ollama enforces it 10/10. LM Studio ignores it 10/10 — it matches on the base title and never checks the number.

So the gap is at the decision / instruction-following layer, not the pixels. Tiling and image-resolution settings would have done nothing.

Why would the same model follow instructions differently?

The most likely culprit is how each backend frames the prompt. My Ollama adapter uses /api/generate (a raw prompt string; Ollama applies the model’s chat template and places the image its own way). My LM Studio adapter uses /v1/chat/completions (OpenAI-style message structure, different image-token placement). Same weights, different framing — and on a borderline rule like “check the volume number,” that framing is apparently enough to tip the model.

Decision

For now: I’m staying on Ollama. It scores higher on my dataset and obeys the volume rule, and the migration code keeps LM Studio available if I want it for text-only jobs later. No need to chase a tidier number for a 3-case delta.

The meta-lessons (the real takeaways)

Verify the obvious before theorizing. “Same quant” was an assumption until I read it off the API. Half my dead ends would have been shorter if I’d checked the easy facts first.
Don’t trust a proxy that doesn’t reproduce the real failure. Reproduce the actual call (same prompt, same schema), not a convenient stand-in.
N=3 is not a measurement. Warm the model, run N=10, and look at whether results are deterministic before drawing conclusions. Local backends auto- unload and have ugly first-request behavior.
The error message is data. The reason field showing the model misreading the title (生まれる vs 崩れる) vs reading it fine but matching wrong is exactly what tells you whether you’re debugging vision or logic.
“Same model” is a lie at the systems level. Quant, chat template, prompt framing, sampling defaults, and image preprocessing all live outside the weights. Two runtimes serving identical weights are not the same system.

Drafted with Claude (claude-opus-4-8) based on a debugging session; reviewed and edited by me.

CatG Homepage