Skip to content

Migrating honto Extraction from gemma3 to gemma4

Context

This blog has a small “post creation pipeline” that turns a honto.jp URL into a ready-to-edit Hugo post:

  • scripts/book_review_helper.py extracts title / creators / publisher / label (and some light heuristics like web_origin)
  • scripts/create_review.py orchestrates Hugo post creation
  • scripts/upload_cover.py optionally validates + uploads a cover image
  • scripts/validation/* provides repeatable evaluation on my existing posts and a small cover dataset

When I first built this, gemma3:12b was the sweet spot for fast + consistent JSON extraction. In April 2026, Google released the Gemma 4 family, and I wanted to see if I could migrate the extraction part of the workflow to gemma4:26b (and potentially converge more steps on a single model).

This post is a record of what changed, what broke, and what ended up working.

The Problem: “JSON is easy” (until it isn’t)

The extraction scripts rely on structured output. If the model returns anything that isn’t parseable JSON, the whole flow becomes flaky and annoying to operate.

In early tests, Gemma 4 would sometimes:

  • emit extra tokens around JSON (code fences, chat-template markers, partial objects)
  • truncate output mid-JSON (especially when the input included a lot of page text)
  • “yolo” a best-effort response when the prompt/schema mismatch was unclear

For cover validation (vision), I already had a robust pattern: start with a fast path and fall back only when needed. I wanted the same reliability for structure extraction.

The Key Change: Ollama structured outputs (schema format)

The big migration step was to stop relying on “please output JSON” prompt-only discipline and instead use Ollama structured outputs:

  • send a JSON Schema in the request format field
  • keep the schema minimal (fewer keys = fewer failure modes)

In practice, “strict schema + small output” is what made Gemma 4 behave.

tier1_structured vs tier1_structured_bool

In the cover benchmark scripts there are two schema variants for the same Tier 1 “structured output” method:

  • tier1_structured: the schema returns match as an enum string ("yes" | "no" | "unsure").
  • tier1_structured_bool: the schema returns match as a tri-state boolean (true | false | null).

Empirically, Gemma 4 is more consistent with the boolean schema (less “I said yes but output no” behavior), so the pipeline defaults to tier1_structured_bool for gemma4:* models.

Minimal extraction schema

For novels, the schema is basically:

  • title (string)
  • author (string)
  • illustrator (string|null)
  • publisher (string|null)
  • label (string|null)

Manga adds original_author where relevant.

Crucially: tags are not part of the schema anymore.

Speed + Reliability: fast-first + retry fallback

To keep latency low without dropping accuracy, extraction now uses the same philosophy as cover validation:

  1. Fast attempt with a small OLLAMA_NUM_PREDICT (e.g. 192)
  2. If the structured parse fails, retry once with a larger budget (OLLAMA_RETRY_NUM_PREDICT, e.g. 512 or 1024)

That keeps the common case fast, while recovering the occasional “needs a few more tokens” response.

Ergonomics: set it once in .env

I don’t want to pass flags every time I create a post. The pipeline is meant to be “one command” on a new URL.

So the scripts were made to consistently load the repo-root .env (even when invoked from a different working directory), and .env.example was updated to reflect the Gemma 4 defaults.

Recommended extraction defaults:

OLLAMA_API_URL=http://localhost:11434/api/generate
OLLAMA_MODEL=gemma4:26b
OLLAMA_USE_FORMAT=1
OLLAMA_TEMPERATURE=0
OLLAMA_NUM_CTX=4096
OLLAMA_NUM_PREDICT=192
OLLAMA_RETRY_NUM_PREDICT=512
OLLAMA_TIMEOUT=240

Cover validation is still independent:

OLLAMA_VISION_MODEL=gemma4:26b

In my current setup I switched cover validation over to gemma4:26b as well, using Ollama structured outputs (schema format) to keep it deterministic.

Benchmarks: warmup is now the default

Local models have a very non-representative first request: model loading, cache warming, and JIT-ish overhead can dwarf steady-state latency.

For that reason, benchmark scripts now do a per-model warmup request by default, with an escape hatch (--no-warmup) when you explicitly want cold-start numbers.

One gotcha: web_origin and deterministic tags

When structured output is enabled, tags are often omitted to keep outputs small.

So tags are now generated deterministically in Python when missing. That includes optional tags like narou / kakuyomu based on a heuristic web_origin detector.

The subtle bug: if web_origin is added to the metadata after tag generation, that tag never appears.

Fix: detect web_origin early (right after fetching HTML) so tag generation can include it.

Results (so far)

  • Structure extraction: gemma4:26b works well once schema format is enabled, output keys are minimized, and a retry budget exists.
  • Cover validation (vision): gemma4:26b is accurate enough for me to move to production.

Latest benchmark snapshot (Gold mode: sample, negatives per positive: 5, seed: 1337, Ollama 0.20.2):

  • tier1_structured (scripts/validation/bench_results/bench_tier1_structured_1775422335.json)
    • gemma4:26b: answered_acc=0.983, fp=1, fn=1, errors=0
    • gemma4:latest: answered_acc=0.908, fp=11, fn=0, errors=0
  • tier1_structured_bool (scripts/validation/bench_results/bench_tier1_structured_1775422991.json)
    • gemma4:26b: answered_acc=1.000, fp=0, fn=0, errors=0

What’s next

There are still real edge cases in the dataset (honto “set” pages, edition markers like 【電子特別版】, series metadata drift, etc.). Some of these are “expected ≠ ground truth” problems (my blog titles are sometimes intentionally normalized), and some are genuine extraction mistakes.

The important part is that the pipeline is now:

  • predictable (schema structured output)
  • fast in the common case (small num_predict)
  • resilient (one retry fallback)
  • ergonomic (configure once in .env)

That’s enough to start treating Gemma 4 as the default extraction model in the real workflow.

Decision (current production default)

At this point I’m comfortable switching the whole post-creation flow to a single model:

  • OLLAMA_MODEL=gemma4:26b (structure extraction + slug romanization)
  • OLLAMA_VISION_MODEL=gemma4:26b (cover validation)

One note: slug generation is still an LLM call (title_romaji_prompt in scripts/prompts.json) and it currently uses the same OLLAMA_MODEL. I didn’t run a dedicated “slug benchmark”; it relies on strict output constraints (lowercase, [a-z0-9-], 20 chars max) and falls back to a local fallback_slugify() if the model returns something invalid.

相關文章

  1. When Over-Engineering Meets Reality: The Author Database Story
  2. Script for creating New Post