Skip to content

Subtitling Cardcaptor Sakura Archive: Three Evenings, Two Pivots

Context

For a while now I’ve had an archive of Cardcaptor Sakura recorded off Japanese broadcast — 13 episodes of Clear Card and 67 of the 70 original series episodes. About 80 MP4 files in total, of which 45 also have the raw .ts source on disk. My daughters got hooked on the opening on first sight, but they don’t understand Japanese, so I needed to find some way to get them a subtitled version.

So: a side project. Generate Japanese subtitles first, then derive an English translation from those, for the whole set, on local hardware. Gemma 4 e2b/e4b claims native audio support, and 26b/31b are supposed to be good at Japanese — I wanted to test both claims.

Time budget context up front: about three weeknight evenings, after dinner. Several choices below — model picks, reaching for whatever was already on disk — make more sense with that in mind.

It pivoted twice. This post is a record of those pivots and the bits I’d want to remember.

The initial plan: Gemma 4 native audio

The first instinct was: Gemma 4 has native audio in. Just feed it 30 seconds of episode at a time and ask for SRT-formatted Japanese with speaker tags. One model, one prompt, done.

Picking the inference engine took longer than I expected:

  • Hugging Face transformers: crashed on Windows in ways I didn’t want to debug at 10pm.
  • vLLM: no audio support yet at the time I tried.
  • LM Studio: GUI is great but it didn’t expose a native-audio endpoint I could script against.
  • llama.cpp (llama-server): worked, with the caveat that running Gemma 4 multimodally needs both the main .gguf weights and the matching audio projector .mmproj file. Easy to miss the second one.

So llama-server it was. The plan was: one model, audio in, SRT out.

The data breakthrough: ARIB embedded subs

Before writing any of the audio pipeline, I went poking at the .ts files. Turns out NHK broadcasts ship ARIB closed captions embedded in the transport stream — with colors, with furigana — and a pure-Python library, johnoneil/arib, will rip them out to .srt or .ass losslessly. This step lives outside the main llm-subtitler pipeline as a one-shot pre-process; the pipeline only consumes the resulting JSON anchors.

That was an embarrassing amount of free progress. About half the archive (the .ts half) now had perfect subtitles with zero LLM involvement.

But the more interesting consequence wasn’t the convenience — it was that the ARIB extractions are essentially ground truth. Exact display windows, exact Japanese, exactly what NHK aired. That reframed the project: anything I built for the MP4 half could be measured against the TS half, instead of vibes-tuned.

The hybrid pipeline (for the MP4 half)

For the episodes without ARIB, the original sketch was “faster-whisper for ears, Gemma 4 for speaker tagging.” That’s not what we shipped. The production pipeline ended up using:

  • Silero VAD for timing — it produces tight speech segments without anyone having to run a full transcription model just to know when people are talking.
  • Gemma 4 E4B for the audio content of each segment — three-temperature FRAG ensemble per chunk, so the synthesis step downstream has options to pick the cleanest one.
  • A synthesis pass (Gemini, with local Gemma 26B-A4B as offline fallback) to collapse the FRAG ensemble into one clean Japanese line per cue.

Speaker tagging didn’t survive contact with reality — the character-dictionary approach was finicky and the families watching this don’t need [Sakura]: prefixes to follow along, so it got cut.

The autoresearch loop

This is the part I’d been waiting for an excuse to try.

I’d been reading Karpathy’s autoresearch sandbox for a while and wanted to point one at a real problem. CCS gave me exactly the shape it wants: a frozen evaluator (ARIB ground truth), a single mutable file (the VAD pipeline), and a metric (F1 of predicted speech segments against ARIB display windows).

I set up a sibling repo, llm-subtitler-research:

  • eval/ — frozen, untouched by the agent. Loads ARIB, computes F1 over a bipartite IoU match.
  • pipeline.py — the one file the agent is allowed to edit.
  • program.md — the per-stage brief (metric, mutation scope, keep/discard rule).
  • results.tsv — append-only log, one row per iteration.

A cron launched a Claude Opus iteration every few minutes overnight. Stage A target: maximize VAD F1 against ARIB display windows.

Stage A: 93 iterations, F1 from 0.3955 to 0.5997

MetricBaselineLocked winner
Train F1 (5 episodes)0.39550.5997 (+52%)
Val F1 (2 episodes)n/a0.7228

Final config locked at: THRESHOLD=0.45, MIN_SPEECH=50ms, MIN_SILENCE=200ms, PAD=2000ms, EXTRA_TRAIL=1400ms, plus a structural change called split-long-preds (re-run silero with a finer min_silence on any prediction ≥4s, replace the parent if ≥2 sub-segments come out).

What worked

Δ F1ChangeWhy it worked
+0.071MIN_SILENCE 500→200msThe anime-tuned baseline over-merged; recall was the bottleneck.
+0.041Split-long-predsBiggest structural win — exposed multiple ARIB matches inside what used to be one merged prediction.
+0.034PAD sweep 400→2000 (six small keeps)ARIB display windows are wider than the actual speech; a bigger pad → better IoU.
+0.024Asymmetric end-only TRAIL 0→1400msARIB persists ~1.4s past the speech tail; padding only the trailing side wins.
+0.014MIN_SILENCE 100→150→200ms (late re-test)Different optimum once split-long-preds was in place.

What failed (the instructive part)

IdeaResultLesson
Lower threshold (0.20–0.30)−0.07Silero over-merges at low threshold.
Dual-threshold union (0.35 + 0.50)−0.142Doubling pred count tanked precision; the bipartite matcher penalizes excess.
Gap-bridge merge of preds <200ms apart−0.542Cascade-merged into 30s+ excluded segments. Spectacular.
Trail clamped to next-pred start (no overlap)−0.012Overlap is beneficial — bipartite picks best-IoU per ARIB anyway.

The −0.542 was my favorite. It was the kind of plausible-sounding change (“clean up tiny gaps”) that you’d never test by hand because of course it works. The eval said no in about thirty seconds.

The plot twist: I had a perfectly good dataset and didn’t read it

After Stage A locked, I started planning Stage B (transcription CER) by — finally — opening the ARIB ground-truth and reading it. About an hour of skimming over twelve segments out of 2,357 was enough to find four categories of problems with my own evaluator:

  • 220 BGM-only blocks (♫〜) across 7 episodes. Pure recall drag — no VAD will ever match them because there’s no speech. They cap recall at 0.84–0.94 per episode regardless of how good the model is.
  • 189 ruby-reading lines. ARIB displays kana readings on a separate line above the kanji (きのもと さくら / 私 木之本 桜。). My ruby-stripping regex only handled the inline 《...》 form. Every name introduction was poisoning the metric.
  • 197 continuation arrows (). ARIB splits a single spoken utterance across multiple display windows. Without merging them at extraction time, my predictions were always “wrong” against an artificially split ground truth.
  • 83 speaker tags sitting after a newline that my ^(...) regex missed.

When I cleaned all of this up and re-ran the locked Stage A winner on the cleaned data, F1 fell from 0.5997 to 0.5738. The drop is purely a metric-shape change — merging arrow-chains made ARIB blocks fewer/longer, so VAD over-segmenting inside one merged block now hurts precision. I tested five nearby variants; the locked config still won. The optimization was real. The ceiling wasn’t.

Put differently: ~10–15 percentage points of what I’d been calling “headroom” was metric noise. The loop had been faithfully optimizing a metric that was partially measuring its own artifacts.

Rule one of practical ML is “look at your data before you point an optimizer at it.” I skipped the hour of ARIB skimming that would have caught all four categories because I was itching to launch the cron.

The loop itself was great — it did exactly what an autoresearch loop should do, which is honestly optimize whatever metric you hand it. I’d happily run another one tomorrow. (One cost data-point: those two evenings of iteration burned my entire weekly Claude Opus quota. Autoresearch is not free.)

A second-opinion review from a different agent (“codex”) materially improved the eval cleanup itself — most notably catching that the kana cue lines should only be stripped when the next line literally starts with , so the aggressive heuristic wouldn’t over-strip real content like the sentence-starter とりあえず. Lesson within the lesson: even when you finally do look at the data, a second pair of eyes on the cleanup rules earns its keep.

The second pivot: stop translating, start retiming

Around the time I was processing the autoresearch findings, the pipeline was producing reasonable Japanese text — but going from Japanese to English went sideways fast. One-shot Japanese→English translation kept turning “Sakura” into “cherry blossom” all over the place, and the character names were impossible to keep consistent. After another chat with Gemini about what to build next, the obvious-in-retrospect option finally surfaced: CCS has a huge fandom. Crunchyroll / NIS released English subs years ago. People keep those files alive on Kitsunekko and similar. The English problem is solved already — I just need to align someone else’s .ass to my local video timing.

That reframed the whole back half of the project. Instead of “transcribe + translate”, the new pipeline became:

  1. Find ARIB anchors (or fall back to a generated .ja.srt from the audio pipeline).
  2. Take an existing Crunchyroll/NIS .ass.
  3. Compute one global time offset and shift.

NHK helps a lot here, because there are no commercial breaks. One scalar offset covers the whole episode.

The ‘dumb math’ retiming breakthrough

The first design instinct was Dynamic Programming. Resist that instinct.

What actually worked is a 1D Hough transform — or, more honestly, “compute pairwise time-differences between every source cue and every anchor cue, then take the mode of the histogram”:

  • For each pair (source cue s, anchor cue a), record t_s − t_a (within some reasonable window).
  • Bin the differences (~0.5s buckets). The tallest bin is your global offset.
  • Subtract that offset from every source cue. Done.

This is mathematically immune to missing cues, extra OP/ED lyrics, and one-off bad anchors — they show up as noise around the mode, not as wrong answers. On real episodes it found exact offsets like -7.0s or -6.5s with ~80% coverage and <0.3s residuals, in seconds of compute.

(With the benefit of hindsight: after all of this, the answer for most episodes was just “+7 seconds, applied globally.” I could have nudged the slider in VLC’s Track Synchronization dialog and gone to bed.)

Where ARIB wasn’t available, the generated .ja.srt (from faster-whisper → Gemma synthesis) was good enough to serve as the anchor. ~75% coverage was plenty for a histogram mode to lock onto a single offset.

Putting it together: the four-phase pipeline

The whole thing lives in a private llm-subtitler repo — it’s vibe code held together by checkpoints and good intentions, not ready to open-source yet. But the shape is worth describing: main.py is a state-machine orchestrator that walks each video through four (well, four-and-a-half) phases. Every phase is idempotent based on file presence: if episode01.ja.srt already exists, Phase 1–3 are skipped for that episode. Crash mid-run, restart, and it picks up exactly where it left off. This sounds boring but it’s the single most valuable property of the whole pipeline — debugging is much shorter when restart-from-scratch isn’t the failure mode.

video.mp4
  │
  │  ── Phase 1+2 ──▶  ffmpeg → 16kHz mono WAV
  │                    Silero VAD (Stage A locked config)
  │                    Gemma E4B audio → raw FRAG fragments
  │                  produces: episode.raw.ja.srt
  │
  │  ── Phase 3   ──▶  drop empty FRAG blocks
  │                    sliding window (batch=20, ctx=3)
  │                    Gemini synthesizes clean Japanese SRT
  │                  produces: episode.ja.srt
  │
  │  ── Phase 3.5 ──▶  pick anchor: ARIB → .ja.srt fallback
  │                    1D-Hough mode offset, evaluate coverage
  │                    if coverage ≥ 0.60 → shift Crunchyroll .ass
  │                  produces: episode.en.srt (or skips → Phase 4)
  │
  └─ ── Phase 4   ──▶  only for episodes Phase 3.5 couldn't time
                       Gemini translates .ja.srt → English SRT
                     produces: episode.en.srt

A few implementation choices worth noting:

VRAM lifecycle in one process. A LlamaServerManager boots llama-server.exe with the E4B multimodal model for Phase 1+2, kills it cleanly at the end, then can re-boot with the 26B-A4B text model for Phase 3 if you’ve configured --phase3-engine llama. On a 24GB 4090 there is no way to hold both models in VRAM simultaneously, so the boot/kill dance is mandatory. I did try the 31B variant — it fits but runs unusably slow at this VRAM budget; the 26B-A4B is blazing fast and good enough at Japanese for synthesis. In practice the production default is --phase3-engine gemini anyway, because the Gemini API is faster still and turned out to cost about $6 total for the whole CCS run.

Sliding window for synthesis and translation. Phase 3 and Phase 4 both use a SlidingWindow(batch_size=20, context_size=3) when the engine is Gemini. Twenty blocks per request gives the model enough context to keep pronoun/name consistency across cues; three blocks of prior context carry character voice forward. The llama engine default of batch_size=1 (one request per block) was where I started, and is roughly 20× slower and noticeably worse on cross-cue consistency.

Phase 3.5 short-circuits Phase 4. If retiming succeeds with ≥60% coverage, .en.srt is written and Phase 4 sees the file already exists, so it skips. If retiming fails (low coverage or a parse error), a diagnostic JSON gets written to .cache/retime/ and Phase 4 picks the episode up via the LLM translation path. Same output filename, different production paths, both valid.

The synthesis prompt is deliberately hostile to chattiness. Local Gemma loves to think out loud; the prompt body is:

Synthesize Japanese subtitles from transcription fragments.
Do not think step by step.
Do not output analysis, reasoning, thoughts, explanations, markdown, or commentary.
Output exactly ONE SRT block per input block.
Maintain sequence numbers and timestamps exactly.
If fragments are noisy, choose the best non-empty fragment.
Output ONLY raw SRT. No conversational text.

Even with this, the silence-hallucination filter and the regex block splitter (next section) are non-optional.

Stage A’s VAD config is baked into src/audio.py. The constants — THRESHOLD=0.45 / MIN_SPEECH=50 / MIN_SILENCE=200 / PAD=2000 / TRAIL=1400 / SPLIT_THRESHOLD=4000 — sit at the top of process_vad() with a comment pointing back to the research repo’s locked winner. That comment is the entire “deliverable” of those two autoresearch evenings: five numbers and one structural change. Worth it.

Production hardening

A few small things that turned a “works on episode 1” prototype into something I’d let loose on the whole archive:

  • Silence-hallucination filter. Gemma E4B occasionally answers a silent audio chunk with English meta-commentary like “I cannot hear any speech.” A small text-filter module (src/text_filters.py) requires Japanese characters in fragments and drops common “cannot hear / cannot transcribe / silence” English meta-responses before they reach the synthesis prompt.
  • Regex SRT block splitter. Naively splitting on blank lines breaks the moment an LLM’s output contains an internal blank line. The fix: only split on a valid Index\nTimestamp anchor.
  • Index healing. Auto-renumber SRT indices so a downstream LLM never sees gaps.
  • UTF-8 BOM tolerance on the glossary loader (utf-8-sig).
  • Safer FFmpeg stream mapping when muxing .ja.srt + .en.srt into the MP4 container — explicit -map so an existing subtitle stream in the source doesn’t quietly pick up the new metadata.
  • Episode-number extraction from Japanese filenames (src/episode_map.py). This was Gemini’s wry observation during review: the math was the easy part, the messy reality was filenames with full-width digits, parenthesized episode numbers, broadcast dates appended at the end, and at least one movie file whose title happened to contain the digit “4.” Without prefer-parenthesized-episode-number logic, the histogram could lock onto a perfect offset for an entirely unrelated file.

Meta-lessons (the ones I want to remember)

  1. Look at your data before you point a loop at it. The single biggest lesson.
  2. Re-test “dead” knobs in new contexts. Several knobs (THRESHOLD=0.45, PAD=2000, MIN_SPEECH=50) were discarded in early iterations and became keeps later once other changes shifted the surface.
  3. Big wins are structural, not knob-tweaks. Pure constant-sweeps capped around F1=0.55. Asymmetric trailing pad and split-long-preds were the +0.05 push.
  4. Visual eyeballing of data is the highest-ROI activity. A regex tested against the wrong example will pass forever. No unit test catches “you tested the wrong form of the input.”
  5. Keep / noise / discard against σ keeps you honest. Stage A had 93 iterations: 19 keeps, 44 noise, 30 discards. Without an explicit σ-aware threshold I would have committed half the noise as “wins” and ratcheted on randomness.
  6. Split research repo from production repo. The research repo could try −0.542 ideas freely; production stayed boring and usable for actual subtitle generation.
  7. Autoresearch is not free. Two evenings ≈ one weekly Opus quota. Worth knowing before launching the next one.
  8. Some episodes are just hard. Ep4 “くたくた日曜日” stayed stuck at recall ~0.40 through all of Stage A. It’s quiet introspective dialogue — an audio-character problem, not a knob problem. The honest answer is to write that down rather than tune blindly.

Where it ended up

  • TS episodes: ARIB extraction, lossless, done. Colors and furigana preserved in the .ass output.
  • MP4 episodes: faster-whisper → Gemma E4B fragment ensemble → Gemini synthesis (or local Gemma 26B-A4B when running fully offline) → .ja.srt. Then Crunchyroll/NIS .ass retimed via 1D-Hough against the ARIB or .ja.srt anchor → .en.srt. Final FFmpeg mux into the MP4 container.
  • Coverage: 60+ episodes processed end-to-end. The translation-from-audio path stayed in the toolkit as the last-resort fallback for episodes where retiming confidence is too low, rather than the headline feature it was originally going to be.

Decision (current production default)

  • Anchor priority: ARIB extraction → generated .ja.srt from local pipeline → translation fallback.
  • English subs: retime existing Crunchyroll/NIS .ass with 1D-Hough global offset; only translate from scratch when alignment coverage drops below threshold.
  • Models: gemma-4-e4b (multimodal, port 8081 via llama-server) for audio fragments; Gemini API for synthesis and translation; local gemma-4-26B-A4B (port 8082) as the offline fallback. The 31B variant is too slow on 24GB VRAM to be the default.
  • Embedding: FFmpeg mux with explicit stream mapping and a --dry-run mode so I can’t accidentally rewrite the wrong file in a library of recordings I do not want to lose.

Three evenings, two pivots, and one hour of data-reading I’ll do first next time. Anyway, my daughters are now happily watching Tomoyo recording Sakura… 😉

The pipeline in three frames

Three frames from one MP4 episode, showing the pipeline’s output at each stage:

3-try audio to text
Phase 1+2: three raw FRAG fragments from Gemma E4B for the same chunk — the multi-temperature ensemble before synthesis picks the best.

Final processed Japanese based on prior context
Phase 3: Gemini synthesis collapses the FRAG ensemble into a single clean Japanese line, using prior cues as context.

Final Text
Phase 3.5 (or 4): the muxed result in VLC — Japanese and English subtitle tracks both available.

Credits

The work was split across three LLM coding agents, each doing what they’re best at:

  • Claude (Opus 4.7) — the autoresearch loop driver; co-author on this post. The Stage A iteration cron, the eval-cleanup write-up, the four-phase orchestration in main.py, and most of the prose here are its work, edited by me.
  • Codex — second-opinion code reviews and the targeted fixes that survived. The conservative kana-cue stripping rule (only strip when the next line literally starts with ), the utf-8-sig glossary BOM fix, the silence-hallucination filter in src/text_filters.py, and the episode-mapping fixes for Japanese filenames all came out of codex’s review passes.
  • Gemini — the Phase 3.5 retiming design and most of the actual code. The “1D Hough / pairwise-difference mode” insight, the .ja.srt-as-fallback-anchor idea, the SRT block-splitter regex, and the FFmpeg embed utility were Gemini’s calls — often after I’d flailed at the same problem with DP. Most of the codebase was written by Gemini Pro 3.1 (preview); the Flash variant needed explicit TDD instructions to stay disciplined (“write the failing test first, then the minimum code to pass”) but, once told, produced perfectly serviceable code at a fraction of the cost.

The shape of the final pipeline is the product of three agents arguing politely about the same problem from different angles. None of them alone would have arrived at this design; I would not have arrived at it without all three.

相關文章

  1. Migrating honto Extraction from gemma3 to gemma4
  2. When Over-Engineering Meets Reality: The Author Database Story
  3. Script for creating New Post