When Over-Engineering Meets Reality: The Author Database Story

The Stale Database Problem

After building my automated book review script, I had a nice workflow going. The script would extract book metadata from honto.jp, auto-fill author social links from author_db.json, and generate a draft post ready for me to write.

The author database was built early on by scanning all my existing posts. It worked great—until it didn’t.

Two months and dozens of reviews later, I noticed every new review had ⚠️ New author warnings. The script would create empty placeholders like [[Twitter/X]()], and I’d manually hunt down the author’s social links. Again. And again.

The database had 2,492 authors from my old posts, but none of the new authors I’d been reviewing recently. It was just frozen in time.

The Obvious Solution (That Wasn’t)

The fix seemed clear: add periodic sync to update the database with new authors from recent posts.

But running a full scan every time felt wasteful. Why re-process 1,455 posts when only 5-10 had changed? That’s the kind of inefficiency that makes programmers itch.

So I did what any engineer would do: I reached for incremental updates. “I’ll build a smart sync that only processes posts modified since the last run. It’ll be fast, efficient, and scalable. Just like rsync or git pull!”

The plan made perfect sense:

✅ Only scan new/changed posts (efficient!)
✅ Track last sync time (smart!)
✅ Use git commit history (modern!)
✅ Scale to thousands of posts (future-proof!)

So I spent an entire afternoon with Claude building a sophisticated incremental sync system with git log integration, timestamp tracking, and multiple command-line modes (--since-last-sync, --since <date>, --full). After adding ~200 lines of code, I was ready to test my elegant solution.

The Performance Test

# Full sync (the "naive" way)
time python3 scripts/sync_author_db.py
# → 2.0 seconds

# Incremental sync (the "smart" way)
time python3 scripts/sync_author_db.py --since-last-sync
# → 16.3 seconds

Wait. What?

I was watching the incremental sync run and it felt like it was taking way longer than the full sync. So I asked Claude (and Codex): “Looking at the bigger picture, does this update make sense? i.e., given the small size of the data (~1500 posts to scan), does incremental update make sense? Or just run full update once every week and call it done better?”

That’s when Claude suggested I actually measure it with time. And the numbers didn’t lie.

The “optimized” incremental sync was 8× SLOWER than the full scan.

What Went Wrong

The problem was simple: I optimized the wrong thing.

What I thought was expensive:

Processing 1,455 posts (reading files, parsing markdown, updating database)
Time per post: ~1.4 milliseconds
Total: 2 seconds

What was actually expensive:

Running git log on 1,455 files to check commit dates
Time per git command: ~11 milliseconds
Total: 16 seconds

My “optimization” added 16 seconds of git overhead to save 2 seconds of processing time. Classic.

Even when incremental sync only processed 26 changed posts (instead of 1,455), it still took 16 seconds because it had to check every file to find those 26.

The Math of Over-Engineering

Let me break down the costs:

Full Sync (Simple):

Read 1455 files:     ~0.5s
Parse & extract:     ~1.0s
Write database:      ~0.5s
Total:               ~2.0s

Incremental Sync (Complex):

Run 1455 git log:    ~16.0s  ← The killer
Read 26 files:       ~0.01s
Parse & extract:     ~0.02s
Write database:      ~0.5s
Total:               ~16.5s

The incremental approach only works if:

Filtering is cheaper than processing (not true here—git is expensive)
The dataset is huge (100,000+ posts where full scan takes minutes)
Processing is expensive (API calls, LLM inference, heavy computation)

My reality:

1,455 posts (small dataset)
Cheap processing (regex + dictionary updates)
Expensive filtering (git subprocess per file)

I had optimized for a scale problem I didn’t have, using a tool that was slower than the problem itself.

The Bugs We Found Along the Way

Before committing, I ran the code through a few different AI code reviewers. Between OpenAI Codex’s local review and Gemini Code Assist on GitHub, they caught three bugs:

Timezone crash: Comparing timezone-naive vs timezone-aware timestamps would crash (Codex)
Wrong git date: Used author date instead of commit date (Codex - subtle but wrong)
No atomic writes: A crash during save could corrupt the database (Gemini)

But the most important feedback came from my question to Claude about whether this even made sense at my scale. That’s when I actually measured the performance and realized I’d spent hours building something that made the tool worse.

The Pivot to Simplicity

I made a pragmatic call: keep the incremental mode (it’s already written and works), but:

Recommend full sync - It’s faster and simpler
Fix the real bugs - Add timezone normalization and atomic writes
Document the lesson - Note in README that full sync is faster

The safety fixes turned out to be simple but crucial:

def normalize_to_utc(dt: datetime) -> datetime:
    """Prevent timezone comparison crashes"""
    if dt.tzinfo is None:
        return dt.replace(tzinfo=timezone.utc)
    return dt.astimezone(timezone.utc)

# Atomic write (crash-safe)
temp_path = db_path.with_suffix('.json.tmp')
with open(temp_path, 'w') as f:
    json.dump(db, f, ensure_ascii=False, indent=2)
temp_path.replace(db_path)  # Atomic rename

Just 20 lines of code made the database actually safe. No more crash risks, no more corruption potential. And they worked for both modes.

The “optimization” I’d spent hours on turned out to be slower. But the safety fixes I added in 20 minutes? Those actually solved real problems.

Lessons Learned

1. Measure before optimizing

I assumed incremental sync would be faster without testing. The measurement proved me wrong in seconds. Assumptions are dangerous.

2. Know your scale

Incremental sync makes sense at Google scale (millions of files). At my scale (1,455 posts), the overhead dominates.

3. Complexity is a cost

The incremental sync added:

200 lines of code
3 new bugs
8× slower execution
Harder to understand and maintain

All to solve a problem I didn’t have.

4. Simple solutions can be the right solutions

Sometimes the naive approach is optimal. A 2-second full scan is instant for a human. Why add complexity?

5. Sometimes you need to ask the right question

Asking “does this even make sense at my scale?” led Claude to suggest actually measuring performance. Without stopping to question the whole approach, I might have just shipped the slower version and never known.

The Final Workflow

Today, my workflow is simple:

# 1. Create a new review (auto-fills known authors from database)
python3 scripts/create_review.py <honto_url> --use-latest

# 2. Manually add social links for any new authors

# 3. Publish (set draft: false, commit to git)

# 4. Update author database (2 seconds, processes all 1455 posts)
python3 scripts/sync_author_db.py

# Done! Next review with this author will auto-fill.

The database now has 2,492 authors and stays fresh. Full sync still takes 2 seconds. I run it once after publishing each review, or just weekly to catch up.

It’s fast enough that I never think about it.

Postscript: When to Actually Optimize

Don’t get me wrong—incremental sync isn’t wrong. It’s just wrong for my use case.

Incremental updates make sense when:

Scale is huge: 100,000+ files where full scans take minutes
Updates are expensive: Each item requires API calls, LLM inference, or network I/O
Frequency is high: Running every minute or second (background daemon)
Filtering is cheap: Quick metadata checks vs expensive processing

In those cases, the complexity of incremental sync pays off.

But for small datasets with cheap operations and expensive filtering? The simple full scan is often unbeatable.

As Donald Knuth said: “Premature optimization is the root of all evil.”

I’d add: “Measure first, optimize second. And sometimes, don’t optimize at all.”

P.S. The incremental sync code is still there if you want to use --since 2025-12-20 to debug specific date ranges. It’s just not the recommended default. Sometimes keeping the experiment around is valuable, even if it’s not the main path.

Written with help from Claude Sonnet 4.5 (suggesting the measurement), OpenAI Codex (catching timezone and git date bugs), and Gemini Code Assist (finding the atomic write issue). Sometimes the best code review is a combination of tools and asking the right questions.

CatG Homepage