EveryClue
es

Why Daily Whodunit Puzzles Need a SAT Solver (and What Happens Without One)

A standard daily-puzzle complaint, across whodunit games and word games alike, is the puzzle has two solutions. Players solve a board, the game says wrong, they peek at the official answer, and the official answer is a different valid solution. This kills trust faster than anything else a daily can do. Players don’t always rage-quit immediately — but they stop trusting you to be the daily that respects their time, and within a month they’re playing somewhere else.

At EveryClue we generate the bulk of our daily inventory with a language model. AI generators are great at writing fluent clue prose and bad at guaranteeing one and only one valid solution. Solving that gap is the entire reason we run a Z3 SAT solver in our pipeline.

What “unique solution” actually requires

For a five-suspect, four-category whodunit, there are 5! × 5! × 5! × 5! = about 207 million possible assignments of attributes to suspects. The clue list narrows that down. The puzzle is good if the clues narrow it to exactly one. The puzzle is broken if they narrow it to zero (unsolvable) or two-or-more (ambiguous).

Checking this by hand is impossible. Checking it heuristically — “let me solve it once and see if I get an answer” — catches the unsolvable case, but cheerfully ships ambiguous puzzles where two distinct assignments both satisfy every clue.

The only reliable check is to enumerate. Take the clue list, feed it into a solver, ask the solver to produce up to two models. If it finds zero, reject; if it finds two, reject; if it finds exactly one, publish.

Why Z3 specifically

Z3 is a SAT/SMT solver from Microsoft Research — the same engine you’d use for hardware verification or proving a piece of cryptographic code correct. We use a fraction of its capabilities. For a whodunit grid, we declare:

  • For each (suspect, category) pair, an integer variable that indexes into the category’s attribute list.
  • A Distinct constraint per category (Latin-square enforcement).
  • One Boolean expression per clue, translated from our clue vocabulary (e.g., pos(A) < pos(B), Implies(varA, varB)).

Z3 takes the lot and answers in milliseconds. If we get one model, we lock the puzzle for publishing. If we get two, the puzzle goes to a quarantine table; the AI generator gets one retry; if the retry also fails, we fall back to a hand-curated seed puzzle for that day. (Wave 1 has seven seed-grade EN puzzles per slot in inventory; we have never run out.)

What the solver does not fix

Worth being clear: SAT verification only checks structural correctness. It does not check:

  • Whether the clue text is grammatical (the AI generator is responsible).
  • Whether the cultural anchors are appropriate (human review handles that).
  • Whether the difficulty matches the slot (a separate scoring pass).
  • Whether the puzzle is fun. (No solver does this.)

We sample for the last three. The solver only ensures the deduction works. Without it, none of the other work matters.

What broke before we added Z3

The first week of pre-launch testing — before we had the SAT step wired up — we shipped a small batch of AI-generated grids to internal testers. Three were unsolvable (one clue contradicted another). One had four valid solutions. One was genuinely good. Three were ambiguous in subtle ways the testers didn’t catch until they peeked at the “official” answer.

That ratio is bad. It is also approximately what you’ll see in any production system that publishes AI-generated logic puzzles without a verifier — every “daily AI-generated mystery” site we’ve audited has at least one of these failure modes in any given week.

After Z3 went in, the publish rate dropped — from “16 candidates per night published as-is” to “16 candidates, ~13 pass verification, 3 rejected and either retried or fall back to seed”. The flip side is that every puzzle that reaches a player has a verified unique solution. Mistakes-counter complaints in the first two weeks of soft launch were uniformly player mistakes, not puzzle mistakes. We can stand behind that.

What this means for you, the player

If you solve today’s EveryClue grid and the game tells you you’re wrong, you’re wrong. The puzzle isn’t broken. Look at your assignments again — somewhere on the grid, one of your cells is off, and one of the clues you read past was directly excluding it.

That’s the bargain. We make sure the puzzle is right; you do the deduction. The whole product is built so we can keep our side of that bargain.