“Vibe Check: Opus 4.7 Stopped Reading Between the Lines” — Katie Parrott (Every, Apr 17 2026)
Why this is in the vault
Opus 4.7 is the model we run the channels-agent on. Every’s day-zero verdict — it’s the strongest coding model on Every’s LFG benchmark, but it stopped doing implicit prompt engineering on the user’s behalf — directly explains a class of failure mode we should expect across every loose-brief skill in the harness. The Alex Albert confirmation reframes 4.7 not as “smarter” but as “less generous with under-specified asks.” Operator playbook attached.
The core argument
- 4.7 cleared Every’s hardest LFG coding benchmark (no 4.6 baseline numbers shown).
- Without detailed briefs, 4.7 hedges, stalls, or guesses wrong — it no longer fills gaps for you.
- Anthropic researcher Alex Albert confirmed the shift: 4.6 was doing meaningful prompt engineering on behalf of users; 4.7 isn’t.
- With detailed briefs (apartment-hunting dashboard with overnight scheduling, PowerPoint generation, consulting-prose drafting), it produced output Parrott rated “better than reading my own.”
- Recommended effort tiers:
maxfor benchmarks/architecture,extra high(now Claude Code default) for async work,high/mediumfor interactive iteration. - Operator playbook (“Rewrite the rail this weekend”): tighten prompts with explicit acceptance criteria, constraints, and budgets. Stop relying on inference.
Mapping against Ray Data Co
Direct relevance — channels-agent + every loose-brief skill. Our autonomous loop runs Opus 4.7. Skills like /check-board, /process-inbox, watch-mode /process-newsletter, and the dynamic /loop all operate on tasks whose Notion descriptions are often terse. If 4.7 is materially less forgiving of underspecification, we should expect more “stalls” and “guesses wrong” cycles than we saw on 4.6 — and the right fix is prompt-side, not model-side.
Concrete actions this surfaces:
- Audit the channels-agent skill prompts for implicit-spec smell. Anywhere the SKILL.md says “use good judgment” or “infer the right format” — those are exactly the spots 4.6 would have papered over and 4.7 will surface as drift. The
/process-newsletterskill is already in good shape (explicit step ordering, frontmatter template, copy-paste caution). Higher-risk surfaces:/process-inbox,/check-board, free-form Discord/iMessage replies. - Reinforce the “queue work to the board” practice. The feedback_queue_work_to_the_board memory already requires individual actionable items, but tightening the description format (acceptance criteria, scope bound) becomes more important on 4.7. Loose Notion task = stalled cycle.
- Effort-tier discipline. Document which skills should run at max vs extra-high vs high effort. The autonomous loop default should likely be extra-high (Parrott’s recommendation for async). Interactive iMessage replies probably stay at high to keep latency tolerable.
- Re-evaluate model routing. 2026-02-18-every-vibe-check-sonnet-4-6 showed Sonnet 4.6 hits Opus-close intelligence at half the cost. With 4.7 demanding more prompt rigor anyway, the cost-benefit on routing well-specified tasks to Sonnet 4.6 changes. If we’re going to pay the spec-writing tax, we may as well pay it on the cheaper model where appropriate.
Harness-thesis tie-in. Aligns with 2026-04-11-garry-tan-thin-harness-fat-skills: as models become more literal, the skill (the spec) carries more weight. 4.7 is a model that punishes thin skills. This is the same direction as 2026-04-08-better-harness-evals-hill-climbing — eval-able, specified skills win.
Contradiction to flag. 2026-01-21-every-opus-wilkinson-ai-life (Wilkinson on “Opus as life partner”) was written when 4.6’s implicit prompt engineering made the model feel preternaturally aware. That experience may not replicate cleanly on 4.7 without more explicit briefing — the “it just gets me” magic was partly the model doing unpaid spec-writing labor.
Sources & bias notes
- Author: Katie Parrott, Every staff writer. Same author as the Sonnet 4.6 vibe check (2026-02-18-every-vibe-check-sonnet-4-6). Pattern is consistent — pragmatic operator review with internal Every team’s actual workflow tests.
- Sponsor block: None detected. Article is paywalled with multiple “Subscribers only” sections; we read via WebFetch summary, not full body.
- Bias to flag: Every is a paying Anthropic customer running production tools (Spiral, Cora) on the API. Every has a structural incentive to keep finding “the new model is great if you adapt.” No disclosure issue, but read the recommendations as “from a sophisticated user,” not “from a neutral evaluator.”
- Source quote (Albert): “4.6 had been doing a meaningful amount of prompt engineering on your behalf that 4.7 doesn’t” — paraphrased from second-hand attribution in the article.
Related
- 2026-02-18-every-vibe-check-sonnet-4-6 — prior Vibe Check, same author, frames the cost-tier dynamic
- 2026-04-15-every-claude-managed-agents-mini-vibe-check — Every’s prior week mini-vibe-check on Anthropic Routines
- 2026-02-05-every-codex-vs-opus — competitive frame for Opus
- 2026-04-16-innermost-loop-welcome-apr-16 — current channels-agent context with Opus 4.7
- 2026-04-11-garry-tan-thin-harness-fat-skills — harness-thesis alignment
- 2026-04-08-better-harness-evals-hill-climbing — eval discipline becomes more load-bearing
- internal-review-mg-harness-cc-wrapped-2026-04-13 — Claude Code wrap-up referenced for effort-tier defaults
- 2026-01-21-every-opus-wilkinson-ai-life — possible contradiction: 4.6’s implicit-spec magic may not survive
- feedback_queue_work_to_the_board — tightening Notion task descriptions is more important on 4.7
Per copy-paste caution: paraphrased throughout, no body text pasted, single quote ≤15 words. Article fetched via WebSearch + WebFetch summary because Gmail thread returned snippet only.