“OpenAI GPT-5.5: 82.7% Terminal-Bench, $5/M tokens, live now” — @Lior Alexander (AlphaSignal)
Why this is in the vault
Three concurrent shifts in one issue: (1) GPT-5.5 ships with agentic-coding numbers that approach Claude’s territory at lower API cost, (2) OpenAI ships Workspace Agents — the “shared org-level bot” pattern that competes directly with Claude Skills + the autonomous-loop pattern RDCO is building, (3) Qwen3.6-27B claims to beat its own 397B sibling on coding while running locally on 18GB VRAM. All three change the cost/capability frontier RDCO is assuming for the harness.
⚠️ Sponsorship
- AugmentCode — explicit “Presented by” mid-newsletter ad block (“Your prompts go stale. Your spec shouldn’t”). Third-party paid. Pitches “Intent” — a living-spec product that updates as code changes, advertised as free with Claude Code/OpenCode/Codex. Bias note: AlphaSignal earns ad revenue, not editorial endorsement.
- exe.dev — embedded as Signals slot #2 with “Presented by exe.dev” tag. Third-party paid placement masquerading as a curation item. Treat as ad, not curation signal.
No author-adviser relationships disclosed. Lior Alexander is the AlphaSignal founder writing in his own voice; no self-cross-promo to other AlphaSignal properties detected.
Issue contents
Top News (3 items):
- GPT-5.5 launch — 82.7% Terminal-Bench 2.0, 58.6% SWE-Bench Pro, $5/M input tokens, 1M context window, same speed as 5.4 with fewer tokens. Live in ChatGPT and Codex (Plus/Pro/Business/Enterprise); API “coming very soon.” Pitched as the “stop babysitting your AI” model — multi-step agentic coding, planning, tool use, self-verification.
- Workspace Agents in ChatGPT — shared team agents, cloud-resident, free until May 6 2026 then credit-based. Business/Enterprise/Edu/Teachers plans only. Use cases: triage, approval routing, IT tickets, Slack feedback capture, account research, call-transcript summarization (claims 5-6h/week saved per rep).
- Qwen3.6-27B — 77.2 SWE-bench Verified vs 76.2 for the 397B sibling, 59.3 Terminal-Bench 2.0, 18GB VRAM, Apache 2.0, 262K context, native multimodal (image+video), “Thinking Preservation” feature retains reasoning across conversation turns.
Signals (6 items, including 2 paid):
- Google open-sources AI design guidelines (counter-move to Anthropic locking theirs down)
- exe.dev cloud dev VMs (PAID — skip)
- Open-source doc-to-PowerPoint converter (7,466 stars)
- Recurrent text-embedding model with constant memory (research)
- Unsloth quantized Qwen3.6-27B with fine-tuning + tool calling (131K downloads)
- MIT recursive-model framework, 10M tokens, 58x better than standard models
Mapping against Ray Data Co
Strong mapping — load-bearing for the harness thesis.
-
GPT-5.5 pricing pressure on the Claude-Code harness. $5/M input tokens with 1M context puts OpenAI ~30-40% below the Opus-tier rate RDCO is paying for the autonomous loop. If the agentic-coding scores hold (82.7% Terminal-Bench 2.0 is competitive with 4.7’s reported numbers), the harness-vs-model debate Karpathy and Anthropic have been having gets a third data point: OpenAI is now shipping the “model does the harness work” pattern at a lower price point. RDCO’s current bet is that thin harness + Claude wins on judgment quality; this issue is an inflection point worth re-evaluating before the next API-cost review.
-
Workspace Agents directly competes with the RDCO autonomous-loop pattern. OpenAI is shipping the “shared org-level bot that runs cloud-resident workflows” as a productized feature — the exact pattern Ray (this agent) implements via cron + Notion + sub-agent fan-out. Two questions: (a) does Workspace Agents subsume what we’re building, making the custom harness redundant? (b) or does the “describe a workflow, ChatGPT guides you to build the agent” pattern miss the depth of vault-grounded judgment we’re optimizing for? Worth a deeper look — this might be a candidate for
/cross-checkagainst the harness-thesis cluster. -
Qwen3.6-27B at 18GB VRAM is the local-model story for sensitive workflows. RDCO has flagged future-state desire to run finance/contact-graph workflows locally. A 27B Apache-2.0 model that beats its 397B sibling on coding and runs on a single consumer GPU (18GB = RTX 4090 territory) makes that future-state cheaper than expected. File for the local-model evaluation thread when it surfaces.
Threshold-crossing note: GPT-5.5 + Workspace Agents in the same week is the first time OpenAI has shipped an integrated harness+model package that meaningfully threatens the “Claude is the obvious choice for autonomous COO work” assumption. Not a recommendation to switch, but a recommendation to re-examine the assumption before the next quarterly review.
Curation section — notes
- Item 1 (GPT-5.5) — primary OpenAI announcement, third-party domain (openai.com presumably), high relevance. No deep-fetch performed (body had sufficient specifics; budget-capped at 2). Worth a follow-up read of OpenAI’s official benchmark page if/when API pricing is confirmed.
- Item 2 (Workspace Agents) — primary OpenAI announcement, high relevance. Same as above — body had the load-bearing details (free until May 6, Business+ tier, use case taxonomy).
- Item 3 (Qwen3.6-27B) — third-party (Alibaba/Hugging Face). High relevance for local-model evaluation. Body specifics sufficient.
- Signal #1 (Google AI design guidelines) — relevant for design-system thinking but likely a brand-guidelines doc, not technical content. Low priority for deep-fetch.
- Signal #2 (exe.dev) — PAID placement. Skip.
- Signal #3 (doc-to-PowerPoint) — adjacent to RDCO’s “doodle-as-hero” design pipeline but not load-bearing. Skip.
- Signal #4 (recurrent text embeddings) — research-only, far from current RDCO bets. Skip.
- Signal #5 (Unsloth Qwen3.6-27B quantized) — direct follow-on to Top News #3; if the local-model thread activates, this is the deployment artifact. Note for that workflow.
- Signal #6 (MIT 10M-token recursive model) — research-only but interesting for the “context is no longer the bottleneck” framing Lior pushes in the intro. File the framing, skip the deep-fetch.
Deep-fetches performed
None. Cap was 2; the email body itself had sufficient specifics on all three Top News items (benchmark numbers, pricing, availability windows, technical specs). Deep-fetches deferred to the moment a specific RDCO decision needs the official source (e.g. confirming GPT-5.5 API pricing before a harness re-evaluation).
Related
- 2026-04-23-alphasignal-claude-via-nim-llm-fallacy — yesterday’s AlphaSignal issue, also touches harness-vs-model debate
- 2026-04-17-alphasignal-opus-4-7-codex-desktop-control — prior Codex/desktop-control coverage
- 2026-04-15-thariq-claude-code-session-management-1m-context — context-rot reference, relevant to the 1M context window claim
- 2026-04-12-alphasignal-claude-code-leak-harness-engineering — harness-engineering thread
Copyright note
Body summarized and paraphrased from the AlphaSignal email; benchmark numbers and pricing quoted as factual claims attributed to the original sources (OpenAI, Alibaba/Qwen). No paragraph-length copies. AlphaSignal text used under fair-use summary for assessment purposes.