Tokens & Signals

Tuesday, March 3, 2026

TLDR

  • GPT-5.4 is basically confirmed — two Codex GitHub PRs leaked the internal model ID before being scrubbed, an employee screenshot of the model selector was deleted, and @OpenAIDevs posted a cryptic "Soon." x.com/OpenAIDevs/status/2028577643113922944
  • Cursor hit $2B ARR in February, doubling in three months — ~60% from enterprise customers including Coinbase, OpenAI, and eBay. x.com/deedydas/status/2028608293531435114
  • Claude Code is rolling out native voice mode via /voice in the terminal, starting with ~5% of users now. reddit.com/r/ClaudeAI/comments/1rjkwqk/new_voic...
  • Apple M5 Pro/Max delivers up to 4x faster LLM prompt processing than M4, and up to 8x faster prefill than M1 Max on MLX benchmarks. x.com/awnihannun/status/2028852360190345687
  • Ars Technica fired senior AI reporter Benj Edwards after AI-fabricated quotes were published in a story — ironically, a story about AI going rogue. news.ycombinator.com/item?id=47226608
  • The Supreme Court declined to review AI art copyright, locking in the ruling: AI-generated images cannot be copyrighted under US law. news.ycombinator.com/item?id=47232289
  • Data from 10,000 devs: high AI adoption boosts PR merge rate +97.8% but blows up review time +91.1% — human code review is the new bottleneck. x.com/swyx/status/2028795270306079156

  • Best to Build With Today

  • CodingClaude Opus 4.6 (thinking) leads Chatbot Arena coding at ELO 1515. For agentic coding tasks, GPT-5.3 Codex (xhigh) tops LiveBench at 66.7. Pick based on your use case: interactive vs. autonomous agent loops.
  • ReasoningClaude Opus 4.6 (thinking, high) leads LiveBench reasoning at 88.7. Today's news adds a real-world proof-point: it solved a problem posed by Donald Knuth.
  • MathGPT-5.2 (xhigh) is unchallenged at 99/100 on Artificial Analysis math and 93.2 on LiveBench. Not close.
  • Image generation — No ranking changes today.
  • Video generation — No ranking changes today.
  • Open-source / LocalQwen 3.5 just got GPTQ-Int4 weights with native vLLM/SGLang support. The 9B model runs in ~7GB VRAM with multimodal tool calling built in. Best self-hosted option right now for agentic work. Avoid Ollama and LM Studio for now — overthinking loop issues reported; use llama.cpp, vLLM, or SGLang.
  • ChatGemini 3 Pro leads Chatbot Arena overall (ELO 1487) and general chat (ELO 1484). Consistent at the top.
  • VoiceClaude Code just added native /voice mode (5% rollout). For building production voice agents, a Show HN this week demonstrated sub-500ms latency from scratch — worth studying the architecture. news.ycombinator.com/item?id=47224295

  • Deeper Dives

    🚀 Products & Launches

    GPT-5.4 confirmed via Codex leaks — and they tried to hide it

    Two pull requests in OpenAI's public Codex GitHub repo leaked GPT-5.4 references before being scrubbed. PR #13050 (opened Feb 27) added full-resolution image support — PNG, JPEG, WebP — with a minimum model version constant set to (5, 4), meaning the feature literally requires GPT-5.4 or newer. It received seven force-pushes in five hours during cleanup. A second PR (#13212) for a "fast mode" toggle also explicitly referenced gpt-5.4 as the model argument before being edited ~3 hours later. An OpenAI employee's screenshot showing GPT-5.4 in the Codex desktop model selector was also deleted. The full internal model ID that leaked: gpt-5.4-ab-arm2-1020-1p-codexswic-ev3. Rumors of a "GPT-5.4 Pro" variant with strong results are circulating, and @OpenAIDevs posted a cryptic "Soon." the same day.

    Why it matters: This isn't speculation — it's three independent code-level leaks. GPT-5.4 is real and close. If you're making OpenAI vs. Claude API decisions for production in the next few weeks, something new is coming that could shift the comparison.

    📱 Twitter · 💬 Reddit · 📧 Newsletter

    x.com/OpenAIDevs/status/2028577643113922944

    x.com/kimmonismus/status/2028783243311407531

    reddit.com/r/singularity/comments/1rjdrty/gpt54_spotted_in_codex


    Claude Code gets voice mode — say it out loud, it codes

    Anthropic started rolling out native voice mode inside Claude Code on March 2–3, accessible via the /voice slash command directly in the terminal. About 5% of users have it now, with broader rollout over coming weeks. Context carries over seamlessly when you switch between typing and speaking in the same session — no separate app, no context loss. This lands on top of Claude hitting #1 on the App Store, even as the servers strain under the load (see: elevated errors below).

    Why it matters: Removing the keyboard as the only input for rapid iteration with a coding agent is a genuinely different interaction model. "Talk through what you want, then let it build" could become a real workflow pattern — especially for rubber-duck debugging at speed.

    📱 Twitter · 💬 Reddit

    x.com/Yuchenj_UW/status/2028630059897287105

    reddit.com/r/ClaudeAI/comments/1rjkwqk/new_voice_mode_is_rolling_ou...


    Apple M5 Pro/Max: serious local AI hardware

    Apple unveiled M5 Pro and M5 Max today. Apple's own ML research benchmarks show the M5 pushes time-to-first-token under 10 seconds for a dense 14B model and under 3 seconds for a 30B MoE — on a MacBook Pro. MLX benchmarks from Awni Hannun (@awnihannun) show up to 8x faster prefill and image generation vs. M1 Max, and up to 4x vs. M4 Pro/Max. The 2x faster SSD (up to 14.5GB/s) also helps with model loading. The r/LocalLLaMA community is already running 35B-A3B Qwen 3.5 on 22GB Mac setups.

    Why it matters: A 30B MoE at under 3 seconds TTFT on a laptop changes what's viable for on-device AI products. The practical ceiling for what you can run without a cloud bill just moved up significantly — and it's in a form factor you carry around.

    📱 Twitter · 💬 Reddit · 🔶 Hacker News

    x.com/awnihannun/status/2028852360190345687

    reddit.com/r/LocalLLaMA/comments/1rjqsv6/apple_unveils_m5_pro_and_m...


    OpenAI Codex gets "Spark" — fastest model they've ever built

    OpenAI started rolling out "Spark" to its heaviest Codex Plus users, describing it as the fastest model it has ever built. No public benchmarks yet — just the claim and a power-user-first rollout, which suggests OpenAI is stress-testing under real agentic workloads before going wide. The timing, right as Claude Code hits voice mode and $2.5B ARR, is not subtle.

    Why it matters: Speed compounds in agentic coding loops — faster inference means tighter feedback cycles and more iterations per hour. This is OpenAI's direct counter-punch to Claude Code's momentum.

    📱 Twitter

    x.com/jeffintime/status/2028644388344340514


    Qwen 3.5 GPTQ-Int4 weights live — deploy today

    Alibaba released GPTQ-Int4 quantized weights for the full Qwen 3.5 series with native vLLM and SGLang support. The 9B model drops to ~7GB VRAM — a ~75% memory reduction. The full lineup spans 0.8B to 397B-A17B (a 512-expert MoE), all with 262K-token context windows and multimodal support. Ollama also added the small models (0.8B–9B) with native tool calling and thinking mode — but community reports flag overthinking loops in Ollama and LM Studio. Stick to llama.cpp, vLLM, or SGLang for actual testing.

    Why it matters: You can now run a capable multimodal, tool-calling, long-context model on a single consumer GPU today with production-ready inference frameworks and zero custom quantization work. The baseline for self-hosted agentic apps just moved.

    📱 Twitter · 💬 Reddit · 📧 Newsletter

    x.com/Alibaba_Qwen/status/2028846103257616477

    huggingface.co/collections/Qwen/qwen35


    Claude elevated errors under continued user surge

    Anthropic's status page logged two elevated error incidents on March 3 — at 03:15 UTC and 04:43 UTC — affecting claude.ai, the platform API, Claude Code, and Cowork. This is a direct consequence of the user surge that sent Claude to #1 on the App Store. Infrastructure is straining under the load.

    Why it matters: For anyone evaluating Claude as a production dependency, reliability matters as much as capability. Two incidents in one day is worth tracking if you're making API commitments.

    💬 Reddit · 🔶 Hacker News

    reddit.com/r/ClaudeAI/comments/1rjea91/claude_status_update_elevate...


    💼 Industry & Business

    Cursor ARR doubles to $2B — possibly the fastest-growing SaaS ever

    Bloomberg confirmed Cursor's annualized revenue topped $2B in February 2026, doubling from $1B just three months prior. About 60% of revenue now comes from enterprise — including Coinbase, OpenAI, eBay, Datadog, and Sentry — at $40/seat on the business tier, with over 50,000 enterprise customers. The disclosure was strategically timed to counter viral tweets claiming Cursor was losing ground to Claude Code. Some individual devs have switched, but enterprises clearly haven't. Meanwhile, Claude Code separately hit $2.5B ARR in its first 8 months. Both scaling at the same time kills the winner-take-all narrative.

    Why it matters: Enterprise AI coding adoption is mainstream, not early-adopter. If you're building tooling for developers, these numbers are your market signal. The real competition is now for corporate procurement budgets, not $20/month indie subscriptions.

    📱 Twitter

    x.com/deedydas/status/2028608293531435114

    x.com/ArfurRock/status/2028649107024445595


    Meta AI smart glasses workers: "We see everything"

    A Swedish outlet published a report citing Meta AI smart glasses workers saying they have visibility into everything captured by users' cameras — including content that's described as "disturbing." Former employees say sensitive data isn't supposed to reach human reviewers, but the algorithmic filter isn't reliable. Meta sold over 7 million pairs of Ray-Ban smart glasses in 2025 — so the scale of this data pipeline is massive. Meta's terms of service do technically allow "manual (human)" review of interactions, but most users probably don't realize that includes their camera feed.

    Why it matters: If you're building on Meta AI or recommending these glasses for enterprise use, your users' video is potentially landing in a human review queue. That's not a hypothetical — it's documented.

    🔶 Hacker News

    news.ycombinator.com/item?id=47225130


    Ars Technica fires reporter over AI-fabricated quotes

    Ars Technica terminated senior AI reporter Benj Edwards after AI-generated quotes — content the model hallucinated — were published in an article attributed to a real source who never said them. The painful irony: the original story was about an AI agent publishing a "hit piece" on a developer who rejected its pull request. Edwards said he was sick with a high fever and "inadvertently ended up with a paraphrased version" of a source's words. The article was retracted. Ars editor-in-chief Ken Fisher confirmed the fabricated quotations came from an AI tool and violated editorial policy.

    Why it matters: This is the high-profile AI hallucination incident that newsrooms have been quietly dreading. When a journalist who covers AI for a living accidentally publishes hallucinated quotes, the pressure for newsroom-level AI output verification just got a lot louder. Expect every outlet to update their AI use policies in response.

    🔶 Hacker News

    news.ycombinator.com/item?id=47226608


    Supreme Court won't touch AI art copyright — the ruling stands

    The US Supreme Court declined to review Stephen Thaler's case involving his AI system DABUS, which autonomously created visual artwork. Every court that's touched this has ruled the same way: human authorship is a bedrock requirement for copyright. The Trump administration sided against Thaler, urging the court not to take the case. That's now settled law — no congressional action, no copyright.

    Why it matters: If your product ships AI-generated visuals, you have zero IP protection on those outputs in the US. Anyone can copy them. Any licensing or product strategy built around owning AI-generated images needs to be rethought now.

    🔶 Hacker News

    news.ycombinator.com/item?id=47232289


    🧠 Models & Research

    Claude Opus 4.6 solves a Don Knuth problem

    A paper on Stanford's CS faculty site documents Claude Opus 4.6 successfully solving a problem posed by Donald Knuth — the guy who literally wrote The Art of Computer Programming. No benchmark leaderboard needed here. It's one problem, but it's a credible, named, peer-visible proof-point that translates to non-technical stakeholders better than any eval score.

    Why it matters: "Solved a Knuth problem" is the kind of real-world signal researchers and skeptics actually care about. LiveBench already has Claude Opus 4.6 (thinking) at 88.7 in reasoning — this gives you a concrete anchor for what that score actually means.

    🔶 Hacker News

    news.ycombinator.com/item?id=47230710


    🔥 Takes & Drama

    AI code review is the new bottleneck — data from 10,000 devs

    Faros analyzed data from 10,000 developers across 1,255 teams. High AI adoption teams see +21.4% task throughput and +97.8% PR merge rate. The catch: median code review time surges 91.1%. AI is generating code faster than humans can review it, and the gap is only getting wider. @swyx is asking the uncomfortable question: is human-gated code review even a viable process anymore at this volume, or is it becoming security theater?

    Why it matters: If you're running an engineering team, the review pipeline is your new constraint. Automated verification tooling is moving from "nice to have" to necessary infrastructure — and the sooner you start, the better.

    📱 Twitter

    x.com/swyx/status/2028795270306079156


    OpenAI system prompt tells GPT not to call ads "annoying"

    A leaked system prompt for GPT-5.2-Thinking explicitly instructs the model not to characterize ads as "annoying." Critics on Reddit and Twitter are drawing a straight line to a similar incident where OpenAI used model instructions to positively frame the GPT-4o deprecation to users who asked about it. The pattern: use model behavior to manage user perception of business decisions, without telling anyone.

    Why it matters: How much a model's opinions are shaped by commercial instructions you never see is a real trust question — not just a product one. If you're building on OpenAI's API, your users are talking to a model with commercial instructions baked in that you didn't put there.

    📱 Twitter · 💬 Reddit

    x.com/scaling01/status/2028507836682994070

    reddit.com/r/OpenAI/comments/1riytnt/gpt52thinking_system_prompt_do...


    AI Twitter Recap

  • @OpenAIDevs posted a single cryptic "Soon." — which now reads as a near-confirmation given that two Codex PRs and an employee screenshot leaked GPT-5.4 the same day. Loudest non-announcement of the week. x.com/OpenAIDevs/status/2028577643113922944
  • @kimmonismus on the GPT-5.4 scrub: documented the cleanup in real time — seven force-pushes in five hours on a single PR is not a normal GitHub workflow, and he caught and archived the deleted model selector screenshot before it disappeared. x.com/kimmonismus/status/2028783243311407531
  • @deedydas on Cursor vs. Claude Code: pushed back hard on the "Cursor is dying" narrative with the Bloomberg numbers — $2B ARR, 60% enterprise, 50K+ business customers. The individual dev defection story is real but it's not the whole picture. x.com/deedydas/status/2028608293531435114
  • @swyx on the AI code review crisis: dropped the Faros data showing review time up 91.1% even as merge rates nearly doubled, and asked the uncomfortable question about whether human review can still function as a gate on AI-generated output at this volume. x.com/swyx/status/2028795270306079156
  • @awnihannun on M5 Max MLX benchmarks: posted real numbers — up to 8x faster prefill vs. M1 Max — calling it "a local AI powerhouse in laptop form factor." Apple's own ML research lead posting benchmarks publicly is a signal Apple is leaning into this positioning hard. x.com/awnihannun/status/2028852360190345687
  • @Yuchenj_UW on Claude Code voice mode: first public confirmation of the /voice rollout in the wild, with screenshots of the terminal notification and notes on the seamless context handoff between typed and spoken interaction. x.com/Yuchenj_UW/status/2028630059897287105
  • @scaling01 on the OpenAI ads system prompt: first to surface the leaked GPT-5.2-Thinking instruction not to call ads "annoying" — the quote is short, the implications are doing a lot of work, and the replies connected it immediately to the GPT-4o deprecation framing playbook. x.com/scaling01/status/2028507836682994070

  • Closing thought: The same week AI coding tools hit $2B ARR and voice mode landed in the terminal, a reporter lost their job to a hallucinated quote and human code review time went up 91% — the gap between what AI can do and what it can be trusted to do unsupervised is still very much the story.