Solar and tax automation stole the show this week. OpenAI built self-improving tax agents that hit 97% accuracy — actual production use with PwC inside their own finance team. At the same time, Hugging Face dropped a benchmark showing most AI agents can’t crack 50% on real enterprise IT work. Both things are true. Progress isn’t linear.
OpenAI’s Tax Agents Learn From Their Own Mistakes
OpenAI partnered with Thrive on tax agents built on Codex that do something interesting: they catch their own errors and adjust without human hand-holding. We’re talking treasury workflows, tax filings, financial reporting — domains where mistakes trigger audits and penalties.
The PwC pilot isn’t a press release demo. It’s running inside OpenAI’s finance org right now. Ninety-seven percent accuracy means these agents are ready for work that previously required senior accountants.
The Benchmark Nobody Wanted To See
Hugging Face and IBM Research released ITBench-AA, testing AI agents on actual enterprise IT tasks. GPT-4, Claude, Gemini — all scored below 50%.
Here’s what this tells us: chat interfaces make models sound capable. Production systems reveal what they can actually execute without breaking things. We’re still closing that gap.
Copilot Moves Into Windows’ Taskbar
Microsoft confirmed Ask Copilot lands in the Windows 11 taskbar mid-2026. Instead of launching an app, AI becomes part of the OS chrome. Computer-using agents are generally available now, plus redesigned workflows and real-time voice.
This is infrastructure play, not feature play. Microsoft wants AI where your hands already are.
NVIDIA Upgrades The Plumbing
CUDA 13.3 shipped with tile programming in C++, compiler autotuning, Python updates. While everyone chases model benchmarks, NVIDIA’s building the systems where GPUs, networking, and storage actually work together.
Next phase winners won’t just have better models. They’ll have better systems.
Quick Hits
Anthropic published containment engineering docs (May 25) — how to limit blast radius when Claude gets more capable across claude.ai, Claude Code, Cowork. Boring work. Critical work.
Google DeepMind continues Gemini Omni rollout. Multimodal in one architecture — text, images, audio, video handled together instead of frankensteined. Flash model card dropped May 19.
xAI opened Grok Build CLI beta to SuperGrok and X Premium Plus. Terminal-based coding agent, runs local. Worth testing if you live in tmux.
Mistral‘s Medium 3.5 (early May) still their flagship — 128B params, 77.6% on SWE-Bench Verified. Quiet week.
Meta AI nothing new. Llama 4 from April 2025 remains current. Could be strategic silence. Could just be quiet.
Rundown for May 28, 2026. Sources: Anthropic, OpenAI, DeepMind, Meta, Microsoft, Mistral, Hugging Face, NVIDIA, xAI.