AI News Daily | May 28 - OpenAI Tax Agents, Hugging Face Benchmark & Windows Copilot Lead Today

Solar and tax automation stole the show this week. OpenAI built self-improving tax agents that hit 97% accuracy — actual production use with PwC inside their own finance team. At the same time, Hugging Face dropped a benchmark showing most AI agents can’t crack 50% on real enterprise IT work. Both things are true. Progress isn’t linear.

OpenAI’s Tax Agents Learn From Their Own Mistakes

OpenAI partnered with Thrive on tax agents built on Codex that do something interesting: they catch their own errors and adjust without human hand-holding. We’re talking treasury workflows, tax filings, financial reporting — domains where mistakes trigger audits and penalties.

The PwC pilot isn’t a press release demo. It’s running inside OpenAI’s finance org right now. Ninety-seven percent accuracy means these agents are ready for work that previously required senior accountants.

The Benchmark Nobody Wanted To See

Hugging Face and IBM Research released ITBench-AA, testing AI agents on actual enterprise IT tasks. GPT-4, Claude, Gemini — all scored below 50%.

Here’s what this tells us: chat interfaces make models sound capable. Production systems reveal what they can actually execute without breaking things. We’re still closing that gap.

Copilot Moves Into Windows’ Taskbar

Microsoft confirmed Ask Copilot lands in the Windows 11 taskbar mid-2026. Instead of launching an app, AI becomes part of the OS chrome. Computer-using agents are generally available now, plus redesigned workflows and real-time voice.

This is infrastructure play, not feature play. Microsoft wants AI where your hands already are.

NVIDIA Upgrades The Plumbing

CUDA 13.3 shipped with tile programming in C++, compiler autotuning, Python updates. While everyone chases model benchmarks, NVIDIA’s building the systems where GPUs, networking, and storage actually work together.

Next phase winners won’t just have better models. They’ll have better systems.

Quick Hits

Anthropic published containment engineering docs (May 25) — how to limit blast radius when Claude gets more capable across claude.ai, Claude Code, Cowork. Boring work. Critical work.

Google DeepMind continues Gemini Omni rollout. Multimodal in one architecture — text, images, audio, video handled together instead of frankensteined. Flash model card dropped May 19.

xAI opened Grok Build CLI beta to SuperGrok and X Premium Plus. Terminal-based coding agent, runs local. Worth testing if you live in tmux.

Mistral‘s Medium 3.5 (early May) still their flagship — 128B params, 77.6% on SWE-Bench Verified. Quiet week.

Meta AI nothing new. Llama 4 from April 2025 remains current. Could be strategic silence. Could just be quiet.

Rundown for May 28, 2026. Sources: Anthropic, OpenAI, DeepMind, Meta, Microsoft, Mistral, Hugging Face, NVIDIA, xAI.

AI News Daily | May 28 – OpenAI Tax Agents, Hugging Face Benchmark & Windows Copilot Lead Today

OpenAI’s Tax Agents Learn From Their Own Mistakes

The Benchmark Nobody Wanted To See

Copilot Moves Into Windows’ Taskbar

NVIDIA Upgrades The Plumbing

Quick Hits

Ready to try one of these workflows?

Tell us where you’re stuck.

OpenAI’s Tax Agents Learn From Their Own Mistakes

The Benchmark Nobody Wanted To See

Copilot Moves Into Windows’ Taskbar

NVIDIA Upgrades The Plumbing

Quick Hits

More field notes

How to implement AI in your business in 2026

What does an AI automation agency actually do?

AI chatbot for business: build, buy, or skip?

Ready to try one of these workflows?

Tell us where you’re stuck.