2026.07: Three tools that own tasks end-to-end (and what that means for your team)
This week, three signals point in the same direction: the tools are getting serious about doing real work, not just suggesting it.
Anthropic shipped a new flagship AI model that can hold an entire codebase in its head. OpenAI released a coding agent that owns tasks end-to-end instead of just writing snippets. And Microsoft updated Power Platform to put AI copilots and agents directly inside the business apps your team already uses.
The pattern? AI just moved from "helpful assistant" to "semi-autonomous coworker." The companies that figure out where to let it own whole workflows, not just answer questions, will pull ahead fast.
Let's break it down.
Signal:
Signal One: Anthropic's Opus 4.6 Lets You Feed It an Entire Codebase and Get Useful Answers Back.
Anthropic released Claude Opus 4.6 with a 1 million token context window (in beta). That means it can ingest an entire codebase, a multi-year regulatory filing, or a full patent family, and reason across all of it without you chopping it into pieces first. It also ships with better agent capabilities: it can split work into subtasks, run tools in parallel, and keep multi-step workflows moving with less hand-holding. On a 1M-token retrieval benchmark, it scores 76% vs. ~18.5% for the previous version, meaning it actually finds what you need in large documents instead of drifting.
Signal Two: OpenAI's GPT-5.3-Codex Turns "AI for Coding" Into an Actual Software Worker.
OpenAI merged its coding model with its reasoning model into GPT-5.3-Codex. It runs 25% faster than the previous version and uses fewer tokens for the same tasks. The difference from earlier models: this one is built for long-running, tool-using tasks. It can plan, execute, self-check, and keep going across terminals, repos, and environments. OpenAI used it internally to debug its own training runs. It's rolling out inside GitHub Copilot, and early reports say it produces fewer half-baked fixes and handles repo-scale reasoning better, especially for bug-hunting and refactors.
Signal Three: Microsoft Power Platform's February Drop Puts AI Agents Inside Your Business Apps.
Microsoft's February 2026 update pushes Copilot and agents deeper into Power Platform. M365 Copilot chat is now embedded inside model-driven apps (preview), so it can reason over your app data plus docs, email, and collaboration content. A new MCP Server lets agents use app capabilities as tools, starting with data entry from unstructured content into forms. There's a shared feed so humans can supervise, compare, and approve agent actions before they go live. And "Code Apps" hit general availability, meaning dev teams can host React or Vue apps as governed Power Apps assets.
Scale:
Scale One: Anthropic's Opus 4.6 Lets You Feed It an Entire Codebase and Get Useful Answers Back.
Start Here: Pick one recurring research or analysis task where your team currently preps documents for AI. Feed the full document set into Opus 4.6 without chunking. Compare the output quality to your current pipeline. The beta 1M context is available on the Claude Developer Platform. Start with read-only analysis, not anything that writes back to your systems. Have a subject-matter expert compare AI answers to known-good answers on 3-5 test cases before you trust it on new questions. Track prep time eliminated, answer accuracy vs. your current process, and time-to-answer for 30 days. If it's faster and accurate, start moving more document-heavy tasks over.
Scale Two: OpenAI's GPT-5.3-Codex Turns "AI for Coding" Into an Actual Software Worker.
Start Here: Pick 3-5 bugs from your backlog that have clear reproduction steps and existing test coverage. Point GPT-5.3-Codex at them one at a time. Review every PR before merging. Treat this like onboarding a junior developer, you check everything. Only use it on code with strong test suites. No production-critical systems on the first pass. Keep a developer in review mode for every change. Set up CI gates so nothing merges without passing tests. Track time-to-fix per bug (agent vs. human), PR quality (revision rate), and developer time spent reviewing vs. writing. Run this for 2 weeks before expanding to more complex tasks.
Scale Three: Scale Three: Microsoft Power Platform's February Drop Puts AI Agents Inside Your Business Apps.
Start Here: If you're on Microsoft's stack, enable M365 Copilot chat in one model-driven app where your team already works daily. Pick a read-heavy use case first, like answering questions about existing records, not writing new ones. ssign someone to monitor the agent feed daily for the first 30 days. Define which actions the agent can suggest vs. which require human approval. Assume Copilot coverage is patchy right now. Canvas apps and Power Pages don't have this yet. Track how often the team uses Copilot in-app vs. going back to their old method, quality of answers (spot-check 10% weekly), and operational cost of monitoring the agent feed. If the supervision cost is higher than the time saved, narrow the use case.
Deep Dive:
Your AI Agent Has a UX Problem. What Apple's Research on Computer Use Agents Means for Your Business.
Last week I watched a demo where an AI agent booked a flight, reserved a hotel, and rented a car. All by itself. Then it bought the wrong insurance, upgraded to a suite nobody asked for, and sent a confirmation to the wrong contact. The technology worked. The experience around it didn't.
A new Apple/Carnegie Mellon research paper puts structure around what's been obvious to anyone actually deploying these tools: the hard part isn't getting AI to click buttons. It's making sure humans can work with these things without losing control of their own business.
This deep dive breaks down the four areas that matter most, from how you talk to agents, to what they should show you, to where they need to pause and ask permission, and gives you a practical framework for evaluating any agent tool before you hand it to your team.
Thanks for reading!
I'd love to hear which of these three signals hit closest to home for you. Reply and let me know what you're testing, what's working, or what still feels like vendor noise.
See you next Friday.
P.S. Three different companies, three different tools, same lesson: AI that lives inside the work beats AI that lives in a separate tab. Pick one workflow where the data is already digital and the process is already documented. That's your starting point.