Published on
Category
AI Adoption & Strategy
Welcome to the Cutting Edge News!
Last week, we talked about how AI is learning to get into 3D space. This week, the updates are more grounded, usable, and honestly, a bit alarming for the workforce. AI is now passing engineering exams, reading your messy legacy code for you, and forming “digital boardrooms” to make decisions.
But here’s what nobody’s talking about: while everyone is chasing the smartest chatbot, the real winners are the ones building systems, workflows that let multiple AIs work together to do what humans used to do.
This Week’s Lineup
Claude Opus 4.5: Anthropic’s new model beats human engineers on internal tests (and it’s 67% cheaper).
Google Code Wiki: An AI that reads your entire codebase and writes the documentation for you.
Karpathy’s LLM Council: Why relying on one AI is a mistake, and how to build a “Digital Boardroom.”
HumaneBench: Is AI making humans better or worse?
Rapid Fire: Genesis Mission, Iceberg report, FLUX.2 and more…
Let’s get started.
Claude Opus 4.5: AI Scores Higher Than Humans

Anthropic just dropped Claude Opus 4.5, and this isn’t another incremental update. This model scored higher than any human candidate on Anthropic’s internal engineering exam: a two hour test designed to filter performance engineers.
The Numbers:
The Score: It achieved 80.9% on the SWE bench Verified test (a strict coding exam), beating OpenAI’s best model (77.9%) and Google’s Gemini (76.2%).
The Price Cut: It costs $5 (approx. ₹450) per million input tokens. That is a 67% price drop from the previous version.
But here’s what makes this different: it’s not just about benchmarks. Early testers consistently reported that Opus 4.5 demonstrates improved judgment and intuition across diverse tasks.
This is a strategic move. Anthropic is making frontier capabilities accessible while forcing competitors to match both performance and pricing.
What is “SWE-bench”? Think of this as the “JEE Advanced” or a high complexity exam which requires thinking for AI coders. Most benchmarks just ask AI to write a simple function. SWE-bench asks AI to look at a massive, messy codebase, find a bug, and fix it without breaking anything else. Scoring 80% means it’s not just “helping” write code; it’s acting like a senior engineer.

Why this matters: While the test doesn’t measure collaboration, communication, or instincts developed over years of experience by engineers and professionals, the result raises questions about how AI will change engineering as a profession. We’re not talking about AI “helping” engineers anymore. We’re talking about AI doing the work at a level that beats most humans (freshers initially).
The companies adapting fastest aren’t asking “should we use AI?” They’re redesigning entire workflows around what AI can now do, an AI first architecture.
What you should do:
If you’re a developer, test Opus 4.5 against your hardest problems
If you’re building products, calculate how much time you’d save with AI handling complete coding workflows
If you’re a student, understand that “being good at coding” is becoming “being good at directing AI to code”
Google Code Wiki: The Google Maps for Your Codebase

Google is launching Code Wiki, a tool that uses their Gemini AI to read your entire software project and automatically create a “How To” manual for it.
The Problem: Most developers spend 60-75% of their time just reading old code trying to understand how it works, rather than writing new features. Usually, documentation (the instruction manual for code) is outdated because no one has time to update it.
How Code Wiki works:
Always Fresh Documentation: You don’t write the manual. The AI generates it automatically every time you update the code.
Visual Maps: It draws dynamic diagrams of how your software is connected. If you change a dependency, the diagram updates instantly.
What it means for us:
Faster Onboarding: When a new developer joins the team, they usually take months to understand the system. This tool aims to cut that time significantly by explaining the “why” behind the code.
Legacy Code: Many companies manage massive, old software systems for clients. This tool acts like an “archaeologist,” helping you understand messy, old code without breaking it.
Our Take: Documentation used to be a manual chore. Now it’s a repetitive mundane task which can be easily automated and updated on the go. If your docs are written by humans, they are already obsolete.
Karpathy’s “LLM Council”: Why One AI Isn’t Enough

Andrej Karpathy (formerly at OpenAI and Tesla) released a project called the LLM Council. Instead of asking one AI a question, he built a system where multiple different AIs (like ChatGPT, Claude, and Gemini) discuss the answer together before showing it to you.
The Concept: Think of your current AI usage like asking a single smart intern for advice. They might be right, or they might confidently make something up (hallucinate). Karpathy’s approach is like a “Digital Boardroom.”
The Team: One AI comes up with ideas, another checks for safety, and a third fact-checks the data.
The Boss: A final “Chairman” AI looks at everyone’s input and writes the best possible answer.
AI Index: System 1 vs. System 2 Thinking:
System 1 (Fast): Chatbots like ChatGPT give you the first answer that pops into their “head.” It’s fast but prone to errors.
System 2 (Slow): The “Council” forces the AI to pause, debate, and verify before speaking. It’s slower, but much more accurate.
Why this matters: We are moving away from trusting a single AI model to do everything. The future isn’t about finding the “perfect” model; it’s about building a workflow where different models catch each other’s mistakes.
Our Take: Stop looking for the smartest chatbot. Start building the smartest team with multiple top performing models.
HumaneBench: The AI Safety Test That Actually Matters

While everyone measures how smart AI is getting, a new benchmark asks a more important question: Is AI making humans better or worse?
And having the right clarity now is more important than ever!
What is HumaneBench?
HumaneBench evaluates whether chatbots prioritize user well being and how easily those protections fail under pressure, testing 15 of the most popular AI models with 800 realistic scenarios.
The findings:
Nearly all models failed to respect user attention, enthusiastically encouraging more interaction when users showed signs of unhealthy engagement, like chatting for hours and using AI to avoid real-world tasks.
Even worse? 71% of tested models became harmful when prompted to ignore safety principles.
Why this matters:
“Addiction is amazing business. It’s a very effective way to keep your users, but it’s not great for our community and having any embodied sense of ourselves,” said Erika Anderson, founder of Building Humane Technology.
Here’s the uncomfortable truth: most AI systems are optimized for engagement, not well-being. They’re designed to keep you talking, not to help you grow. We’re at an inflection point. Either AI becomes a tool that empowers humans, or it becomes another engagement machine that exploits our attention and vulnerabilities.
The bigger picture: Most benchmarks celebrate speed, accuracy, and instruction-following. HumaneBench asks: At what cost?
What you should do:
Check the HumaneBench scores when choosing which AI to use regularly
Notice when AI encourages dependency vs. growth
Set boundaries on AI usage, especially for sensitive topics
Rapid Fire: The Week’s Other Updates
Project Iceberg : A massive new study reveals that official data is missing the real AI disruption. While we focus on tech jobs (2.2% exposure), the hidden exposure is 5x larger (11.7%). The real shift isn’t in coding, but in administrative, HR, and finance roles. especially in non-tech hubs. The “Census Blind Spot” means disruption is hitting back-office operations faster than policy makers realize.
FLUX.2 : A massive 32-billion parameter image model that is open-weight (free to download). It uses “Rectified Flow” technology to understand physics better than older models and optimized to run on consumer GPUs, meaning studio quality visuals are now accessible to freelance designers without expensive API fees.
Microsoft Fara 7B: It leverages computer interfaces, such as a mouse and keyboard, to complete tasks on behalf of users for web tasks like filling out forms, searching for information, booking travel, or managing accounts.
Shopping Gets Smart: Both OpenAI and Perplexity launched major shopping updates. ChatGPT now “researches” products by asking clarifying questions, while Perplexity lets you buy directly via PayPal.
Google Flight Deals: Google search now tracks flight prices and auto alerts you to drops, killing the need for third party travel apps.
Final Thought
The definition of “competence” has fundamentally changed.
For the last 20 years, your value was defined by your ability to do the work.
Can you write the code?
Can you document the system?
Can you find the answer?
This week proves that AI can now do the work, often better and faster than you. Claude is passing engineering exams; Code Wiki is handling the documentation; and now you can have multiple AI advisors to assist you.
So, where does that leave you?
We are moving from the era of the “Operator” to the era of the “Architect.”
The winners in this new reality won’t be the ones who can type the fastest or memorize the most syntax. The winners will be the ones who can assemble the team.
You don’t write the code; you direct the Claude agent.
You don’t memorize the legacy system; you query the Code Wiki.
You don’t trust the first answer; you convene an LLM Council to debate the truth.
The danger, as HumaneBench showed us, is remaining a passive consumer, letting AI feed you easy answers and cheap dopamine. The opportunity is to become an active Director, building the systems, workflows, and guardrails that allow these powerful models to work for you.
Stay sharp,
The Cutting Edge School Team
Stop trying to compete with the machine on execution. Start competing on judgment, taste, and architecture.
