GPT-5: Breakthrough or Just a Better Autocomplete?

Highlights

Exhibits PhD-level intelligence, but still doesn’t self-learn once deployed
Thinking Mode unlocks deeper chain-of-thought reasoning when prompted (“Think hard about this…”)
Accuracy soars: 94.6 % on AIME math, 88 % on Aider code, with just 1.6 % health-domain hallucinations
Produces sharper writing, cleaner code and more reliable medical guidance—raising the bar in ChatGPT’s core domains
Less sycophantic, more candid: tighter safety guardrails mean fewer flattering or misleading replies
Now agentic—can tap tools like Gmail and Google Calendar to take real actions on your behalf
Still short of AGI: ultimately a much smarter autocomplete, not a self-aware mind
Rivalry intensifies as Meta, Anthropic, DeepSeek and xAI race to catch up

Sam Altman opened the GPT-5 launch event with a simple timeline: 32 months ago, we introduced GPT-3. That model was like talking to a high school student. GPT-4 felt more like a college student. Now, GPT-5 is like having a PhD-level expert in your pocket. With this comparison, OpenAI set the stage for what it calls a “significant step” toward Artificial General Intelligence (AGI). But as bold as this announcement is, the moment also reveals the gap between what we have and what AGI truly means.

OpenAI describes GPT-5 as its “smartest, fastest, most useful” model to date. It comes equipped with two operating modes, Chat and Thinking. It dynamically shifts between them, depending on how complex your request is. The idea is simple: don’t overthink what doesn’t require it, but apply deep reasoning when needed. All of this is managed by an intelligent router trained on real usage data, things like how often users switch models, their preference for different responses, and whether those responses were factually correct. In theory, it’s smart enough to know when to be smart.

This model, OpenAI claims, isn’t just a little better. It’s smarter across the board. GPT-5 significantly reduces hallucinations and has better instruction-following and less sycophancy. It reportedly excels in the three areas where ChatGPT is most used: writing, coding, and health. The performance gains are backed by benchmark data OpenAI chose to publish this time around. On HealthBench Hard, GPT-5 Thinking showed hallucination rates as low as 1.6%, compared to 12.9% for its predecessor. Real-world queries gathered from ChatGPT traffic showed error rates dropping from over 20% in GPT-4o to below 5% with GPT-5 reasoning activated.

On the academic front, GPT-5 reached 84.2% accuracy on MMMU. a visual reasoning benchmark. On the AIME 2025 math challenge, it scored 94.6% without tools. In code generation, it posted 74.9% on SWE-bench Verified and 88% on the multilingual coding benchmark Aider Polyglot. These aren’t marginal gains, they’re significant jumps. Here’s a quick summary of the metrics showcased in the GPT-5 introductory video.

Benchmark	Task Domain	GPT-5 Score
HealthBench Hard	Clinical QA	1.6% hallucination (with thinking)
ChatGPT Traffic Prompts	Real-world queries	4.8% error (with thinking)
MMMU	Visual Reasoning	84.2% accuracy
AIME 2025 (no tools)	Competition Math	94.6% accuracy
SWE-bench Verified	Software Engineering	74.9% pass@1
Aider Polyglot	Multilingual Code Editing	88% pass@2

That said, many of these benchmarks are internal or proprietary, so the numbers should be viewed in context. We can appreciate the trajectory without surrendering to the marketing.

But even with all of these improvements, GPT-5 still has a crucial limitation: it doesn’t learn. Once deployed, it stays static. It doesn’t remember you, doesn’t evolve with you, doesn’t update its knowledge autonomously. Altman was candid in admitting this: “We’re still missing something quite important… this is not a model that continuously learns as it’s deployed.” So for all the PhD metaphors, GPT-5 is still a graduate who doesn’t read new books. And that brings us to the deeper philosophical question: what are we building? Is GPT-5 a step toward AGI, or is it just a better autocomplete machine with more polish and power? The answer depends on how you define intelligence. If it means pattern recognition, GPT-5 excels. But if it means awareness, learning, curiosity, then we’re not there yet.

Nevertheless, GPT-5 is not just a lab demo. It’s being deployed at scale. GitHub Copilot now integrates GPT-5 for developers. In ChatGPT, GPT-5 Thinking is now gradually rolling out in ChatGPT. Soon, users will be able to enable it directly via the model picker, or invoke it naturally by typing something like “think hard about this” into the prompt. The system will dynamically recognize the intent and activate deeper reasoning accordingly. It integrates with Gmail, Calendar, and other productivity tools to create what OpenAI calls “agentic capabilities.” We’re no longer just chatting with a bot, we’re delegating.

Microsoft has already rolled GPT-5 into its entire ecosystem: Microsoft 365 Copilot, Azure AI Foundry, and GitHub tools. GPTBots, an enterprise-grade multi-agent platform, has integrated GPT-5 to power distributed AI decision-making. Meanwhile, other players in the AI race—Meta, Anthropic, xAI—are moving quickly. Musk claims Grok is “better than PhD-level in everything.” The language has become theatrical, the comparisons exaggerated. The arms race isn’t just about performance; it’s about perception. Still, some of the changes are genuinely thoughtful. GPT-5 uses a new approach to safety called “safe completions,” moving away from binary refusals to more nuanced replies. It tries to help even when it must reject a prompt. It is also reportedly less sycophantic, with measurable reductions in over-flattering replies—a direct response to criticism of earlier models.

So where does that leave us? GPT-5 is not AGI. It’s not sentient. It’s not your friend, colleague, or co-founder. But it is undeniably useful. It’s faster, more honest, more versatile, and more capable than anything before it. And while it may not “think” the way humans do, it can reason through tasks with a depth that makes you forget it doesn’t actually understand. The challenge, then, isn’t just building smarter models. It’s teaching ourselves to use them wisely, to distinguish signal from spectacle, and to recognize that human judgment still matters. But the bigger question now looming: is this a race? And if so, who is best positioned to lead it next?

Meta, with its massive infrastructure and open-source philosophy, could democratize large models faster. Anthropic, with its emphasis on alignment and constitutional AI, may strike a better balance between safety and power. xAI, still relatively nascent, benefits from Elon Musk’s integrated platforms. And Google DeepMind continues to innovate behind the scenes with a stronger cross-pollination between research and application. Meanwhile, DeepSeek is quietly building momentum, particularly in Asia, with strong multilingual performance and an emphasis on open model weights, which could position it as a serious contender in both academic and enterprise circles.

Success may not come from size alone, but from integration: whoever can embed intelligence across tools, APIs, interfaces, and physical devices will likely steer the next wave. The race isn’t just about whose model scores higher—it’s about who builds ecosystems people can trust, adapt to, and depend on. That’s not just a technical challenge, but a cultural and economic one.

So yes, celebrate the milestone. Test its limits. Push its boundaries. But don’t confuse GPT-5 with something it isn’t. It’s not AGI. Not yet.

GPT-5: Breakthrough or Just a Better Autocomplete?

Highlights

Author

Muhammad Waseem

Vice-Head of GPT-Lab Tampere & Postdoctoral Researcher