OpenAI's reasoning models: Better thinking skills at the expense of factual accuracy

The latest AI models o3 and o4-mini from OpenAI show impressive progress in logical thinking, but suffer from a surprising problem: they hallucinate much more frequently than their predecessors.

In the world of artificial intelligence, the trend so far has been clear: newer models hallucinate less, i.e. provide less false information than older versions. But OpenAI’s latest reasoning models break this trend in an unexpected way. While the o3 and o4-mini models, developed specifically for complex reasoning and problem-solving tasks, clearly outperform their predecessors in areas such as programming and mathematics, they also show a worrying increase in hallucination rates.

Internal assessments show that o3 generates false information in 33% of answers in the PersonQA benchmark, while o4-mini even hallucinates in 48% of cases. In comparison, the hallucination rate for GPT-4o is only 12%. This development is in contrast to the general industry trend, where leading models such as Google’s Gemini-2.0-Flash-001 achieve hallucination rates of only 0.7%.

Table of Contents

Causes of the hallucination problem

The phenomenon can be attributed to several technical factors. The reinforcement learning methods used in o3 and o4-mini reward logical coherence more than factual accuracy. As a result, the models provide plausible-sounding but not fact-based explanations. In addition, these reasoning models focus more on pattern recognition in structured problems than on broad data ingestion, leading to gaps in general knowledge.

Particularly problematic: The models often show a high level of confidence in presenting false information. In some cases, they invent detailed but completely fictitious biographies of historical figures or suggest seemingly plausible but potentially dangerous drug combinations in medical applications.

Industry-wide solutions

Different companies are pursuing different strategies to tackle the hallucination problem. Google’s Gemini 2.0 combines Retrieval Augmented Generation (RAG) with real-time fact checking, while Anthropic’s Constitutional AI embeds ethical guidelines directly into the thought process, reducing harmful hallucinations by 58% – but at the cost of analytical flexibility.

Among the most promising approaches are multi-agent systems in which specialized “critic” models analyze the chain of thought before a final output is made. Equally interesting are dynamic confidence values, in which the models automatically assess the certainty of each assertion and add appropriate notes in the case of uncertain statements.

OpenAI’s situation epitomizes a fundamental dilemma in AI development: systems are becoming more powerful, while their errors are becoming more subtle and consequential. Resolving the hallucination paradox may require a reorientation of the fundamental goals of machine reasoning – not just what conclusions models can reach, but how they should reach conclusions in an uncertain world.

Summary

OpenAI’s new reasoning models o3 and o4-mini show significantly higher hallucination rates (33% and 48% respectively) than older models (GPT-4o: 12%)
The models also offer significantly improved problem-solving capabilities in areas such as programming and mathematics
The phenomenon contradicts the industry trend of falling hallucination rates
Causes include reinforcement learning methods that favor logical coherence over factual accuracy
Multi-agent systems and dynamic confidence values are considered promising solutions to the problem

Source: TechChrunch

OpenAI’s reasoning models: Better thinking skills at the expense of factual accuracy

Causes of the hallucination problem

Industry-wide solutions

Summary

Related Posts: