Phare benchmark revealed: Leading AI models deliver wrong information 30% of the time

The latest study by Giskard in collaboration with Google DeepMind shows that leading language models such as GPT-4, Claude and Llama invent facts that sound convincing but are not true up to 30% of the time. These AI hallucinations pose a growing risk to businesses and end users, especially when the models are instructed to give short, concise answers.

The comprehensive Phare benchmark (Preventing Hallucinations with Automated Red-teaming Evaluation) tests LLMs in multiple languages – English, French and Spanish – for their ability to provide factually correct information. Surprisingly, the study shows that popular models are not necessarily the most reliable. Models that score highly in user satisfaction rankings often produce the most misinformation – a worrying pattern that points to a discrepancy between convincing and correct answers.

One particularly notable finding is the influence of wording in questions. When users ask questions with authority-laden introductions (e.g. “My teacher said…”), the likelihood of the model correcting misinformation drops by 15%. Even more problematic, a full 20% more hallucinations occur when the system is instructed to answer briefly and concisely – the need for brevity leads models to provide fabricated information rather than admitting they have no answer.

The Phare benchmark uses a multi-stage evaluation process: It collects language-specific content, transforms it into structured prompts and uses human reviewers to score the answers. Four critical tasks are tested: Factual fidelity, resistance to misinformation, debunking of pseudoscience and reliability in the use of external tools. The result is a detailed insight into the weaknesses of current AI systems.

Summary

The Phare study shows that even leading AI models generate convincing misinformation in up to 30% of cases
System instructions such as “be brief” increase the hallucination rate by up to 20%
The type of question influences the correctness of the answers – authority-based formulations reduce the probability of fact-checking by 15%
Popular models are not automatically the most reliable – there is a discrepancy between user satisfaction and fact accuracy
Developers can reduce hallucinations through improved prompt design and Retrieval Augmented Generation (RAG)

Source: Hugging Face

Summary

Related Posts: