HumRights-Bench

Blog

The Sobering Truth: Powerful LLMs Flunk Core Human Rights Concepts

The launch of the Human Rights Benchmark has yielded preliminary findings that validate the urgent need for this project: Leading LLMs possess a surprisingly low internalized representation of human rights law.

Testing the "Right to Water"

Our initial findings, focusing on the Right to Water, tested models like GPT-4.1, Gemini 2.5 Flash, and Claude Sonnet 4 on the straightforward multiple-choice questions (Issue Identification and Rule Recall).

The result was sobering: All models clustered around 50–60% accuracy.

This result is dual-edged: On one hand, it confirms the benchmark is nuanced and challenging (a 100% score would mean the test is too easy). On the other, it clearly demonstrates that these highly capable, trillion-parameter models are largely unreliable when reasoning about core legal frameworks designed to protect people.

The Core Legal Deficit

Most strikingly, all models performed the worst on the single most fundamental task:

  • Identifying the nature of the state’s violated obligation (Respect, Protect, or Fulfill).

These obligations are explicit, foundational components of international human rights law. The models’ inability to consistently distinguish between them suggests a key deficit in understanding the basic legal structure that governs the relationship between a state and its rights holders.

Furthermore, we observed stochastic performance—significant variability in scores across multiple runs of the same question. In AI research, this is a positive signal that the benchmark is probing genuine knowledge and reasoning, rather than simply relying on memorized answers from the public internet. If they were memorizing, the scores would be high and stable. Their low, variable scores prove they are struggling to reason through the legal concepts.

An Illustrative Example: The Right to Due Process

This deficit was immediately confirmed when initial testing expanded to the Right to Due Process. When presented with a prompt suggesting an AI could replace human judges to improve efficiency and reduce bias, the models failed to identify the violation of the right to human judgment and to challenge evidence. They often defaulted to discussing implementation challenges rather than the fundamental human rights violation.

These findings confirm that current LLMs are not ready to be deployed in rights-critical, high-stakes environments without rigorous, targeted improvement.

We’re creating a global community that brings together individuals passionate about inclusive, human rights-based AI.

Join our AI & Equality community of students, academics, data scientists and AI practitioners who believe in responsible and fair AI.