HumRights-Bench

HumRights-Bench

A Benchmark for Human Rights

We asked the world's most advanced AI models to reason about human rights. They failed.

GPT-5, Claude, and Gemini all performed near chance, and worst of all at the most basic task of human rights reasoning: recognising when a right has been violated at all.

Upcoming · 18 June 2026

We present HumRights-Bench to UN Member States at the Human Rights Council: Can AI Understand Human Rights Law? Convened with Globethics. 

15:00–16:00 | Room IX,  Palais des Nations, Geneva 

Why this matters

These are not abstract systems. The same models already decide who qualifies for benefits, who gets hired, and what billions of people see online, decisions that determine whether human rights are realised or denied. Yet until now, no one had tested whether they can reason about human rights in the first place.

 

HumRights-Bench is the first benchmark grounded in international human rights law.

Expert-validated and scenario-based, it measures whether an AI system can do what the institutions deploying it assume it can: identify a rights violation, recall the law that applies, weigh it, and propose a remedy.

What we found

In our pilot, on the right to water, every frontier model we tested scored near chance, between 34 and 58 percent. They were weakest at issue identification, recognising that a right had been engaged at all, which is the foundational step on which every later stage of the analysis depends. A failure there cascades into the wrong law and the wrong remedy.

The pilot is small and the results are exploratory, but the signal is unambiguous: the models already making rights-critical decisions cannot yet reliably reason about rights. 

HumRights-Bench makes that failure measurable, and visible to the developers, regulators, and institutions responsible for it.

How it works

We adapted IRAC, the framework used to train lawyers, into IRAP: Issue Identification, Rule Recall, Rule Application, and Proposed Remedies. Substituting remedies for a verdict reflects how human rights practice actually works.

Realistic scenarios, drawn from UN General Comments, Special Procedures reports, and leading jurisprudence, are each validated by at least three human rights professionals, then used to test how a model reasons across all four steps, with multiple-choice, ranking, and open-response questions scored by state-of-the-art metrics.

Why a law-grounded benchmark, now

The timing is not academic. The Council of Europe’s Framework Convention on Artificial Intelligence names HUDERIA as its recommended human rights impact assessment, yet HUDERIA has no way to check whether the models being assessed can reason about the rights at stake. HumRights-Bench supplies that missing basis. It can equally inform the Fundamental Rights Impact Assessments required under Article 27 of the EU AI Act. In short, it turns “trustworthy AI” from a claim into something that can be measured.

Who it's for

  • Governments and international organizations deploying AI in consequential decisions — benefits adjudication, procurement, public service delivery — where failures in rights reasoning translate directly into rights violations.
  • Regulators and compliance officers — whether working under existing frameworks like the EU AI Act or HUDERIA, or seeking to get ahead of governance before frameworks arrive — who need structured, reproducible evidence about model capabilities.
  • Model and system developers who want a competitive edge on rights-critical tasks, and who need a rigorous, principled measure of performance before deployment obligations catch up with them.
  • Human rights professionals — litigators, treaty body experts, civil society monitors, and corporate due diligence practitioners — who need to know whether an LLM can be trusted before relying on it in their work.
  • AI evaluation scientists building the infrastructure for domain-specific benchmarking at the intersection of law, social impact, and alignment.

Releases:

Who built it

HumRights-Bench is built by researchers from Hunter College, the Oxford Internet Institute (University of Oxford), Georgetown University, the University of Oslo, and AI & Equality by Women at the Table. Presented as an accepted poster at CS&Law 2026; the methodology is now under submission to ICML’s AI for Law (AI4Law) track