Can today's AI models reason about human rights?

Not reliably. On HumRights-Bench, frontier models (GPT-5, Claude, Gemini) scored 51 to 58 percent, barely above chance, and performed worst at recognising when a right has been violated.

How is the benchmark validated?

Scenarios and questions are validated by practising human rights lawyers and professionals. The pilot release includes six scenarios each reviewed by at least three experts.

Honourable Mention at the ICML 2026 AI for Law workshop

HumRights-Bench

A Benchmark for Human Rights

We asked the world's most advanced AI models to reason about human rights. They failed.

GPT-5, Claude, and Gemini scored barely above chance, 51 to 58 percent, and stumbled worst on the most basic task of all: recognising when a right has been violated. HumRights-Bench is the first expert-validated benchmark that can measure this, grounded in international human rights law rather than abstract AI ethics.

The institutions now writing AI into law, the EU AI Act and the Council of Europe’s HUDERIA, assume these systems can reason about rights. Nobody had tested whether that was true. We built the instrument that does.

51 to 58%
frontier-model accuracy, near chance

1st
benchmark grounded in human rights law, not ethics principles

Right to water
the pilot domain, validated end to end by experts

Recent

HumRights-Bench at the ICML 2026 AI for Law workshop, 10 July 2026.

We presented HumRights-Bench to UN Member States at the Human Rights Council, 18 June 2026.

Why this matters

These are not abstract systems. The same models already decide who qualifies for benefits, who gets hired, and what billions of people see online, decisions that determine whether human rights are realised or denied. Yet until now, no one had tested whether they can reason about human rights in the first place.

HumRights-Bench is the first benchmark grounded in international human rights law.

Expert-validated and scenario-based, it measures whether an AI system can do what the institutions deploying it assume it can: identify a rights violation, recall the law that applies, weigh it, and propose a remedy.

What we found

In our pilot, on the right to water, every frontier model we tested scored near chance, between 34 and 58 percent. They were weakest at issue identification, recognising that a right had been engaged at all, which is the foundational step on which every later stage of the analysis depends. A failure there cascades into the wrong law and the wrong remedy.

The pilot is small and the results are exploratory, but the signal is unambiguous: the models already making rights-critical decisions cannot yet reliably reason about rights.

HumRights-Bench makes that failure measurable, and visible to the developers, regulators, and institutions responsible for it.

HumRights-Bench pilot results — accuracy by IRAP reasoning task (right to water)
Model	Issue: obligation	Issue: failure mode	Rule recall	Rule application	Proposed remedies	Overall
Gemini-3	0.540	0.710	0.765	0.240	0.630	0.577
GPT-5	0.520	0.675	0.715	0.225	0.550	0.537
Claude Opus 4.7	0.473	0.595	0.774	0.180	0.520	0.508
Qwen 3.5-9B (open-source ref.)	0.394	0.519	0.494	0.025	0.531	0.339

How it works

We adapted IRAC, the framework used to train lawyers, into IRAP: Issue Identification, Rule Recall, Rule Application, and Proposed Remedies. Substituting remedies for a verdict reflects how human rights practice actually works.

Realistic scenarios, drawn from UN General Comments, Special Procedures reports, and leading jurisprudence, are each validated by at least three human rights professionals, then used to test how a model reasons across all four steps, with multiple-choice, ranking, and open-response questions scored by state-of-the-art metrics.

Why a law-grounded benchmark, now

The timing is not academic. The Council of Europe’s Framework Convention on Artificial Intelligence names HUDERIA as its recommended human rights impact assessment, yet HUDERIA has no way to check whether the models being assessed can reason about the rights at stake. HumRights-Bench supplies that missing basis. It can equally inform the Fundamental Rights Impact Assessments required under Article 27 of the EU AI Act. In short, it turns “trustworthy AI” from a claim into something that can be measured.

Who it's for

Governments and international organizations deploying AI in consequential decisions — benefits adjudication, procurement, public service delivery — where failures in rights reasoning translate directly into rights violations.
Regulators and compliance officers — whether working under existing frameworks like the EU AI Act or HUDERIA, or seeking to get ahead of governance before frameworks arrive — who need structured, reproducible evidence about model capabilities.
Model and system developers who want a competitive edge on rights-critical tasks, and who need a rigorous, principled measure of performance before deployment obligations catch up with them.
Human rights professionals — litigators, treaty body experts, civil society monitors, and corporate due diligence practitioners — who need to know whether an LLM can be trusted before relying on it in their work.
AI evaluation scientists building the infrastructure for domain-specific benchmarking at the intersection of law, social impact, and alignment.

Releases:

V0 (in progress)

The Right to Water with 6 validated scenarios and around 100 expert-authored questions.

V1 (2026)

Further rights, beginning with due process and education, and a public leaderboard.

Who built it

HumRights-Bench is built by researchers from Hunter College, the Oxford Internet Institute (University of Oxford), King’s College London, Georgetown University, the University of Oslo, and AI & Equality by Women at the Table.

Presented as an accepted poster at CS&Law 2026.
Accepted to ICML 2026’s AI for Law track; camera-ready paper forthcoming.