We asked the world's most advanced AI models to reason about human rights. They failed.
GPT-5, Claude, and Gemini all performed near chance, and worst of all at the most basic task of human rights reasoning: recognising when a right has been violated at all.
Upcoming · 18 June 2026
We present HumRights-Bench to UN Member States at the Human Rights Council: Can AI Understand Human Rights Law? Convened with Globethics.
15:00–16:00 | Room IX, Palais des Nations, Geneva
Why this matters
These are not abstract systems. The same models already decide who qualifies for benefits, who gets hired, and what billions of people see online, decisions that determine whether human rights are realised or denied. Yet until now, no one had tested whether they can reason about human rights in the first place.
HumRights-Bench is the first benchmark grounded in international human rights law.
Expert-validated and scenario-based, it measures whether an AI system can do what the institutions deploying it assume it can: identify a rights violation, recall the law that applies, weigh it, and propose a remedy.
What we found
In our pilot, on the right to water, every frontier model we tested scored near chance, between 34 and 58 percent. They were weakest at issue identification, recognising that a right had been engaged at all, which is the foundational step on which every later stage of the analysis depends. A failure there cascades into the wrong law and the wrong remedy.
The pilot is small and the results are exploratory, but the signal is unambiguous: the models already making rights-critical decisions cannot yet reliably reason about rights.
HumRights-Bench makes that failure measurable, and visible to the developers, regulators, and institutions responsible for it.
How it works
We adapted IRAC, the framework used to train lawyers, into IRAP: Issue Identification, Rule Recall, Rule Application, and Proposed Remedies. Substituting remedies for a verdict reflects how human rights practice actually works.
Realistic scenarios, drawn from UN General Comments, Special Procedures reports, and leading jurisprudence, are each validated by at least three human rights professionals, then used to test how a model reasons across all four steps, with multiple-choice, ranking, and open-response questions scored by state-of-the-art metrics.
Why a law-grounded benchmark, now
The timing is not academic. The Council of Europe’s Framework Convention on Artificial Intelligence names HUDERIA as its recommended human rights impact assessment, yet HUDERIA has no way to check whether the models being assessed can reason about the rights at stake. HumRights-Bench supplies that missing basis. It can equally inform the Fundamental Rights Impact Assessments required under Article 27 of the EU AI Act. In short, it turns “trustworthy AI” from a claim into something that can be measured.
Who it's for
Releases:
Who built it
HumRights-Bench is built by researchers from Hunter College, the Oxford Internet Institute (University of Oxford), Georgetown University, the University of Oslo, and AI & Equality by Women at the Table. Presented as an accepted poster at CS&Law 2026; the methodology is now under submission to ICML’s AI for Law (AI4Law) track