HumRights-Bench

Blog

From Code to Compliance: We’re building the first benchmark to hold AI accountable to human rights law.

We put GPT-5, Claude, and Gemini to the test on human rights reasoning. They scored near chance, and stumbled hardest on the most basic task of all: recognising that a right had been violated. Here is what the numbers show, and why it should concern anyone deploying these systems in decisions that affect people’s lives.

How do you test whether a machine understands human rights law? We adapted the framework used to train lawyers, built realistic scenarios from UN jurisprudence, and had human rights professionals validate every one. A look inside how HumRights-Bench turns a legal question into a measurable score.

Grading a multiple-choice answer is easy. Grading whether a model proposed the right remedy for a rights violation is not. This is the hardest problem in benchmarking legal reasoning, and how we are solving it: calibrating automated scoring against the judgement of human rights experts.

Most AI evaluation asks what a model should do, measured against aggregated human preference. Human rights law asks something harder and more exact: what a model must recognise, because the law requires it. The case for grounding the next generation of AI evaluation in obligation, not opinion.

We’re creating a global community that brings together individuals passionate about inclusive, human rights-based AI.

Join our AI & Equality community of students, academics, data scientists and AI practitioners who believe in responsible and fair AI.