HumRights-Bench (v0)

A Benchmark for Human Rights

A new benchmark, validated by leading human rights experts, for assessing internal representations of human rights principles in leading LLMs and LRMs

Why this matters

Large language models now influence decisions that shape people’s lives—from benefit eligibility and hiring to content moderation for billions. Yet no rigorous benchmark tests whether these systems understand basic human rights like non-discrimination, due process, or access to essential resources. Without such measures, we risk automating rights violations at unprecedented scale.

HumRights-Bench is the first expert-validated framework to evaluate LLMs against human rights principles.

Starting with the right to water and due process, it builds a scalable foundation to assess emerging models and guide responsible deployment—before harm occurs.

Overview

Developed for:

Human rights professionals, to inform judgements about the strengths and limitations of LLMs they may wish to use in their workflow.

Model and system developers, to provide a measure of model performance on these critical tasks.

The HumRights-Benchmark evaluates how well LLMs/LRMs can:

Identify what type of human rights obligations are being unmet in a given real-world scenario.

Recall specific provisions in human rights conventions, laws, or principles that are relevant in a given real-world scenario.

Determine which provisions may be most relevant, and which may be the least, in a given real world-scenario.

Propose remedies to mitigate specific human rights violations in a given real-world scenario

About the HumRights Benchmark

Methods:

Adaptation of the IRAC legal reasoning framework (described in another LLM benchmark, Legal-Bench) for the unique tasks human rights work entails.

Taxonomization of the human rights problem space by typologies of obligatory violations, perpetrators, implicated stakeholders, social contexts, and complex conditions (such as natural disasters, armed conflict, or involvement of indigenous peoples).

Creation of 20 complex metascenarios (comprising 4-6 subscenarios allowing for full combinatorial coverage of each element in our taxonomy) implicating each specific human rights, as enumerated in the UDHR.

Additionally, creation of specific assessment heuristics to accompany each subscenario: multiple-choice, multiple-select, ranking, and open response questions . These questions are designed to be posed to an LLM under evaluation. LLM responses are measured by state-of-the-art metrics.

Validation of every scenario and heuristic with at least 3 human rights professionals.

Dataset and Planned Releases:

V0 (in progress)

Coverage of the Right to Water and Right to Due Process.
- 20 validated metascenarios implicating the Right to Water.
- 20 validated metascenarios implicating the Right to Due Process.
For each right, about 100 unique, expert-authored assessment heuristics.

V1 (release 2026)

Coverage of 5 additional rights.
Leaderboard

Join our newsletter on Substack to get the latest updates and learn about opportunities for collaboration.