HumRights-Bench

HumRights-Bench (v0)

A Benchmark for Human Rights

A new benchmark, validated by leading human rights experts, for assessing internal representations of human rights principles in leading LLMs and LRMs

Overview

The HumRights-Benchmark evaluates how well LLMs/LRMs can:

Identify what type of human rights obligations are being unmet in a given real-world scenario.
Recall specific provisions in human rights conventions, laws, or principles that are relevant in a given real-world scenario.

Determine which provisions may be most relevant, and which may be the least, in a given real world-scenario.

Propose remedies to mitigate specific human rights violations in a given real-world scenario

Developed for:

Human rights professionals, to inform judgements about the strengths and limitations of LLMs they may wish to use in their workflow.

Model and system developers, to provide a measure of model performance on these critical tasks.

Why this matters

Large language models now influence decisions that shape people’s lives—from benefit eligibility and hiring to content moderation for billions. Yet no rigorous benchmark tests whether these systems understand basic human rights like non-discrimination, due process, or access to essential resources. Without such measures, we risk automating rights violations at unprecedented scale.

HumRights-Bench is the first expert-validated framework to evaluate LLMs against human rights principles.

Starting with the right to water and due process, it builds a scalable foundation to assess emerging models and guide responsible deployment—before harm occurs.

About

Methods:

We adapt the IRAC legal reasoning framework (described in another LLM benchmark, Legal-Bench) for the unique tasks human rights work entails.

We taxonomize the human rights problem space by typologies of obligatory violations, perpetrators, implicated stakeholders, social contexts, and complex conditions (such as natural disasters, armed conflict, or involvement of indigenous peoples).

We create 20 complex metascenarios (comprising 4-6 subscenarios allowing for full combinatorial coverage of each element in our taxonomy) implicating each specific human rights, as enumerated in the UDHR. 

We also create specific assessment heuristics to accompany each subscenario: multiple-choice, multiple-select, ranking, and open response questions . These questions are designed to be posed to an LLM under evaluation. LLM responses are measured by state-of-the-art metrics.

We validate every scenario and heuristic with at least 3 human rights professionals.

Dataset and Planned Releases: