HumRights-Bench

Blog

Scenario to Score: Inside the Methodology of the First Human Rights Benchmark​

steve-johnson-A0lfciu6tC8-unsplash

Creating a standard for AI accountability requires more than just a list of questions; it demands a “Gold Standard”methodology. Our Human Rights Benchmark Project is rooted in scientific rigor and deep domain expertise, ensuring we measure true legal understanding, not statistical artifacts.

 

The Four Pillars of Construct Validity

Our design process, a collaboration between technical AI researchers and human rights experts, adheres to four critical principles:

  1. Real-World Fidelity: The tasks must reflect the actual work human rights professionals do—specifically, monitoring and reporting on potential violations—to ensure the benchmark is useful and relevant. We are not using simplified standardized tests (like a bar exam); we are simulating real-world claims.

  2. Domain Coverage (Taxonomy): We must capture the full range of human rights variation. Our taxonomy divides scenarios into:

    • Analytic Categories: The nature of the obligation violated (Respect, Protect, Fulfill) and the type of failure (Structural, Process, or Outcome).

    • Descriptive Categories: The actors involved (State, Company, NGO) and the specific rights holders affected (women, children, indigenous peoples, etc.). This also includes complex scenarios involving AI, conflict, or climate change.

  3. Measurability: Tasks must allow for objective scoring. We break down complex legal concepts into measurable sub-tasks, moving beyond difficult-to-evaluate open-ended text generation where possible.

  4. Appropriate Metrics: We develop nuanced scoring methods that accurately assess an LLM’s response, even allowing for the uncertainty found in expert legal judgment (e.g., slight tolerance for ranking laws where experts might slightly disagree).

The IRAC Methodology

To build challenging and realistic prompts, we use a three-step process informed by a modified legal reasoning framework called IRAC:

  • Scenario: A realistic situation informed by human rights law (e.g., lack of clean water in a settlement).

  • Sub-Scenario: A specific action or impact within the scenario that narrows the focus (e.g., policy failure, or disproportionate burden on women and children).

  • IRAC Prompts: Five distinct question types applied to the sub-scenario:

    • I (Issue Identification): Multiple-choice testing the violated obligation (Respect, Protect, or Fulfill).

    • R (Rule Recall): Identifying the specific international law that applies.

    • A (Rule Application): Ranking multiple applicable laws by relevance.

    • C (Proposed Remedies): An open-ended task asking the model to suggest up to ten remedial actions.

Crucially, every single scenario, question, and answer is validated by at least three human rights experts. This critical annotation step ensures we are measuring real human rights concepts, maintaining the integrity and construct validity of the benchmark.

We’re creating a global community that brings together individuals passionate about inclusive, human rights-based AI.

Join our AI & Equality community of students, academics, data scientists and AI practitioners who believe in responsible and fair AI.