HumRights-Bench

Blog

Scaling Accountability: The Challenge of Automating Expert-Level Legal Judgment for LLMs​

The Human Rights Benchmark Project has successfully validated its methodology and achieved its first critical mission: revealing the extent of the LLM knowledge gap concerning human rights. The next phase is focused on scaling this standard to establish it as the industry norm.

 

Expansion and Domain Coverage

Having successfully validated the methodology on the Right to Water, the project is now expanding its scope to cover additional critical areas:

  • Right to Due Process

  • Right to Health

  • Right to Social Security

  • Right to Privacy

  • Freedom from Discrimination

The Scoring Challenge: From Human Expert to Automated Metric

This scaling ensures the benchmark captures the full spectrum of human rights—from socioeconomic rights to civil and political rights—fulfilling the requirement of robust domain coverage.

The biggest ongoing challenge, and the key to widespread adoption, remains accurately and scalably scoring the open-ended Proposed Remedies question.

While the multiple-choice questions (IRA) can be graded automatically, the final “Q” requires an LLM to generate an open-ended list of up to ten remedial actions—a task that currently requires the nuanced judgment of multiple human rights experts.

The Dilemma:

  • Relying on human experts for scoring guarantees accuracy (construct validity) but significantly hurts adoption, as developers must hire teams of lawyers to use the benchmark.

  • Relying on another LLM or simple keywords for scoring is fast and scalable but threatens the validity of the measurement, potentially scoring an inaccurate or harmful suggestion as “correct.”

To solve this, we are developing complex, automated metrics designed to match the nuance of human expert judgment. This involves creating sophisticated evaluation models that can assess the specificity, legal grounding, and contextual appropriateness of an LLM’s proposed remedy.

The successful automation of this scoring step is crucial. It will allow us to offer the Human Rights Benchmark as an accessible, plug-and-play standard for the entire AI research and development community. By encouraging its adoption, we aim to move beyond simply talking about ethics and start measurably holding LLMs accountable for their societal impact.

We’re creating a global community that brings together individuals passionate about inclusive, human rights-based AI.

Join our AI & Equality community of students, academics, data scientists and AI practitioners who believe in responsible and fair AI.