How we built CoPE

We just published the methodology behind CoPE. This is the model that powers Zentropi, and we think the approach might be useful for others working on policy-steerable classification systems.

How we built CoPE
This paper describes how we taught our language model to accurately read content policies.

We just published the methodology behind CoPE. This is the model that powers Zentropi, and we think the approach might be useful for others working on policy-steerable classification systems.

We had already open-sourced the model itself, but the more significant contribution here might be the technique. The paper describes how we trained CoPE - the methodology that others can use to build similar systems for their own needs.

Briefly put: we trained a 9-billion parameter model that matches GPT-4o at content moderation at roughly 1% the size. This paper gets into the details of how we did that.

The Problem We Were Trying to Solve

Content classification has a dependency problem. When policies change - and they change constantly, in response to new harms, regulatory requirements, or community needs - existing tools require retraining. Organizations articulate new content standards, then wait months for data collection and model updates. The enforcement system is always behind the policy.

This happens because traditional classifiers learn patterns from labeled examples. They learn what hate speech "looks like" based on the training data, not what a specific policy actually says. Change the policy, and you need new training data that reflects the new definitions.

We wanted a model that could take any policy as input and evaluate content against it directly - no retraining required.

Contradictory Example Training

The core technique is what we call Contradictory Example Training. We show the model the same content with different policies that produce opposite labels.

For example, consider a social media post that includes a slur used in a reclaimed, in-group context. Under a strict policy that prohibits all slur usage regardless of context, this violates. Under a policy that permits reclaimed usage by in-group members, it doesn't. Same content, opposite correct answers - depending entirely on what the policy says.

By training on both cases, we create an environment where the only way for the model to determine the correct label is by paying close attention to the detail of the policy. Pattern matching won't work. Cultural heuristics won't work. The model has to actually read.

The paper goes deeper into the theory here, including why we believe this approach generalizes to policies the model never saw during training.

Building the Dataset: Binocular Labeling

The paper goes deep into methodology, including how we created the training data. This is the nerdy stuff - how do you build a sufficiently high-quality dataset where the same content has contradictory but correct labels under different policies?

The challenge is that contradictory training requires deterministic policies - policies where independent readers would reach identical conclusions when applying them to the same content. Without this consistency, the model could succeed through guesswork rather than policy interpretation. If humans can't agree on what a policy means, we can't train a model to follow it.

To build the dataset, we used LLM-assisted labeling with a technique we call binocular labeling:

  1. We draft an initial policy and use an LLM to generate a semantically equivalent but linguistically distinct alternative version
  2. We run the same content through both policy versions using an LLM-based labeling system
  3. We only manually review the mismatches - cases where the two versions produced different labels
  4. Based on those reviews, we refine the policy language and repeat until the two versions achieve high agreement

This approach dramatically reduces the manual labeling burden. Instead of reviewing every example, we focus human attention on the ambiguous cases that reveal where policy language needs clarification.

Results

CoPE achieves 91% F1 on hate speech compared to GPT-4o's 87%, with sub-200ms latency on a single consumer GPU. We tested across seven harm categories including hate, sexual content, violence, harassment, self-harm, drugs, and toxicity.

The model runs at roughly 1% of GPT-4o's parameter count, making it practical to deploy at scale without frontier model inference costs. The paper includes detailed benchmark comparisons against other open models including LlamaGuard and ShieldGemma.

Open Research Problems

We should be clear about what's still hard. A few areas where we'd welcome collaboration:

Evaluation is genuinely difficult. Most public benchmarks don't disclose the labeling guidelines given to raters, making it hard to know whether you're measuring policy interpretation or agreement with unstated cultural assumptions. We had to build our own evaluation framework with held-out policies, but the field needs better shared benchmarks.

Deterministic policies are a constraint. The methodology requires policies where humans can achieve high agreement. Highly subjective categories - "this feels creepy" - may not meet this threshold. We don't yet know how to extend the approach to inherently ambiguous domains.

Multilingual remains untested. Our current evaluation focuses on English. The base model supports other languages, but we haven't validated performance, and policy interpretation may have different challenges across linguistic and cultural contexts.

This Powers Zentropi

This is the methodology behind our product. If you want to see CoPE in action, Zentropi offers custom content labeling at scale - you define the policy, and we help you refine it into a version that's machine-interpretable so that you can do accurate labeling. Learn more at https://zentropi.ai

Read the full paper here: https://arxiv.org/abs/2512.18027

Questions about the methodology? Reach out at info@zentropi.ai.

Get Updates From Zentropi