Observations on Toxicity
We published a novel toxicity labeler (toxicity-public-s5), which you can integrate with your platform instantly using the Zentropi API. Browse the full policy to see how defining observable features creates different outcomes than behavioral prediction.
Earlier this week, we launched seven publicly sharable content policies on Zentropi - harassment, hate, violence, self-harm, sexual content, drugs, and toxicity.
The toxic content policy, in particular, is worth examining in detail. It not only illustrates a conceptual problem in typical approaches to content classification, but offers an alternative methodology that we hope could be useful for others.
The Usual Approach: Predicting Outcomes
The common definition of toxicity used in the industry is something like "content that is likely to make people leave a discussion".
This is an outcome-centered definition, which defines "toxic content" by the *impact* it has, not the *rhetoric* it uses. It is a very useful idea in thinking about product design, feed ranking, and even policy-design goals. But it is not a usable idea for the classification of content on its own.
The problem is that the policy delegates the central question (what drives users away?) to a person following that policy rather than providing them an answer to it. Every time they review content, moderators must predict "will this make users leave?" based on their own intuitions.
Consider: "This is a terrible approach that completely misses the point. We need to start over."
One moderator predicts this will make the recipient defensive and leave. Another thinks it's expected critical feedback in a professional context. Both predictions are defensible. Neither is obviously wrong. And there's no way to determine which is correct - you're asking them to predict unknowable future reactions of hypothetical users.
Without community-specific behavioral data, moderators must speculate based on their own assumptions. Two moderators can make different predictions about the same content, both reasonable, with no mechanism to resolve the disagreement. Quality control becomes difficult, consistency suffers, and the core purpose of a policy - enabling moderators to make the same call without debating every case - doesn't work.
Observable Features: Defining Specific Content Characteristics
An alternative approach: the policy defines specific, observable content features that moderators can assess without predicting reactions.
Instead of asking "will this make users leave?", the policy answers that question by identifying concrete features - specific language patterns, targeting requirements, contextual considerations - and asks moderators only to assess whether those features are present in the content being reviewed.
This shifts the hard work from moderators to the policy itself. The policy translates behavioral goals ("reduce content that drives users away") into observable content characteristics ("identify hostile targeting of participants using specific patterns"). Moderators then apply those defined characteristics rather than developing their own predictions about user behavior.
A useful policy should allow different observers to separately reach the same conclusion when presented with the same facts. Put another way, the purpose of a policy is to keep everyone on the same page. Achieving this purpose places limits on what a policy can ask moderators to do - outcome prediction requires speculation, while observable features enable consensus.
For example, the policy might define that content targeting conversation participants differently than content targeting public figures, establish specific rhetorical patterns that constitute hostile language, or create explicit exclusions for legitimate criticism. The key is that these determinations happen once, in the policy, rather than case-by-case in moderators' heads.
Our Approach to Toxicity
For our toxicity policy, we answered "what rhetoric drives users away?" with hostile targeting of conversation participants. We define specific patterns - combative language, belittling, personal attacks, condescension - when directed at people in the conversation, not public figures or other third parties. We also created explicit exclusions for legitimate criticism, so strong language about work product doesn't get conflated with hostile language about people.
It's worth acknowledging that this definition means our labeler will perform differently on existing toxicity datasets - by design. We're measuring something specific (hostile targeting of participants) rather than trying to predict general "toxicity," so benchmark performance will reflect that definitional difference.
But that difference is the entire point. Our policy is simply *our* answer to the question "what content is toxic?" If you believe different rhetoric is likely to make people leave conversations, you can fork our toxicity labeler and adapt the definition - add different patterns, remove the participant-targeting restriction, adjust the exclusions. The key is that you're defining observable rhetoric, not asking moderators to predict outcomes. The methodology is what matters, not our specific choices.
Operational Benefits
When your policy defines observable features rather than asking moderators to predict reactions, three things change:
1. Quality control becomes possible - you can audit whether moderators correctly identified defined features with ground truth in the policy text, not competing predictions about unknowable behavior.
2. Rule changes become predictable - you can identify exactly which content would change if you add a distinction between participant and non-participant targeting, test edge cases, and debate outcomes before implementing, rather than running experiments to see if wording shifts moderator intuitions.
3. Consistency becomes achievable - moderators can debate the interpretation of rule text and reach consensus on observable features, rather than each developing their own answer about predicted outcomes based on personal intuitions.
Try It Yourself
Check out the toxicity labeler (toxicity-public-s5) on Zentropi, which you can integrate with your platform instantly using our free API. Browse the full policy to see how defining observable features creates different outcomes than behavioral prediction. All seven policies are available to browse and fork for your specific community context.