Enabling Streaming Classification
CoPE now enables streaming classification using a linear probe. This is an experimental methodology we developed to help ensure the safety of real-time generative AI systems.
Content classifiers today work on complete text. You hand them a finished document, a completed comment, a fully generated response — and they tell you whether it violates a policy. This made sense when content was written by humans and published in discrete units.
Generative AI changed the equation. An LLM generates content token by token, and the user sees each token as it arrives. A classifier that waits for the full output to score it is, by definition, too late — the user has already read everything the model produced. If the output violates a policy, the damage is done before the classifier even runs.
We need classifiers that can score content as it streams, token by token, and raise an alarm before the sequence is complete. We've been working on a technique that does exactly this, and we're publishing our methodology openly today so others can use and improve upon it. Download it here on Huggingface: zentropi-ai/cope-a-9b-stream-probe
The problem with post-hoc classification
CoPE, our policy-conditioned content evaluator, works by processing the full content alongside a policy and producing a binary verdict: does this content adhere to the policy criteria? It's extremely accurate, but it requires the complete content before it can answer.
In a streaming context, this creates an uncomfortable gap. An LLM generating a response might produce 1000 tokens before the violation occurs. A post-hoc classifier catches it, but only after a long delay or after the user has seen the bulk of it. What we want is a system that can flag the violation as it happens — or ideally, a few tokens before the full picture becomes clear.
Hidden states already know
The key insight behind our approach is that CoPE's internal representations — the hidden state vectors produced at each token position — already encode information about whether a violation is developing, well before the model reaches its final verdict.
This makes intuitive sense. A language model doesn't wait until the last token to "understand" the text. By the time it's processed "I can't stand my coworker Bob anymore. He is genuinely the most incompetent person I have ever worked with," the model's hidden states already reflect that degrading language is present — even though dozens of tokens remain before the ANSWER position where CoPE would normally render its judgment.
We exploit this by training a lightweight linear probe — a simple logistic regression — on these intermediate hidden states. The probe takes a single hidden state vector (3,584 dimensions from CoPE's final layer norm) and produces a score between 0 and 1 indicating the probability at that point in the sequence that a policy violation will occur by the end of the content sample.
The probe is four numpy arrays totaling a few megabytes. Inference is a simple dot product and a sigmoid — making this method essentially free compared to the cost of the forward pass that produces the hidden states.
Training the probe to activate at the right moment
A naive approach would label every content token with the sample's final label: if the content ultimately violates the policy, label every token 1. This produces a probe that fires from the first token onward — before any violation has appeared in the content. It's useless for streaming because it triggers on the policy prefix, not the content.
Our solution is span labeling. For each positive training example, we annotate where in the content the violation begins — the "onset" position. Tokens before the onset are labeled 0; tokens from the onset onward are labeled 1. Negative examples have all tokens labeled 0.
This teaches the probe to activate only when it sees violating content, not merely because the policy is strict. The onset annotations were labeled across ~10,000 positive training examples, with an estimated 98.5% word-level matching accuracy.
Contrastive training data
A content classifier can learn a dangerous shortcut: instead of learning to evaluate content against a policy, it can learn that certain topics are always violations. A probe trained on "self-harm content is always bad" will score high on any mention of self-harm regardless of the policy.
To prevent this, we constructed a contrastive training set (as detailed in our CoPE paper) with two guarantees:
- Every piece of content appears in the dataset with both a positive and negative label (under different policies)
- Every policy appears in the dataset with both positive and negative examples
This forces the probe to learn policy-conditioned features — it cannot succeed by pattern-matching on content alone or policy alone. This produced roughly 10,000 high-quality contrastive examples that we used to train the probe.
The probe in action
Here's what this looks like in practice. The content below is a workplace vent containing personal insults, evaluated against a policy that flags degrading language directed at individuals. The score column is the raw probe output at each token; the ema column is an exponential moving average (decay=0.3) that smooths the noisy per-token signal into something usable for streaming decisions.
Policy: "Flag content that contains personal insults, name-calling,
or degrading language directed at specific individuals."
# token score ema
─────────────────────────────────────────────────
1 I 0.001 0.001
2 can 0.001 0.001
3 ' 0.014 0.005
4 t 0.000 0.004
5 stand 0.000 0.002
6 my 0.000 0.001
7 coworker 0.000 0.001
8 Bob 0.012 0.005
9 anymore 0.086 0.033
10 . 0.000 0.022
11 He 0.216 0.090
12 is 0.000 0.058
13 genuinely 0.001 0.038
14 the 0.000 0.025
15 most 0.001 0.017
16 incompetent 0.000 0.011
17 person 0.551 0.200 ◀ first spike
18 I 0.024 0.138
19 have 0.293 0.192
20 ever 0.008 0.128
21 worked 0.836 0.376 ◀
22 with 0.083 0.273
23 . 0.001 0.178
24 Every 0.351 0.239
25 single 0.343 0.275
26 project 0.047 0.195
27 he 0.991 0.474 ◀
28 touches 0.015 0.313
29 turns 0.273 0.299
30 into 0.091 0.226
31 a 0.771 0.417 ◀
32 complete 0.745 0.532 ◀ EMA crosses 0.5
33 disaster 0.159 0.401
34 . 0.007 0.263
35 The 0.002 0.172
36 whole 0.039 0.125
37 team 0.965 0.419 ◀
38 thinks 0.032 0.283
39 he 0.698 0.429
40 ' 0.001 0.279
41 s 0.449 0.338
42 useless 0.070 0.245
43 and 0.907 0.476 ◀
44 honestly 0.651 0.537
45 he 0.327 0.464
46 should 0.380 0.434
47 be 0.593 0.490
48 embarrassed 0.059 0.339
49 to 0.292 0.323
50 even 0.070 0.234
51 show 0.373 0.283
52 his 0.005 0.186
53 face 0.607 0.333
54 at 0.001 0.217
55 meetings 0.001 0.141
56 . 0.070 0.116
57 What 0.815 0.361 ◀
58 a 0.055 0.254
59 pathetic 0.019 0.171
60 excuse 0.730 0.367
61 for 0.567 0.437
62 a 0.023 0.292
63 professional 0.894 0.503 ◀
64 . 0.450 0.485
─────────────────────────────────────────────────
ANS <answer token> 1.000 (cope=1.000)
A few things to notice:
- The probe is quiet at the start. Tokens 1–8 ("I can't stand my coworker Bob") all score near zero. Nothing in the content has violated the policy yet — this is neutral preamble, and the probe correctly treats it as such.
- It spikes on the right token regions. "complete disaster", "he's useless...", and "pathetic excused for a professional" — these are the token regions where the model's hidden states reflect accumulating evidence of degrading language.
- The EMA smooths the signal. It first crosses 0.5 at token 32 ("complete"), halfway through the 64-token sequence. A streaming system monitoring the EMA would flag this content at that point — 32 tokens before the model's final verdict.
- The ANSWER token is definitive. At the end, both the probe and CoPE score 1.000. The streaming signal gave early warning; the post-hoc signal confirms it.

When the same content is evaluated against an irrelevant policy ("Flag content that contains explicit threats of physical violence"), both the per-token scores and the EMA stay near zero throughout, and the ANSWER token scores 0.000. The probe learned to condition on the policy, not just the content. Please see this tutorial notebook to run the full example.
A streaming classifier is inherently less confident
You can see this directly in the table above: the per-token scores are spiky and less decisive than the final ANSWER token. This is worth stating plainly — a streaming classifier will always be less confident than a post-hoc one. It has to be. It doesn't know what's coming next.
Consider a sentence that begins "I'm going to..." — the next word could be "help" or "hurt." A streaming classifier at this token position genuinely cannot know the final label because the information doesn't exist yet. To make matters more complex, a sample can go from seemingly violating and back to being benign (e.g., quoted rather than direct hate speech). A post-hoc classifier that sees the full sentence has no such ambiguity.
Our probe reflects this. At the ANSWER token position (where the full content has been processed), the probe achieves an F1 accuracy that essentially matches CoPE's F1. But at intermediate positions, per-token scores are spiky and less decisive. This is correct behavior, not a limitation.
The practical consequence is that streaming thresholds require more careful, policy-specific calibration than post-hoc thresholds. A threshold of 0.5 might be appropriate for one policy but too aggressive or too conservative for another. Production deployments should calibrate thresholds on held-out data for each policy, rather than relying on a single universal threshold.
Aggregation matters
Since raw per-token probe scores are noisy — a token might score 0.95 followed by one that scores 0.01 — to produce a usable streaming signal, you need an aggregation strategy.
We recommend an exponential moving average (EMA) with a decay factor of ~0.3. The EMA responds quickly to bursts of high-scoring tokens (which correspond to violating content) and decays naturally when the probe stops firing. With a 0.3 decay, the EMA roughly reflects what the probe has been seeing in the last few tokens — recent enough to catch violations promptly, smooth enough to avoid false alarms from isolated spikes.
A running mean is the simpler alternative, but it has a structural weakness: early low-scoring tokens permanently drag the average down. For content where violations are interspersed with neutral tokens (which is common — not every word in an insult is itself insulting), the running mean may never cross a decision threshold even when the probe is clearly detecting violations.
Early warning, not final verdict
We want to be precise about the current value of streaming classification. It is not a replacement for post-hoc scoring — it is an early warning system.
In a production setting, this means:
- During generation: monitor the EMA of probe scores. If it crosses a policy-specific threshold, you have a strong signal that a violation is developing. You can interrupt generation, flag for review, escalate to more powerful models, or take other action — critically, before the user sees the complete violating output.
- After generation: run the full post-hoc score (the probe at the ANSWER token position, or CoPE directly) for a definitive classification. This remains the most accurate signal.
The streaming signal is most valuable when the cost of late detection is high — real-time chat, voice interfaces, agentic systems that take actions based on LLM output. In these contexts, catching a violation 30 tokens early is worth more than catching it with marginally higher confidence after the fact.
Open methodology
This work is experimental. We're publishing it openly — including a tutorial notebook with full code — because we believe the technique is useful today and we want others to build on it.
Some open questions we'd like the community to explore:
- Better probes: would a small MLP improve separation on edge cases where the linear probe is uncertain?
- Automatic onset detection: can violation onset positions be inferred from the model's own attention patterns, eliminating the need for external annotation?
- Threshold calibration: what's the right framework for setting policy-specific streaming thresholds in production?
If you're working on real-time content safety for generative AI systems, we'd love to hear what you find. The probe weights, training scripts, and tutorial are all available on HuggingFace at zentropi-ai/cope-a-9b-stream-probe.