The Missing Coordinate in the RLHF Diversity Debate

Why aligned language models occupy a region 1.70× narrower than humans, and what that means for the next generation of affective AI governance

May 13, 2026

The diversity literature on aligned language models has converged on a finding that is becoming a major research current.

Kirk and colleagues, in their ICLR 2024 work on RLHF generalisation and diversity, showed that reinforcement learning from human feedback improves robustness while narrowing output diversity. Murthy, Ullman, and Hu, at NAACL 2025, extended the argument into conceptual space: aligned models display lower conceptual diversity than their instruction-tuned counterparts, and no current model reaches human-level diversity.

Both findings matter.

But both live inside the semantic plane.

We count clusters of meaning.
We have not been counting clusters of affect.

The geometry between how complex a thought is and how strongly it is expressed has remained largely unmapped.

This essay is about that missing plane, and about a paper published this week in PLOS ONE that attempts to chart it across 351,734 anonymous narratives.

What the existing measurement misses

A sentence carries at least two independent signals.

The first is narrative complexity: the structural elaboration of the underlying thought. Temporal nesting. Counterfactuals. Hedging. Multiple actors. Self-monitoring.

The second is affective intensity: how strongly emotion is permitted to appear on the surface of the sentence.

These two signals can move together. But they are not required to.

A maximally complex grief can arrive in four flat words.

A minor irritation can arrive in a carefully structured paragraph.

Most sentiment systems collapse both into a single scalar. Most diversity benchmarks ignore the second axis entirely.

The result is that a class of compression effects has remained invisible to the instruments we currently use.

The study

In the paper published this week in PLOS ONE, 351,734 anonymous narratives drawn from English-language relationship-support communities were scored independently on two 0–10 scales:

Narrative complexity (N)
Expressed affective intensity (A)

Their discrepancy, D = N − A, was treated not as residual noise but as a regulated degree of freedom.

A measurable signal of how a person balances the cost of being understood against the cost of being exposed.

The Pearson correlation between N and A across the corpus was 0.009. Statistically indistinguishable from zero.

Stronger emotion does not reliably produce more elaborate language. More elaborate language does not reliably carry stronger affect.

The plane between them is real, populated, and structured.

The distribution stabilised into four expressive regions.

Coupled expression (91.3%)

Narrative complexity and affective intensity remain broadly aligned.

Strategic understatement (5.7%)

High emotional complexity compressed into restrained language.

Strategic overstatement (0.6%)

Low affective intensity carried by elaborate narrative structure.

Collapse (2.3%)

Affective intensity remains high while narrative scaffolding becomes minimal.

The quietest sentences in the corpus were not the simplest minds.

Disproportionately, they were the most complex.

The alignment finding

The same scoring procedure was then applied to an RLHF-aligned language model responding to matched prompts.

Projected onto the same coordinate system, the model occupied a region approximately 1.70× narrower than the human distribution (bootstrap 95% CI [1.68, 1.70], permutation p < 0.0001).

The compression was not uniform.

The model rarely entered the collapse zone. Extreme overstatement was also sparse.

These are precisely the regions where humans under sustained regulatory pressure disproportionately appear.

This is not a claim that alignment causes the gap.

The paper is careful about that distinction.

The observation is narrower and more important:

Aligned models do not adequately reach parts of the human affective plane.

Where existing RLHF diversity work mapped output diversity and conceptual diversity, this measurement maps affective expressive geometry.

Why this matters downstream

Four implications follow.

1. Alignment research

Affective diversity becomes measurable.

The 1.70× figure is not an argument against RLHF. It is evidence that current diversity benchmarks are incomplete.

A model can preserve lexical variety while compressing expressive geometry.

Those are not the same phenomenon.

2. Affective computing

Silence and absence become distinguishable.

A flat sentence inside the coupled region means something fundamentally different from a flat sentence inside the collapse region.

Most sentiment systems currently erase that distinction.

3. Clinical NLP and computational psychiatry

Strategic understatement and collapse stop appearing identical.

One is a sustainable regulatory strategy.

The other is the linguistic signature of someone running out of room.

Confusing them is already a known screening failure.

4. AI policy

Since February 2025, Article 5(1)(f) of the EU AI Act has prohibited certain forms of emotion recognition in workplaces and educational settings.

The regulation defines what affective AI systems are not allowed to do.

It does not yet define what expressive freedom a system must return to the human.

D offers a quantitative handle on that missing question.

The longer programme

The broader programme this paper belongs to, which I have been developing under the name Affective Sovereignty, asks a question the accuracy era could not formulate.

Not how well a system reads us.

Which regions of our expression it leaves intact.

The fastest-wearing relationship is not the one that fights.

It is the one in which both parties continuously translate themselves into a readable form.

Something similar is beginning to happen between humans and the systems trained to understand them.

Each accommodation feels small.

The geometry is what changes.

References

Kim, R. S. (2026). Narrative–affect discrepancy as a regulated degree of freedom in 351,734 relationship narratives. PLOS ONE. https://doi.org/10.1371/journal.pone.0348715

Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., & Raileanu, R. (2024). Understanding the effects of RLHF on LLM generalisation and diversity. ICLR 2024.

Murthy, S. K., Ullman, T., & Hu, J. (2025). One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual diversity. Proceedings of NAACL 2025.

European Commission. Regulation (EU) 2024/1689 (AI Act), Article 5(1)(f). Prohibitions in force from 2 February 2025.

Ari Mostov

May 14

Makes me curious to see how this work compares to studies of physical expressions of emotion, such as eustress/destress. Would be interesting to see how the body responds to these exchanges with RLHF, not just at the semantic level but at the biological level.

1 reply by Ryan Sangbaek Kim, Ph.D.

1 more comment...

Ryan Sangbaek Kim, Ph.D.

Discussion about this post

Ready for more?