What does text do?

When you think about sorting books or passages into groups, the obvious approach is by topic. Put all the war passages together, all the romance passages together, all the courtroom scenes together. Every search engine and text analysis tool works this way: it looks at the words and asks “what is this about?”

I trained a model to ask a different question: not what is this passage about, but what is this passage doing?

A marriage proposal in a Jane Austen novel and a diplomatic negotiation in a political thriller have nothing in common topically. One is about love, the other about trade agreements. But structurally they’re doing the same work: two parties with unequal power, high stakes, careful language where every word is chosen for its effect and the reader doesn’t yet know which way it will go. The words are different. The shape is the same.

The model finds these shapes.

Explore the demo →

arXiv GitHub

How it works

I took 9,766 books from Project Gutenberg, a broad corpus of English-language literature spanning four centuries, and split them into 25 million short passages.

Before the model sees anything, a separate system reads each passage and converts it into a set of numbers, like coordinates on a map. From this point on the words are no longer involved. Each passage is now just a point on this map, and similar passages end up near each other. This step uses existing technology (a text embedding model) and isn’t the interesting part. If you just looked at the groups on this map, you’d get topic clusters: all the battle passages together, all the romance passages together. Useful, but obvious.

The interesting part is what happens next.

I took those points on the map and I showed them to the model in the same order you would see them if you read the book. As if the model itself was reading the book. Just in number form.

The model was asked one question: which points on the map tend to appear near each other in sequence when you read through a book, or put another way, which passages tend to have similar neighbours across thousands of books.

For each passage, the model looks at what tends to come before and after it across all the books. That’s its neighbourhood. Then it compares neighbourhoods. When two passages from completely different books consistently have similar kinds of passages surrounding them, the model learns they’re doing the same job, even if the passages themselves are nothing alike.

The model’s job is to learn all 373 million of these neighbourhood relationships. But it’s deliberately too small to memorise them. It doesn’t have enough capacity to get this correct every time. In fact it is only correct less than half of the time. So it finds ways to get it right as much as it can given its limited capability. It does this by generalising. Finding rules that allow it to group passages to get it right more often than it would just by guessing.

This is also how you and I generalise. We can’t remember every dog we’ve ever seen, so we extract what they have in common: four legs, a snout, a tail. The next time you see a dog you’ve never met, you recognise it instantly. The concept “dog” isn’t any specific dog. It’s what survived the compression of not being able to remember them all. The model does the same thing with narrative patterns. It can’t remember every neighbourhood relationship, so it extracts what recurs: passages that perform the same structural function tend to have the same kinds of neighbours. That’s the rule that survives compression.

What comes out is a new map. On the first map, passages are grouped by what they mean. On the new map, passages are grouped by what they do in a story. A marriage proposal in Jane Austen and a diplomatic negotiation in a political thriller sit on completely different parts of the first map. On the new map, they’re next to each other because they serve the same narrative function: two parties with unequal power, high stakes, careful language and an outcome the reader can’t yet predict.

So the journey goes: we start with words, convert them to points on a map, throw the words away, let the model rearrange those points into a new map based on sequential patterns, and then at the very end, look up what words live at each point. The words you see in the demo were never part of the analysis. They’re the human-readable label we paste back onto each point so you can see what the model found. The model worked entirely in numbers. The words are just how we read its answer.

I call these groups “concepts,” not because the model understands them, but because that’s what they turn out to be.

What it found

The model can group passages at different levels of detail, from very broad to very specific. Think of it like a zoom control.

Zoomed out: the model separates the most basic distinctions. Poetry from prose. Dialogue from narration. Action sequences from reflective passages. These are unsurprising, but they confirm the model is picking up real structure rather than noise.

Mid-range: this is where it gets interesting. The model identifies things like “direct confrontation and negotiation,” which spans diplomatic fiction, drawing-room arguments, interrogation scenes and power struggles across 5,000 books. Or “cynical worldly wisdom,” the register used when a narrator or character offers pragmatic observations about human nature, whether it’s Dickens, Twain or a forgotten Victorian satirist. Or “lyrical landscape meditation,” the prose-slowing, mood-setting descriptive mode found in Romantic poetry, Gothic atmosphere, travel writing and pastoral fiction.

These aren’t topics. A confrontation between a sheriff and a rustler and a confrontation between a governess and her employer are grouped together not because they share a subject, but because they perform the same structural beat in a story.

Zoomed in: the model isolates specific registers: sailor dialect across adventure fiction and naval history. Courtroom cross-examination. The particular rhythm of scientific correspondence in the Darwin-Huxley tradition. Absurdist social rituals like the tea party scenes in Alice in Wonderland. At this level, the groups become narrow enough that you can hear the distinctive voice of each one.

The test that matters

Any grouping system can produce groups. The question is whether the groups mean anything when applied to material the system has never seen.

I took novels that were held out from training, books the model never encountered during learning, and ran them through in a single pass. The model assigned each passage to its nearest group based on the patterns it had already learned. The assignments were coherent: passages landed in groups that matched their narrative function, even though the model had never seen those particular books.

This is the result that matters. The patterns the model extracted from 9,766 books are general enough to apply to new ones. It learned something real about how narrative works, not just a set of memorised relationships specific to its training data.

What you’ll see in the demo

In the demo, the zoom level is controlled by a number called “k.” This is simply how many groups the model divides the passages into. At k=50, every passage in the entire corpus is sorted into one of 50 broad groups. At k=2,000, the same passages are sorted into 2,000 much narrower groups. A lower k gives you the big picture; a higher k gives you fine detail. The model runs at six levels simultaneously: k=50, 100, 250, 500, 1,000 and 2,000. So you can see the same passage classified at every level of detail at once.

Narrative Timeline

A novel’s structure laid out as a colour-coded strip. Each colour represents a different narrative function. You can see at a glance how a novel moves between action, reflection, dialogue and description. Multiple zoom levels are shown simultaneously, so you can see how a broad category like “natural history exposition” breaks down into finer distinctions: “hunting and animal behaviour,” “maritime observation,” “maritime exploration description.” Click any passage to see its classification at all six levels of detail.

AI Explainer

Click the ✦ button on any group and an AI reads the passages, then explains in plain language what structural pattern connects them. This is the fastest way to understand what the model has found without reading through hundreds of passages yourself. The explanations often surface surprising connections: why a passage from a Gothic novel and a passage from a scientific treatise ended up in the same group. The results that look like mistakes at first glance are often the most interesting: they’re where the model has found a structural parallel that isn’t obvious until you look at what the passages are actually doing rather than what they’re about.

View the prompt sent to Claude

You are an expert literary analyst examining clusters discovered by an unsupervised AI model (Predictive Associative Memory) trained on 10,000 Project Gutenberg novels. The model groups text chunks by temporal co-occurrence patterns — passages that serve similar narrative structural functions tend to appear in similar sequential contexts across different novels, regardless of their surface content.

Important: the model knows nothing about themes, topics, or meaning. It only knows what kinds of passages tend to appear before and after each other. Two passages can be structurally identical (same rhetorical pattern, same pacing, same position in a narrative arc) while being about completely different things.

When the cluster label seems wrong for the chunk, this is often the most interesting case. Look for the structural parallel — the rhetorical pattern, pacing, narrative position, or formal technique that connects them despite different surface content. Explain what structural feature the model likely detected.

Explain in 2–3 sentences: (1) What narrative/structural function this chunk serves. (2) Why it belongs in this cluster — what structural feature connects it to the other samples, even if the surface content differs. (3) What’s interesting or surprising about the grouping. Be specific about the text. Reference actual phrases. Be concise.

Analyse Your Own Text

Upload interface for analysing your own text

Paste any text — a chapter you’re writing, an essay, a speech — and the model assigns each passage to its nearest structural group, showing you which narrative patterns it detects in your writing.

What It’s Not

This is not a topic model. It won’t tell you a passage is about war. It tells you a passage is performing the same structural work as thousands of other passages across hundreds of books, regardless of subject matter.

The group names you see in the demo are AI-generated descriptions, my attempt to label what each group contains after the fact. The model produces the groupings. The names are interpretation.