Semantic Self-Similarity or How I Split a Conversation into Scenes Using Language Models
whisper, signal/noise, Audio Analysis, Dungeons and Dragons, SSM, Self Similarity Matrix, Speech to Text, python, transformers, semantic embeddings
Introduction - From Audio to Text
Me and my DnD friends always forget past interactions (names, places, instructions, the usual Dungeons & Dragons stuff). So, why not make use of the marvels of the modern era and record the sessions? Great!.. We now have a four-hour, 200 MB .aac or .mp3 file.
Anyone could scrub through it, fast-forwarding and rewinding to find something related to whatever clue we need 😎
But what if I process that huge-ass audio file with a Speech-to-Text library like whisper?
Great, again, after six hours of my poor laptop CPU running at 100% utilization, we now have the audio (which won’t disappear) and a sausage of text, mostly “eeehhhh... aaand... so...”, swearing, and other OOC remarks about penises or “what if” sexual situations involving NPCs.
Being aware that there are commercial solutions (those meeting bots that automatically summarize conversations) where you just invite the bots to an online meeting, they sit silently, and later a fancy summary lands in everyone’s inbox, I wondered about what that these "magic black boxes" would be doing. As those meetings can be noisy too, yet the bot somehow figures out that “ok, my computer is acting weird today” isn’t part of a presentation about quarterly results.
So, what if somehow we could detect the inner coherence of situations? My engineering thesis was about using self-correlation to tell when a signal was an actual signal and not just random noise. Language isn’t that different: meaningful things often have internal structure.
Now, again, if we could measure the meaning of each sentence as a vector in a multi-dimensional space, that would be cool. Well, that’s what modern Natural Language Processing models do. These vectors, called semantic embeddings, represent semantic information: sentences with similar meaning end up close together, and unrelated ones are far apart. With that, we can start finding which parts of the transcript are related to each other, effectively mapping the conversation’s structure.
Breaking the sausage
So, now we’ve got two monster files: the audio file and a 135k characters single line of text about what seems to be some players talking over each other, random jokes, and sometimes the actual game plot. To find structure in that mess, we need to slice it somehow.
A naive approach would be to split by sentence or punctuation. But speech-to-text output doesn’t really care about grammar. So instead, I’ll divide the transcript into chunks of N words. Think of it as cutting the sausage into equal slices, no matter where the commas fall.
Windowing of around 80 words worked not-bad for this kind of dialogue. Smaller chunks are too noisy. Larger ones start mixing different topics together. We can also overlap the windows (for example, 50% overlap) if you don’t want sharp edges between slices. Each chunk will later become one point in the semantic space (our “unit of meaning”).
You may think "but sentences aren't fixed-length! you are doing it wrong!". Yeah... But we have a monster source of entropy, too. A behemoth of 4 hours of speech-to-text text to find coherence in there.
Giving meaning a shape
Now that we have a bunch of text chunks, we can feed each of them into a sentence embedding model. I will be using one from the Sentence Transformers library all-MiniLM-L6-v2 which turns text into a 384-dimensional vector. Larger models exist, but this one is fast and decent for conversational data.
Here’s what happens, conceptually:
- Each chunk of text gets mapped to a point in a very high-dimensional space.
- The direction of that vector encodes what the text chunk means.
- Two chunks that talk about the same thing (say, both are about a negotiation between a kobold and a witch buying equipment) will have vectors pointing in similar directions.
- Completely unrelated talk, like someone asking for pizza, will point elsewhere entirely.
Once we have all those vectors, we can compare them pairwise using math techniques like cosine similarity (i.e. the dot product divided by the product of their magnitudes). That gives us a matrix showing how semantically close each part of the conversation is to every other. If we visualize it, we start seeing blocks of coherence (the “scenes” of our story emerging from the noise).
Plotting that in a Self-Similarity Matrix and seeing the patterns feels like watching the session memory take physical form.
The Self-Similarity Matrix
Once every chunk has its vector, the fun begins. If we compare each vector with every other one, we can see which parts of the conversation are semantically related.
Plotting all those pairwise similarities gives us a self-similarity matrix: a square grid where both axes represent the timeline of the session. Brighter spots mean two segments talk about roughly the same thing. Diagonal blocks appear when the conversation stays on the same topic for a while — that’s a “scene.” Dark gaps or sudden color changes mark topic shifts, interruptions, or those moments when someone derails the story to order pizza.
What’s cool about this is that it’s language-agnostic. The model doesn’t need to know what Delerion or Varka are doing; it just sees that a group of sentences share a similar semantic fingerprint. You can literally see story structure emerging from noise, like patterns in a spectrogram but for meaning instead of sound.
I made a post about using it for music a while ago, plotting the SSM of a Cocteau Twins song. Not the lyrics, but taking the FFT of the audio itself.
Scene Boundaries, Keywords & Narrative Shape
Once you have the matrix, the next step is spotting where one topic ends and the next begins. A simple trick is to look only at the similarity between consecutive chunks along the diagonal. When that similarity suddenly drops, it probably means someone changed the subject. Plotting that line over time shows valleys at scene transitions. Taking its derivative makes those drops pop even more clearly, turning topic changes into visible spikes.
To label them, run a quick TF-IDF analysis inside each cluster and grab the top few words.
They won’t be poetic, but they tell you what each block is about "mercenaries", "tavern", "tiefling"],
["delegates", "guild", "negotiation"],
and so on.
Escena 6 (2 segmentos):
vivo, cobol, carpintero
Escena 7 (11 segmentos):
whisky, persuasión, alcohol
Escena 8 (3 segmentos):
caco, doctor, maestro
Escena 9 (2 segmentos):
normalmente, trago, así
Escena 10 (1 segmentos):
tres, faltaba, entramos
Escena 11 (2 segmentos):
palmeira, sao, paulo
Escena 12 (1 segmentos):
pepsi, césar, ensalada
Escena 13 (1 segmentos):
paz, laguer, alta
Escena 14 (2 segmentos):
lord, capitán, órdenes
Escena 15 (4 segmentos):
hippie, ahí, casa
Finally, you can reduce the embedding space to two dimensions with PCA or UMAP and plot the points colored by time. The curve that appears is your narrative trajectory (how the conversation drifts across topics). Clusters form little constellations of meaning; jumps and loops mark returns to earlier ideas.
Conclusion
When someone/anyone imagine a self-similarity matrix, they could picture it like a perfect carpet of diamonds neatly aligned along the main diagonal. With every “scene” and every transition as a clean break. Reality is messier.
The matrix just mirrors the texture of real conversation. Finding the structure inside that noise becomes a bit of an art: tuning window size, adjusting clustering thresholds, filtering filler words, deciding what counts as “similar enough.”
Once you accept that, the method works quite well. You could see chapters forming in long podcasts, story arcs emerging in role-playing sessions, or topic drift in team meetings. With the right tuning, even a four-hour DnD recording starts to look like a living narrative map rather than a wall of words.
More satisfying than any polished commercial summary bot, it’s to get a glimpse of how machines perceive stories as a bunch of vectors in high-dimensional space, and that I didn't need to subscribe to any pay-as-you-go service 😸.