Semantic Analysis Applied to "Sent From My Telephone" by Voice Actor

October 2025 —
Audio Analysis, Digital Humanities, Experimental Music, Voice Actor, embeddings, music, python, semantic embeddings, whisper

Introduction

Spotify sometimes shuffles in the right direction when an album ends (that’s how I discovered Sent From My Telephone by Voice Actor). And sometimes I also sleep relatively well (that’s why I had time and energy to analyze it).

I noticed something in the monologues which spread across its +100 tracks: the album seems more like a log of emotions than a conventional record. Maybe there even was a linear story hidden inside it.

As Kurzweil (the artist) had released the album in alphabetical order, she made it an ideal candidate to be reassembled. I couldn’t resist, so I decided to buy it (so I could have every track in a separate file), treat it as a "fragmented log system" and see what kind of "signal" I could recover.

I asked the oracle 🧞 why Voice Actor’s Bandcamp cites Martha C. Nussbaum’s Upheavals of Thought: Since that book explores emotion as a form of intelligence and self-knowledge, the reference suggests that the album, too, treats feeling as structure: an attempt to think through emotion rather than perform it.

Perhaps taking an objective / systematic / procedural approach isn’t missing the point at all. It might be a sensible way to explore a four-hour dataset of dreams and emotions.

Collection and Preparation

I bought the album on Bandcamp and downloaded the tracks as .aac files
Renamed and normalized the file names with PowerShell

Get-ChildItem -Path $src -Filter *.m4a | ForEach-Object {
    $newName = $_.BaseName -replace '[^\w\-]', '_'
    $newPath = Join-Path $dst ($newName + $_.Extension)
    Move-Item $_.FullName $newPath
}

Transcribed all 108 tracks with Whisper (medium) in Python

result = model.transcribe(audio_file, word_timestamps=True, fp16=False, language="en")

Each transcription was saved as a JSON file in a speeches/ directory. You shouldn't find these files in my GitHub repository anymore.

The transcripts are full of the small (and sometimes not-so-small) glitches typical of speech-to-text software, but they generally preserve the meaning and intent.

Semantic Extraction

Processed each text field with sentence-transformers (all-MiniLM-L6-v2).

from sentence_transformers import SentenceTransformer
import numpy as np
import json
from pathlib import Path

src = Path(r"C:\temp\reconstructingMyTelephone\speeches")
dst = Path(r"C:\temp\reconstructingMyTelephone\embeddings")
dst.mkdir(exist_ok=True)

model = SentenceTransformer("all-MiniLM-L6-v2")

for json_file in src.glob("*.json"):
    with open(json_file, "r", encoding="utf-8") as f:
        data = json.load(f)
    text = data.get("text", "").strip()
    if not text:
        continue

    emb = model.encode(text, normalize_embeddings=True)
    out_path = dst / (json_file.stem + ".npy")
    np.save(out_path, emb)
    print(f"Saved {out_path.name} ({emb.shape})")

This saved one .npy embedding per track.

106 thoughts (2 tracks are instrumentals), each one encoded in 384 dimensions

Ordering by Similarity

Computed a cosine similarity matrix (np.dot(embeddings, embeddings.T))
Used a simple greedy heuristic: start from one track and always jump to the most similar unvisited one
Saved the resulting order to semantic_order.csv and manually recreated it as a Spotify playlist

Validation with a Self-Similarity Matrix (SSM)

Visualized the similarity matrix using Matplotlib.

The diagonal blocks can be interpreted as “narrative chapters” (clusters of voices and moods).

Look at ~25, ~70 and ~95.

Notes on sorting the embeddings

That first attempt to order the embeddings used a greedy “nearest-neighbor” approach: starting from one track and always jumping to the most similar one not yet visited. It is intuitive, almost simple, but I learned that this algorithm only makes local decisions, it optimizes only the next step (or track in this case). The result looks fine at first but gradually breaks down: as the algorithm runs out of nearby points, the remaining ones become progressively unrelated, creating a high-entropy tail where similarity collapses. As if someone would be picking 'nice things' from an 'assorted things' box, but here the 'nicety' is the proximity of the next embedding. Naturally the box of embeddings will have farther and farther embeddings.

Then I tried to remember something (anything) from my Signals and Systems classes. Of course I remembered almost nothing, but I firmly knew that this was a known issue. Again, I asked the oracle 🧞 about what I can do in this situation, and then the robot answered: Recasting the problem as a Traveling Salesman Problem (TSP) would change the picture.

Instead of linking neighbors greedily, the solver minimizes the total distance across all tracks. It looks at the whole map at once, producing a globally smoother “semantic path” where similar ideas remain contiguous from start to finish.

Peaks mark where a track’s theme or idea drifts away from a previous one, suggesting potential chapter boundaries.

In signal-processing terms, the greedy order behaves like a locally-stable but globally-unstable filter — it drifts. The TSP approach enforces global phase coherence, reducing that drift and flattening the noise floor of the sequence.

Then our SSM using the new sorting looks like this:

Look at ~25 and ~75

Note: while the Δ-distance plot captures local continuity between consecutive tracks, the Self-Similarity Matrix provides a global representation of structural relationships, as it considers all pairwise similarities (all vs all). Chapter boundaries in the SSM therefore reflect more stable, cluster-level transitions.

And the "Sent From My Telephone - Semantic Order - TSP Algorithm" Spotify playlist is here.

Bonus: Acoustic Analysis

While I was at it, using librosa 0.1x, I extracted tempo, key, RMS energy, and spectral centroid for each .m4a.

The resulting .csv file is here.

I made a playlist based on key and energy. Tracks are sorted by key (A, A#, B, C, C#, D, D#, E, F, F#, G, G#) and then, inside each group, by RMS energy.

Closing Thoughts

Fittingly, the TSP playlist closes the cycle with “Object Positioning”, a 33-second track that simply says, “What? I didn’t understand a word. I think I heard notes.”

Reordering the voice notes by semantic proximity produces a trajectory that isn’t present in the original order: quotations group with quotations, interpersonal loops group together, and low-context content converge at the end.

No Pages Found