Mapping Consonants as Percussion: A Small Experiment with Whisper and Audio Analysis
Audio Analysis, Whisper, Vocal Percussion, Creative AI, Music Tech, Plosive Consonants
I've been playing with treating consonant sounds in vocals as if they were percussion hits. Many singers don't just carry melody and semantics with their voices. They, knowingly or not, also use the attack of consonants (p, t, k, ch) as rhythmic accents that become part of the drumming texture.
The experiment was simple:
- Take a track (The World Backwards - Broadcast (UK)) and run it through Whisper, the FOSS speech recognition model from OpenAI.
- From the transcript, ignore the accuracy or meaning of words altogether and only pay attention to their plosive consonants. Note: If the model suggests "feel" instead of "fear", or "pattern" instead of "patent" we don't care, as we're not looking for semantic accuracy, just the plosives.
- Place those sounds on a timeline aligned with a bar/beat grid in 3/4 at 188 BPM (which is the song's tempo).
What comes out is a kind of map of vocal percussion, a scatter of markers where the singer's consonants align with or play against the beat.
Why bother with this? Maybe because lyrics aren't just semantics or poetry, but also audio events that interact with rhythm, just like a hi-hat or a snare. By visualizing them this way, you can see how the human voice is simultaneously melody, meaning, and percussion.
The Technical Side
The Python implementation was straightforward:
- Parse Whisper's JSON output with word-level timestamps
- Filter for words containing plosive consonants (p, b, t, d, k, g, c, q, ch)
- Plot them against calculated beat markers based on BPM
# Simple plosive detection
plosives = ["p", "b", "t", "d", "k", "g", "c", "q", "ch"]
for word in transcript:
if any(plosive in word.lower() for plosive in plosives):
events.append(word.timestamp)
Unexpected Insights
It's a tiny project, but it sparked some observations:
- Machine transcription tools aren't just for accessibility or subtitles.
- Rhythm isn't only in drums and basslines.
- You can strip away meaning and listen to the raw mechanics of sound.
What's Next?
The logical next step? What if we applied this to glossolalia or constructed languages like Simlish? Whisper would still attempt to guess and fit phonetic patterns to real words, potentially revealing the "envelope" or rhythmic skeleton of vocals.
Sometimes you should try to use tools for purposes they weren't designed for 😸