Filtering spam using cheap heuristics
LinkedIn, Indeed, and many other websites send me weekly "opportunities" / "picks" / whatever. Most of those emails pass gracefully ignored. But then I thought I could be missing something great. Again, I would not follow every link, but there were a couple of interesting ones...
Given that I consider that there is a recurring tendency in modern engineering and development to escalate simple problems into machine learning problems, I tried to prove myself and the universe that a lightweight heuristic system is frequently sufficient. You can kill a mosquito with an Orbital Ion Cannon, but using cheaper tools first usually is the preferred approach.
The practical problem is not some deep semantic understanding. The problem is about filtering large volumes of semi-structured noise into a smaller set of potentially relevant messages.
For instance, a deterministic token-overlap approach has several advantages over more fashionable ML-heavy alternatives...
The best solution to a problem is usually the easiest one.
First, computational cost is negligible. A simple set intersection between a vocabulary extracted from a CV and the tokenized (converting all to lowercase) content of recruiter emails executes almost instantly, even across thousands of messages. No GPU inference, API billing, vector indexing, or model maintenance is required.
Second, the system remains inspectable. The score assigned to an email can be directly explained through the matched terms. This is operationally useful. When a ranking looks incorrect, debugging consists of examining tokenization and scoring rules rather than reverse-engineering opaque embedding behavior.
Third, heuristic pipelines tend to degrade gracefully. Recruiter emails are highly repetitive documents filled with tracking links, HTML remnants, unsubscribe sections, and generic corporate language. A small amount of preprocessing (removing URLs) already eliminates a substantial portion of irrelevant variance. The remaining signal is usually dominated by technical keywords, which are exactly the features a heuristic matcher handles well.
A weekly recruiter digest does not need perfect semantic understanding. It only needs to separate “possibly relevant” from “probably irrelevant” with acceptable reliability. Also, systems that are understandable, deterministic, and cheap to operate are easier to maintain.
Core design choices
- Source-driven vocabulary: My CV and related files are treated as the authoritative profile. This biases the matching toward domains already relevant to the user (me).
- Simple token overlap: Similarity is computed through set intersection between source vocabulary and email vocabulary. This keeps the system computationally cheap, transparent, and easy to debug.
- Normalized scoring: Scores are divided by sqrt(email_length) to reduce bias from long emails, while avoiding excessive penalties on verbose but relevant messages.
- Noise reduction: URLs, tracking artifacts, HTML remnants, and extremely long tokens are removed before tokenization, mitigating common email clutter.
- Adaptive thresholding: Instead of a hardcoded cutoff, the script computes a dynamic threshold based on the average score of the current email batch.
- Minimal dependencies: The pipeline relies mostly on Python standard library components, improving portability and reducing operational complexity.
- Human-readable output: The resulting matches and scores are in plain text.
- Telegram integration: Notifications provide a compact summary of the results.