Longitudinal Sentiment Analysis of Personal Chat Logs

February 2026

Objective

Analyze ~36,000 personal WhatsApp messages to:

Normalize and structure raw chat logs
Store them in a queryable database
Apply Spanish transformer-based sentiment analysis
Aggregate results weekly
Observe longitudinal emotional patterns

Parsing and Normalization

Input format

WhatsApp exports lines like:

23/07/2020, 11:59 - Facundo Ipharraguerre: Para gastar más

Parsing strategy

Regex pattern:
- Extract timestamp
- Extract username
- Extract message
Handle multiline messages
Filter unwanted content:
- media placeholders
- URLs
- empty/symbol-only lines

Core regex:

r"^(\d{2}/\d{2}/\d{4}, \d{2}:\d{2}) - ([^:]+): (.+)"

Normalization decisions

Preserve raw text
Convert timestamps to ISO 8601
Store clean text without system noise
Do not lowercase globally (preserve signal for NLP)

Result: structured rows of:

(timestamp, participant, message)

SQLite Ingestion

Instead of per-user text files, use a relational model:

Tables

conversations
participants
messages

Key design:

Single messages table
Foreign keys to participants and conversations
Indexed by participant and timestamp

Benefits:

Flexible aggregation
Cross-user comparison
Time-based queries
Easy extension to MSN archives later

Temporal Behavioral Metrics

Before sentiment, basic structural analysis:

Messages per participant
Activity by quarter of day (00–06, 06–12, 12–18, 18–00)
Monthly and weekly aggregation

This establishes behavioral baseline independent of emotional interpretation.

Transformer-Based Sentiment (Spanish)

Used pysentimiento, a Spanish NLP toolkit built on transformers.

For each message, the model outputs probabilities:

NEG;NEU; POS

Example:

"La comida está bien, pero el servicio es lento."→ NEG 0.72

Continuous Sentiment Index

Instead of categorical labels, derive a numeric signal:

sentiment\_index = POS - NEG

Properties:

Range ≈ [-1, +1]
Neutral messages ≈ 0
Strong polarity weighted by model confidence

This avoids a crude +1 / -1 encoding.

Weekly Aggregation

Messages grouped by ISO week:

Computed per week:

mean_sentiment
std_sentiment (volatility proxy)
mean_pos
mean_neg
message_count

Observed Patterns (in glorious Excel 2003)

From the output:

Baseline slightly negative (~ -0.2)
Consistent volatility (~0,067 stdev)
Occasional strong positive spikes (on vacations)
No extreme polarity drift...

Baseline Removal (“DC Coupling” Adjustment)

Initial weekly sentiment showed a persistent negative offset (≈ −0.2). This likely reflects:

Conversational framing (complaints, problem-solving tone)
Cultural linguistic patterns
Model calibration bias
Personal writing style

To focus on variation rather than absolute polarity, the weekly series was mean-centered:

sent_{normalized} = sentiment_{weekly} - \overline{sentiment}

This transformation:

Preserves relative swings
Removes stylistic bias
Makes upward/downward deviations easier to see
Allows comparison across different periods or platforms

Conceptually, this is equivalent to removing the DC component in signal processing.

After normalization:

Zero represents your long-run expressive baseline
Positive values represent relatively more positive weeks
Negative values represent relatively more negative weeks

This makes the graph about dynamics, not personality labeling.

What This Project Demonstrates

Personal archives can be structured into research-grade datasets
SQLite is sufficient for longitudinal text analysis
Transformer sentiment models can scale to tens of thousands of short texts
Weekly aggregation provides stable signal
Self-quantification via NLP is feasible without large infrastructure

No Pages Found