Longitudinal Sentiment Analysis of Personal Chat Logs
Objective
Analyze ~36,000 personal WhatsApp messages to:
- Normalize and structure raw chat logs
- Store them in a queryable database
- Apply Spanish transformer-based sentiment analysis
- Aggregate results weekly
- Observe longitudinal emotional patterns
Parsing and Normalization
Input format
WhatsApp exports lines like:
23/07/2020, 11:59 - Facundo Ipharraguerre: Para gastar más
Parsing strategy
- Regex pattern:
- Extract timestamp
- Extract username
- Extract message
- Handle multiline messages
- Filter unwanted content:
- media placeholders
- URLs
- empty/symbol-only lines
Core regex:
r"^(\d{2}/\d{2}/\d{4}, \d{2}:\d{2}) - ([^:]+): (.+)"
Normalization decisions
- Preserve raw text
- Convert timestamps to ISO 8601
- Store clean text without system noise
- Do not lowercase globally (preserve signal for NLP)
Result: structured rows of:
(timestamp, participant, message)
SQLite Ingestion
Instead of per-user text files, use a relational model:
Tables
- conversations
- participants
- messages
Key design:
- Single
messagestable - Foreign keys to participants and conversations
- Indexed by participant and timestamp
Benefits:
- Flexible aggregation
- Cross-user comparison
- Time-based queries
- Easy extension to MSN archives later
Temporal Behavioral Metrics
Before sentiment, basic structural analysis:
- Messages per participant
- Activity by quarter of day (00–06, 06–12, 12–18, 18–00)
- Monthly and weekly aggregation
This establishes behavioral baseline independent of emotional interpretation.
Transformer-Based Sentiment (Spanish)
Used pysentimiento, a Spanish NLP toolkit built on transformers.
For each message, the model outputs probabilities:
NEG;NEU; POS
Example:
"La comida está bien, pero el servicio es lento."→ NEG 0.72
Continuous Sentiment Index
Instead of categorical labels, derive a numeric signal:
Properties:
- Range ≈ [-1, +1]
- Neutral messages ≈ 0
- Strong polarity weighted by model confidence
This avoids a crude +1 / -1 encoding.
Weekly Aggregation
Messages grouped by ISO week:
Computed per week:
- mean_sentiment
- std_sentiment (volatility proxy)
- mean_pos
- mean_neg
- message_count
Observed Patterns (in glorious Excel 2003)
From the output:
- Baseline slightly negative (~ -0.2)
- Consistent volatility (~0,067 stdev)
- Occasional strong positive spikes (on vacations)
- No extreme polarity drift...
Baseline Removal (“DC Coupling” Adjustment)
Initial weekly sentiment showed a persistent negative offset (≈ −0.2). This likely reflects:
- Conversational framing (complaints, problem-solving tone)
- Cultural linguistic patterns
- Model calibration bias
- Personal writing style
To focus on variation rather than absolute polarity, the weekly series was mean-centered:
This transformation:
- Preserves relative swings
- Removes stylistic bias
- Makes upward/downward deviations easier to see
- Allows comparison across different periods or platforms
Conceptually, this is equivalent to removing the DC component in signal processing.
After normalization:
- Zero represents your long-run expressive baseline
- Positive values represent relatively more positive weeks
- Negative values represent relatively more negative weeks
This makes the graph about dynamics, not personality labeling.
What This Project Demonstrates
- Personal archives can be structured into research-grade datasets
- SQLite is sufficient for longitudinal text analysis
- Transformer sentiment models can scale to tens of thousands of short texts
- Weekly aggregation provides stable signal
- Self-quantification via NLP is feasible without large infrastructure