Longitudinal Sentiment Analysis of Personal Chat Logs

February 2026

Objective

Analyze ~36,000 personal WhatsApp messages to:

  • Normalize and structure raw chat logs
  • Store them in a queryable database
  • Apply Spanish transformer-based sentiment analysis
  • Aggregate results weekly
  • Observe longitudinal emotional patterns

Parsing and Normalization

Input format

WhatsApp exports lines like:

23/07/2020, 11:59 - Facundo Ipharraguerre: Para gastar más

Parsing strategy

  • Regex pattern:
    • Extract timestamp
    • Extract username
    • Extract message
  • Handle multiline messages
  • Filter unwanted content:
    • media placeholders
    • URLs
    • empty/symbol-only lines

Core regex:

r"^(\d{2}/\d{2}/\d{4}, \d{2}:\d{2}) - ([^:]+): (.+)"

Normalization decisions

  • Preserve raw text
  • Convert timestamps to ISO 8601
  • Store clean text without system noise
  • Do not lowercase globally (preserve signal for NLP)

Result: structured rows of:

(timestamp, participant, message)

SQLite Ingestion

Instead of per-user text files, use a relational model:

Tables

  • conversations
  • participants
  • messages

Key design:

  • Single messages table
  • Foreign keys to participants and conversations
  • Indexed by participant and timestamp

Benefits:

  • Flexible aggregation
  • Cross-user comparison
  • Time-based queries
  • Easy extension to MSN archives later

Temporal Behavioral Metrics

Before sentiment, basic structural analysis:

  • Messages per participant
  • Activity by quarter of day (00–06, 06–12, 12–18, 18–00)
  • Monthly and weekly aggregation

This establishes behavioral baseline independent of emotional interpretation.

Transformer-Based Sentiment (Spanish)

Used pysentimiento, a Spanish NLP toolkit built on transformers.

For each message, the model outputs probabilities:

NEG;NEU; POS

Example:

"La comida está bien, pero el servicio es lento."→ NEG 0.72

Continuous Sentiment Index

Instead of categorical labels, derive a numeric signal:

sentiment\_index = POS - NEG

Properties:

  • Range ≈ [-1, +1]
  • Neutral messages ≈ 0
  • Strong polarity weighted by model confidence

This avoids a crude +1 / -1 encoding.

Weekly Aggregation

Messages grouped by ISO week:

Computed per week:

  • mean_sentiment
  • std_sentiment (volatility proxy)
  • mean_pos
  • mean_neg
  • message_count

Observed Patterns (in glorious Excel 2003)

From the output:

  • Baseline slightly negative (~ -0.2)
  • Consistent volatility (~0,067 stdev)
  • Occasional strong positive spikes (on vacations)
  • No extreme polarity drift...

Baseline Removal (“DC Coupling” Adjustment)

Initial weekly sentiment showed a persistent negative offset (≈ −0.2). This likely reflects:

  • Conversational framing (complaints, problem-solving tone)
  • Cultural linguistic patterns
  • Model calibration bias
  • Personal writing style

To focus on variation rather than absolute polarity, the weekly series was mean-centered:

sent_{normalized} = sentiment_{weekly} - \overline{sentiment}

This transformation:

  • Preserves relative swings
  • Removes stylistic bias
  • Makes upward/downward deviations easier to see
  • Allows comparison across different periods or platforms

Conceptually, this is equivalent to removing the DC component in signal processing.

After normalization:

  • Zero represents your long-run expressive baseline
  • Positive values represent relatively more positive weeks
  • Negative values represent relatively more negative weeks

This makes the graph about dynamics, not personality labeling.

What This Project Demonstrates

  • Personal archives can be structured into research-grade datasets
  • SQLite is sufficient for longitudinal text analysis
  • Transformer sentiment models can scale to tens of thousands of short texts
  • Weekly aggregation provides stable signal
  • Self-quantification via NLP is feasible without large infrastructure
No Pages Found