./how-its-built/data-foundation
[02]

Data Foundation - Understanding iMessage

Deep dive into the iMessage database schema, message decoding, and snapshot strategy.

Data Foundation

Alright so this whole project really starts with the data. I'm a believer that data will be a company's moat in the future, and given this one doesn't capture any data, not a great moat.

Before we can do anything - train models, or build our vector embeddings - it is good to understand where most iMessage data actually lives on disk.

The iMessage Database

Where Messages Live

If you've ever used iMessage on a Mac, there's a SQLite database sitting on your disk right now at:

~/Library/Messages/chat.db

You should literally try to ls it right now (partially as a sanity check, given this project falls apart if that doesn't exist). There are numerous factors that could influence this sqlite database. You'll most likely need to have iCloud messages enabled. It's important to note that iCloud is a sync system, so it will have a large amount of your iMessages local for reference. This is not guaranteed to be all of your iMessages. Apple states:

When you set up iCloud for Messages, your messages are stored in the cloud, with only the most recently accessed Messages stored locally on your device. You can see them on any device that's set up for iCloud and Messages, including your iPhone, iPad, and Mac.

So basically if you have no iCloud storage on, they will sync locally which you can do. Generally, I have iCloud syncing enabled, and the vast majority of my messages are in this local database. Other engineers report the same finding, but it's not definitive given Apple's iMessage and syncing strategy around the iMessage chat.db database, is a bit opaque. There is still a value add here given I think most of the questions will be biased temporally to recent events.

One note: since MacOS Mojave (10.14), Apple has put ~/Library/Messages behind TCC (transparency, consent, and control). That's why we need *Full Disk Access granted to the process that is trying to read from it. And that is why you'll need to grant Be Right Back Full Disk Access (FDA) during setup. We need to be able to read from that folder.

Database Schema

Apple's schema is rather... complex? Probably with good reason, and I don't know the full story, but I would imagine the variety of the oss that are supported, the mass volume (roughly 8.4B/day in 2024) has led to some of these growing pains.

We'll take a simpler first pass, and focus on the four tables that BRB cares about. These are taken literally from my laptop with my specific os version:

╭─johnlarkin@Mac ~/Documents/coding/be-right-back/code/be-right-back ‹feat-blog-work-pt3*›
╰─➤  sw_vers
ProductName:		macOS
ProductVersion:		26.4
BuildVersion:		25E246

╭─johnlarkin@Mac ~/Documents/coding/be-right-back/code/be-right-back ‹feat-blog-work-pt3*›
╰─➤  sqlite3 ~/Library/Messages/chat.db ".tables"
_SqliteDatabaseProperties
attachment
chat
chat_handle_join
chat_lookup
chat_message_join
chat_recoverable_message_join
chat_service
deleted_messages
handle
index_state_metrics
kvtable
message
message_attachment_join
message_processing_task
persistent_tasks
recoverable_message_part
scheduled_messages_pending_cloudkit_delete
sync_chat_slice
sync_deleted_attachments
sync_deleted_chats
sync_deleted_messages
unsynced_removed_recoverable_messages

Feel free to run this on your mac:

  sqlite3 ~/Library/Messages/chat.db ".schema message"
  sqlite3 ~/Library/Messages/chat.db ".schema handle"
  sqlite3 ~/Library/Messages/chat.db ".schema chat"
  sqlite3 ~/Library/Messages/chat.db ".schema chat_handle_join"

message - the core one, every message is a row

More than 100 cols, so here are the important ones:

col description
ROWID Primary key
text Plain text content (nullable - more on this)
attributedBody Binary blob containing an NSAttributedString (also nullable)
is_from_me 1 if you sent it, 0 if you received it
date Timestamp in Apple's Cocoa format (this is annoying from apple)
service "iMessage" or "SMS" or "RCS" (RCS is new)
associated_message_guid Non-null for reactions/tapbacks, links back to the original message

handle - contact identifiers (phone numbers, email addresses)

col description
ROWID Primary key
id The actual identifier - phone number or email address
service "iMessage" or "SMS" - same person on both is two separate rows
country Country code (e.g. "us"), nullable
uncanonicalized_id The original un-normalized identifier, nullable
person_centric_id Apple's attempt to merge duplicates of the same contact, nullable

chat - conversation metadata. Group chats have a non-null room_name

More than 30 cols, so here are the important ones:

col description
ROWID Primary key
guid Unique string identifier (e.g. iMessage;-;+15551234567)
chat_identifier Human-readable identifier (phone/email for 1
, group id for groups)
service_name "iMessage" or "SMS"
room_name NULL for 1
, non-null for group chats (this is how we detect)
display_name Custom group chat name if the user set one, nullable
style Chat style flag (1
vs group encoding)
is_archived 1 if the user archived the conversation

chat_handle_join - the junction table mapping participants to conversations

Tiny table, just the join:

col description
chat_id FK to chat.ROWID (cascades on delete)
handle_id FK to handle.ROWID (cascades on delete)

The relationship chain is: message -> chat_message_join -> chat -> chat_handle_join -> handle. I'm sure Apple has dedicated ways for handling this, but obviously it does appear to be a lot of joins to answer the question of "who sent this message".

The queries can get a bit complex for pulling information, and I'll walk you through one for example. This was something that even our lovely Claude Code/Codex agents failed at "one-shotting". Sure context problem and didn't fully go deep research -> agent hook, but validation was required.

/// Query to get all one on one messages for a specific contact
pub const GET_MESSAGES_FOR_CONTACT: &str = r#"
    WITH the_handle AS (
      SELECT ROWID AS handle_rowid,
             COALESCE(person_centric_id, id) AS persona_key
      FROM handle
      WHERE id = ?
    ),
    p2p_chats_for_person AS (
      SELECT ch.ROWID AS chat_id
      FROM chat ch
      JOIN chat_handle_join chj ON chj.chat_id = ch.ROWID
      JOIN handle h             ON h.ROWID     = chj.handle_id
      JOIN the_handle th        ON COALESCE(h.person_centric_id, h.id) = th.persona_key
      WHERE ch.room_name IS NULL
        AND ch.guid NOT LIKE '%;chat%'
      GROUP BY ch.ROWID
      HAVING COUNT(DISTINCT COALESCE(h.person_centric_id, h.id)) = 1
    )
    SELECT DISTINCT
      m.ROWID  AS rowid,
      m.text,
      m.attributedBody,
      m.is_from_me,
      m.date,
      m.service
    FROM message m
    JOIN chat_message_join cmj ON cmj.message_id = m.ROWID
    WHERE cmj.chat_id IN (SELECT chat_id FROM p2p_chats_for_person)
      AND m.item_type = 0
      AND m.is_system_message = 0
      -- keep inline replies; drop pure reactions/tapbacks:
      AND NOT (
        m.associated_message_guid IS NOT NULL
        AND m.text IS NULL
        AND m.attributedBody IS NULL
      )
    ORDER BY m.date ASC, m.ROWID ASC
    "#;

To walk through this a bit:

  • the_handle portion is really answering the question, "ok who is this number"
  • p2p_chats_for_person is filtering out and looking for one on one chats with this person across numerous handles. this is to get around the issue of iPhone numbers and their emails. So like my dear friend who got destroyed in our groupchat recently:
icloud woahs
  • the main select is pretty self explanatory
  • the filters are as follows:
    • m.item_type = 0 -> dictates regular iMessages
    • m.is_system_message -> filters out the rows like in our screenshot above, where we're saying "x left the chat"
    • NOT (...) -> drops the tapback reactions. they're stored as full message rows with the associated_message_guid set.

Apple's Timestamp Format

One interesting, and very annoying callout, is Apple doesn't use Unix timestamps. They instead use Cocoa/Foundation time - which are the seconds since Jan 1st, 2001 (rather than 1970). The offset is 978,307,200, so you'll probably see that offset constant stashed in some Python or Rust code.

Even more annoyingly is that at some point across macOS versions, Apple switched from storing seconds to nanoseconds... So the same column can contain values in two different units depending on when the message was sent or the underlying macOS version that was associated with it.

Our conversion logic uses a simple heuristic:

const COCOA_TO_UNIX_OFFSET: i64 = 978_307_200;

pub fn cocoa_to_unix(cocoa_ts: i64) -> i64 {
    normalize_cocoa_timestamp(cocoa_ts) + COCOA_TO_UNIX_OFFSET
}

fn normalize_cocoa_timestamp(ts: i64) -> i64 {
    if ts > 2_000_000_000 {
        ts / 1_000_000_000  // nanoseconds -> seconds
    } else {
        ts                   // already seconds
    }
}

This was a pretty objective threshold. Cocoa timestamps in seconds won't exceed that until 2064. So yes, very candidly this is a bug that we would break some timestamp parsing in 2064. I'm fine with that for now, because given the rate engineering is changing, I'd bet my life that no one will be using this in 2064.

Message Decoding Challenges

This was actually a real pain! Certainly wasn't a one-shot with Opus 4.6. I doubt it would have been with 4.7 either. I'll talk about some of the reasons why.

Plain Text vs NSAttributedString

I very incorrectly assumed that every message has its text in the text column. That assumption fails for lots of newer messages, basically anything with rich links, any kind of formatting, and I think storing in attributedBody is just the default now. attributedBody is a sqlite column that is stored as a serialized NSAttributedStringnsattributedstring.

So the conditional logic is if text is null, you should flip over to attributedBody to decode a binary blob and get the message content. I wasn't doing this at first and I was dropping a vast amount of conversation context.

Decoding Binary Blobs

It's interesting to talk about the attributedBody column and what it actually looks like on disk.

The column contains a NSKeyedArchive encoded NSAttributedString. This is Apple's binary serialization format. It's basically a plist wrapping some Objective-C objects and ohhhh boy do I not like working with Objective-C.

As I've discovered from working with the data, there are three different serialization formats that Apple has adopted over the years:

  1. Modern NSKeyedArchive - starts with magic bytes bplist00, uses the secure coding API
  2. Legacy NSKeyedArchive - same format, less restrictive unarchiving API
  3. NSArchiver - ancient format, starts with streamtyped around byte 2

We try all three in order, falling back through each if the previous one fails.

Blob Example

Here's what these blobs actually look like. This is a real attributedBody for the message "Yeah ":

╭─johnlarkin@Mac ~ ‹main›
╰─➤  xxd /tmp/attributed_body_sample.bin | head -8
00000000: 040b 7374 7265 616d 7479 7065 6481 e803  ..streamtyped...
00000010: 8401 4084 8484 124e 5341 7474 7269 6275  ..@....NSAttribu
00000020: 7465 6453 7472 696e 6700 8484 084e 534f  tedString....NSO
00000030: 626a 6563 7400 8592 8484 8408 4e53 5374  bject.......NSSt
00000040: 7269 6e67 0194 8401 2b05 5965 6168 2086  ring....+.Yeah .
00000050: 8402 6949 0105 9284 8484 0c4e 5344 6963  ..iI.......NSDic
00000060: 7469 6f6e 6172 7900 8484 084e 534f 626a  tionary....NSObj
00000070: 6563 7400 8596 9284 8484 084e 5353 7472  ect........NSStr

So a 5-char message takes ~180B on disk. You can see the structure: streamtyped magic at byte 2, Objective-C class hierarchy (NSAttributedString -> NSObject -> NSString in thie middle there), and then the actual Yeah text. A longer message is going to be somewhat optimized for storage given the overhead is mostly fixed.

Objective-C FFI from Rust

This was part of the headache. We need Apple's own runtime to decode these blobs, so we use direct objC FFI from Rust via the objc crate. This links against both the Foundation and AppKit frameworks. The unarchiver requires an explicit allowlist of the classes that we're permitting it to deserialize for safety. We allow most of the basics - NSString, NSDictionary, etc, plus some optional AppKit ones that we need for the new rich text message features.

Here you can see the staged approach of trying all three. A very interesting note, and I'm not sure why, but all of my messages hit the streamtyped format. Perhaps some legacy / attribute of me migrating across numerous MacOS devices over the years.

/// Internal unsafe implementation of attributedBody decoding
///
/// Tries multiple archiver formats:
/// 1. NSKeyedUnarchiver modern secure API (for "bplist00" format)
/// 2. NSKeyedUnarchiver legacy API (fallback)
/// 3. NSUnarchiver (for old NSArchiver "streamtyped" format)
unsafe fn try_decode_attributed_body(bytes: &[u8]) -> Option<String> {
    // Create NSData from bytes
    let data = NSData::from_vec(bytes.to_vec());

    // Try NSKeyedUnarchiver first (for modern archives starting with "bplist00")
    let keyed_unarchiver_class: *const Class = class!(NSKeyedUnarchiver);
    let attr_string_class: *const Class = class!(NSAttributedString);

    // Try modern secure API first (requires allowed classes set)
    if let Some(text) = unsafe { try_modern_unarchive(keyed_unarchiver_class, attr_string_class, &data) } {
        debug!("NSKeyedUnarchiver modern API succeeded");
        return Some(text);
    }

    // Try legacy NSKeyedUnarchiver API
    if let Some(text) = unsafe { try_legacy_keyed_unarchive(keyed_unarchiver_class, &data) } {
        debug!("NSKeyedUnarchiver legacy API succeeded");
        return Some(text);
    }

    // Try NSUnarchiver for old NSArchiver format (starts with "streamtyped")
    let unarchiver_class: *const Class = class!(NSUnarchiver);
    if let Some(text) = unsafe { try_nsunarchiver(unarchiver_class, &data) } {
        debug!("NSUnarchiver succeeded");
        return Some(text);
    }

    None
}

Once any unarchiver succeeds, we extract the plain text through another chain of null-checked ObjC calls. This fn is referenced throughout all of them.

unsafe fn extract_string_from_nsattributed(attr_obj: *mut Object) -> Option<String> {
    if attr_obj.is_null() { return None; }

    // [attributedString string] -> NSString
    let nsstring: *mut Object = msg_send![attr_obj, string];
    if nsstring.is_null() { return None; }

    // [nsstring UTF8String] -> *const c_char
    let c_str_ptr: *const c_char = msg_send![nsstring, UTF8String];
    if c_str_ptr.is_null() { return None; }

    // C string -> Rust String
    match unsafe { CStr::from_ptr(c_str_ptr) }.to_str() {
        Ok(s) => Some(s.to_owned()),
        Err(e) => {
            debug!(error = ?e, "Failed to convert C string to Rust String");
            None
        }
    }
}

And if you're a good Rust eng, you'll see that this is all unsafe. That should be pretty obvious but we're calliing into Obj-C runtime through raw FFI. It's all unsafe because we have no idea about the attr_obj pointer that we're referencing. Also the return type we don't actually know what it is, so we have to type it ourselves.

Snapshot Strategy

Ok changing gears to talk about my snapshot strategy!

Why Not Query Live?

This is probably intuitive to some but I did not want the open chat.db and query it live. This mainly boiled down to:

  1. locking
  2. fears of data consistency
  3. safety
  4. performance

I'm not going to go into much more detail than that.

Safe Database Access

Instead, I snapshot. The process copies three files atomically:

~/Library/Messages/chat.db      -> ~/.be-right-back/data/raw/{timestamp}/chat_snapshot.db
~/Library/Messages/chat.db-wal  -> ~/.be-right-back/data/raw/{timestamp}/chat_snapshot.db-wal
~/Library/Messages/chat.db-shm  -> ~/.be-right-back/data/raw/{timestamp}/chat_snapshot.db-shm

Those -wal and -shm files are critical. This comes back to SQLite usage. sqlite uses WAL for crash recovery and concurrency access. Copying just the .db file withohut those would mean losing recently committed transactions.

I also snag the AddressBook. Part of that is because I wanted to be able to resolve the phone numbers and email addresses to display names. You'll see that we join on this data at some points as well. This is non-fatal though, if you don't have your addressbook locally or its empty, that's fine. We just fallback to the phone numbers.

Multiple Snapshots

Currently.... I don't manage multiple. We only ever have one. I started with multiple to do cool diffing and multiple version control but the sheer space on disk was becoming an issue and I thought the vanilla user wouldn't even care.


With the data foundation in place - we can read iMessages, decode Apple's binary formats, and safely snapshot the database - the next step is turning all of that raw data into something useful. That's the ingestion pipeline.

Data Foundation - Understanding iMessage - Be Right Back