The mbox Format

How Email Clients Store Mail on Your Hard Disk

Woman working at laptop in cafe
Deagreez / Getty Images

The most common format for the storage of mail messages is the mbox format. MBOX stands for MailBOX. A mbox is a single file containing zero or more mail messages.

The mbox Format

If we use the mbox format to store emails, we put all of them in one file. This creates more or less long text file (Internet email always only exists as 7-bit ASCII text, everything else — attachments, for example — is encoded) containing one email message after the other. How do we know where one ends and another starts?

Fortunately, every email has at least one From-line at its very beginning. Every message begins with "From " (From followed by a white space character, also called a "From_" line). If this sequence ("From ") at the beginning of a line is preceded by an empty line or is at the top of the file, we have found the beginning of a message.

So what we look for when parsing a mbox file is, essentially, an empty line followed by "From ".

As a regular expression, we can write this as "\n\nFrom .*\n". Only the very first message is different. It starts merely with "From " at the beginning of a line ("^From .*\n").

"From " In the Body

What if exactly the sequence above appears in the body of an email message? What if the following is part of an email?

  • ...I send you the most recent report.
  • From this report, you need not...

Here, we have an empty line followed by "From " at the beginning of the line. If this appears in a mbox file, we unmistakably have the beginning of a new message. At least that's what the parser thinks and why both the email client and we would be quite confused by an email message that contains neither sender nor recipient but begins with "From this report."

To avoid such disastrous conditions, we need to make sure "From" never appears at the beginning of a line following an empty line in the body of an email.

Whenever we add a new message to a mbox file, we look for such sequences in the body and simply replace "From" with ">From ". This makes misinterpretations impossible. The example above now looks like this and no more triggers the parser:

  • ...I send you the most recent report.
  • >From this report, you need not...

This is why you may sometimes find ">From" in an email where you'd expect a mere "From ".