Gnus, Isync, Dovecot, and Lucene searches

For someone who is completely reliant on email for both his professional and personal lives, I seem to have a hell of a time getting my email environment just the way I like it. I use Emacs, with the Gnus newsreader, caching IMAP email locally with Dovecot.

In the course of setting up better full-text email search within Gnus, I’ve switched to a slightly more complex email setup, and am blogging for posterity.

My email environment

  • Archlinux (as of September 2014)
  • The git version of Emacs: 24.4.50
  • The git version of Gnus: Ma Gnus 0.12
  • The git version of Isync: 1.1.2 (note the executable is called mbsync)
  • Dovecot: 2.2.13

My basic requirements:

  • I’m on a single-user computer, so would prefer not to create a bunch of system-level users just to read my email. Don’t put my mail in /home/vmail etc., as detailed in some Dovecot setups.
  • I have multiple email addresses.
  • A depressing number of these email addresses are Gmail.
  • I want all messages for all accounts stored under a single location in my own home directory, for ease of encryption, backup, etc.
  • I want to search like a boss, and I want to do it in Chinese.

The problem

A couple of years ago I set things up largely based on this blog post, which is offlineimap-specific, but fairly easy to tailor to mbsync. This is the bare-bones easiest way to use Dovecot: Dovecot doesn’t run as a server, and only wakes up when it is called by either mbysnc or gnus. Getting this working required having Dovecot installed on the system, and nothing else: zero configuration.

This worked great for ages, but only because I didn’t care about IMAP search. I did my searching using a Notmuch search index that existed in parallel to my IMAP installation. That started to prove annoying, however, because while Notmuch is an awesome search tool, it doesn’t integrate well with Gnus by default. Getting from Notmuch search results to the “real” messages in a Gnus summary buffer requires a hack, and it all just felt wrong.

Searching in Gnus is ideally done with the nnir meta-search engine: it searches messages from the server under point, based on the search engine you’ve configured for that server, and creates a native summary buffer – it feels just like regular Gnus, regardless of how the messages were collected.

There are two problems with using nnir to search IMAP: 1) Native IMAP search is dog slow, and 2) the nnir’s query syntax for IMAP is oddly limited. This blog post will address problem one; problem two might have to wait.

I started looking into Dovecot-based full-text search indexes, and Lucene presented itself as the simplest solution, though I don’t need Java so it was the clucene C++ version that made the most sense.

The problem is that full text search (FTS) with Lucene seems to require a running Dovecot daemon (if I’m wrong please email eric at this domain name and tell me!). So all of a sudden we’re going from our beautiful bare-bones Dovecot setup, to requiring an actual running daemon, with configuration and everything.

So be it! But the goal is to minimize configuration, because really I was perfectly happy with the original no-configuration arrangement, I just want to add FTS.

Dovecot

Dovecot setup instructions often seem to assume one email account per system user, and multiple users per machine. Many of us are in the opposite situation, however: a single-user computer, with multiple email accounts. Dovecot has the concept of virtual users, which is fairly well suited to this situation. The following is the basic /etc/dovecot/dovecot.conf file, as simple as possible:

protocols = imap

listen = *, ::
log_path = /var/log/dovecot.log
info_log_path = /var/log/dovecot-info.log

ssl = no
disable_plaintext_auth = no

auth_verbose = yes
auth_mechanisms = plain

passdb {
       driver = passwd-file
       args = /etc/dovecot/passwd
}

userdb {
       driver = static
       args = uid=eric gid=users home=/home/eric/.mail/%d/%n
       default_fields = mail=maildir:/home/eric/.mail/%d/%n/mail
}

mail_plugins = $mail_plugins fts fts_lucene

plugin {
       fts = lucene
       fts_lucene = whitespace_chars=@.
       fts_autoindex = yes
}

The upshot of all all this is, I’m creating only virtual users, no system users. I did not create a dovecot user, nor a dovenull user, nor a vmail user, or anything else the HOWTOs tell you to do. I’m the only user on my system, and I can do without those. Dovecot is flexible.

The “passwd” section specifies a file where I’ve stored user information: ie, the username and (local-use only) password for each of my email addresses. The contents of my /etc/dovecot/passwd look like:

eric@ericabrahamsen.net:{PLAIN}passwurd
eric@paper-republic.org:{PLAIN}prasswowrdy2
info@paper-republic.org:{PLAIN}plasswsword
[etc]

You might like to use a better authentication mechanism than PLAIN, see this page for options. If you use a different mechanism, you might be need to change the auth_mechanisms entry in dovecot.conf.

Then the ‘userdb’ section says where each of these accounts keeps its mail, and the ownership of those files. The args and default_fields stuff is opaque to me, but specifying the values this way works.

Because all of the accounts belong to my user, the uid and gid correspond to my system user. The home directories for each account are under ~/.mail folder, in directories that look like domainname/user (specified by the “%d/%n” escapes). The home directories hold more than just the mail, they hold Dovecot’s index files, the uidvalidity stuff, and the Lucene indexes – the whole point of this exercise to begin with.

On Archlinux, start the server (and set it up for automatic restart) with:

$ sudo systemctl start dovecot
$ sudo systemctl enable dovecot

Adjust for your distribution.

Isync

Dovecot is done, so now we move to the ~/.mbsyncrc. Here’s the account configuration for one address:

IMAPAccount ea
Host imap.gmail.com
User eric@ericabrahamsen.net
PassCmd "/usr/bin/pass email/ea" # retrieves the remote password
UseIMAPS yes
CertificateFile /etc/ssl/certs/ca-certificates.crt

IMAPStore ea-remote
Account ea

IMAPAccount ea-dovecot
RequireSSL no
Host localhost
User eric@ericabrahamsen.net
Pass passwurd  # local password I don't care much about
UseIMAPS no
UseTLSV1 no

IMAPStore ea-local
Account ea-dovecot

Channel ea
Master :ea-remote:
Slave :ea-local:
Patterns * !"[Gmail]/All Mail"
Create Both

We’re good to go! That’s enough to run “mbsync ea” in the terminal, and get a complete sync of messages from the server.

Note that, because the dovecot config file activate the FTS plugin and sets fts_autoindex to “yes”, the simple act of syncing mail with the server will also create a local full text search index of mail. You don’t have to do anything else, or worry about keeping it up to date.

Gnus

Now we configure Gnus similarly:

(nnimap "EA"
  (nnimap-stream network)
  (nnimap-address "localhost")
  (nnimap-authenticator login)
  (nnimap-user "eric@ericabrahamsen.net"))

You’ll probably have other server parameters in there, but that’s enough to get going. The first time you sync in Gnus, it will ask you for the local password (“passwurd”, in this case), and prompt to save it in ~/.authinfo. Because I don’t care much about this password, I leave it saved plain in that file. You could choose to GPG encrypt it. Create one server entry for each of your addresses.

Searching

Now we search in Gnus using nnir: “G G” on a group name, or on several marked groups, or on a topic heading. Or just “G” on a server name in the Server buffer.

Actually, this is where things fall down just a little bit. Indexing is painless and searches are fast, but there are two remaining problems:

The first is that nnir search syntax for imap searches is weird. By default it searches on only one field (which you choose with nnir-imap-default-search-key), or, with a prefix arg, allows you to select a different field to search on. If you want to search multiple fields, you have to fall back to raw imap search syntax, which is cumbersome. The whole thing is awkward, but will eventually get addressed.

A potentially bigger issue is encoding in searches. The Lucene index assumes utf-8 encoding for all your emails, and in a perfect world, that would be enough. Many emails come in different encodings, however, and/or are base-64 munged. I and others have found that Lucene isn’t indexing messages properly, however, and some encoded strings in message headers and bodies aren’t located by searches. Some people run filters in the indexing process so that the messages are converted to utf-8 before they’re indexed. So far I’m just ignoring this problem – I’ve been bitten by it very rarely.

The third problem is a combination of the first two: if you want to search for non-ascii strings via an IMAP server’s SEARCH command, there are two ways to enter the string. Most servers (including Dovecot, but possibly not Gmail?) let you do it by enclosing the string in double quotes (see RFC-2060), which you simply enter as part of the nnir search.

Servers that don’t support this can search for non-ascii strings using a fairly complicated system of feeding literal search strings to the server, along with the number of bytes in the string. Gnus doesn’t currently support this, though I have a patch that partially addresses it.

Obviously, searching imap via nnir isn’t quite there yet. Over the next few months, I’m hoping it will make a little progress…