Sunday, August 30, 2015

Bulk-download from GMAIL

I collected over a decade of work related emails. Now it's time to run some analytics on it. A long time ago I convinced the IT department at work to automatically forward a copy of every email message to my GMAIL account. In return I promised I would never complain about problems with the Exchange server.

The following method is fairly well documented on the Web, and works pretty much on every UNIX based system, as well as on Windows using Cygwin. The installation steps are different for each operating system. Here, I'm describing the steps for OS X. While there's really nothing new to it I hope an updated version will be helpful.

First, I had to install the 'fetchmail' client to download the messages. The program is not in the Homebrew distribution. I got it from http://sourceforge.net/projects/fetchmail/files/branch_6.3/fetchmail-6.3.26.tar.xz.

Fetchmail is build on the OpenSSL library, which is no longer part of OS X. I ignored all the warnings and installed it with brew install openssl. Enter the fetchmail directory, and run

./configure --with-ssl=/usr/local//Cellar/openssl/1.0.1j/
make
sudo make install

Verify that this is the actual location of the openssl installation.

The fetchmail program pulls messages from the source mail-server and expects the local mailing system to take care of them. Fortunately, procmail is already installed. The following shows the configuration file for fetchmail. I'm not saving it in the default configuration, but rather specify the file at the command line.

poll imap.gmail.com protocol IMAP 
        user "johndoe@gmail.com" is john here
        password 'J0hn$p@ssword'
        folder 'MYFOLDER'
        nokeep
        flush
        fetchsizelimit 0
        ssl
        mda '/usr/bin/procmail -d %T'

With GMAIL username johndoe@gmail.com and password J0hn$p@ssword, the username on the local system john, and the GMAIL folder MYFOLDER.

You may try to download a limited number of messages by changing the 'fetchsizelimit' parameter, and without the 'flush' option. Even with 'fetchsizelimit 0' set to unlimited, GMAIL may still have set a limit. Check your GMAIL settings. Run the command

fetchmail -vf myfetchmailrc

and check if everything connects properly. My objective is to download all the messages and clear up some space. Therefore, I use the 'flush' option. However, it does't look like GMAIL actually removes messages if they have additional labels, i.e. belong to another folder. Also, fetchmail will only download messages that are flagged as "unread". The following steps get the data ready to download:

  1. Move all the desired messages into a folder (i.e. label)
  2. Select all messages and remove any other labels.
  3. Select all messages and mark as "unread".

This should get them ready for download. It took me multiple attempts to clean up labels, and I ended up moving messages that were not automatically removed from GMAIL to a new folder, then repeated the above steps. I produces duplicates, but that's not much of a concern since each message has a unique identifier.

Once successfully "fetched", procmail will store them in a flat mbox file. There are numerous clients and libraries to read mbox file. For my purpose I prefer having each message in a separate text file. The "good old" MH (Mail Handler) commands come in handy here. On OS X I had to download the source code from http://www.nongnu.org/nmh/ and install with

./configure
make
sudo make install
The default install directory is /usr/local/nmh/bin. I left them there because the commands will be part of some other scripts. Before one can extract the messages from the system mailbox, MH needs to be configured with /usr/local/nmh/bin/install-mh to create a configuration file ~/.mh_profile. Alternatively, one could just edit it with the following:
Path: /Volumes/DATA/GmailProject/Mail
Here, the "Path" specifies the directory for the extracted messages. The "inc" command will create a subdirectory "inbox". Run
/usr/local/nmh/bin/inc
to "incorporate" the newly downloaded messages.

Downloading tens of thousands of messages will take a while. I suggest running this on a computer that can be left alone for a couple of hours. Now I'm ready to get some insight from years of emailing...

No comments:

Post a Comment