Maildir deduplication script in Python

Some months ago I wrote a tiny Python script which scan all folders and sub-folders of a Maildir, then remove duplicate mails.

You can give the script a list of email headers to ignore while it compares mails between each others. This is particularly helpful to find duplicate mails having the exact same content but different headers/metadatas.

I created this script to clean up a Maildir folder I messed up after moving repeatedly tons of mails from a Lotus Notes database. As you can see below, the same mail imported twice contain a variable header based on the date and time the import was performed:

This variable header make mails looks different from the point of view of the script. That’s explain why I implemented the HEADERS_TO_IGNORE parameter with the default set to X-MIMETrack.

The script is available on my GitHub repository. It was tested on MacOS X 10.6 with python 2.6.2 but should work on other systems and versions as the code is really simple (and stupid).

16 thoughts on “Maildir deduplication script in Python

  1. Great, I wanted that but never bothered looking for a script. Thanks, it works like a charm!

    6670 duplicate removed in a total of 20927 mails.

  2. That’s great, thanks for sharing! Just wondering why you didn’t dedupe based on the Message-ID header? I see you wrote “This is particularly helpful to find duplicate mails having the exact same content but different UIDs” – by UID are you referring to Message-ID, and if so, under what conditions would you expect to have two duplicate mails with differing Message-ID headers?

    Also, it looks like your code takes into account the body of the mail when computing the hash digest, but this won’t work in all cases. For example, when sending a mail to a list, some mail setups will not automatically ignore the copy returned from the list server, and this copy may have a standard list footer appended to the body. Another example would be if someone else sends a mail to two or more lists that you are subscribed to, and one (or more) of those lists appends a footer to the body which is unique to that particular list. As in the first example, you’d end up copies of the same mail but with different endings to the mail body.

  3. @Adam: I really don’t know what I had in mind when I mentioned UIDs. I just remember that I wrote this blog post in a hurry, so it’s quite likely that I meant “headers” instead of “UIDs”. Thanks for pointing this inconsistency ! Post fixed ! ;)

    Regarding mailing lists you’re absolutely right. Now, making a bullet-proof script able to accurately detect this kind of duplicates requires more work. I may update this script in the future with such features if I encounter this exact same use-case. For the moment I have no needs of that. But if you have patches, I’ll be happy to merge them with my script ! :)

  4. Pingback: Ultimate guide of Lotus Notes mail migration | Kev's blog

  5. Hey guys, any chance you’d know what is causing this:

    Processing 79207 mails in ~/Mail/GMail/archive ....Traceback (most recent call last):
      File "/Users/me/maildir-deduplicate.py", line 336, in
        main()
      File "/Users/me/maildir-deduplicate.py", line 330, in main
        mail_count += collateFolderByHash(mails_by_hash, maildir, opts.message_id)
      File "/Users/me/maildir-deduplicate.py", line 187, in collateFolderByHash
        mail_hash = computeHashKey(message, use_message_id)
      File "/Users/me/maildir-deduplicate.py", line 161, in computeHashKey
        m = re.match("(\[\w[\w_-]+\w\] )(.+)", mail['Subject'])
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 137, in
    match
        return _compile(pattern, flags).match(string)
    TypeError: expected string or buffer
    

    in an offlineimap Maildir that I have also touched with mutt? The source is a GMail IMAP repository and I would touch it with various IMAP clients including re-alpine, which leaves traces of itself sometimes through messages.

    I also have noticed perhaps all of my duplicates are because one message will have a Content-Length: header and then the duplicate message (or original message!) will not. I don’t know which behavior is correct.

  6. TypeError is fixed by this pull request but Anonymous should note that some of his messages are missing the Subject header, which is not good.

    @Anonymous: the copy with the Content-Length header is probably more correct than the one without. However, the deduplication script ignores this header, so it will still spot the duplicates. Also, when removing duplicates via the --remove option, it will remove all but the longest version, so you will end up with the copy containing that header, assuming all other things are equal.

  7. @Adam: Thanks for this detailed analysis and for you code contribution ! I merged this afternoon your pull request to the main repository. Thanks again Adam. I really appreciate your involvement. :)

  8. Hi!

    surely nice but I got:

    Traceback (most recent call last):
      File "/home/anderl/bin/maildir-deduplicate", line 337, in
        main()
      File "/home/anderl/bin/maildir-deduplicate", line 333, in main
        duplicates, sets = findDuplicates(mails_by_hash, opts)
      File "/home/anderl/bin/maildir-deduplicate", line 212, in findDuplicates
        subject, count = re.subn('\s+', ' ', subject)
      File "/usr/lib/python2.7/re.py", line 162, in subn
        return _compile(pattern, flags).subn(repl, string, count)
    TypeError: expected string or buffer
    

    On line 212 I put

            try:
                subject, count = re.subn('\s+', ' ', subject)
                print "\nSubject: " + subject
            except:
                print "\nNo Subject"
    

    to fix it. But I am not sure whether this will be alright.

  9. I should add that uppon trying --remove for a copy it issued

    Fatal Python error: Inconsistent interned string state.
    

    in the end. But duplicates seem to be gone :)

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>