Some months ago I wrote a tiny Python script which scan all folders and sub-folders of a Maildir, then remove duplicate mails.
You can give the script a list of email headers to ignore while it compares mails between each others. This is particularly helpful to find duplicate mails having the exact same content but different headers/metadatas.
I created this script to clean up a Maildir folder I messed up after moving repeatedly tons of mails from a Lotus Notes database. As you can see below, the same mail imported twice contain a variable header based on the date and time the import was performed:
![]()
This variable header make mails looks different from the point of view of the script. That’s explain why I implemented the HEADERS_TO_IGNORE parameter with the default set to X-MIMETrack.
The script is available on my GitHub repository. It was tested on MacOS X 10.6 with python 2.6.2 but should work on other systems and versions as the code is really simple (and stupid).



Great, I wanted that but never bothered looking for a script. Thanks, it works like a charm!
6670 duplicate removed in a total of 20927 mails.That’s great, thanks for sharing! Just wondering why you didn’t dedupe based on the Message-ID header? I see you wrote “This is particularly helpful to find duplicate mails having the exact same content but different UIDs” – by UID are you referring to Message-ID, and if so, under what conditions would you expect to have two duplicate mails with differing Message-ID headers?
Also, it looks like your code takes into account the body of the mail when computing the hash digest, but this won’t work in all cases. For example, when sending a mail to a list, some mail setups will not automatically ignore the copy returned from the list server, and this copy may have a standard list footer appended to the body. Another example would be if someone else sends a mail to two or more lists that you are subscribed to, and one (or more) of those lists appends a footer to the body which is unique to that particular list. As in the first example, you’d end up copies of the same mail but with different endings to the mail body.
@Adam: I really don’t know what I had in mind when I mentioned UIDs. I just remember that I wrote this blog post in a hurry, so it’s quite likely that I meant “headers” instead of “UIDs”. Thanks for pointing this inconsistency ! Post fixed !
Regarding mailing lists you’re absolutely right. Now, making a bullet-proof script able to accurately detect this kind of duplicates requires more work. I may update this script in the future with such features if I encounter this exact same use-case. For the moment I have no needs of that. But if you have patches, I’ll be happy to merge them with my script !
Pingback: Ultimate guide of Lotus Notes mail migration | Kev's blog
Well, it took half a year or so, but here you go
https://github.com/kdeldycke/scripts/pull/1
Thanks for accepting those! Here’s some more …
https://github.com/kdeldycke/scripts/pull/2
If I submit any more in the future, most likely I won’t bother commenting here again – other readers see github for the latest.
Awesome Adam ! Thanks for your contributions ! That’s exactly why I love Open-Source !
Hey guys, any chance you’d know what is causing this:
Processing 79207 mails in ~/Mail/GMail/archive ....Traceback (most recent call last): File "/Users/me/maildir-deduplicate.py", line 336, in main() File "/Users/me/maildir-deduplicate.py", line 330, in main mail_count += collateFolderByHash(mails_by_hash, maildir, opts.message_id) File "/Users/me/maildir-deduplicate.py", line 187, in collateFolderByHash mail_hash = computeHashKey(message, use_message_id) File "/Users/me/maildir-deduplicate.py", line 161, in computeHashKey m = re.match("(\[\w[\w_-]+\w\] )(.+)", mail['Subject']) File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 137, in match return _compile(pattern, flags).match(string) TypeError: expected string or bufferin an offlineimap Maildir that I have also touched with mutt? The source is a GMail IMAP repository and I would touch it with various IMAP clients including re-alpine, which leaves traces of itself sometimes through messages.
I also have noticed perhaps all of my duplicates are because one message will have a Content-Length: header and then the duplicate message (or original message!) will not. I don’t know which behavior is correct.
@Anonymous: I have no time to look at you issue now. Can you add this error as a bug report in
maildir-deduplicate.py‘s GitHub repository ?TypeError is fixed by this pull request but Anonymous should note that some of his messages are missing the
Subjectheader, which is not good.@Anonymous: the copy with the
Content-Lengthheader is probably more correct than the one without. However, the deduplication script ignores this header, so it will still spot the duplicates. Also, when removing duplicates via the--removeoption, it will remove all but the longest version, so you will end up with the copy containing that header, assuming all other things are equal.@Adam: Thanks for this detailed analysis and for you code contribution ! I merged this afternoon your pull request to the main repository. Thanks again Adam. I really appreciate your involvement.
You’re very welcome
Hi!
surely nice but I got:
Traceback (most recent call last): File "/home/anderl/bin/maildir-deduplicate", line 337, in main() File "/home/anderl/bin/maildir-deduplicate", line 333, in main duplicates, sets = findDuplicates(mails_by_hash, opts) File "/home/anderl/bin/maildir-deduplicate", line 212, in findDuplicates subject, count = re.subn('\s+', ' ', subject) File "/usr/lib/python2.7/re.py", line 162, in subn return _compile(pattern, flags).subn(repl, string, count) TypeError: expected string or bufferOn line 212 I put
try: subject, count = re.subn('\s+', ' ', subject) print "\nSubject: " + subject except: print "\nNo Subject"to fix it. But I am not sure whether this will be alright.
@Andreas: Thanks to signaling us this error. I just created a bug report. Next time you stumble upon a bug, don’t hesitate to create a ticket on GitHub !
I should add that uppon trying
--removefor a copy it issuedin the end. But duplicates seem to be gone
@Andreas: Issue created in GitHub.