Skip to Content

Kolab and Spam

In January, I switched my mail server to use Kolab, which is a great solution for a single domain. With a little creativity, you can make Kolab work with virtual domains as well. (I hear a development version has this support built into it, so the next official release version ought to be very interesting.) However, there was one small glitch - it's built in spam filters were not doing a fantastic job. This is not the fault of the developers though, it has more to do with tweaking the system to meet your own needs. So, after my post last week about hating spam, I did some digging to find out how I could increase the effectiveness of the spam filtering.

First, I came across a page called Fighting Spam that details some of the steps you can do to help out. The section titled "Using SBL and XBL lists from Spamhaus and ORDB" is enough of a step that will eliminate a lot of your spam. But of course, it's not a magic bullet. The next thing to do is get spamassassin to learn what is spam and what is not.

The webpage talks about setting up a shared folder to store spam in. In my case, this isn't the best choice as I don't need groupware collaboration on my network, and don't have my mail client setup to use shared folders. So, I created a new mailbox account called "spam". The idea is that I will move all spam I receive to this mailbox. I then created two sub folders in this account - one called spam, and another called ham (for stuff that we need to unlearn as spam). I I use KMail for my mail client when I'm at my desktop, so I've configured KMail with another mail account and pointed it at this spam account. The key here is that I made KMail treat this account as an IMAP account, rather than a POP3 account. Now, whenever I receive spam, I move this mail into the spam folder of the spam account. This allows me to collect spam in a central location, which happens to sit on my email server.

Next, we need to tell Kolab's spamassasin to learn from these messages. I found a script for this, but only used it as a basis for my own. The important parts are:

/kolab/bin/sa-learn - this is the learning tool for Kolab's spamassasin
/kolab/var/imapd/spool/domain/ - this is where Kolab stores the IMAP emails
/kolab/var/amavisd/.spamassassin - this is where spamasasin stores the databse regarding learned email

It's important to note that whether you choose to use a public folder, or a separate mailbox, the messages will be stored in an appropriate directory under /kolab/var/imapd/spool/domain/. For myself, this translated to /kolab/var/imapd/spool/domain/open2space.com/user/spam/spam/. As you can see there is a directory for your domain under the domain directory. Then there is a user directory, and a correspondig directory for each of your users. And under my spam account's directory, there is a corresponding spam and ham folder. If you were using shared folders, you would have a directory called "shared^mydirname" at the same level as the users folder, and you would use this path instead.

Now that we know these paths, we can execute the command to learn the email:

/kolab/bin/sa-learn --spam /kolab/var/imapd/spool/domain/open2space/users/spam/spam/[1-9]* --dbpath /kolab/var/amavisd/.spamassassin

A lot of typing, but most of it is simply the directories. Note the --spam argument - this tells sa-learn that we are learning spam. To learn ham, we simply change this argument to --ham and change the target directory.

There is one problem with this command though. If you are like me and have stored a folder full of spam (I had 3000+ messages in mine), then the final part of the path introduces the problem. The [1-9]* says to get everything that starts with a number. But when you're dealing with 3000+ messages, that is equivalent to passing 3000+ arguments to the command, and you quickly get an error indicating the argument list is too long. To resolve this, I had to modify my command to use something like "1*" as the last part (instead of the [1-9]*). Then run the command 9 times changing the number as needed. This kept the argument list to a manageable size. But why not just use "*"? Well, there are some other files in these directories that are needed by the IMAP server, and we don't want to learn these as spam, or remove them.

I created a script that looks like this:

#!/kolab/lib/openpkg/bash
# ********************************************************
# This file applies the bayesian learning of spamassassin
# to any files in the spam and ham directories, then it
# updates the antivirus signatures.
# ********************************************************

#Change the ROOTMAILPATH to reflect your domain.
ROOTMAILPATH=/kolab/var/imapd/spool/domain/open2space.com
SALEARN=/kolab/bin/sa-learn
DBPATH=/kolab/var/amavisd/.spamassassin/

#Learn the spam messages
$SALEARN --spam $ROOTMAILPATH/user/spam/spam/1* --dbpath $DBPATH
$SALEARN --spam $ROOTMAILPATH/user/spam/spam/2* --dbpath $DBPATH
$SALEARN --spam $ROOTMAILPATH/user/spam/spam/3* --dbpath $DBPATH
$SALEARN --spam $ROOTMAILPATH/user/spam/spam/4* --dbpath $DBPATH
$SALEARN --spam $ROOTMAILPATH/user/spam/spam/5* --dbpath $DBPATH
$SALEARN --spam $ROOTMAILPATH/user/spam/spam/6* --dbpath $DBPATH
$SALEARN --spam $ROOTMAILPATH/user/spam/spam/7* --dbpath $DBPATH
$SALEARN --spam $ROOTMAILPATH/user/spam/spam/8* --dbpath $DBPATH
$SALEARN --spam $ROOTMAILPATH/user/spam/spam/9* --dbpath $DBPATH

#Learn the ham messages (if any)
$SALEARN --ham $ROOTMAILPATH/user/spam/ham/1* --dbpath $DBPATH
$SALEARN --ham $ROOTMAILPATH/user/spam/ham/2* --dbpath $DBPATH
$SALEARN --ham $ROOTMAILPATH/user/spam/ham/3* --dbpath $DBPATH
$SALEARN --ham $ROOTMAILPATH/user/spam/ham/4* --dbpath $DBPATH
$SALEARN --ham $ROOTMAILPATH/user/spam/ham/5* --dbpath $DBPATH
$SALEARN --ham $ROOTMAILPATH/user/spam/ham/6* --dbpath $DBPATH
$SALEARN --ham $ROOTMAILPATH/user/spam/ham/7* --dbpath $DBPATH
$SALEARN --ham $ROOTMAILPATH/user/spam/ham/8* --dbpath $DBPATH
$SALEARN --ham $ROOTMAILPATH/user/spam/ham/9* --dbpath $DBPATH

#clear the directories once learning is complete
#this is done to help keep the number of files to be learned to a minimum
#rm -rf $ROOTMAILPATH/user/spam/spam/[1-9]*
#rm -rf $ROOTMAILPATH/user/spam/ham/[1-9]*

#update the antivirus signatures
/kolab/bin/freshclam

A few things to note here:

  • the very first line says "#!/kolab/lib/openpkg/bash" instead of "/bin/bash". This is because we want Kolab to be aware of this script, and Kolab runs in a chrooted environment.
  • We are using variables to reduce the commands to a more readable size.
  • I'm handling each number separately via individual commands. This only really needs to be done if the folder contains a large number of files. Otherwise, you could use the [1-9]* approach, and a single command.
  • We are learning both the spam and the ham.
  • We are removing the files after they are learned. This is to keep the folder manageable. That said, there are some issues resulting from this, though nothing major. I would suggest you start with these lines commented out.
  • Finally, we update the antivirus definitions, just because antivirus is related to antispam, and it doesn't hurt to do it here.

The only thing to do now is to schedule our script to run periodically. The catch here is whether you do this in Kolab's chrooted environment, or in the base system's environment. The way the script is written, it shouldn't matter which though. But if you do decide to use the chrooted environment, then you should grab the opa script to switch your role, before creating the cron entry.

I used the "crontab -e" command to edit my scheduled jobs, and added the following line:

20 4,16 * * *   /root/cron/salearn.sh > /dev/null 2>&1

So I'm saying I want to run the script at 4:20am and pm (twice a day), and don't want to see the output of the script. Of course, I ran the script a couple times manually to make sure things were happening as they should.

And the overall result? I've gone from getting 10 - 40 spam messages a day, to getting only 3 or 4. And these are slowly disappearing as they are added to the learning queue. Well, this isn't quite accurate. I'm still getting more than that, but the messages are properly tagged as spam, and then filtered into another directory, where I hardly notice them. This is a setting in spamassasin though - you can have it outright delete the mail if you wish. But I'm keeping an eye out for any false positives.

There is one other suggestion in the Fighting Spam page. They suggest publishing your spam account by signing it up for a porn site, or something else that is well known to harvest email addresses for spam purposes. While I agree with the principle, I won't be doing this myself. The idea is that this is a spam trap - any message that the account receives is spam. So we then learn it as spam making our spam filter more effective. That's the part I agree with. The downside though is that all that spam has to be received by your mail server - so you are using up your bandwidth to make this happen. If you're not concerned about your bandwidth, then this is not an issue. But my background suggests I should not be sending unnecessary network traffic, or accepting any. So, I have chosen not to use a spam trap. This might change though if I start getting flooded with spam again.

I hope this info is of some use to you. I know it's made a difference for me.

(now, I just have to figure out how to get my ISP off the blacklists so that my own email is not treated as spam.... but they won't listen to me... sigh...)