SpamAssassin: b0rked
April 25, 2004 11:10 AM   Subscribe

For those that use SpamAssassin, you may have noticed a degrading service since January 2004. As usual, Google has the Answer - it seems a spammer paid $200 an overly helpful geek on Google Answers to detail exactly how SpamAssassin works... I wonder if said geek ever got the money?
posted by wibbler (27 comments total)
(via filepile)
posted by wibbler at 11:11 AM on April 25, 2004

Eh, it's open source -- more completely revealing the system's workings will ultimately serve to strengthen it. I found the description really interesting. I suspect I'll end up tweaking my scoring as a result of it.

I finally took my spam in hand earlier this month. I kept my tag-as-spam threshold at a score of 5, but set a "> /dev/null" at a score of 10. files everything from 5.0-9.9 in my spam folder. Whereas I once had 200+ pieces of spam to paw through in my spam folder each day, now I'm down to 1-5.

I've never auto-deleted before (mostly out of a packrat mentality that I and study...all...spam), but after reviewing a large corpus of legitimate mail and of spam, I found that I could eliminate 95% of my spam while, with a decent whitelist, losing a fraction of 1% of my e-mail, and that fraction being the sort of quasi-not-spam, like promotional e-mails from companies with whom I did business but apparently failed to check off the "don't contact me" box. (Incidentally, I created my whitelist by just grepping through my Sent Items IMAP folder, pulling out every "To: " field, piping that through sort, and then uniq. I'll just set up a cron job so that runs again once a month and Bob's your uncle.)

To help spam get up to the score of 10 more readily, I pumped up the score for the Bayesian filtering -- at the 90% level, it gets a +5, and at the 99% level, it gets a 9.9.

Incidentally, I set my whitelist entries to receive a ranking of -100. I also set all mail with a Microsoft executable attached to have +100 and, poof, no more viruses. :)
posted by waldo at 11:32 AM on April 25, 2004

All I see here is the question and the answer, without any evidence (or even a narrative) to support your contention that jmwilson is a spammer and that this resulted in he and others defeating SpamAssassin. I'm not saying it's not true, it's just that I'd find the part of the story you left out much more interesting than what you included.
posted by Ethereal Bligh at 11:32 AM on April 25, 2004

Dude, hello? The first rule?

I'm not convinced this guy is a spammer. Yes, the voluminous detail he asks for, his willingness (verging on downright eagerness) to pay the maximum amount Google allows, and his desire for the answer to be removed from the site after he receives it are all suspicious.

But if you're looking to beat SpamAssassin, this analysis only gets you halfway there -- you still have to craft messages that will get through the filters, and this analysis does nothing to identify any possible holes in the system.

However, if your goal is to replicate SpamAssassin's functionality in your own product, this would be exactly what you'd need.
posted by jjg at 11:34 AM on April 25, 2004

still works for me. combined with reverse dns lookup blocklists i get no spam (and make no attempt to hide my email).

and anyway, is it that difficult to reproduce spam assassin? the only part i can think that isn't standard knowledge (ie basic modern stats) is how they divide messages up into different "features", and you can get that from the source without trying to understand all the rest of the code.

i really don't understand why people don't force their employers/isp providers to filter properly. it's not hard (if you're used to doing linux admin - i wouldn't expect my parents to be able to configure it) and it works.
posted by andrew cooke at 11:52 AM on April 25, 2004

Speaking of which, has anyone noticed a sudden upswing in the amount of spam sneaking under Spamassassin and's junk filter in the past day or two? I've been getting a lot rated less than 5, consisting of a bunch of nonsense words, maybe one or two words related to viagra or penis enlargement, a hairy URL, and a bunch more nonsense words.

Not sure even how to filter on that, but I'd be eager for suggestions.
posted by adamrice at 12:09 PM on April 25, 2004

SpamAssassin had got rather less efficient for me recently but since adding the antidrug, chickenpox and tripwire custom rule sets it's got a great deal better. You can download them from here.
posted by kerplunk at 12:27 PM on April 25, 2004

Am I just lucky in that I don't get a ton of spam, ever, to any of my e-mail addresses I've collected over the years? I don't do anything with them but correspond, and limit that to people I know. I don't publicize any of them but one, I'm mindful of the blind carbon copies, and that's about it. Is that all it takes? This question has been nagging at me for awhile, never come up with a good explanation; I feel horror looking at some of my friends inboxes.

Yes, Mr. G. Horse I am looking you square in the mouth
posted by WolfDaddy at 1:08 PM on April 25, 2004

But if you're looking to beat SpamAssassin, this analysis only gets you halfway there -- you still have to craft messages that will get through the filters, and this analysis does nothing to identify any possible holes in the system.

No.. but it does help you 'craft messages that will get through the filters' ;-) I work on a corporate e-newsletter and I have to make sure that the customers will receive it (they're all legit customers, and can unsubscribe specifically from the newsletter at any time), and if rules like these would help me get around the filters.. then I'm sure real spammers could find them useful too.
posted by wackybrit at 1:40 PM on April 25, 2004

Wouldn't it be easier to just write an email and test it against SpamAssassin? It's not like the code or algorithms are secrets.

I found that spamassassin got a lot less useful for me last December. I made an effort to tune up my spamassassin configuration (self link) and found that with good Bayesian filtering trained, spamassassin still does pretty well.

63% of my email last month was spam.
posted by Nelson at 2:32 PM on April 25, 2004

I have to say, I haven't noticed any such degradation in service; I get somewhere around 500-600 pieces of unsolicited email a day, and perhaps one or two slip through a day, unchanged over the past six to eight months.

Waldo's idea of autodeleting only spam that reaches a critical threshhold hadn't ever occurred to me, but how brilliant! I just set the rule in my .procmail file, and in the past five minutes, I've watched three pieces of spam flushed down the /dev/null toilet. Satisfying.

And lastly, I concur -- there's nothing in that Google Answer that couldn't just-as-easily be achieved by a spammer by running an email against SpamAssassin and checking out the score. It's not like it costs a dime...
posted by delfuego at 3:01 PM on April 25, 2004

What this does allow however; is for someone to implement a feature equivalent closed source version of spam assassin, much like the clean room process used in the first IBM compatibles.

Or it might just be someone with a personal interest in how it works but doesn't want to learn to read Perl.
posted by gi_wrighty at 3:58 PM on April 25, 2004

Clean room design
posted by gi_wrighty at 3:59 PM on April 25, 2004

First, SpamAssassin still works quite well as long as you run sa-learn periodically so it can tune the Bayesian classifier - I just don't get spam on accounts which use SpamAssassin.

Second, the entire point is a little nonsensical. Yes, that answer is suspicious but spammers were already well aware of SpamAssassin and actively trying to break it for ages. More importantly, most of the techniques SpamAssassin uses aren't easy to cheat, particularly since they overlap (e.g. obfuscation techniques score highly but not using them means that the Bayesian classifier will trivially catch the message). The main reason why spam is still a problem is simply that most people aren't using any sort of filter at all.
posted by adamsc at 4:24 PM on April 25, 2004

SpamAssassin has not been working well at all for me lately. In particular, lots of messages with random single word titles and nonsense senders.
posted by smackfu at 5:47 PM on April 25, 2004

What procmail rule are you using? Something like this?
:0 :
* ^X-Spam-Level: \*{10,}$
posted by mrbill at 6:05 PM on April 25, 2004

Nevermind, you have to do this:
:0 :
* ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*
posted by mrbill at 6:18 PM on April 25, 2004

And you really, really do need to slash-escape those asterisks. :)

As was mentioned previously, if you want to trust SpamAssassin's Bayesian filter by cranking up the points for the highly-spammish (90%+) messages, it's important that you feed SpamAssassin a decent corpus of hundreds or thousands of pieces of spam (sa-learn --spam --mbox /home/yourname/mail_folder/your_uce_folder]) along with, ideally, all of your legitimate mail. This is easiest if you store your mail in IMAP. If you download it via POP, see if you can somehow export your mail folders as mbox files and upload them to your server for the purpose of running through sa-learn, to add to your corpus. In my experience -- and I have a huge, huge corpus, of all of the e-mail that I've gotten since 1999 -- the Bayesian filtering is like having a parallel spam filter, running side-by-side with all the rest of the SpamAssassin rules. I've had great success with cranking up the penalties for high Bayesian scores.

Thanks for the links to the additional rulesets, kerplunk. I added several of those to my two servers, and I'm really enjoying watching that Procmail log scroll by. :)
posted by waldo at 7:32 PM on April 25, 2004

Spamassassin is catching less than half my spam. What options, tweaks, etc., are you guys using?

Cough it up. This is no good for my blood pressure.
posted by NortonDC at 8:45 PM on April 25, 2004

Spamassassin is catching less than half my spam. What options, tweaks, etc., are you guys using?

It should work very nicely out of the box, no tweaks necessary, as long as you feed it a big ol' corpus of your mail, via "sa-learn". Pretend that SpamAssassin is just like a very confused old lady that works at the post office, and you're trying to teach her to discriminate between mail and junk mail. So just give her two big piles: one of the hundreds of legitimate envelopes that you've gotten over the past year or two, and one of the hundreds of junk mail envelopes that you've gotten in the same period.

Without that corpus, Spam Assassin will cut down on your spam, sure, but it won't be nearly as good as if you train it. On the off chance that you have trained it, and it still sucks, then I suspect that you accidentally fed it a batch of good mail labeled as spam, or vice-versa, in which case you should clear out shop and start over again.
posted by waldo at 9:36 PM on April 25, 2004

I K9
posted by milnak at 1:39 AM on April 26, 2004

NortonDC: Spamassassin is fundamentally broken. I'm going to get crucufied by the geeks for saying this, but it's true.

SA's first step is to look the content of the message, and try to tell if it's spam by asking questions like 'is that a real domain?' and 'does that reverse DNS test suceed?' These questions are worthwhile, but they're fallible.

SA's much-touted second line of defense is the much touted Bayesian Filtering. You see, with Bayesian Filtering, you no longer have one inbox, you have two. First, you have your "good" inbox, which you must endeavour to keep spam free. The problem is your newfound second inbox. You see, now you need to keep your "spam" inbox free of legitimate email. If you don't, then eventually your false positive rate will slowly rise, and you probably won't even notice until it nails something important.

Nobody likes SpamAssassin because it's good, they like it because it's the best thing that they've used. These are two very different standards.
posted by mosch at 1:41 AM on April 26, 2004

Nobody likes SpamAssassin because it's good, they like it because it's the best thing that they've used. These are two very different standards.

That's not true. I like SpamAssassin because it's good.
posted by waldo at 2:47 PM on April 26, 2004

I received e-mail, a result of this thread, from a contributor to SpamAssassin. As he is without a MeFi account, he asked that I pass on the following:

1. SpamAssassin v3.0 should be out in 2-3 months. It will have lots of new rulesets with many old ones rewritten, which will make the analysis available to spammers have outdated and much less useful.
2. Additional rulesets are available at the SpamAssassin wiki and the SpamAssassin Rules Emporium.
3. It is important that SpamAssassin be set to autolearn from any messages tagged with a high level of spamishness. So if you have SpamAssassin erasing anything ranked, say, 10 or higher, make sure that you have SpamAssassin learning from that experience.

I post this not so much for the benefit of MeFites (this thread is about to go off the front page), but for the benefit of those that may find their way here through Google.
posted by waldo at 8:57 PM on April 26, 2004

From SA developer j2323 (not me):

We've always known that spammers can hire techies to take apart the rules. that's not a problem. the SA rules are designed to:

(a) spot signs that the message was sent through open proxies or other ways to hide who actually sent it -- anyone using a spamming tool to send *anything* will get hit by this, no matter what other things they try to hide it as. In other words, it detects the *hiding* part. No matter how deep they get into the rules, they still won't be able to avoid that.

(b) not give trivially forgeable negative rules -- at least not any more. we did have a few trivially forgeable ones in about 2.3x and learned our lesson back then. oops.

(b) provide several hundred not-quite-as-effective "cover" rules so they waste their time trying to figure *everything* out ;)

As it stands, any spam filter will get attacked by spammer techies anyway, even without this guy helping them -- witness the "Bayes buster" text, the hash-buster strings etc. (BTW, on the "open source is easier to hack than closed" angle, the hash-buster strings were designed to attack AOL's secret filtering scheme; the spammers worked it out through trial and error without access to AOL's source.)

Also, SA's a "wide spectrum" approach. the idea is that with the simple text rules, PLUS Bayesian filtering, PLUS dns blocklists PLUS dcc/razor etc., it provides something that in combination is much much harder to evade than any one alone. The test types are supposed to overlap so that evasion of one shows up as a hit on another.

Plus a few of the answers that "maniac-ga" gives him are downright incorrect ;)

Regarding the theory that this answer had something to do with a change in spam in January -- I doubt it. What I *am* noticing now, is that unlike before where there seemed to be a large number of small-time spammers using spamware apps to send spam independently, there now seems to be less people involved, but those people are the big names, they really know *how* to spam, and they are sending more mails. This could be because more of the small-timers have been scared off by CAN-SPAM etc., and the big-timers are going all-out for some reason -- possibly they feel their days of easy spamming are numbered...

BTW, if anyone's finding that SpamAssassin 2.6x is letting more spam through, the dev version of 3.0.0 is kicking ass (especially with the new SURBL code to match URLs in the message body). we're hoping to release that in a month or 2, once we close out the remaining bugs on the list, but it's working well for me at least.

PS: ha, the spammer can't get the explanation text removed from public display! love it. open-source isn't really spammer style ;)

oh btw -- 'I wonder if he gets offers from spammers to help them' -- actually, never! but I have received a death threat from a scary Russian once. :(
posted by mutagen at 10:48 PM on April 26, 2004

We need a reachout program of MeFi accounts for anti-spam devs.
posted by nedrichards at 12:33 AM on April 27, 2004

my university put SA on the new mail servers; seems to work OK although i wish they'd enable an autodelete option for users - what gets marked is always junk, i've seen no false positives yet; i'd rather blackhole the crap than let it sit and take up space on my server until i delete it. i get messages every day marked as spam; easy enough to set a delete rule for my mail client (mozilla). what little gets through unmarked is 99% of the time caught and dumped by mozilla's bayesian filter.

that said i would like to see a serious effort to hunt down and kneecap spammers and those who encourage them by purchasing items through spam emails. they deserve it.
posted by caution live frogs at 10:41 AM on April 27, 2004

« Older "With his blue ox, Emily Dickenson, Walt Whitman...   |   Sensestage Newer »

This thread has been archived and is closed to new comments