Join 3,516 readers in helping fund MetaFilter (Hide)


You say Viagra, I say \/!@&2^
June 27, 2007 10:06 AM   Subscribe

How Many Ways Can You Spell V1@gra? Building on previous research (Cockerham, 2004), Brian Hayes attempts to find the limits of Viagra-spammer ingenuity.
posted by Horace Rumpole (17 comments total) 5 users marked this as a favorite

 
P.O.R.N.
posted by Wizzle at 10:21 AM on June 27, 2007


That support vector thing was pretty cool, but wouldn't it necessarily be less effective than the Bayesian method? Or are they equivalent?
posted by DU at 10:28 AM on June 27, 2007


R O L A I D S
posted by hermitosis at 10:42 AM on June 27, 2007


I'm certain this has been on mefi before, but I'm failing to find it. An Intuitive Explanation of Bayesian Reasoning is truly excellent as an intro to this stuff.
posted by Skorgu at 11:33 AM on June 27, 2007 [2 favorites]


I'm certain this has been on mefi before, but I'm failing to find it.

The cockeyed.com link was, but the main article is brand new.
posted by Horace Rumpole at 11:40 AM on June 27, 2007


I'm pretty sure Cockerham's name wouldn't make it through my spam filter.
posted by salishsea at 11:50 AM on June 27, 2007


c 0 c k 3 r h 4 m
posted by Mikey-San at 1:01 PM on June 27, 2007


DU: in that particular case, a SVM could well be more accurate, but I doubt it.

A SVM will find a pattern amongst heaps of training data in the tokens (I suppose these are words) that indicate spam, by putting them in million-dimensional spaces in thousands of combinations - much more powerful than a fairly simple one-dimensional Bayesian algorithm. It needs lots of data for that, ideally a dozen or two thousand messages classified by a human into spam and ham, ideally in the same proportion as they occur in reality. In theory, it should be more accurate.

What I can't see is how it would be more resistant to padding with random paragraphs lifted from non-spam corpora. Unknown tokens, will be ignored by the SVM, since it can't generalise, so new spellings will fool it. If a joker 'unknown word' is introduced into the training, it's still susceptible to be made irrelevant by a padding large enough. Unless it gives such a weight to unknown words that no one can write neologisms, which is just silly. Maybe it should then be coupled with a whitelist.

The article suggests using the messages themselves as tokens, which doesn't make sense to me - no two messages are the same, so the SVM would just not find any match to decide upon.

I have done some work with SVMs, but don't haved a deep understanding of all their aspects, so maybe I'm missing something. Maybe someone more versed in AI can explain what the experiments are about.
posted by Spanner Nic at 1:11 PM on June 27, 2007


I'm not buying the leap from "you can put asterisks in between each letter" to "you can put any symbol or letter in between the letters". Like I don't think "VtAGGReA" works.
posted by smackfu at 2:06 PM on June 27, 2007


I'm not buying the leap from "you can put asterisks in between each letter" to "you can put any symbol or letter in between the letters". Like I don't think "VtAGGReA" works.

I don't buy it either, but your example is not an example of applying that principle. Also, your point is already made in the article.
posted by jejune at 2:55 PM on June 27, 2007


How Viagra spam works
posted by acro at 3:03 PM on June 27, 2007


Favorited because it uses properly formatted MLA citation of cockeyed.com.
posted by chlorus at 3:37 PM on June 27, 2007


Horace Rumpole Oops, I actually meant I was sure the Bayesian reasoning post had been on mefi before. Sorry. This was a good post, thank you.
posted by Skorgu at 4:24 PM on June 27, 2007


...putting them in million-dimensional spaces in thousands of combinations - much more powerful than a fairly simple one-dimensional Bayesian algorithm...

Oh, I was imagining them to have the same number of dimensions. The Bayesian stuff we do at work has many inputs, was confusing me.
posted by DU at 4:41 PM on June 27, 2007


The man from U.N.C.L.E.
posted by Tube at 6:31 PM on June 27, 2007


\/ | /\ (+ |2 /|/ ! /-\ [, /2 I'.' l .'. I, 42 /!

Some are better than others, but none are listed in the mentioned algorithm.
posted by Kickstart70 at 8:40 PM on June 28, 2007


And I hate that the preview showed those on three lines but the post didn't show that.
posted by Kickstart70 at 8:40 PM on June 28, 2007


« Older Why yes, I WOULD like to ride a rocket into space,...  |  China expands its influence in... Newer »


This thread has been archived and is closed to new comments