That support vector thing was pretty cool, but wouldn't it necessarily be less effective than the Bayesian method? Or are they equivalent? posted by DU at 10:28 AM on June 27, 2007
I'm pretty sure Cockerham's name wouldn't make it through my spam filter. posted by salishsea at 11:50 AM on June 27, 2007
c 0 c k 3 r h 4 m posted by Mikey-San at 1:01 PM on June 27, 2007
DU: in that particular case, a SVM could well be more accurate, but I doubt it.
A SVM will find a pattern amongst heaps of training data in the tokens (I suppose these are words) that indicate spam, by putting them in million-dimensional spaces in thousands of combinations - much more powerful than a fairly simple one-dimensional Bayesian algorithm. It needs lots of data for that, ideally a dozen or two thousand messages classified by a human into spam and ham, ideally in the same proportion as they occur in reality. In theory, it should be more accurate.
What I can't see is how it would be more resistant to padding with random paragraphs lifted from non-spam corpora. Unknown tokens, will be ignored by the SVM, since it can't generalise, so new spellings will fool it. If a joker 'unknown word' is introduced into the training, it's still susceptible to be made irrelevant by a padding large enough. Unless it gives such a weight to unknown words that no one can write neologisms, which is just silly. Maybe it should then be coupled with a whitelist.
The article suggests using the messages themselves as tokens, which doesn't make sense to me - no two messages are the same, so the SVM would just not find any match to decide upon.
I have done some work with SVMs, but don't haved a deep understanding of all their aspects, so maybe I'm missing something. Maybe someone more versed in AI can explain what the experiments are about. posted by Spanner Nic at 1:11 PM on June 27, 2007
I'm not buying the leap from "you can put asterisks in between each letter" to "you can put any symbol or letter in between the letters". Like I don't think "VtAGGReA" works. posted by smackfu at 2:06 PM on June 27, 2007
I'm not buying the leap from "you can put asterisks in between each letter" to "you can put any symbol or letter in between the letters". Like I don't think "VtAGGReA" works.
I don't buy it either, but your example is not an example of applying that principle. Also, your point is already made in the article. posted by jejune at 2:55 PM on June 27, 2007
Favorited because it uses properly formatted MLA citation of cockeyed.com. posted by chlorus at 3:37 PM on June 27, 2007
Horace Rumpole Oops, I actually meant I was sure the Bayesian reasoning post had been on mefi before. Sorry. This was a good post, thank you. posted by Skorgu at 4:24 PM on June 27, 2007
...putting them in million-dimensional spaces in thousands of combinations - much more powerful than a fairly simple one-dimensional Bayesian algorithm...
Oh, I was imagining them to have the same number of dimensions. The Bayesian stuff we do at work has many inputs, was confusing me. posted by DU at 4:41 PM on June 27, 2007
The man from U.N.C.L.E. posted by Tube at 6:31 PM on June 27, 2007
\/ | /\ (+ |2 /|/ ! /-\ [, /2 I'.' l .'. I, 42 /!
Some are better than others, but none are listed in the mentioned algorithm. posted by Kickstart70 at 8:40 PM on June 28, 2007
And I hate that the preview showed those on three lines but the post didn't show that. posted by Kickstart70 at 8:40 PM on June 28, 2007
« Older Why yes, I WOULD like to ride a rocket into space,... | China expands its influence in... Newer »
This thread has been archived and is closed to new comments
posted by Wizzle at 10:21 AM on June 27, 2007