I for one welcome our self aware spam bot overlords.
January 31, 2005 10:57 PM   Subscribe

breaking CAPTCHAs. In this case the programmers were able to use software they had already designed to analyze images of people.
posted by delmoi (30 comments total)
 
Quite impressive. If only they would use this knowledge for good...
posted by sourwookie at 11:14 PM on January 31, 2005


The authors of that page seem to think that any success rate greater than zero is an overall success, since the process is automated anyways, and it's easy for the computer to just try again. But wouldn't it be trivial for the server running the CAPTCHA to blacklist any IP that failed more than a couple of times? In that case, only a few attempts would be successful before the server dropped the banhammer.
posted by neckro23 at 11:20 PM on January 31, 2005


If only they would use this knowledge for good

if only they would spend their time learning how not to be worthless, bottom-feeding pricks.
posted by blendor at 11:46 PM on January 31, 2005


if only they would spend their time learning how not to be worthless, bottom-feeding pricks.

who is this fucking maniac?
posted by luckyclone at 11:54 PM on January 31, 2005


ah, that would be me. sorry, up too late, switching between screens and projects, and misinterpreted the article. comment retracted.
posted by blendor at 11:58 PM on January 31, 2005


Someone who can't tell the difference between CS researchers at a top university and spammers.
posted by Ethereal Bligh at 11:58 PM on January 31, 2005


someone who could tell the difference if he was paying attention. again, apologies.
posted by blendor at 12:00 AM on February 1, 2005


what have i just stepped into - the fucking maniac clubhouse?
posted by luckyclone at 12:02 AM on February 1, 2005


Ve alvays begin ze knowledge at zi institute!
posted by HTuttle at 12:06 AM on February 1, 2005


Don't worry about luckyclone. He just goes from post to post saying "fucking maniac" about everything.
posted by pracowity at 12:09 AM on February 1, 2005


Slashdot has an article about using 3d models for captcha. Interesting.
posted by seanyboy at 12:09 AM on February 1, 2005


Someone who can't tell the difference between CS researchers at a top university and spammers.

I'm quite aware of the difference. My apologies if I came off as cynincal.

This is indeed impressive work and I enjoyed reading it. It's just that in light of this thread I was expecting more sinister motives for this research.

I understand the parsing of visual info to be a CS holy grail of sorts and I won't begrudge anyone involved in that field.

Though there is no doubt that "worthless, bottom feeding pricks" will have no trouble finding vexing uses for this work.
posted by sourwookie at 12:26 AM on February 1, 2005


Recognizing childlike drawings of common objects: guitar, house, bird, shoe, etc., would be easy for humans, yet very difficult for computers. Or you could be asked to define a property of one of a group of objects: "What color is the shoe?" "How many windows on the house?"
Going from image to property to text would be tricky.
posted by weapons-grade pandemonium at 12:38 AM on February 1, 2005


I'd be more impressed by captchas if saying their name out loud didn't make me sound completely retarded.

captcha!
posted by blacklite at 12:51 AM on February 1, 2005


Recognizing childlike drawings of common objects: guitar, house, bird, shoe, etc., would be easy for humans, yet very difficult for computers. Or you could be asked to define a property of one of a group of objects: "What color is the shoe?" "How many windows on the house?"
Going from image to property to text would be tricky.


I just recently took a class on computer vision. While the deriving properties bit is afaik out of reach right now, the identifying drawings scenario is quite doable if you have some idea of the scope of possible objects that would be drawn. it's essentially the same as recognizing distorted letters only in the case of letters, you're working with only 26 possible objects that can be drawn.
posted by juv3nal at 2:59 AM on February 1, 2005


I'm amazed that captchas still work as well as they do, given that this page has been around since Dec. 3, 2002, and their object recognition software longer than that.

I see that captcha.net is sort of keeping track of how many of the algorithms are doing in so I don't have to. Also, this article.
posted by sninky-chan at 5:03 AM on February 1, 2005


The /. thread seanyboy posted contains mention of a technique to get around CAPTCHAs in general by getting humans to decode them for you. Some porn sites apparently make users complete CAPTCHAs in order to continue viewing their content. But these CAPTCHAs are actually not from the porn sites but rather are being served up by webmail systems through which the porn site sends spam. Thus a simple pornographic carrot tempts horny humans into creating spam accounts for the machines. It's all apparently so easy if you take your clothes off.
posted by Songdog at 8:10 AM on February 1, 2005


The /. thread seanyboy posted contains mention of a technique to get around CAPTCHAs in general by getting humans to decode them for you.

That method seems to imply that there is no surefire solution based on distinguishing between humans and computers since in that case you are actually getting humans to tell you the answer.

But what about using cap-whatevers that are extremely context-sensitive such as "What is the name of the weblog you are visiting?" I suspect though that that advantage will only last a short while in this arms race...
posted by vacapinta at 9:51 AM on February 1, 2005


highlighting is easily detectable, for instance through some of the non-standard javascript functions that IE provides.
posted by sonofsamiam at 10:54 AM on February 1, 2005


perhaps we should simply abandon this place and head for the SMTP servers co-lo'd in Zion.
posted by jungturk at 11:10 AM on February 1, 2005


Actually, something like odinsdream idea would work fine if you tagged every word in a block of text (not in a form's blank) with a span, each with a unique class or ID, and then used CSS to apply colors. If you asked "which of the above words is blue?" the spambot would have to A) parse and understand the sentence; B) figure out that the above text was tagged for CSS hooks; C) load the stylesheet; D) parse the CSS and know that #009 means "blue". Not impossible, but tricky.
posted by adamrice at 1:33 PM on February 1, 2005


This is an arms race between those who display text (limited to what human beings can read) and machine reading of text by spammers (with, perhaps, clever workarounds to get humans into the loop). What the researchers have done is figure out how to get computers to read existing CAPTCHAs. But CAPTCHAS can be made more difficult for computers but only slightly more for humans, by treating each letter differently:

* different background (lines)
* different font size
* different font
* different rotation (skew, up to 30 degrees in each direction)
* multiples of a character, stacked and slightly offset
* varying [and better] clutter (almost-connecting, slightly different lines, dots, etc.)

Even if a computer is able to get a reasonably high percentage of cases correct, a spammer cannot afford large amounts of computer (CPU) time per attempt. So a CAPTCHA that can be read correctly (say) 10% of the time by spammer automation, but requires 30 seconds of CPU power per attempt (thus 5 minutes per successful attempt) is good enough to prevent automated spamming (except from zombie computers, which is another problem).

Finally, diversionary text (strike-throughs, several chaarcter-strings that are upside-down along with a note to igore upside-down text; characters of different colors with a note to [say] input the red characters, and so on) can severely increase the cost of machine reading (if nothing else, by lowering the probability of success).

So I'd bet on the CAPTCHA generators rather than the CAPTCHA automated readers, if I had to pick a horse.
posted by WestCoaster at 1:39 PM on February 1, 2005


But wouldn't it be trivial for the server running the CAPTCHA to blacklist any IP that failed more than a couple of times?

Nekro23, my thoughts exactly. Fail the test n times in m minutes and your IP is banned for a few hours....
costs a little more server-side, but it would be worth it.


posted by login at 2:12 PM on February 1, 2005


odinsdream: document.selection.createRange().text seems to work. via msdn

adamrice: there's no need to parse CSS, the browser can already do those things. Code from IE or Mozilla can be called by a bot.

I feel that only challenges generated by a real human will ultimately be easily solvable by humans but not computers.
posted by sonofsamiam at 2:32 PM on February 1, 2005


WestCoaster, the problem you leave out is that computers get faster and cheaper every year and you have not suggested away to keep increasing the processing cost to keep pace.
posted by billsaysthis at 5:13 PM on February 1, 2005


Just for the people saying ban the ip address if you get the capcha wrong to many times. The hackers/spammers will have a list of thousands of proxies. The site kills one and they have plenty more to use.

I agree most captchas could be distorted more and people can still read them. The ones on the page are pretty basic. I think its yahoo mail which has one thats pretty hard to read, and ive gotten it wrong, but im sure its worth it.

It is all a numbers game though, even something that can decode the captchas at %10 is good enough, just make 10 times more attempts!
posted by phyle at 5:22 PM on February 1, 2005


There is only one solution: ASCII art.
posted by kindall at 6:01 PM on February 1, 2005


I like vacapinta's too.
posted by Songdog at 6:07 AM on February 2, 2005


Billsaythis: I think you could reasonably increase the complexity of the hash calculation required for posting.

Think of this: User A wants to post on my blog. In order to do so, their web browser is given a chunk of data to hash using a specific algorithm. That algorithm and the data hashed can be chosen by me, and can be arbitrarily complex and computationally intensive. In order to post a comment, User A must return the appropriate hash value that I have already computed in a batch process, updating the list of hashes on a regular basis. This means that for User A, while they are typing out their comment, their browser is furiously computing my hash. It takes 5 seconds to compute the hash, after the hash has been computed, the user can post a comment. This would be all but invisible to users, and tremendously time consuming and inefficient for spammers, either requiring massive capitol outlays for powerful computers, or slowing down their zombie botnet.

The next step is baysian filtering of comments, User A comments, has the appropriate hash value, but it looks alot like spam (say ranks 90% likely to be spam and above). This comment would be subject to a second test, a turing test of some sort, as well as flagged.

Repeated failure of any of these tests would result in temporary ip-bannination.

I'm looking into making some sort of system like this for my blog, although I haven't really the free time to implement it these days.
posted by Freen at 8:10 AM on February 2, 2005


freen: I surf with Java disabled and don't use IE. How will you run the code on my machine? Not trying to kick your idea over - I think it's going in the right direction. Just not sure how you get the client side penalty without pissing off/losing a percentage of your audience...

posted by login at 3:11 PM on February 2, 2005


« Older At least they'll know what to serve it with   |   poppin in the rain Newer »


This thread has been archived and is closed to new comments