Corect Seplling Mayd Eesy
April 10, 2007 6:45 PM Subscribe

How to write a spelling corrector in twenty lines of Python.
posted by alms (44 comments total) 33 users marked this as a favorite

I am so jealous. It's the explanation that makes this great, not the code. It doesn't matter if you don't know a lick of programming -- read through how it works.

He had me at "collection of Sherlock Holmes stories."
posted by waldo at 7:01 PM on April 10, 2007

Wizardry.
posted by paladin at 7:02 PM on April 10, 2007

Wow. I'm strongly reminded of my boss who, even considering where he works and at what level he is it, is a certifiable genious. Or maybe I'm just dumb. In any case, I read his code and I'm like....wha? (It doesn't help that he uses variables named "x" and "tmp".) So I ask for an explanation and he gives me one. It only gives me the most superficial understanding. So I struggle with it for a few more hours and ask a really dumb question which is answered in Martian. Eventually things become clear enough that I am impressed by his brilliance at thinking of an approach that would never have occurred to me in a million, billion years.

And now that I've read more of the page, I'm tempted to say how obvious it is to apply error-correcting code type processing (which I think this is, though he doesn't say it) to spelling, but after I more or less understand what my boss is doing I find his ideas obvious too.
posted by DU at 7:03 PM on April 10, 2007

DU writes "Eventually things become clear enough that I am impressed by his brilliance at thinking of an approach that would never have occurred to me in a million, billion years."

You may find the Boyer-Moore string search algorithm fun. Or Knuth's video lectures.
posted by orthogonality at 7:06 PM on April 10, 2007 [1 favorite]

Well, the spell checker in Firefox fucking sucks, so it can't be that easy.

Google's, on the other hand, is actually better then MS words', and I think it's aided by how often people click on the "did you mean" link, it weighs in how common queries are (so it can get some strange word that's been queried a lot of times) and so on, as well as having the web itself as a huge corpus of misspellings.

The result of his program is this:

So on the development set of 270 cases, we get 80% correct in 17 seconds (a rate of 16 Hz), and on the final test set we actually do a little better, 87% correct (but at only 14 Hz). Therefore I can declare victory: the program is doing better than 80% and better than 10 corrections/second.

That's really, really bad. Imagine if 20% of all of the miss-spelled words were guessed incorrectly. That would actually suck. It would mean one in five words couldn't be guessed, and that's even worse then firefox for me. (I'd say that it guesses right 90% of the time). Word and google must be better then 98% for me. And I'm a terrible speller.

And 10 corrections per second? Imagine if Google's search system could only perform 10 operations per second. That would mean Google could only handle 864,000 searches per day. I think it does a bit more then that.

80% is pretty easy to achieve in AI. 80% is also almost completely worthless for any real world application.

Anyway, the effort is in coming up with the formula, not expressing it in code. It's likely that many machine learning algorithms can be expressed in just a few lines of code.
posted by delmoi at 7:09 PM on April 10, 2007 [1 favorite]

Can you make it recognise different types of trees from quite a long way away?
posted by Smedleyman at 7:12 PM on April 10, 2007 [2 favorites]

Err, I probably said that a bit more strongly then I needed too. But this is really not very impressive to anyone who's taken a course in machine learning. And 80% really is pretty poor performance.
posted by delmoi at 7:12 PM on April 10, 2007

Can you make it recognise different types of trees from quite a long way away?

The lurch!
posted by maxwelton at 7:24 PM on April 10, 2007

Did you mean: the larch
posted by Acetylene at 7:38 PM on April 10, 2007 [4 favorites]

.1 seconds to perform a correction != 10 corrections/second or 864,000 searches a day

The idea is that a machine, when given a word, can correct it in .1 seconds. Certainly Google has thousands and thousands of machines and it's very likely that each machine can handle multiple corrections concurrently.

A comparative analogy would be cashiers in a store. If it takes a cashier 1 minute to check a customer out, it is the total number of cashiers that would ultimately determine the stores throughput. A tenth of a second to correct the spelling of a word seems pretty fast for a relatively naive algorithm.
posted by christonabike at 7:40 PM on April 10, 2007

Delmoi: all your other criticisms aside, I'm pretty sure that google has many more (more than one!) operational units capable of performing a given algorithm.
posted by Matt Oneiros at 7:43 PM on April 10, 2007

zxzxzxzyyy
posted by Kwine at 7:48 PM on April 10, 2007 [2 favorites]

Google's, on the other hand, is actually better then MS words', and I think it's aided by how often people click on the "did you mean" link, it weighs in how common queries are (so it can get some strange word that's been queried a lot of times) and so on, as well as having the web itself as a huge corpus of misspellings.

This is what Norvig means when he suggests improving the language model.

And 10 corrections per second? Imagine if Google's search system could only perform 10 operations per second. That would mean Google could only handle 864,000 searches per day. I think it does a bit more then that.

This is irrelevant, since the corrections can be (and almost certainly are being) generated statically beforehand, and then just looked up at runtime.
posted by gsteff at 7:49 PM on April 10, 2007

A comprehensive index not just of all words in English but of all misspellings in English?

Whoa.
posted by Richard Daly at 7:57 PM on April 10, 2007

Google would often come court the CS majors at [my previous-institute-of-higher-learning], giving technical talks ending with, "oh, and by the way, we're hiring summer interns." (Also free pizza)

Anyhow, during one of those "how the magic happens" talks, we were informed that the spelling correction is the most taxing part of the process of bringing you your search results quickly. Pretty neat I guess, and ever since then I've always felt just a little bit guilty when using google as a spell checker.
posted by kaytwo at 8:03 PM on April 10, 2007

It's not that scary. The english version of Wictionary currently claims @350,000 words. If the average length of a misspelling is 8 characters, an average of 9 possible corrections are stored per misspelling, and the misspellings are about the same length as their corrections, you'll have around 30mb of data (less if it's stored in a normalized database). The amount of data balloons if you're trying to find corrections for sequences of words rather than individual words, as Norvig describes in his 4th potential optimization, but regardless, for an operation like Google, space will be much cheaper than processing power.
posted by gsteff at 8:14 PM on April 10, 2007

matt?
Matt?
MATT?
MATT?
MATtHEW HAUGHEY WHERE IS OUR SPELL CHECKER???!!!
posted by caddis at 8:39 PM on April 10, 2007

see, if we had one I might not have misspelled your name (well at least not have slipped in that lower case "t"). Please, please bring back the spell checker which makes us all look more erudite than we really are.
posted by caddis at 8:42 PM on April 10, 2007

+1 cadiss
posted by staggernation at 8:46 PM on April 10, 2007

That was very illuminating.
posted by autodidact at 9:09 PM on April 10, 2007

The idea is that a machine, when given a word, can correct it in .1 seconds. Certainly Google has thousands and thousands of machines and it's very likely that each machine can handle multiple corrections concurrently.

Most likely, but it still seems like a huge waste of time, when you're going to be handling billions of transactions a day. Of course it can be optimized, but it's going to take more then 20 lines of code to do that.

That said, I just realized that this is the guy who wrote my textbook so perhaps I should just shut up.
posted by delmoi at 9:09 PM on April 10, 2007

Bayesian statistics can be a controversial technique, depending on how it's used.

The Bayesian approach calculates a posterior probability, or what statistician Ronald Fisher called the "reverse probability" from multiplying a "normalized" likelihood with the prior probability of the null hypothesis, or Bayes' theorem:

P(θ|X) = P(X|θ) P(θ) / P(X)

where θ refers to determining the parameters (mean, variance, etc.) of a null hypothesis, given the empirical, or observed data set ("X").

As a basic example, Bayes' theorem allows us to state the probability of the fairness of a given coin pulled from a bag of coins, both before and after any tosses, if we make reliable assumptions about the fairness of the coins within the bag, prior to taking any out.

A frequent complaint about the Bayesian approach is that it makes subjective, sometimes contentious assumptions about the prior nature of the data being observed. The source of this philosophical contention in the use of Bayesian over classical frequentist testing stems from where (or, perhaps, with whom) these prior probability distributions originate.

So-called "non-informative" priors, such as the Jeffreys' prior, express vague or general information about the parameter, to try to make as few "subjective" assumptions about the data as possible.

Informative priors, such as those used in this application, use previous experience or information to set parameter values in the prior: e.g., a simple guess of the expected temperature at noon tomorrow could be calculated from today's temperature at noon, plus or minus normal, day-to-day variance in observed noon-time temperatures.

So-called "conjugate priors" are used when the prior distribution takes on the same form as the posterior distribution and when the mean and variance are usually dependent, and are therefore often used in analysis of empirical data.

The "empirical Bayes" approach was introduced by Herbert Robbins as a way to infer how accident-prone someone is, given the observed fractions of accidents already suffered by the larger population. The objectivity of this testing stems from using observed, "empirical" data to generate informative priors used in Bayesian inference.

In the case of empirical data found in the English language, we would use Bayes' theorem to evaluate the probability of different spellings for a particular word, given observed spellings of that word across numerous works of literature. Throw in Beowulf to Shakespeare to Pynchon and you'll have a "profile" of the language — but depending on the spell-checker you want to build, maybe it would be better to leave Beowulf out? Or for Google, it could use its snapshot of the browsable web that it thinks is in English. If enough sites on the Web spell something wrong, then Google would probably prefer to direct its users to use search results aligned with the mispelling, as opposed to the dictionary word. This is where subjectivity and uncertainty can creep in.

Many empirical Bayesian techniques have been applied in various areas within the field of systems biology, which are data-rich and analysis-poor. In particular, Brad Efron at Stanford is one of the big names in this field, and has an excellent introduction to Bayesian thinking over here. Other papers of his on the subject of empirical Bayesian testing can be found here.
posted by Blazecock Pileon at 9:20 PM on April 10, 2007 [6 favorites]

btw, Peter Norvig is currently the Director of Research at Google.

My favorite Norvig essay remains "Teach Yourself Programming in 10 Years"

on preview: and speaking of ten years, I think it'll take me that long to parse Blazecock Pileons post. Thanks for the background info!
posted by gwint at 9:24 PM on April 10, 2007

The real constraint here was not accuracy or speed per se but programming efficiency.

In an environment where processing power is generally a more than suitable replacement for elegance in script writing, it's nice to see an example of super efficient code. It also illustrates why Python is for the hardcore.

Anyway, the effort is in coming up with the formula, not expressing it in code.

Yes and no. No doubt effort is made in coming up with the search methods and often the software implementation is straightforward, but there can also be lots of clever genius in the source coding.
posted by three blind mice at 10:11 PM on April 10, 2007

It's threads like this that make me both love and hate metafilter. On any given day, I can believe that I will be at least as smart as one in ten people that I talk to. On occasion, I find that the ratio is even lower and I am as smart as many of the people that I meet.

And I feel good about myself. I think, "Hey, you got that obscure reference to the Battleship Potemkin, and then you accurately identified those three kinds of unusual animal photos that were stumping your friends, and then when all seemed lost, you were able to troubleshoot that router while explaining to someone else why the .40 S&W was superior to both the 9mm and the .45 ACP."

And I feel smart.

Then I come here and see my weakness. I don't understand programming. Never have.

I've been lucky enough to always have friends and employees that I could turn to, but it has long been the bane of my existence that the true hacker arts are a mystery to me.

And threads like this lay that pain bare for all to see.

Damn/ Thank you metafilter. Damn/ Thank you all to hell.
posted by quin at 10:19 PM on April 10, 2007

there can also be lots of clever genius in the source coding.

Indeed. Raise your hand if you could have done this -- efficiently or not -- in 20 lines of code like Norvig just did.
posted by event at 10:20 PM on April 10, 2007

Fascinating technique. Also buried in the article is a reference to Google Research's 5-deep n-gram data on the English language, which is found here. Make sure to tell your favorite linguists and statisticians about this!
posted by aeschenkarnos at 10:25 PM on April 10, 2007

This is cool, but Google's incredible corpus is a far bigger factor in its amazing spell-checking ability than the algorithms it uses. Google can spell-check my surname, which, according to various white-pages sites, is possessed by only about 50 people in the US. I don't care what kind of algorithm Office or aspell or ispell use; their dictionaries will never even begin to compare to that which Google has assembled.
posted by IshmaelGraves at 11:24 PM on April 10, 2007

Ishmael, did you follow aeschenkarnos' link? For 150 bucks you can get a copy of a nice chunk of that corpus. Sure it's not complete, but it's 5-deep.

24GB of compressed text! Amazing.

Also, this post rocks, as I've just modified a Python based blog editor last night, and was impressed at how easily I could understand the code.

I don't know why threeblindmice says Python is for the Hardcore. I mean, surely it can be utilized by the hardcore, but it's really a great simple language, from what little I've played with it. Fuck, at least I understood it hella easier than my first (and also probably, my last) Perl script...

Yay Python. Yay smart Google engineers. Yay computers!
posted by symbioid at 11:30 PM on April 10, 2007

Teach a man to spell, and he'll be able to spell; teach a man to write a spellchecker, and nobody else will bother learning to spell.
posted by davejay at 11:31 PM on April 10, 2007 [3 favorites]

I don't know why threeblindmice says Python is for the Hardcore.

Without starting any the your-favorite-programming-language sucks, it seems to me that people who know their shit often demonstrate their prowness using Python.
posted by three blind mice at 11:59 PM on April 10, 2007

tbm, Yes Python programmers are often more educated than Visual Basic or PHP programmers. :) But this is true for most langauge communities which are (a) general purpose and (b) used mostly under unix. Perl people are equally well educated.

But here your really seeing the beauty of the functional programming style, not the specific langauge. Yes again programmers who appreciate functional programming are another cut above the rest.
posted by jeffburdges at 2:10 AM on April 11, 2007

Just as mathematicians see grace and elegance in maths and mathematical expressions, so there are the same potentials in code.

My futile attempts at code are more akin to a drunken, three-legged elephant trying to scale a wall: lots of trying to climb the wall, then just breaking through it by (getting a snippet somewhere else) and not cleaning up. A bloody mess even I can't wade through if I leave it for a week.
posted by flippant at 3:16 AM on April 11, 2007

Python is the distillation of many thousands of man-years of using lesser tools - it is the response to the flaws on those tools. As such, it is very powerful without sacrificing simplicity, and would be my first suggestion to anyone learning how to program. The fact that it is fast (directly based on C), complete (large choice of components) and stable (where you keep the horses) makes it viable and fun to use even at high level programming, which is why it is so popular in examples such as this one.
posted by CautionToTheWind at 6:03 AM on April 11, 2007

This is freaking awesome.

But here your really seeing the beauty of the functional programming style, not the specific language.

I mostly agree. However I would argue that Python's versatility also lends itself to elegant code in non-functional/language processing contexts. Joe Schmoe (me) can create some runnable pseudo-code, while someone with the genius of Norvig can create some truly beautiful algorithms.

A purely functional language like Haskell might be prettier for this, but not by much. Not only that, but purely functional languages are pretty inaccessible to the beginner. Not so with Python.

/goes back to writing Ruby
posted by anomie at 10:42 AM on April 11, 2007

I think the best part about this thread is the random and potentially purposeful spelling mistakes in the commentary.

If folks are so enamoured by the spell checking capabilities of Google, for example, folks should install the Toolbar, which has a simple spell-check button attached to the same query engine that corrects your spelling when you type in "Furry Pron".
posted by thanotopsis at 10:43 AM on April 11, 2007

Or, you could just use... Firefox. Not that it's spell checker is all that... But, it's got built-in spell check for forms.
posted by symbioid at 11:09 AM on April 11, 2007

Unfortunately, the Firefox spell checker can't tell you the difference between "it's" and "its."
posted by Acetylene at 12:35 PM on April 11, 2007

import collections

model = collections.defaultdict( lambda : 1 )

nifty trick. files away for future reference
posted by Fezboy! at 1:25 PM on April 11, 2007

Reminds me of this great quote told to Joel Spolsky in 2005, describing the gulf in thinking between Google and Microsoft:

A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. "Google uses Bayesian filtering the way Microsoft uses the if statement," he said. That's true. Google also uses full-text-search-of-the-entire-Internet the way Microsoft uses little tables that list what error IDs correspond to which help text. Look at how Google does spell checking: it's not based on dictionaries; it's based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn't.

posted by MetaMonkey at 3:32 PM on April 11, 2007

Blazecock: I take it your a med student or doctor or something?

In computer science Bayesian analysis is the way to go, because it vastly, vastly reduces the amount of computation needed. If you're spellchecking thousands of words a second, you just don't have time to do anything more thorough.
posted by delmoi at 6:29 PM on April 11, 2007

Or, you could just use... Firefox. Not that it's spell checker is all that... But, it's got built-in spell check for forms.

Yes, and it sucks. :P
posted by delmoi at 6:31 PM on April 11, 2007

The Joel Sposky quote is another nail in the coffin. Microsoft is the walking dead: no longer really in the game, because it hasn't yet figured out the game has even changed.

Ever since learning of Bayesian theory I've come up with dozens of ways to use such a tool to add significant useful functionality to all sorts of everyday tasks. Most of life is quite predictable.
posted by five fresh fish at 8:28 PM on April 11, 2007

Microsoft is the walking dead.
posted by gemini at 11:52 PM on April 11, 2007

« Older The Spirit of Truth | Stem cell research: Natural Born Killers? Newer »

This thread has been archived and is closed to new comments

MetaFilter

Corect Seplling Mayd Eesy
April 10, 2007 6:45 PM Subscribe

Tags

Share

Corect Seplling Mayd Eesy April 10, 2007 6:45 PM Subscribe

Tags

Share

Corect Seplling Mayd Eesy
April 10, 2007 6:45 PM Subscribe