Exact String Matching Algorithms
October 26, 2010 3:25 AM   Subscribe

Exact String Matching Algorithms - Source code for Boyer-Moore, Horspool and other string-matching algorithms, along with visualizations of their operation
posted by Blazecock Pileon (15 comments total) 61 users marked this as a favorite
 
I'd love to see the approximate string matching version of this.
posted by caelumluna at 3:35 AM on October 26, 2010


Boyer-Moore is worth looking at even if you have no interst in programming or algorithms, because, if you're stupid the way I am, it's one of those things that is so obvious -- but only in retrospect. Which makes it mind-blowing: you're in a small way, seeing the world differently than you did before you saw the Boyer-Moore algorithm.
posted by orthogonality at 3:56 AM on October 26, 2010 [8 favorites]


it's one of those things that is so obvious -- but only in retrospect.

I've heard this referred to as the "Genius from Mars" technique.

As an admin, the one real lesson I want developers to take from this page is string matching is expensive and must be avoided whenever possible. Boyer-Moore and Alpha Skip are very clever, but string matching is still a huge loss if you didn't need to do the string comparison at all. Or, if you will, the CPU hates when you LIKE it without reason.
posted by eriko at 5:37 AM on October 26, 2010


Super! Great post.
posted by Wolfdog at 5:58 AM on October 26, 2010


I like how the DNA nucleotide codes are used for the animated examples. In the late 1990s boom days of nucleotide sequencing I got a big leg up in my dissertation research by using stuff like BLAST to piece together expressed sequence tags and discover some human enzymes using string matching based on a recently identified mouse enzyme.
posted by exogenous at 6:35 AM on October 26, 2010


Although it sounds counter-intuitive, even GNU grep uses a form of Boyer-Moore matching. When it compiles the regexp to a DFA, it computes on the side a list of "must match" sub-keywords that are present in the regexp. For example if the regexp was "foo.*bar" then you know that any potentially matching line must contains the keywords "foo" and "bar". It then uses the extremely fast BM exact string matching algorithm against the whole file, not even bothering to consider lines yet. Only when it's found a potential match does it search out line endings that bookend the potential match and then call the full DFA regex matching engine on that line. In this way whole parts of the file can be eliminated from ever having to be considered, and is a reason why GNU grep beats the naive grep implementation where the file is split into lines and each line is applied to the DFA.
posted by Rhomboid at 7:14 AM on October 26, 2010 [10 favorites]


This is old school internet learning at its best.
posted by Free word order! at 7:17 AM on October 26, 2010 [3 favorites]


To be specific, these are searching for substrings within a larger search space, not testing two strings for equality.
posted by mkb at 7:32 AM on October 26, 2010


Speaking of DNA, where's the Burrows-Wheeler string searching algorithm? It's kind of new, but is used with all the new sequence data to match a ridiculous number of strings in time proportional only to the length of the query. It also is yet another way that techniques from compression provide a big advance in a different field. When you're only focused on reducing the number of bits to the bare essentials, you often end up finding out something new about that type of data.
posted by Llama-Lime at 7:36 AM on October 26, 2010 [2 favorites]


Speaking of DNA, where's the Burrows-Wheeler string searching algorithm?

Speculating here, but I suspect because the listed algorithms relate to finding exact matches whereas nucleotide algorithms intentionally allow for imprecise matches, i.e. mismatches. Then again, my understanding of algorithms is about on par with juggalo understanding of magnets.
posted by exogenous at 7:49 AM on October 26, 2010 [1 favorite]


Burrows-Wheeler actually excels at precise matching, and the gymnastics that have to be used for imprecise matching probably limit it's shelf life with DNA to only a few more years, because as sequence reads get longer there are more total number of errors.

That said, I kind of have a juggalo awe of the BWT, because I always find it magical the way that a string is turned into a seeming mishmash and then completely recovered. How the fuck does it work, indeed.
posted by Llama-Lime at 8:02 AM on October 26, 2010


Boyer-Moore and Alpha Skip are very clever, but string matching is still a huge loss if you didn't need to do the string comparison at all. Or, if you will, the CPU hates when you LIKE it without reason.

Seriously. I just found a huge optimization for our code where... there were two nested linear searches of an array of items, string comparing the name of each of them with the name of the desired one until the correct one was found.

Then there was a list of sub items and... it did it again. *facepalm*

I changed it to a generated dictionary and the several hundred lookups per web page view went from several hundred milliseconds (it wasn't a small list of items!) to 0.

At least I get to say I made a huge performance gain but GOOD GOD.
posted by flaterik at 1:10 PM on October 26, 2010 [1 favorite]


Either I'm stupid (or very tired) or these are incredibly terse and difficult explanations. I'm going to need to read the C code to understand these.
posted by Xezlec at 10:01 PM on October 26, 2010


Hmmm...

I would think that Python code would be a bit more instructive than C code, but hey, who's counting. Also, commenting in the code would go a long way.
posted by kaibutsu at 11:51 PM on October 26, 2010


Xezlec, I've got a degree in computer science and do a fair amount of optimization work. I was very interested to read these descriptions but they weren't accessible enough to get into with all the other stuff in my head these days. Maybe if I was going to be tested on it...
posted by flaterik at 6:33 PM on October 28, 2010


« Older I have made a decision.   |   Didn't predict that... Newer »


This thread has been archived and is closed to new comments