A Regular Expressions Sandbox
August 24, 2018 9:14 AM Subscribe

Regular Expressions 101 is an online sandbox for writing and testing regular expressions. It supports PCRE, Javascript, Python, and Go syntax and has a well-designed visualization, explanation, and performance evaluation interface.
posted by jedicus (32 comments total) 60 users marked this as a favorite

I fucking love that site. Any time I'm constructing a REGEX, I'm over there putting in concepts and seeing how they come out.
posted by sciatrix at 9:18 AM on August 24, 2018 [5 favorites]

An explanation of your regex will be automatically generated as you type.

Niiiiiiice.

Now have it generate a state machine diagram on the fly as well.
posted by tobascodagama at 9:25 AM on August 24, 2018

It describes pcre as "php". I must be even older than I thought :(

Nice app though. I'm mobile ATM and wasn't expecting such a slick result.
posted by merlynkline at 9:28 AM on August 24, 2018

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

(This is tool is very good)
posted by Artw at 9:32 AM on August 24, 2018 [4 favorites]

For .NET-style regexes, there's Regex Storm.
posted by a snickering nuthatch at 9:34 AM on August 24, 2018

No Regex thread is complete without a mention of Regex Crosswords.
posted by Tell Me No Lies at 9:45 AM on August 24, 2018 [5 favorites]

For Ruby-style regexes, there's Rubular.
posted by Nossidge at 9:52 AM on August 24, 2018 [2 favorites]

This site has saved me so many times.
posted by OverlappingElvis at 9:57 AM on August 24, 2018

I don't ever write regex without running it through this site, with a bunch of different text strings in the bottom box. I also don't trouble-shoot other folks regex without plopping it in here along with typical app data to see if their regex is doing what they think it's doing.

Regex is such a fascinating and frustrating insight into stuff our brains do pretty well but is hard to get machines to do well. I probably can't formulate a super-tricky example off the cuff, but things like "hey, can you highlight all image tags which have a data attribute and are within the third consecutive paragraph tag after a block-level tag..." is not hard to get most people to do properly but can be one of those "my regex is longer than the sample text" things.
posted by maxwelton at 10:05 AM on August 24, 2018 [2 favorites]

There's a bunch of regex helper web tools but this is the best one. I use it frequently.

If you're writing complicated regular expressions, make friends with the idea of multiline "extended" (aka verbose) regexps. That lets you format a complicated regexp in a sane, maintainable way. Combined with named capture groups you have something great.

                # identify URLs within a text file
          [^="] # do not match URLs in IMG tags like:
                # 
http|ftp|gopher # make sure we find a resource type
          :\/\/ # ...needs to be followed by colon-slash-slash
      [^ \n\r]+ # stuff other than space, newline, tab is in URL
    (?=[\s\.,]) # assert: followed by whitespace/period/comma

posted by Nelson at 10:08 AM on August 24, 2018 [12 favorites]

I have been using this for at least a year or two.

By far and away the best regex helper I have used.
posted by KaizenSoze at 10:14 AM on August 24, 2018

Yep, that's a purple link all right.

This very useful website runs on donations, but the $ button in the bottom left is pretty subtle, so if it's helped you out a lot like it has me, consider helping them out in turn.
posted by one for the books at 10:18 AM on August 24, 2018 [3 favorites]

Vim is my editor, and it's basically how I learned regex.

If you have the "hlsearch" and "incsearch" option on, Vim will evaluate your regex and highlight matches as you type it out. Perfect for learning.

Vim's syntax does differ slightly than standard regex, but this was a lifesaver for me.
posted by dcipjr at 10:37 AM on August 24, 2018 [6 favorites]

Vim's regex facility is very handy, but I can never remember what needs to be escaped, so I have to use a fair amount of trial-and-error to write them.
posted by a snickering nuthatch at 10:58 AM on August 24, 2018

Earlier today, I had to salvage some text out of a gawd-awful old table-based html layout. My solution was to use pandoc for quick-and-dirty conversion, and sed to nibble away what I didn't need. I think where people run into trouble with regex is trying to do too much in a single pattern . In this case, it was easier to write 10 simple expressions to cut away the junk.
posted by GenderNullPointerException at 11:20 AM on August 24, 2018 [5 favorites]

Ah, but then you don't get to do the "one line of code"* PERL-scripter boast thing.

* line may be several hundred characters.
posted by Artw at 11:33 AM on August 24, 2018 [1 favorite]

Thank you, I'd lost this!
posted by lokta at 1:43 PM on August 24, 2018

Oh good, if you show "all tokens" you can get into all the weird non-capturing grouping operators and stuff that I can never remember. You program Perl long enough and the basics do get ingrained, but for the 5% of situations where the basics don't cover it I still have a bookmark in Camel 3 where they start to be explained. I bought it in 2000.

All it really needs is the coworker who knows more about Perl internals than nearly everybody who can then explain in detail why you don't want to use a particular [[:thing:]] shorthand because of how it mishandles Unicode or numeric characters from other languages or whatever.
posted by fedward at 2:49 PM on August 24, 2018 [1 favorite]

As a learning exercise I wrote a PDF citation scraper that used a regex that ended up about 8 lines of 80 characters long. By the time I was done I had to relearn normal human speech from scratch. Fortunately I had the time because the scraper was that slow.
posted by srboisvert at 5:02 PM on August 24, 2018 [7 favorites]

I have a love-hate relationship with regex. While I'm writing it, it's a nightmare. When I'm done writing it and it works, and I look back on what I have created, I feel like I have written a magical spell that I have cast on the computer. I am a powerful witch, inscribing magical arcane runes! Fear me!!

I have a love-hate relationship with regex. While I'm writing it, it's a dream: I know exactly what I'm going for, and after a little trial-and-error, it works. When I'm long since done writing it, and I look back on what I have created because it needs to change a little bit to support some new use case, I feel like I have taken a shit in a gallon of milk and left it in the back of my closet to fester in an act of utter aggression against my future self.
posted by invitapriore at 6:05 PM on August 24, 2018 [6 favorites]

When I'm long since done writing it, and I look back on what I have created because it needs to change a little bit to support some new use case, I feel like I have taken a shit in a gallon of milk and left it in the back of my closet to fester in an act of utter aggression against my future self.

Yeah, my rule of thumb is, you can't tweak a regex expression. You have to kill it with fire and start again from scratch. Every fucking time.

It's funny you posted this link today. I've had that site open in a tab for the last two days while I've been trying to perfect a script that runs through a two-million-word file, finds acronyms that couldn't phonotactically be a real English word, and puts them in uppercase. I think I FINALLY finished it, so I can close the tab at last, so I rewarded myself by coming to Metafilter and here's what I found.
posted by lollusc at 2:24 AM on August 25, 2018 [3 favorites]

Neat! I've found the way debuggex visualises the expressions to be very easy to follow.
posted by Tzarius at 3:56 AM on August 25, 2018

If you have the "hlsearch" and "incsearch" option on, Vim will evaluate your regex and highlight matches as you type it out.

Well you just made my life a little better
posted by PMdixon at 6:14 AM on August 25, 2018

I've had to use regexes in a couple of Oracle stored procedures in the last year. Both times I've left the room, spun around three times, spit, and cursed before pushing my changes.

So far nothing's broken.
posted by Mr. Bad Example at 6:28 AM on August 25, 2018 [1 favorite]

you can't tweak a regex expression

You can though! That's why up above I sang the virtues of multiline / verbose regular expressions and named capture groups. Compare these two regexps for matching input like 3:5 and 4.9:

(\d+)?(([\-:])(\d+)?)?

 (?P<start>   \d+   )?     # numbers
(  (?P<sep>   [\-:] )      # separator
   (?P<end>   \d+   )?     # numbers
)?                         # final sep+numbers optional

They match the exact same text. I sure would prefer to maintain the second instead of the first though. The named groups make it crystal clear what the values you're extracting are, particularly in the code below which can reference things like m.group('start'). The white space groups the components of the regex structurally. And the comments hold your hand all the way through. (Maybe a little too much in this case, but they are invaluable when matching something more complex than \d+)
posted by Nelson at 7:16 AM on August 25, 2018 [6 favorites]

It's hard to overstate how much surface area PCRE* has, compared to POSIX and other simpler regex flavors/dialects.

There's a lot of research about converting regular expressions to NFAs (non-deterministic finite automata) and then to DFAs (deterministic finite automata) -- converting the regex to a format where the computer can look at each input character once and update its state (its key into a table of what to match next), without spending time on any other decisions along the way. Many PCRE extensions make it much harder, or outright impossible, to do this conversion.

In the best case, the regex can still be converted to an NFA: they're non-deterministic because the regex engine needs to either choose between possible options (and if it hits a dead end, backtrack and try the others), or branch off and handle multiple different matches-in-progress at once, possibly coordinating a whole crowd.

Many of the extensions can't be expressed in an NFA at all, so the engine does something ad hoc instead. This can combine badly with other extensions and become a Denial-of-service attack vector. (You can see this in the tool: it's not hard to write a regex that gets "engine error" or "catastrophic backtracking", because they're carefully limiting resource usage.)

A lot of these extensions could be handled easily by a parser -- they don't fit what automata theory supports at all, but PCRE is the ultimate triumph of scope creep over theory and good taste.

If you want to learn more, Russ Cox has a great series of posts about implementing regular expression matching, and in particular, why backtracking is dangerous. (I think many of the languages listed have switched over since; it's from 2007.)

* Also, PCRE stands for *Perl*-Compabitle Regular expressions; the "(php)" is probably just a clue for people looking for their regex flavor by language name, because that's how the other flavors are categorized.
posted by silentbicycle at 6:09 AM on August 26, 2018 [4 favorites]

I think many of the languages listed have switched over since; it's from 2007.

At least Java, Perl, Python, .Net, and PRCE still allow back-references; so they still have a backtracking algorithm as part of the matching. One could handle this by using backtracking only when it is needed, but I think this is not done. Instead, the suggestion is to use a specialized library for those cases (like RE2).

Also, the general attitude seems to be something like "the user just has to know how the matching algorithms works and what is expensive" (that's how I understood e.g. Friedl's comments in and under this old blog post).

As a theoretical computer scientist, I find this attitude rather puzzling: Why not use the algorithm with performance guarantees, when most people don't even seem to use the features that require expensive backtracking? (Two years or so ago, I downloaded all examples of some huge library of user written regex, and only very few used back-references in a meaningful way. Capture groups to some degree, but those can be handled differently.)

On the other hand, part of my PhD was about repetition operators in regex, I already turned that connection into one successful grant, and I plan to add more grants and papers. So I won't complain at all. (Oh, and fun fact: PCRE is not PERL compatible anymore. Because why not.)
posted by erdferkel at 1:49 PM on August 26, 2018 [3 favorites]

Oh, nice! I was wanting something like this the other day when I was trying to convert a messy file list with varying naming conventions scraped off a website into something more neatly formatted. Though eventually I realised that by the time I've got a regex beyond about fifty characters, I'd much rather write it out in explicit functions.
posted by lucidium at 7:00 PM on August 26, 2018 [1 favorite]

As a theoretical computer scientist, I find this attitude rather puzzling: Why not use the algorithm with performance guarantees, when most people don't even seem to use the features that require expensive backtracking?

As Damian Conway put it, describing a different aspect of Perl: “Programmers who know what they’re doing get extra functionality. Programmers who don’t know what they’re doing get punished. Either way, they deserved it.”
posted by um at 8:41 PM on August 26, 2018 [1 favorite]

Also, the general attitude seems to be something like "the user just has to know how the matching algorithms works and what is expensive" (that's how I understood e.g. Friedl's comments in and under this old blog post).

There's a lot about Perl that's that way. Perl has a bias towards programmer speed, not execution speed. Also, TMTOWTDI, so who am I to go "correcting" somebody else's code? Often even the slow algorithm is fast enough, since computers keep getting faster. In the circles I've traveled in you might get a question about "why did you do it this way and not this way*" but there's a whole lot of Perl running in environments where a little bit of execution speed is inconsequential. If somebody needs that extra bit of execution speed they're either going to ask or do the research, and then they can find out about the optimized way of doing things and determine if it's right for them.

* Also sometimes there's not even a real performance gain in doing it a specific, more idiomatic way. One question I got at my last gig was why I habitually used C style for loops instead of idiomatic Perl foreach with $_. My answer was partly habit and partly that I'd done too much work in other languages without an equivalent foreach, but the C style for loop is one bit of syntax that works in every language and it costs me less to remember it. The Perl foreach is more idiomatic, but it's not like there are any Perl programmers reading my code who can't also read the C style, and the performance should actually be identical.

As a theoretical computer scientist, I find this attitude rather puzzling: Why not use the algorithm with performance guarantees, when most people don't even seem to use the features that require expensive backtracking? (Two years or so ago, I downloaded all examples of some huge library of user written regex, and only very few used back-references in a meaningful way. Capture groups to some degree, but those can be handled differently.)

I think portability and habit reinforce each other. If I want a high performance algorithm that includes just a subset of functionality, I can either include it and make sure the module is where I need it (and part of my deployment scripts, and so on) or I can just avoid using the slow stuff. I don't know that I've ever needed to write a backtracking regex, so I probably could easily switch to whatever subset worked faster, but what do I gain from that? Most of the time the standard is fast enough and I'm not writing the expensive stuff anyway. So, yes, I could get a minor performance gain, but I also would then have another thing to deploy and test. Until I know I need that gain, I'm not going to add the complexity.
posted by fedward at 8:22 AM on August 27, 2018

The problem is that a naive user risks creating a ReDoS vulnerability without realizing it. Given the risk for denial of service, I would rather use a safe alternative.
posted by Monday, stony Monday at 9:14 PM on August 28, 2018

« Older "But here is the rub: He said this in 1969." | Parents’ Red Queen’s Race Newer »

This thread has been archived and is closed to new comments

MetaFilter

A Regular Expressions Sandbox
August 24, 2018 9:14 AM Subscribe

Tags

Share

A Regular Expressions Sandbox August 24, 2018 9:14 AM Subscribe

Tags

Share

A Regular Expressions Sandbox
August 24, 2018 9:14 AM Subscribe