Weekend means filesearch comparisons and a drink (drink not provided)
September 23, 2016 7:06 PM   Subscribe

From grep, ag, git grep, ucg, pt, sift comes a New Challenger: ripgrep In this article I will introduce a new command line search tool, ripgrep, that combines the usability of The Silver Searcher (an ack clone) with the raw performance of GNU grep. ripgrep is fast, cross platform (with binaries available for Linux, Mac and Windows) and written in Rust.
We will attempt to do the impossible: a fair benchmark comparison between several popular code search tools. Specifically, we will dive into a series of 25 benchmarks
posted by CrystalDave (30 comments total) 43 users marked this as a favorite
 
Informative discussions on hackernews, /r/rust, and /r/programming.
posted by a snickering nuthatch at 8:09 PM on September 23, 2016 [1 favorite]


Obsessive beanplating from an apparently highly-skilled technical person. Doubleplusgoood!
posted by spacewrench at 8:27 PM on September 23, 2016 [1 favorite]


When I came across this earlier, I got about 20% in before I checked to make sure that this treatise is like twice as long as the only thing I've written that I ever refer to as a "book" in conversation.

Bravo to the author. People should strive for this level of documentation. It's a drag that so many more "important" projects don't even make the attempt.
posted by brennen at 8:34 PM on September 23, 2016 [1 favorite]


(It's also true that grep seems to be particularly rich literary nerd soil for some reason.)
posted by brennen at 8:35 PM on September 23, 2016 [1 favorite]


Also, that hacker news thread is a model of friendly cooperation between developers of competing tools.

Good work, everybody. Now I'm going to bed before I accidentally read the news and everything sucks again.
posted by brennen at 8:39 PM on September 23, 2016 [2 favorites]


Not on homebrew yet? More like homeboooooooo
posted by Going To Maine at 8:42 PM on September 23, 2016


This is relevant to my interests. Or will be as soon as somebody writes an Emacs integration. As it stands, ag is already fast as hell and there are good Emacs integrations.
posted by edheil at 8:56 PM on September 23, 2016 [1 favorite]


Going To Maine, you can add it to your homebrew with a one-liner given in the first link. I mean, it still just downloads the binary, cause not everybody's going to want to install Rust just so they can compile it from source....
posted by edheil at 9:01 PM on September 23, 2016


I was drawn in by the promise of fast Unicode matching, then a little put out to hear that it only works with ASCII-compatible encodings.

UTF-8 != Unicode. I'm very open to an argument that it's common enough that it's worth optimizing for specially, sure. But the comparison with other tools with full Unicode support felt a little unfair.
posted by brett at 9:16 PM on September 23, 2016 [4 favorites]


brett, yeah, that does seem like it really limits things.

Which of the other tools do better with Unicode support, and what do they do? UTF-16? 32?
posted by edheil at 10:01 PM on September 23, 2016


heh, I'm surprised to see this on MeFi, I thought it'd be too technical for the front page.

About unicode: UTF-16 is used in some filesystems' metadata, in some programming languages' in-memory strings… but does anyone ever use it in text files? I've seen legacy encodings like CP1251 and KOI8-R. And UTF-8. That's it.
posted by floatboth at 1:34 AM on September 24, 2016


From the first sentence of the post I totally thought this was going to be about Jean M. Auel.
posted by No-sword at 3:56 AM on September 24, 2016 [1 favorite]


In other words, if you like fancy regexes, non-UTF-8 character encodings or decompressing and searching on-the-fly, then ripgrep may not quite meet your needs

The harshest fetish remains unfulfilled.
posted by srboisvert at 4:40 AM on September 24, 2016 [4 favorites]


> From the first sentence of the post I totally thought this was going to be about Jean M. Auel.

Like this?
posted by farlukar at 5:34 AM on September 24, 2016 [2 favorites]


GNU grep will support basically any encoding you want to throw at it, by setting the LC_CTYPE environment variable. It's never explicitly stated but the documentation suggests it works simply by decoding to Unicode code points and then matching on those. It makes a few references to how the current locale determines the definition of various character classes.

The ripgrep author seems to agree. Near the bottom, in the discussion of its subtitles_no_literal test, it says:
Now, in all fairness, GNU grep’s locale and encoding support far exceeds what rg supports. However, in today’s world, UTF-8 is quite prevalent, so supporting that alone is often enough. More to the point, given how common UTF-8 is, it’s important to remain fast while supporting Unicode, which GNU grep isn’t able to do.
I realize UTF-8 support is going to be plenty enough for most Western programmers (i.e., almost all the ones discussing it on these sites) but I'm less sure about the rest of the world. Around the time we all started declaring UTF-8 the winner for world's preferred encoding, uptake of Unicode itself was still slow in Japan because of concerns around Han unification. But that was a few years ago. I haven't seen more recent information about what encodings are popular in Asian countries today.
posted by brett at 5:45 AM on September 24, 2016


if nothing else, there are loss-free conversions from every other unicode format to utf-8, and there are a lot of formats out there. For unicode alone we have UTF-1, UTF-7, UTF-9, UTF-EBCDIC, UTF-16, and UTF-32.

UTF-8 is normal on *nix systems, and UTF-16 is the Windows standard.
posted by idiopath at 6:02 AM on September 24, 2016


if you like fancy regexes

...then get the hell back in the pentagram so I can banish you properly.
posted by Mr. Bad Example at 6:23 AM on September 24, 2016 [10 favorites]


Rust is a delightful programming language, with an even more delightful community. Super chill, helpful, inclusive. If you ask a dumb noob question on IRC, experts and even core team devs will jump in and help you. (also, here are the signs on the bathrooms at a recent conference).

It's a non-garbage-collected language, speed competitive with C/C++ (see benchmarks game, which at least is suggestive even if it is not everything). It borrows its type system and many of its design patterns from functional languages like ML/Scala. Much of the time you can treat it like a high level functional language.

The coolest thing about Rust, though, is that the type system is set up in such a way that most possible memory errors and concurrency errors will be caught at compile time. It should be nearly impossible to segfault or use-after-free in a Rust program, even though memory management is completely deterministic and totally under the control of the programmer. Likewise you won't be able to have data races in concurrent code. (There is an "unsafe" keyword if you need to do something dangerous that these checks would prevent you from doing -- there are sometimes "false positives" where legitimate code is seen as doing something dangerous. more often when this happens to me I was about to make a very bad mistake, though.)

It has a package manager/build tool called cargo that is just as easy to use as something like npm.

If you want to try out ripgrep, you should head over to https://www.rustup.rs/ and install rust. Then just "cargo install ripgrep" and you're good to go.

The author of ripgrep has another great tool for dealing with csv files called xsv. "cargo install xsv" if you want to check that out.
posted by vogon_poet at 7:01 AM on September 24, 2016 [6 favorites]


It should be nearly impossible to segfault or use-after-free in a Rust program

That sounds like a challenge.
posted by octothorpe at 7:45 AM on September 24, 2016 [5 favorites]


That sounds like a challenge.

Rust does have "hold my beer and watch this" sections (boringly called "unsafe"), in which you can do exactly that, but the language imposes strict rules on how you can use stuff so this should be a safe claim. Those rules also exclude a lot of totally legal and useful things (hence the need for "unsafe"), but most of those can be worked around.

The trick is not so much that it can do this, but that it can do this without any run-time overhead. All the validation is done at compile time.
posted by It's Never Lurgi at 8:15 AM on September 24, 2016 [2 favorites]


> but does anyone ever use it in text files?

Windows! UTF-16 used to be the standard text encoding in Windows. I think they finally ditched it for Windows 10.
posted by a mirror and an encyclopedia at 10:05 AM on September 24, 2016


On one hand, this is great. The writeup is fantastic and it looks like the software was thoughtfully engineered. I'm happy this exists. Huzzah for great documentation!

On the other hand, "grep is too slow" isn't a phrase I've ever used and the temptation to be lazy about things that don't matter to me is strong.
posted by eotvos at 10:52 AM on September 24, 2016


> ...then get the hell back in the pentagram so I can banish you properly.

More of my life than I really care to think about has been spent wallowing in the Turing tarpit that is fancy regular expressions, and these days I watch for them in other people's code the way you watch for rattlesnakes in the southwestern brush, but their availability in a pattern matching tool is probably a reasonable criterion for evaluation.
posted by brennen at 11:07 AM on September 24, 2016 [1 favorite]


That sounds like a challenge.

my impression from hanging out in rust IRC is that if you did manage to get safe rust code to segfault, it would be treated as a code red high-priority language bug and fixed ASAP. i know next to nothing about PL theory but my understanding is that at least theoretically the type system should completely guarantee safety. also they have runtime array bounds checks which was seen as a reasonable compromise for security. you need to use "unsafe" to avoid them.

(the guarantees come at the expense of a learning curve and occasional "oh COME ON i just want to do this one reasonable thing". i've also heard that you need to develop stockholm syndrome with the rust compiler. it's not trying to punish you, it's for your own good.)
posted by vogon_poet at 11:48 AM on September 24, 2016 [2 favorites]


FWIW, if you're on a mac, and do a fair amount of grepping, you should probably install GNU grep from Homebrew -- in many cases, it's significantly faster than the BSD grep that comes bundled with OS X.

Be warned that Homebrew will alias GNU Grep to ggrep to avoid conflicts with system utilities that depend on the behavior of BSD Grep.

(Many of the other GNU utils are also available on Homebrew, and are typically more mature than their Mac OS counterparts)
posted by schmod at 11:59 AM on September 24, 2016 [3 favorites]


On the other hand, "grep is too slow" isn't a phrase I've ever used and the temptation to be lazy about things that don't matter to me is strong.

Oh man, you haven't run grep recursively against super large directory structures then. It's a dog compared to Silver Searcher or even Ack. Excited to give ripgrep a try.
posted by mcstayinskool at 1:55 PM on September 24, 2016


The other thing about ack (which has to get the lion's share of credit for this whole category of tool existing) is that it's in Perl, and as such gives you access to actual Perl regexps, which must bid fair to be the single most feature-complete and user-friendly regex implementation in existence. (User friendly is, you know, relative.)

Anyway, yeah, when I first installed it, ack was moderately lifechanging. It's not like I'm ever likely to stop using grep, but there's definitely a place for more smarts.
posted by brennen at 4:20 PM on September 24, 2016


Much of the time you can treat it like a high level functional language.

This really depends on what you mean by "high level functional language". It certainly does have a nice type system, with sum/product types and sophisticated inference and lots of other niceties. But defining persistent data structures is a pain—you have to use a lot of Rc or Arc or the like, precisely because of the approach to memory management. And for similar reasons higher-order functions aren't super convenient, and returning non-Boxed functions is not very flexible (different functions being returned technically have different types, so you can't do if a { |x| x + 1 } else { |x| x + 2 }, even though those both look like the should be u32 -> u32 or whatever the default integer type is.
posted by kenko at 8:37 PM on September 24, 2016 [2 favorites]


Link for the closure example in my comment. If you comment out the definition of bar, it will compile.
posted by kenko at 8:41 PM on September 24, 2016


I remember trying ack, and not being impressed.. The .gitignore is interesting (or, more generally, ignore unversioned files), but it's pretty hard to get my fingers to forget the whole "find -name *.c/cpp/java/js | xargs grep foo" .. Big paradigm.
posted by k5.user at 8:56 AM on September 26, 2016


« Older Fleabag   |   Big Lonely Doug Newer »


This thread has been archived and is closed to new comments