The R Project for Statistical Computing
February 15, 2010 4:07 PM   Subscribe

R is quickly becoming the programming language for data analysis and statistics. R (an implementation of S) is free, open-source, and has hundreds of packages available. You can use it on the command-line, through a GUI, or in your favorite text editor. Use it with Python, Perl, or Java. Sweave R code into LaTeX documents for reproducible research.

Good places to get started: official manuals and FAQ, Quick-R, R help, and the R Wiki. (R translations exist for MATLAB users and Octave users.) Handy resources: RSeek and the R reference card (PDF). See also intros to R for psychology and must-have R packages for social scientists. There's even some books on R.

Check out some neat graphs (with R source code). Popular graphics packages for R include lattice, ggplot2, and RGL or rggobi for more complex data visualization.

Melt down and recast your data with the reshape package.

Many R GUIs are available, including R Commander, JGR, Rattle for data mining, and even an R GUI Generator. Or use R through text editors: Emacs Speaks Statistics, a vim plugin, SciViews-K for Komodo, and Tinn-R. The Omega Project has some other interesting R interfaces.

Happy data crunching!
posted by parudox (114 comments total) 227 users marked this as a favorite
 
I love R.

That is all.
posted by Jimbob at 4:10 PM on February 15, 2010 [1 favorite]


All the best pirate statisticians swear by it
posted by nervousfritz at 4:14 PM on February 15, 2010 [18 favorites]


R.

Hard to Google. Effective as any statistician could want.

Although, I must say, awfully late to the anti-aliasing game, when it comes to graphs.
posted by tmcw at 4:16 PM on February 15, 2010


Although, I must say, awfully late to the anti-aliasing game, when it comes to graphs.

This is true for the default graphics device - just output to PDF and it looks beautiful.
posted by Jimbob at 4:18 PM on February 15, 2010


My R experience:

Compile, compile, compile.

Troubleshoot, troubleshoot, troubleshoot.

...6hours later...

It's alive!

Now what the hell do I do.

Reboot in XP

Load Minitab.

...10 minutes later...

I'm done!
posted by 517 at 4:29 PM on February 15, 2010


In my haste, I almost posted aspersions that R has dynamic scope, which I find terrifying.

I went and found a pdf file which states differently. That said, it does have some interesting scoping rules (walk up enclosing scopes, then walk down a search list of all loaded libraries, in which user load order matters).

So, I guess I find the scoping rules of R terrifying, just for a different reason than the one I thought motivated my horror.

That said, some of my friends love the hell out of this creation.
posted by enkiwa at 4:32 PM on February 15, 2010


Yay, R! So cool to see this on the blue, and thanks for including so many helpful links!

IMO, it can be a very steep learning curve, even for programmers, due to its truly funky syntax (<- for an assignment operator, WTF?) and documentation written mostly by and for academic statisticians. But the more time you spend with and, the more you discover -- and the amount of stuff there is to discover in the world of R is truly astonishing.
posted by treepour at 4:34 PM on February 15, 2010


R is the one thing I use for 6+ hours every day. Nice solid post.

Also, ggplot2 and I have known each other (almost biblically) for several months now. We plan to name our first son Akaike.
posted by special-k at 4:35 PM on February 15, 2010 [5 favorites]


Guess I'm not gonna get laid for a while after that last comment.
posted by special-k at 4:37 PM on February 15, 2010 [12 favorites]


s/with and,/with it/
posted by treepour at 4:37 PM on February 15, 2010


R is awesome. So is Sweave. I did almost my entire PhD thesis, and my last major peer-reviewed manuscript in Sweave. Need to change the dataset used for the entire paper? No problem, just change one line, type make and the whole document is remade, all the figures are adjusted, all the numbers referred to in the text are corrected, and you're done.

Searching for R with Google is hard. Use Rseek, linked above. Yes, the learning curve is steep. But once you've learned a few things you can do some very complicated stuff very quickly.
posted by grouse at 4:45 PM on February 15, 2010 [2 favorites]


Does anyone in the industry still use S-PLUS or has R eaten their lunch?
posted by Rhomboid at 4:46 PM on February 15, 2010


So, I guess I find the scoping rules of R terrifying, just for a different reason than the one I thought motivated my horror.

I definitely feel that it's probably not a programming language for people who are already seriously into programming languages, because a lot of stuff it does is kind of awkward and weird. Although I do like that you can define default values in function definitions ie.
myfunction <>

And the <>real programming languages have fits.

But it seems to work. The fact that I can, in about four lines of code, load in data, run a GAM model on it and plot the results, makes me very happy.

posted by Jimbob at 4:46 PM on February 15, 2010


517: "Compile, compile, compile.

Troubleshoot, troubleshoot, troubleshoot.

...6hours later...
"

Installing programs with *nix is vastly easier than with WIndows if you use a package manager.
posted by idiopath at 4:47 PM on February 15, 2010


Damn my trying to be tricky with my tags. What I tried to say was it's very nice that you can do:

myfunction <>real programmers have a fit. Anyway. Back to work.
posted by Jimbob at 4:49 PM on February 15, 2010


I'll shut the hell up now. My inability to use HTML entities makes anything I have to say ignorant and worthless.
posted by Jimbob at 4:50 PM on February 15, 2010 [3 favorites]


Heh, from the first article:
“I think it addresses a niche market for high-end data analysts that want free, readily available code," said Anne H. Milley, director of technology product marketing at SAS. She adds, “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.”
That's the sound a marketroid makes just before getting hit by a locomotive.
posted by Malor at 4:50 PM on February 15, 2010 [19 favorites]


it's probably not a programming language for people who are already seriously into programming languages, because a lot of stuff it does is kind of awkward and weird

Actually, I worry about it becoming the first programming language for some people because it will rot their brain in a big way. Still love it, but it's worrisome.
posted by grouse at 4:52 PM on February 15, 2010


This post is a great way to out all teh R geeks.
posted by special-k at 4:54 PM on February 15, 2010 [2 favorites]


Also, ladies, if it helps my case any I am using R at a brewery down the road from my office. They have a recession monday all day happy hour.
posted by special-k at 4:57 PM on February 15, 2010 [4 favorites]


I'm surprised Anne H. Milley is still sprouting that shit after this fiasco, where bugs in S were found to have generated bogus results in a number of air pollution studies.
posted by Jimbob at 4:58 PM on February 15, 2010 [1 favorite]


"Installing programs with *nix is vastly easier than with WIndows if you use a package manager."

I was. Go try to get a version of R set-up with a few different GUIs to choose from, and do it in the same amount of time it takes to use a windows installer. You won't. Especially when R, and most of the R GUIs weren't in the normal repositories at the time I tried to set it up.

The point of my post is that R, like most things open source, is not friendly to the new user.
posted by 517 at 4:58 PM on February 15, 2010


I should've pointed out that R is a functional programming language - which may contribute to difficulties for those not used to functional programming..
posted by parudox at 5:07 PM on February 15, 2010


517: on Debian, and presumably Ubuntu, it looks like installing the metapackage "r-recommended" should install an X-Windows version with a whole bunch of stuff.

In the package manager, if you search for 'statistical', that will give you a very long list of prepackaged R libraries and utilities. Searching for R ends up being a bit over-broad, but 'statistical' seems to be a good keyword.

I can see these things available in both 'testing' and 'stable', so it's been in Debian, at least, for several years.

I don't know a damn thing about R or what you actually needed, but there's a ton of stuff all ready to go, once you find the right keyword.
posted by Malor at 5:07 PM on February 15, 2010


Let's say I am good at math but bad at computers and I want to run R on my (OSX 10.6) Mac. I don't know what python, perl, or java are. I'm comfortable with working from a command line (e.g. this is how I use Maple and MAGMA.) What do I do?
posted by escabeche at 5:10 PM on February 15, 2010


517: "Go try to get a version of R set-up with a few different GUIs to choose from, and do it in the same amount of time it takes to use a windows installer. "

I just did it, since I don't have it installed on this computer. The download took much longer than the install did, closely followed by the time needed for the fonts to install. And the install wasn't even interactive, all I had to do was wait for it to finish. This is on debian stable.
posted by idiopath at 5:11 PM on February 15, 2010


TheWhiteHat <- ("Loves him some R")
posted by The White Hat at 5:11 PM on February 15, 2010 [1 favorite]


I use it, but I've yet to warm to it. The main problem is that it just takes forever, in two senses.

First, when writing my own stuff or using something moderately arcane, I tend to need to google something every few hours; googling R is a real pain, not just because "R" as a search term is useless (RSeek helps a bit with that), and not just because it has always been useless and therefore help forums have not grown up around similar seekers, but because what help forums there are tend to be dominated by low-social-skill statisticians whose objective is often to "help" with the minimum possible advice. When compounded with man files that are usually the bare minimum with at most a very few largely unexplained examples, figuring out what in the end turns out to be an easily-solved stumbling block can be an hour-long exercise in frustration. For whatever reason, other statistical software (like Stata, which I also use) is not just easier to search for help for, but also tends to have more actually helpful help when you do find it.

The other component of "takes forever" is that, as a high-level interpreted language, R is dog slow for complicated calculations. If you're trying to writing a complex but low-level program, my experience suggests that R can often be 100 to 1000 times slower than C. Again, it's well worth it for high-level or prepackaged stuff, but it can be a bit frustrating to realize that your cool statistical package written in some other language needs to be rewritten in R for widespread adoption, and as a result will probably run 100 times more slowly.
posted by chortly at 5:11 PM on February 15, 2010 [2 favorites]


I tend to think of software like a bicycle.

Sure, you can go get a nice-looking bike down at the local shop with a big cushy seat, and which is adjusted so that you can put your heels flat on the ground without getting off of the saddle every time you reach a stoplight... And then you can be confounded by the terrible ergonomics of those choices within a couple weeks, and go back to driving the car to Wal-Mart every time you need a quart of milk, thus avoiding the travails of painful knees and chafed thighs.

Or you can go with the road bike (perhaps with panniers), with its relatively hard, narrow seat and long reach to the pedals and the ground. In the short term, it takes a hell of a lot more getting used to starting and stopping and steering, but eventually it's faster and far more user-friendly than the cruiser. Suddenly you're using the bike for everything, and never bother to renew your driver's license...
posted by kaibutsu at 5:12 PM on February 15, 2010 [6 favorites]


escabeche: "What do I do?"

probably darwinports is going to be the easiest route.
posted by idiopath at 5:12 PM on February 15, 2010 [1 favorite]


I'm working my way through the new O'Reilly book R in a Nutshell right now! I have a very thin programing background (some IDL and even less Python), and things seem to be clicking.
posted by nowoutside at 5:14 PM on February 15, 2010


(I don't make any claims about R, though, as I don't use it, though I've been half-heartedly meaning to for a while... I just don't deal with statistics very much at all, unless you mean like permutation statistics, in which case I can tell you very many things that have nothing to do with statistics as it's usually understood. I do use linux exclusively these days, though, and use Sage (the open-source alternative to mathematica and maple) in my research.
posted by kaibutsu at 5:15 PM on February 15, 2010


I also don't claim to always be perfect about closing my parenthetical statements, which probably goes a long way towards explaining my deficiencies as a programmer.)
posted by kaibutsu at 5:16 PM on February 15, 2010 [6 favorites]


R is very pretty, but I'm too invested in Stata.
posted by scunning at 5:18 PM on February 15, 2010


Is there something like Sweave for Matlab...? Or, does R generally have the random/arcane stuff functions you can find in Matlab toolboxes? Make my life easier, please.
posted by zeek321 at 5:21 PM on February 15, 2010


Guess I'm not gonna get laid for a while after that last comment.

I guarantee you that somewhere out there is a frustrated biologist who will love you forever if you can show them how to change the axis tick marks.
posted by penguinliz at 5:24 PM on February 15, 2010 [7 favorites]


Here's some interesting discussion on the merits of various stats packages from Feb 2009. Unsure if anything has changed on the memory limitations of R in the last year.

I'm not conversant in enough stats languages to defend R syntax, but it's much less painful than SAS code.
posted by benzenedream at 5:25 PM on February 15, 2010


Regarding R's memory limitations - not really. Even 64-bit implementations run into memory problems because of the use of 32-bit integers in some Fortran code that's involved in R. That's the big downside of R, for my purposes... "Failed to allocate vector of size 1.3gb". But I usually manage to get around it, somehow.
posted by Jimbob at 5:30 PM on February 15, 2010


This is perhaps a good time to introduce scipy which has much of R's functionality, but runs 80 billion times faster, on account of it being mostly made of Fortran (but having a friendly Python interface). Matplotlib is pretty useful too. Did I mention I like Python?
posted by Popular Ethics at 5:33 PM on February 15, 2010 [3 favorites]


Yeah, the biggest problem with R syntax AFAICT is that it's unconventional. I mean that both in the sense that it violates the expectations you've built up using other languages (the aforementioned "->" and "<-" as assignment operators are the worst offenders here, although of course "=" is still available if you want it) and in the sense that it lacks the sort of internal naming conventions that make a language easier to learn. For each built-in function with a two-word name, you have to remember whether those words are camel-cased or all lowercase, whether one's abbreviated, whether there's a period separating them — there's simply no pattern. That's not a serious language design problem, but it's deeply irritating on a day-to-day basis.

Well, and yeah, those memory allocation glitches. I tend to use Weka for serious machine learning these days, because that's where the memory problems seemed to come up the most for me.
posted by nebulawindphone at 5:34 PM on February 15, 2010


Thanks, parudox! I use R on a daily basis for my Linguistics labs, but I hadn't had the chance to explore additional packages and GUIs.
posted by HopperFan at 5:43 PM on February 15, 2010


In general, I've found the issue with R to be the issue with any open source project that does more than one complex thing/has obscure extensions. Core is fairly solid, and there are a lot of cool beginners resources and add-ons for moderately complex operations that are fairly solid. However the more obscure the more the expectation is you know exactly how to apply it. So your basic linear models are fairly easy, but when you do something obscure, you better have a solid grounding in the math or you won't know what assumptions and tests are needed.

Whereas commercial programs tend to be written that assume it's your first time doing something. Their help is focused on the beginner user who doesn't know the math. They tend to be simplistic and make it hard to do anything complex or obscure. And their extensibility sucks. Seriously, I've written SAS macros that are 12 lines and several hours of research to do something that in R is one simple line of commands.

This makes R an excellent tool for creating your own analysis or for doing moderately complex statistics. However you better know the math if you want to do something interesting.

Frankly this makes sense, since people tend to write add-ons for stuff they know how to do and want to do regularly, or make it easier for others to do.

All in all R is a useful tool, and a free one. But, especially in extensions, the ability of commercial software to pay people to document how to do stuff and make it as simple as possible can undercut it.
posted by gryftir at 5:46 PM on February 15, 2010


"I just did it, since I don't have it installed on this computer."

Okay, now get it to run a multiple regression in R commander.
posted by 517 at 5:48 PM on February 15, 2010


It's somehow surprising that no one's done R before on Metafilter. Good post.

Oh, also, no one's mentioned Crantastic yet; it's a nice alternate interface for searching through (and tagging) CRAN packages.
posted by thisjax at 5:49 PM on February 15, 2010


And since nobody's mentioned it yet: Bioconductor, the singing, dancing genomics package collection.
posted by benzenedream at 5:56 PM on February 15, 2010


What do people do when their datasets get too large, or the computations take too long? Just leave for the day and come back 24 hours later?
posted by enkiwa at 6:08 PM on February 15, 2010


What do people do when their datasets get too large

Think carefully about my data, rearrange it and try again. Although packages like biglm are useful.

or the computations take too long? Just leave for the day and come back 24 hours later?

Or 2 weeks later, in the case of some things I've done. But that's what you get for dealing with big data.
posted by Jimbob at 6:10 PM on February 15, 2010


What's the deal with that <>The Art of Computer Programming and Cormen et al.'s Introduction to Algorithms use it, but are there any programming languages that use it? Maybe something to do with the punch card era?
posted by scose at 6:30 PM on February 15, 2010


Oops, something bad happened with my HTML. I was talking about the <- operator for assigment.
posted by scose at 6:32 PM on February 15, 2010


The R mailing list would make a pretty good drinking game: a sip for every time someone gets a terse, Calvin-Coolidge-esque reply; a sip for every time Brian Ripley or Peter Dalgaard posts to the list; two sips for every time someone gets referred to "a basic statistics class"; a shot for every time the entire text of someone's reply is "?command"...

Anyway, great post. I thought I would contribute another link because I really haven't seen anything like it before: a bunch of shiny Web apps that use R to dynamically render graphs.

> What do people do when their datasets get too large

?filehash

(couldn't resist, sorry)
posted by en forme de poire at 6:37 PM on February 15, 2010 [1 favorite]


Yes R doth truly rock.
posted by jeffburdges at 6:38 PM on February 15, 2010


?filehash

(couldn't resist, sorry)

Go on.

> ?filehash
No documentation for 'filehash' in specified packages and libraries:
you could try '??filehash'
posted by a robot made out of meat at 6:50 PM on February 15, 2010


scose: "What's the deal with that <-"

The idea is that equality should not be mistaken for assignment, and in programming assignment is the act of storing something for later reference, so an arrow makes sense for that.

In algol family languages, "=" where the programmer meant "==" is still a common enough bug to lead to silliness like if( 3 == x ).
posted by idiopath at 6:52 PM on February 15, 2010 [2 favorites]


R has been on my fiddle-with list for a while now. Interesting that the assignment operator is Knuth's "gets" (x <- y meaning "x gets y"); I thought when I learned about it that "gets" was a nice notation, but I didn't until recently realize there was a popular language that used it. Makes me wonder what other interesting language knobblies there are.

Sweave looks terribly useful. Is Sweave (perhaps with make) smart enough to do caching, or does it re-run your data analysis on every compile?
posted by fantabulous timewaster at 6:55 PM on February 15, 2010


Fantabulous, I present the cacheSweave package.
posted by Jimbob at 7:04 PM on February 15, 2010 [1 favorite]


I was. Go try to get a version of R set-up with a few different GUIs to choose from, and do it in the same amount of time it takes to use a windows installer. You won't. Especially when R, and most of the R GUIs weren't in the normal repositories at the time I tried to set it up.

Was this 15 years ago? You could almost certainly has a linear regression done in fifteen minutes if you googled "R linear regression tutorial". R is less easy than minitab or JMP because it does more. Try doing anything in your first ten minutes at a SAS terminal. Open source has nothing to do with it.

Stata gets a lot of credit for being crazy easy and having nice menus. Side effect is watching people do things very wrong because it made it too easy.
posted by a robot made out of meat at 7:08 PM on February 15, 2010


Oh, and a tip for the R fans out there - if you haven't discovered it already, you should check out sqldf - it uses SQLite to do seamless SQL operations on data frames in R. I don't know how I survived for so long without it.
posted by Jimbob at 7:11 PM on February 15, 2010 [5 favorites]


In algol family languages, "=" where the programmer meant "==" is still a common enough bug

That's why I prefer languages such as algol, where assigment is ":=" /smartass
posted by Monday, stony Monday at 7:16 PM on February 15, 2010 [1 favorite]


robot: hoist on my own petard, there. Try here: filehash (or, install.packages("filehash"); library("filehash"))

A useful blog entry about it is here.
posted by en forme de poire at 7:17 PM on February 15, 2010


> Okay, now get it to run a multiple regression in R commander.

If you have data formatted as, say:
case,response,var1,var2,var3
1,0,1,-3,0.2
2,0.1,2,-5,0.4
3,2,4,-2,0.01
...
where rows are cases and columns give you the value of your regressors in each case, then the following should give you a multiple regression:
your.data <- read.csv("your-data.csv", header = 1, row.names = 1)
model <- lm(response ~ var1 + var2 + var3, data = your.data)
summary(model)
A useful PDF is here.
posted by en forme de poire at 7:26 PM on February 15, 2010 [2 favorites]


So, I have, uh, a friend, who is pretty uneducated about statistics. I think I, I mean, my friend would be much more likely to learns some stats if he could learn by doing stuff in R.

Any recommendations for a "learn by doing" tutorial for R, one which assumes no knowledge of R or statistics, and aims to build both together?
posted by orthogonality at 8:03 PM on February 15, 2010


Let's say I am good at math but bad at computers and I want to run R on my (OSX 10.6) Mac. I don't know what python, perl, or java are. I'm comfortable with working from a command line (e.g. this is how I use Maple and MAGMA.) What do I do?
posted by escabeche at 5:10 PM on February 15 [+] [!]


Install it straight from here. The mac version is quite a bit nicer, I think.

I've previously failed in getting RPy to work. Is it worth it, and any tips about installing it on a Mac?
posted by a womble is an active kind of sloth at 8:12 PM on February 15, 2010


What fun it is to troubleshoot 1-based array indices....
posted by Blazecock Pileon at 8:14 PM on February 15, 2010


R saved my bacon when I gave up getting anyone to look at my data sets in Matlab. If a ME who knows programming only in Fortran 77 and C can learn R, anyone can. Many thanks for the GUI recommendations.
posted by jet_silver at 8:19 PM on February 15, 2010


"Any recommendations for a "learn by doing" tutorial for R, one which assumes no knowledge of R or statistics, and aims to build both together?"

Using R for Introductory Statistics (pdf).

Introductory statistics with R By Peter Dalgaard

Statistics: an introduction using R
By Michael J. Crawley

Analyzing linguistic data: a practical introduction to statistics using R By R. Harald Baayen

Learning Applied Statistics with R (stack overflow question)

I also like A first course in statistical programming with R, but that might be a bit at the intermediate level, as there is a lot of assumed math.
posted by ollyollyoxenfree at 8:42 PM on February 15, 2010 [39 favorites]


While I've never used R, I just set my tutoring student up with it. He's a middle-aged fellow who's decided, on a lark, to go back to school for financial modeling.

After several months' hiatus, he set up a session with me a couple weeks ago. I met him, and his major question was "Should I install this GNU SCL thing I found on google? Or should I copy the statistics functions out of the book?" It seems that, since his beginning programming classes were in C++, he was of the impression that it was a good language to use for running a derivative pricing model. Mind you, this is the same fellow who needed help in his first tutoring session with what a "text file" was.

So, setting aside the relative merits of R versus other numerical analysis languages, I'd still say it's a giant fucking improvement over C++.
posted by Netzapper at 8:47 PM on February 15, 2010 [1 favorite]


I was chatting with a guy who I guess is a statistician or something. He's a professor I think, looks really scholarly, programs in R.

And he made me feel all bad because I was talking about some code I'd written that used, like, 3 loops or something (yeah, I know n3 or something but seriously okay because this was a function in the real world, where we're not processing an infinite number of items, just like 1000 or something, and three nested loops will handle that just fine). And he's all like "You shouldn't be using loops. A real programming language, you shouldn't have to use loops to do [whatever it was we were talking about, I can't remember]" And I felt really bad. I mean, I was thinking, what the heck is wrong with me that unless I'm using loops and a bunch of if-thens I can't figure it out.

And finally I asked, "Okay, so how exactly would the computer figure this thing out?" And he's like, "Well, I guess, behind the scenes, it's probably using loops."

That made me feel a bit better, and I know writing a web app or something in R would probably be pretty dumb. But still, there's a part of me that wonders what's wrong with me that I can't figure it unless I have my loops.
posted by Deathalicious at 8:50 PM on February 15, 2010


I want to note this wasn't R. I don't know R yet and probably never will be, as statistics are, to me, not the most interesting thing I can do using computers.
posted by Deathalicious at 8:52 PM on February 15, 2010


Actually if you want to use R's feature set in a more conventional/modern langauge, you can use Rpy to call R functions from Python. That would probably be helpful for people who already know Python.
posted by delmoi at 8:57 PM on February 15, 2010


And finally I asked, "Okay, so how exactly would the computer figure this thing out?" And he's like, "Well, I guess, behind the scenes, it's probably using loops."

Sure, but it's also more likely to be able to parallelize your code. If you use a loop, the CPU can only work on one index at a time, whereas if you use a map/reduce or filter style call where you pass a function that gets applied to every element it can run on lots of CPU cores or even whole clusters of machines. That's one of the reasons people are pushing functional programming so much these days, since it's easier to parallelize.
posted by delmoi at 9:06 PM on February 15, 2010 [3 favorites]


That made me feel a bit better, and I know writing a web app or something in R would probably be pretty dumb. But still, there's a part of me that wonders what's wrong with me that I can't figure it unless I have my loops.

This seems to be one of the major impediments for teaching OO and functional programmers SQL. "No, stop trying to tell the computer how to retrieve the data, tell it what you want."
posted by rodgerd at 9:38 PM on February 15, 2010 [2 favorites]


functional programmers

That should, of course, be proceedural. Stupid thinkos.
posted by rodgerd at 9:38 PM on February 15, 2010


Usually "for loop aversion" is just a smokescreen for scripting languages that are too slow to do any real work themselves (e.g., Matlab), so to get any sort of performance they have to ship off large batch operations to "real code" written in C, which will then promptly run a for loop.

Parallelization is fine, but functional programming is hardly a magic bullet; you can write perfectly functional (e.g., heavily recursive) code that is still impossible to parallelize, and there are imperative constructions (like parfor in Matlab) that are equivalent to map while maintaining common for loop syntax.
posted by Pyry at 9:45 PM on February 15, 2010


I've previously failed in getting RPy to work. Is it worth it, and any tips about installing it on a Mac?

Unless you need your R objects to be accessible within a Python environment or vice versa, it can be painful to troubleshoot and perhaps more work than should be required. You also need to understand R and Python data structures pretty well. IMO, for maintainability, consider using R as a scripting language directly, instead, using Rscript.
posted by Blazecock Pileon at 10:03 PM on February 15, 2010


I've used R here and there for school projects, and found it to be a very powerful and really neat, if somewhat difficult to start with language. It was certainly satisfying to be able to write 5 lines of code to do the same thing that my peers were doing in 50+ lines of Java.

If you're like me and have a certain affinity for paper reference books, the following helped me a lot with the basics of R/S: The Basics of S-Plus.

I think the school's library had the 2nd edition when I checked it out, so you will probably do just fine with the (much cheaper) 3rd edition.
posted by grandsham at 11:09 PM on February 15, 2010


Another vote for R here. Recently, I've been combining R and C, but writing the backend of my analysis (I work with Bayesian hierarchical models) in C and interfacing with R. This can speed things up by orders of magnitude in the types of analyses I do. I get the best of both worlds that way.
posted by Philosopher Dirtbike at 12:48 AM on February 16, 2010


I just use the Data Analysis add in for Excel.
posted by Damienmce at 1:01 AM on February 16, 2010 [1 favorite]


I cannot even begin to express how perfect the timing of your post is. Seriously, I was just having a conversation with a friend earlier tonight about this very thing, including all of my anxiety and overwhelmedness about needing to understand R all its labyrinth workings. All of the answers I need are right here. You have saved me hours of research, that I can now spend on figuring out how to actually do my real research.

I must spouse you now.

Also, I just bought the Baayen book from Amazon yesterday. It's great to hear MeFites recommending it, as well as R itself.
posted by iamkimiam at 1:41 AM on February 16, 2010


This seems to be one of the major impediments for teaching OO and functional programmers SQL. "No, stop trying to tell the computer how to retrieve the data, tell it what you want."

How can there even be an impediment to learning SQL? It's about the easiest thing ever. I bought an "SQL in 24 hours" book in highschool and I think I actually only spent about 4 even reading it.
posted by delmoi at 3:24 AM on February 16, 2010


Well the fact that you would turn to a book for a theoretical grounding puts you in an upper percentile unfortunately. A lot of programmers with little prior database experience tend to approach it as if the database is just a replacement backing store for arrays that also has some fancy filtering. So they'll do dumb things like fetch all rows matching A from table X and store than in an array, and then fetch all rows matching B from table Y and store than in a second array, and then they'll loop through each element in the first array comparing it to each element in the second array to find the relations. They do this because they don't know what a join is (or they do but they get confused by all the different kinds of joins) or how to specify one to the database. So they instead do what they know, which is to just plow ahead and implement things the way they would if they had no database and just a bunch of flat files.
posted by Rhomboid at 3:53 AM on February 16, 2010 [2 favorites]


Years ago I unwittingly joined a cult called SAS, and now I'm trapped paying exorbitant licensing fees and going to expensive courses. Would R be a way out of this cult? Seriously: I've put years of my life into learning SAS- is there an advantage to leaving?
posted by acrasis at 4:19 AM on February 16, 2010


FWIW, SAS recently announced the R Interface to SAS/IML Studio.

Disclaimer: I work for SAS.
posted by kurmbox at 5:15 AM on February 16, 2010


I bought the ggplot2 book by Wickham a few months ago and I highly recommend it. It enables amazingly good plots to be produced. There are pdf versions of bits of it floating around online, but I wanted all the information in one place and to support his great package.
posted by a womble is an active kind of sloth at 6:31 AM on February 16, 2010


Needs more cowbell.
posted by warbaby at 6:50 AM on February 16, 2010


Awesomeness <- Emacs + ESS + R

I use R everyday and I am still learning. Its a pain to figure stuff out when you don't know the appropriate function name but when you do there is a perverse sense of enjoyment (I still remember the day I discovered alpha shading in graphs!). Now if only I could convince my PI to stop working in Excel...
posted by Hutch at 7:07 AM on February 16, 2010


Now that's timely. My fully paid-for, fully activated Minitab just threw a fit about not being activated and the license server kicked me off, telling me that the activation code is already in use. Support will fix it eventually, I guess, but meanwhile I'll have an other look at R. Last time I tried it it seemed like one of those sprawling, academic-based software that are unusable for casual users like myself: I only do stats once in a while, and I need an interface that I can figure out immediately. Having to learn a full language (and to remember what cryptically-named package does what) seemed overkill, so after playing a little with R-Commander I quickly came back to the commercial comfort of Minitab. But it looks that I don't have much choice now.
posted by elgilito at 7:29 AM on February 16, 2010


@Popular Ethics: > This is perhaps a good time to introduce scipy

The Scipy/Numpy/Matplotlib triumvirate are awesome as well! Matlab's matrix-friendliness and plotting capabilities meet Python's everything else.

I have to say, though, good luck compiling any of them from source. I had to run some Scipy scripts on a cluster, and so I couldn't just use a package manager because I didn't have root access. There was an older version of Scipy that was installed, but it threw some kind of error when you imported scipy.stats and so it was kind of unusable for my purposes. Anyway, compiling was not an experience I would wish on anyone (mostly, to be fair, it was getting the BLAS/LAPACK part to compile). In case anyone has to suffer through this, garnumpy helps a lot, although the experience was still far from headache-free. R, in contrast, was pretty much a breeze to install locally.

I've also had weird experiences with things not working properly or not being implemented in Scipy (I seem to remember having some bizarre problem with the hypergeometric CDF implementation, though I don't remember exactly what was wrong), so I've mostly had to resort to RPy for my stats-in-Python needs. Also, R is still the natural choice for me if I have to do some kind of statistical modeling because the sheer number of packages is so extensive and the modeling commands themselves have a pretty simple syntax. Scipy's a great project, though, and I'll definitely be keeping an eye on it.
posted by en forme de poire at 8:43 AM on February 16, 2010 [1 favorite]


139 favorites so far parudox. Excellent!
posted by bukvich at 9:06 AM on February 16, 2010


Awesomeness <>

So I hear, but to a non-initiated emacs person, I spent a few hours on Sunday trying to install this (on a mac). Pretty hard to figure out what is going on based on the scant instructions online so I returned to my pedestrian approach of copying bits of scripts into the R console and running them. I feel like there must be a better way....

posted by a womble is an active kind of sloth at 9:32 AM on February 16, 2010


How can there even be an impediment to learning SQL? It's about the easiest thing ever. I bought an "SQL in 24 hours" book in highschool and I think I actually only spent about 4 even reading it.

You might like to think that, but it simply isn't true. In the OO world, especially, there are a slew of people who consider themselves programmers who claim learning the basics of SQL and relational models is too hard. It's quite frightening, really, especially when you look at some of the crapulent ORMs, frameworks, and OO DBs developed to cater to people who are too dumb to code[1].

. So they'll do dumb things like fetch all rows matching A from table X and store than in an array, and then fetch all rows matching B from table Y and store than in a second array, and then they'll loop through each element in the first array comparing it to each element in the second array to find the relations. They do this because they don't know what a join is (or they do but they get confused by all the different kinds of joins) or how to specify one to the database.

Oh god this. Still, it can let you look like a hero; I once spent an afternoon getting a two-order-of-magnitude improvement in a codebase that was doing this for every web page the app served.

As I like to point out to people when I find them doing aggregaion or sorting in $LANGUAGE when they have their data in a perfectly good RDBMS, if you really think you can code these sorts of algortihms better than a decent database server, you should talk to Larry, because he'll make you a millionaire.

[1] And yes, I'm well aware there are good and useful cases where OO DBs make a lot of sense, and are certainly more appropriate to a given problem than a relational one.
posted by rodgerd at 9:32 AM on February 16, 2010


sorry bout the italics... I was trying to quote "Awesomeness <- Emacs + ESS + R"
posted by a womble is an active kind of sloth at 9:33 AM on February 16, 2010


I do a lot of work in SPSS syntax but have not done anything that looked like real programming for over a decade. What got me into R was a great book by Joseph Adler, Baseball Hacks. Not only does it act like a great R primer (full intro chapter and examples through the rest of the book), but is also gives an intro chapter on mySQL and one on PERL, so that by the end of the book you are getting them to all play nicely with one another. Not an exhaustive education on any of the three, but enough for me to get down to actually doing a little work with it.
posted by cgk at 9:38 AM on February 16, 2010


R is a terrible programming language. It is a wonderfully efficient and comprehensive collection of tools. I use it every day and find something to hate most days... and yet I use it every day.

A lot of people seem to want R to be a language for building applications. That's a terrible idea. Call out to R, have it do marvelous things quickly, load the results back into some reasonable language, and move on.
posted by gurple at 10:26 AM on February 16, 2010 [2 favorites]


Also, ladies, if it helps my case any I am using R at a brewery down the road from my office.

You know who else did statistics in a brewery? Yep, that's right, W. S. Gossett. Or "student" to you, Guinness.
posted by Mental Wimp at 11:01 AM on February 16, 2010 [5 favorites]


Some notes from a package developer who is used to working on clusters in R:

Sooner or later you need to call out to C/C++ (usually with OpenMP directives if you can) and/or a relational database, 'ff'/'bigmemory', or possibly Hadoop/MapReduce, to do heavy lifting. That's OK -- it's the same path that most Matlab programs eventually tread as they grow. The Incanter project (in Clojure, with its mapreduce hooks and RIncanter bindings) looks like a possible next step in R's evolution -- the biggest difference between Matlab and R is in the core, where Matlab can easily take care of SMP/MPP with checkpointing, and R isn't there yet. You end up with a slapdash assembly of 'doMPI', 'foreach', etc. and C/OpenMP function calls, which while it works OK, isn't exactly a gentle learning curve. But large-scale numeric computing rarely is.

Anyways, hope this provides some ideas for people that are running out of time or RAM in R. Various combinations of the above have solved most of my R-related issues along the way. In a few weeks, I will hopefully be putting a semi-flexible mixture model package on CRAN, with a more restrictive, but faster, core E-M loop than 'flexmix'. I hope it will offer some ideas for using OpenMP directives to radically speed up a common task (which I feel should be made easier).


To whoever posted about sqldf:

You do realize that sqldf selects are typically on the order of a hundred times slower than using R's native hybrid indexing, yes? As time goes by, I grow more and more convinced that using SQL on data.frames is somewhat pointless. merge() handles the typical JOIN operations, and SELECT is easily handled by redefining

"%u%" <- union
"%d%" <- setdiff
"%i%" <- intersect
"%nin%" <- function(x, table) match(x, table, nomatch=0) == 0
"%notin%" <- function(x, table) match(x, table, nomatch=0) == 0

Try it. I bet you'll find that it's easier and faster after a short while. I tried sqldf for a while (after about 15 years of working on relational databases and 5 years of using R off-and-on). Then I figured out how to do the same things with indexing. The latter is faster and usually clearer. R is not fast to begin with; making it slower is a good way to experience frustration.
posted by apathy at 12:08 PM on February 16, 2010 [1 favorite]


nowoutside: I'm working my way through the new O'Reilly book R in a Nutshell right now! I have a very thin programing background (some IDL and even less Python), and things seem to be clicking.

Oh man, there's an O'Reilly book for this? That bears closer examination.


Deathalicious: And he's all like "You shouldn't be using loops. A real programming language, you shouldn't have to use loops to do [whatever it was we were talking about, I can't remember]" And I felt really bad.

My take is that he's trying to talk up his own favorite language, & coming off as a language zealot.

This keeps popping up here, so: use &lt; to get "<" & &gt; to get ">." (They stand for "less than" & "greater than," not, say, "left something" & "right something," which totally doesn't work.)
posted by Pronoiac at 12:35 PM on February 16, 2010 [1 favorite]


<ooh! carry on.>
posted by iamkimiam at 12:45 PM on February 16, 2010 [1 favorite]


To whoever posted about sqldf:

You do realize that sqldf selects are typically on the order of a hundred times slower than using R's native hybrid indexing, yes?


Oh I realize that, but it hasn't impacted me in any way I've noticed. I do it mainly for the syntax. SQL's syntax is so easy - I can figure out sorts and joins and selects so much easier than doing in R with merge() and subset() and so forth. Simply because I've only been using R for about 3 years, while I've been using SQL for about 12. I didn't recommend sqldf for speed, I recommended it because there might be other people out there who like to deal with SQL as well. However, I will endeavor to give your suggestions a try.
posted by Jimbob at 1:29 PM on February 16, 2010


517: Okay, now get it to run a multiple regression in R commander.

Or, hey, I have an idea! Why don't you do it yourself?

We've told you it's all right there and how to get to it. I don't know about idiopath, but I don't know enough R to do your experimentation for you. I can, however, see one hundred and fifty seven separate packages for R in Debian. Seven are the base installation of R, and the rest seem to be libraries. The "r-recommended" metapackage will add a number of the most common libraries. All you have to do is point and click a little, and voila.

You've already made up your mind that "it's hard!", when it's not hard at all. You don't have to compile anything on Debian or Ubuntu to get a fully-functioning R installation with a huge library of prepackaged utilities. All you have to do is click on some checkboxes for which features you want, wait awhile for them to download/install, and then use them. And if any bugs are found or new features are rolled out, your distro's automatic update feature will notify you and install them if you like.

The keywords you need to search with: r-base, r-doc, r-cran, r-noncran (for one package). From the command line, you can also use "^r-" as your search term... meaning any package that starts with r-. I don't know if that works in the GUI.

It would be a lot easier to find in the giant Debian archives if they'd called the language something longer than one letter. Searching for just r returns over 13,000 results.
posted by Malor at 3:46 PM on February 16, 2010


Malor: "^r-" as your search term... meaning any package that starts with r-. I don't know if that works in the GUI. "

Yeah, that works in aptitude. Regarding the little assignment, I got as far as tinkering with R-Cmdr and then realized I was spending the rest of my time figuring out stuff about statistics more than figuring out R.
posted by idiopath at 4:35 PM on February 16, 2010


R!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
posted by brendano at 4:57 PM on February 16, 2010


Actually, I fired up my Ubuntu image, and ^r- doesn't properly work from the GUI. It seems to just ignore the ^. So, I suggest using r-base, r-cran, and r-doc. I had trouble finding R Commander; as it turns out, the package is named r-cran-rcmdr. Can't search on "commander", either. Grr.

Trying to figure out what it was called, I checked the R Commander website, and noticed that they say Ubuntu's version is typically out of date. They recommend using their repository instead.

To do that, add a mirror in Administration/Software Sources/Other Software, Add button. Enter:

deb http://cran.stat.ucla.edu/bin/linux/ubuntu karmic/

You can choose other mirrors if you want, that's just what I picked. Make sure the checkbox to use that source is checked.

When you close that program, you'll get a warning about a missing key. You can ignore this if you wish; you'll get warnings about untrusted software when you proceed with your R installation. If this worries you, don't install yet. If you don't care, go ahead and install "r-recommended", and whatever else you want.

Adding their key is hard to do from the GUI. The GUI requires a key file, and the R team doesn't directly provide one. Rather, they publish their key through the keyserver network. This is safer, because if their website gets compromised, the bad guys can't hand out both a bogus key file and malware signed with that key. Keeping the key out-of-band is much more robust. But, since the GUI can't talk to a keyserver, you pretty much have to hit the command line.

The published method on the project Ubuntu archive is a little more complex than it needs to be. This is a slightly simpler solution:

sudo apt-key adv --keyserver keys.gnupg.net --recv-keys E2A11821

Note: that command is a little dangerous, above and beyond trusting the R developer, because you don't know if I'm giving you the correct key ID. I could be an evil miscreant, or I could just have made a typo. For maximum safety, double-check the key code on the R project site. It's just past the heading "SECURE APT".
posted by Malor at 5:34 PM on February 16, 2010


R's mailing list is not pleasant. R's community on StackOverflow is really, really helpful - and you can occasionally get help from the person who wrote the package you're asking about (Hadley Wickham OMG <3).

Also, when R does something that pisses you off, you're never in doubt about what to scream. Try that with SPSS.
posted by McBearclaw at 7:51 PM on February 16, 2010 [1 favorite]


Also: do people really use GUIs for R? That seems barbaric to me.
posted by McBearclaw at 7:52 PM on February 16, 2010


You do realize that sqldf selects are typically on the order of a hundred times slower than using R's native hybrid indexing, yes? As time goes by, I grow more and more convinced that using SQL on data.frames is somewhat pointless. merge() handles the typical JOIN operations, and SELECT is easily handled by redefining

Have you looked at the R/PL for Postgres? Same issues?
posted by rodgerd at 1:12 AM on February 17, 2010


Or, hey, I have an idea! Why don't you do it yourself?

Aaaah, just like the R mailing list!

*sighs contentedly*
posted by Jimbob at 1:51 AM on February 17, 2010 [2 favorites]


Jimbob, he was asking us to run a statistical analysis for him. We told him how to do it, and in fact I just posted a fairly detailed set of instructions on how to get the most recent version running on Ubuntu.

I am not, however, going to do his homework.
posted by Malor at 2:37 AM on February 17, 2010


Oh I know, I was just kidding. Most of the similarly terse replies on the R mailing list are in response to similarly pathetic questions ;)
posted by Jimbob at 2:44 AM on February 17, 2010 [1 favorite]


Usually "for loop aversion" is just a smokescreen for scripting languages that are too slow to do any real work themselves (e.g., Matlab), so to get any sort of performance they have to ship off large batch operations to "real code" written in C, which will then promptly run a for loop.
I think Matlab now uses just-in-time compilation to evaluate loops, so that the loop body is (slowly) parsed only once rather than each time through the loop body. There's occasionally discussion on the Octave mailing lists about adding this feature, but it seems unlikely to happen soon for a number of reasons.

Are there other languages that have this limitation? I guess in Perl you'd see it in a loop containing "eval 'code'" rather than "eval { code }", but I haven't done any benchmark to see what the penalty is.
posted by fantabulous timewaster at 10:56 AM on February 17, 2010


The problem is unlikely to be just parsing repeatedly. The C code for (i = 0; i < n; i++) x[i] += y is going to run much, much faster than anything that involves repeated dispatch of even bytecode one array element at a time. Not only do you save the instructions needed for the comparisons and jumping, but you also make cache misses much less likely, avoid bad branch prediction, allow SIMD, etc. So for any array-understanding scripting language (R, MATLAB, NumPy),

for (i in seq(length(x))) {
    x[i] <- x[i] + 1
    x[i] <- x[i] * 2
}


is going to be a lot slower than

x <- (x + 1) * 2

Even if you pull out all the stops with compilation, unless you are also going to do magic analysis that allows you to understand that the latter is mathematically equivalent to the former. And that's fairly difficult to do.

The latter is really easier to type and understand, too, but people who are stuck in an Algol-like paradigm sometimes have a hard time wrapping their heads around the possibility.

You almost never need a for loop in R. I think I've used about two of them in the last five years or so.
posted by grouse at 11:10 AM on February 17, 2010 [2 favorites]


fantabulous timewaster: I guess in Perl you'd see it in a loop containing "eval 'code'" rather than "eval { code }", but I haven't done any benchmark to see what the penalty is.

In Perl, I'd try putting the loop into eval 'code'. I just compared this way to something running eval for every line in a file, & there's a good 20-30x slowdown that way.

eval {code} is more for robustness in the face of errors, I think.
posted by Pronoiac at 3:29 PM on February 17, 2010


USGS has a free R course, which focused more on command line stuff in 2008 and has more GUI (R commander and Rattle) this year. It's a good place to start for beginners, but I also learned a lot about using the various GUIs that make it somewhat easier to create graphics without having to look up all the necessary commands.
posted by stinker at 9:21 AM on February 20, 2010 [1 favorite]


What do people do when their datasets get too large, or the computations take too long? Just leave for the day and come back 24 hours later?

A) We often don't load everything into memory at once. Fine ways to subset the data and work on a piece of it at a time

B) We parallelize the code, so that we can split it over may processors, often on a cluster. As we speak, I have 200 cores crunching some genomic data for me.

c) Yes, sometimes there's no way around long jobs. I've certainly done analyses that take, say, a week to run.

Bioinformaticists are a good group to ask these questions. Dealing with whole genomes has made us learn lots of tricks for dealing with very large data sets. (Two other good groups are astronomers and high-energy physicists)
posted by chrisamiller at 3:59 PM on February 20, 2010


Also, R is the language I love to hate. The power is great. The syntax makes me cry, especially when switching from my standby language of Ruby.
posted by chrisamiller at 4:00 PM on February 20, 2010


« Older Because Pop Rocks   |   Prince rehearsal videos, 1984 Newer »


This thread has been archived and is closed to new comments