the threat R poses
April 30, 2022 5:13 PM   Subscribe

 
This post could probably benefit from background and context. The link is VERY jargon and acronym heavy, and probably not of interest to the metafilter audience without some further explanation as to why this matters or what it means.
posted by jonathanhughes at 5:40 PM on April 30 [4 favorites]


I loved this piece and it was immediately interesting and understandable to me, and I'm going to be sharing it with colleagues!
posted by brainwane at 5:53 PM on April 30 [5 favorites]


jonathanhughes: Here's some background context:

There's a programming language called R. Researchers often use it to do statistical analysis and similar stuff in the sciences.

Most people writing software programs these days don't write everything from scratch. They reuse programs other people built, using them like LEGO bricks, gathering and building upon them.

But -- unlike a LEGO brick -- often, the program ("package") that you're reusing gets upgrades on an ongoing basis. You reused version 1.2 but then version 1.3 comes out. And maybe that brick is shaped a little differently and the castle falls down.

In the sciences, it's good practice for your work to be reproducible -- someone else should be able to run the same experiment you did and get the same results. If you say "here's my paper and here's the code I wrote to run the experiment and analyze stuff," ideally, other people should be able to run that same code on the same data and get the same results. And when you publish your code, it includes an inventory of the names of all the bricks you built on.

But if you did your experiment on Feb 2nd, and then I'm reading the paper on Dec 12th, then probably, between those times, the developers who manage the bricks in question have released a lot of version upgrades, changing how things work. So if I just copy your code and run it, things probably won't work.

So the author of this post is suggesting a way that a researcher can sort of freeze their bricks inventory in time through a special keyword, and then future people will be able to reproduce their work more easily.
posted by brainwane at 6:04 PM on April 30 [54 favorites]


thanks!
posted by jonathanhughes at 6:10 PM on April 30 [1 favorite]


So there already seems to be a mechanism in R for versioned library dependencies, where you can construct a list that says things like
  • awesome_library v1.3.4
  • unstable_library v0.1.74.alpha.final.noreallyfinal
and I assume there are tools for reconstructing an R analysis environment from such a list. The complaint here is that scientists, being busy people who may not actually care very much about their computers and their version numbers, frequently skip this step. So the linked tool is a multi-versioning library loader whose parameter is just a date. If you have some code that used to work but now doesn’t, or which your colleague claims produced some output but now produces different output for the same input, you can say “well, it worked last July,” and have a reasonable chance of reproducing the original environment.

I suspect it works because R has a reasonably complete centralized repository, the CRAN, with a sensible mirroring schedule. This gives a better correspondence between dates and version numbers than some catastrophe like searching all of GitHub.

It’s not as good a solution as actually recording all of the version numbers in your software stack. But it frequently happens that a “quick script” which “shouldn’t need” such documentation grows out of control and becomes important and brittle. A clever invention.
posted by fantabulous timewaster at 6:19 PM on April 30 [6 favorites]


This feels like reinventing the wheel - specifically, a wheel labeled "Maven".
posted by NoxAeternum at 6:21 PM on April 30


Oh that's pretty cool. Describing the problem as a threat that R, specifically, poses to reproducible research seems... I dunno, kinda clickbaity to me, since this is a pretty general problem with any analysis software. But this package seems like a great solution for R. It actually would have seriously helped me out with an issue I had a year or two ago.

Does pip have a similar mechanism for getting packages by date rather than version number? That would be super handy.
posted by biogeo at 6:35 PM on April 30 [4 favorites]


This Maven? That’s way outside the comfort zone of even many very good R users of my acquaintance.

The brilliance of the groundhog approach is that it can be done well after the fact using an estimate of the date by someone else. Probably that won’t get all the analyses running again, but it’s low-effort.
posted by clew at 6:45 PM on April 30 [16 favorites]


biogeo, I am fairly sure pip does not have something built-in to let you do this at the command line because PEP 508, "Dependency specification for Python Software Packages", doesn't cover that, but I bet there's a third-party tool or a pyproject.toml flag or something that does something a bit like this.... I'm diving back into Python packaging work for dayjob reasons and will probably have this question in the back of my head till I find an answer!
posted by brainwane at 6:46 PM on April 30 [2 favorites]


The brilliance of the groundhog approach is that it can be done well after the fact using an estimate of the date by someone else. Probably that won’t get all the analyses running again, but it’s low-effort.

The problem is that it's dodging the actual issue - that just as nobody would take a paper that did not document its sources properly in a manner that would allow following researchers to grab those sources, so to should code in research be handled so its sources can be properly retrieved. The issue with groundhog is that it "works" because R has a relatively monolithic repository structure, so it's possible to ask that repository "hey, what was the state of package X on date Y?" But that's still a bad idea built on faulty assumptions like "was the researcher actually using the most current module when they created the script?" Basically its offloading the responsibility of maintaining the code's object model to someone else and hoping they'll get it in the ballpark, instead of saying "hey, your POM is basically a bibliography for your code."
posted by NoxAeternum at 7:05 PM on April 30 [1 favorite]


The ability to search by date is of use to reproducibility of results, and groundhog sounds like a pretty good idea.
I'm here to bang the drum for GNU Guix, a package manager / build system / OS that tracks packages and all their dependencies by individual commits (dates are in there too of course, as part of the metadata).

It's a lot more cognitive overhead than just using a simpler language-specific package manager, unless until you need to support multiple languages. It's programmed and configured in Lisp, and "functional" in a way that guarantees that a particular "derivation" of a version of your package will compile to a specific, reproducible binary.

From what I've heard it's mostly used in scientific / academic computing. It's not a very practical desktop OS
posted by Rev. Irreverent Revenant at 7:12 PM on April 30 [2 favorites]


What does this do for things like the following scenario in say npm:

1. I import/install react18 explicitly
2. I do the same with material-ui@latest which in turn relies on say react17

I believe in npm this uses both react18 and then loads react17 for packages that call that. I forgot what this scenario is called right now but this is a pretty common scenario. If I recall it gets more complex if a package calls the latest of a package and it actually doesn't support that package.

Furthermore there's been packages I've used in if not npm one of the multitude of other package managers that are no longer viable due to OS changes. Usually this means that the OS patched a security vulnerability and the package didn't upgrade whatever it needs. Docker alleviates this a bit as you can specify the exact OS/kernel at the expense of exposing a security vulnerability.

The point being is there's a lot of ways to approach this and you still have significant gaps if you're using something that's not maintained 3, 6 or 12 months from the time it is written. Recent security vulnerabilities seem to have exacerbated this and node package management is notoriously not great, but even languages that seem to cultivate careful programmers like Rust have similar issues. Or I'm just bad at package management.
posted by geoff. at 8:25 PM on April 30


The complaint here is that scientists, being busy people who may not actually care very much about their computers and their version numbers

I imagine they only care about those things insomuch as they enable publication. Until your paper is accepted, the computer working is very important, and transitively, the versions of things you use the week before the submission deadline probably needs to be the same as the version you get the day/hour before.

Basically, until the Nobel committees start testing whether published software reproduces results, I see no incentive to fix this. Unlike OSS tools that see repeated use in society, I don't expect a lot of bug fixing to go on. Especially when bug fixes to your data analysis script could end up nullifying your published result!
posted by pwnguin at 8:28 PM on April 30


I mean, yes, Maven sucks. That's why there are a billion other solutions out there for handling build dependencies and versioning of said dependencies in a software ecosystem. I'd recommend starting with Gradle, if you're willing to learn a small amount of Groovy. This is not a new problem, and the author's impulse to go write his own custom library to handle versioned dependencies, without actually referring to any of the well-known prior art, makes me suspect that he (a) is uniquely unqualified to be writing this thing, and (b) has no idea what kind of minefield he's stepping into (hello, do you know more than 1 developer? Did you know that if you ask them to talk amongst themselves about their preferred way of handling minor and subminor release numbers, you will get an effect not unlike dropping a pellet of elemental sodium into water?)
posted by Mayor West at 8:47 PM on April 30 [3 favorites]


The threat that R poses is that it is frequently written by people who actually use a phrase ("functional programming") to describe something special, above and beyond what they default to.

I did not realize this until I was sitting in machine learning courses with them. Their code is terrifying, and I can only imagine that choosing versions of libraries is just deciding which set of bizarre side effects you want to endure.
posted by Tell Me No Lies at 8:59 PM on April 30


(a) is uniquely unqualified to be writing this thing, and (b) has no idea what kind of minefield he's stepping into (hello, do you know more than 1 developer?

I'd agree with you except nearly every single language tries to reinvent this wheel and even within a language there's multiple package management systems. Though this seems uniquely amateurish. The big glaring problem I see with this versus something resembling semantic versioning is that we can both be using the same package (say 1.0) but I use it on May 1, and you have a date of May 20. Having the possibility of the same package, same code but different dates is really confusing in determining whether we are using the same version of the library.

Frankly dates are not what the rest of the software development world uses to version and if you're going to buck some basics in versioning you better have a good reason to.
posted by geoff. at 9:00 PM on April 30 [4 favorites]


I'm not going to take over this thread but some quick googling shows there's renv which does exactly what this attempts to do using a more standard methodology, including using lock files. It even brings up the exact same caveats I brought up:
The results produced by a particular project might depend on other components of the system it’s being run on – for example, the operating system itself, the versions of system libraries in use, the compiler(s) used to compile R and the R packages used, and so on. Keeping a ‘stable’ machine image is a separate challenge, but Docker is one popular solution. See also vignette("docker", package = "renv") for recommendations on how Docker can be used together with renv.
If this is such a problem research scientists can't be bothered to learn some of the basics of package management then maybe the approach should be to create an IDE that helps with usability. As the renv page describes more eloquently than I could, at a certain point some package management decisions need to be made by the researcher/developer and can't be guessed or automated.

The alternative I guess is to take a snapshot of a complete VM of the environment it ran in. That might not be a bad idea for longevity as restoring a VM will likely be more doable than a package management system that itself might be out of date or without documentation in 3 years time.
posted by geoff. at 9:23 PM on April 30 [3 favorites]


It's a huge problem for everyone. A trivial package that really no one should even have used (left pad) literally broke enough code that it shut down most of the internet. So it turns out versioning is a hard problem for everyone (omg ruby nightmares).

So for science to be good science, papers don't just need reproducible results they need to include code and the entire environment that ran the code. Probably add in the chips that ran the code, the batch and dye if available, cloud vendor, the ambient humidity of the machine room and a record of any power glitches. Joking about humidity but at some point... oh and don't forget versioning the training sets.
posted by sammyo at 10:24 PM on April 30


I think it's important to keep in mind that the target audience for a project like this is not software developers, but researchers with varying levels of programming competency. Being able to reference the version of a library by date-of-repository-access is not intended to replace proper versioning or package management, but to give a more simplified pathway for researchers to do the right thing when publishing their work, which is to report the exact version of all analysis packages that were used in producing the work. If I run an analysis suite that relies, directly or indirectly, on dozens of packages in R, I ought to report the exact version of every package that was used, but in fact this sort of thing is pretty rare in my experience. Even the shift towards people making their analysis code publicly available as a standard part of the scientific publishing process is still relatively new.

In the R ecosystem, things are complicated further by the fact that it's not uncommon for published packages to have unstable APIs that change between releases, sometimes in significant ways. This bit me in the ass a year or two ago, when I found that a whole bunch of analysis code that I'd written for a project a few years earlier no longer worked at all when I returned to it, because in the mean time I'd updated various packages both within and outside of R that meant I couldn't even roll back to the earlier version of the package. So knowing the versions of all the R packages used for an analysis is really the bare minimum.

The advantage of the approach with this Groundhog package is that I, as a researcher, can update everything I use to a specific known state, based on a date, prior to publication, and verify that my code works as intended in that state. Then I can report that the versions of all packages were those as retrieved from CRAN via Groundhog using that specific date stamp (which can be any time I choose up to the present). This means that I'm both (a) more likely to actually do this rather than collect the version information of all my installed packages, (b) the journal is more likely to be happy to publish that one sentence rather than have to include a table of all R packages and version numbers, and (c) other researchers have an easy way of replicating my work without necessarily having to have access to my analysis code (and thus possibly a file indicating all package version numbers), if they want to write equivalent code based on the same packages.

This by no means solves everything, and I think a lot of attention in the reproducible research community is rightly on containerization tools like Docker to ensure that the complete analysis ecosystem can be packaged and distributed for replication. But that still requires a level of technical sophistication that is beyond most researchers, even ones who are fairly competent programmers in their analysis environment of choice. I think this serves as a good step in the right direction for improving the general state of things in a way that has a sufficiently low barrier to entry that it may actually get some buy-in from the community.
posted by biogeo at 10:37 PM on April 30 [22 favorites]


I think it's important to keep in mind that the target audience for a project like this is not software developers, but researchers with varying levels of programming competency.

To which my argument is twofold:

One, these researchers are demonstrating a level of laxity and cavalierness that they would not tolerate in a sphere closer to their own. The OP literally has a section where the author points out the "value" of his system by pointing out that a list of the packages and versions used could be replaced by a single "oh, this command will give the packages used" - something that I doubt they would find acceptable for a proper bibliography. The problem is not "programming competency" - it's not grasping that versioning is a key part of proper documentation, and if addressed from that point, the system the OP proposes would be seen as inadequate.

Two, it turns out that replication is pretty important to programmers as well, because we like to be able to get programs to work in new locations as a matter of course. As such programmers have come up with a number of systems to address the importance of versioning to this, as well as addressing why solely driving off dates is a non-starter due to the inadequacy of dates to addressing the issues of versioning. Again, this isn't "competency", but addressing the importance of versioning in making code portable.
posted by NoxAeternum at 10:59 PM on April 30 [2 favorites]


The OP literally has a section where the author points out the "value" of his system by pointing out that a list of the packages and versions used could be replaced by a single "oh, this command will give the packages used" - something that I doubt they would find acceptable for a proper bibliography.

People routinely cite websites using the date they were accessed.
posted by en forme de poire at 11:29 PM on April 30 [10 favorites]


My ridiculously unpopular opinion is that we should write all of the science-things in TensorFlow.

No really, hear me out.

It's baaasically a bunch of linear algebra and probability primitives which have been bent to all kinds of purposes ranging from statistical analysis to streaming audio processing. Because of the vast array of downstream users, the primitives carry few assumptions, operate on a wide variety of platforms, and have been heavily vetted for correctness, numerical stability, memory safety, and so on.

Reproducibility becomes a matter of using forward-compatible interfaces, instead of relying on fly-by-night libraries written for who-knows-what specific purpose by who-knows-which over-stressed graduate student. If you can't express your computation as a graph of primitive transformations, do you really understand what you're doing?

(I'm only 85% joking here.)
posted by kaibutsu at 11:40 PM on April 30 [2 favorites]


> This feels like reinventing the wheel - specifically, a wheel labeled "Maven".

when it comes to being excited about installing uncontrolled versions of the latest exciting and useful software dependencies over the internet, i'd argue that there are two categories of people in the world: people who have spent at least a couple of years working in a team on a single long-running software project maintaining CI pipelines, triaging each morning's new exotic build failure mode, and having the occasional biannual panic attack trying to figure out which precise build of the product actually shipped to the clients when assessing the impact of a major product defect or security vulnerability that is discovered in the wild a long time after it actually shipped, ... and then there's everyone else.

people in the former group who haven't burned out and quit the industry yet or escaped into management may have quirky preferences like checking all the package dependencies and the compiler toolchain into version control, ranting about it being important to be able to build and test the product from source without a network connection, may regard programming language specific package managers as an anti-pattern and reach for the whisky each time they see a new one announced, and want as little internet-connected software in their personal lives as possible.

people in the latter group might write throwaway scripts to perform analyses or automate once-off tasks in software-adjacent fields or are new to the software industry, bright eyed and bushy tailed, full of optimism about the potential of technology, and haven't accumulated enough broken build fatigue yet, or maybe have quite a few years of experience but haven't yet worked on any single code base for a long period.

the latter group is probably much, much larger than the former group, and that's okay! & that's why articles like this are helpful.
posted by are-coral-made at 1:33 AM on May 1 [8 favorites]


> a lot of attention in the reproducible research community is rightly on containerization tools like Docker to ensure that the complete analysis ecosystem can be packaged and distributed for replication

introducing docker may make things more reproducible provided:

1. you actually save the manufactured container images to storage somewhere
2. the container images are designed so that when the container starts it doesn't try to download and install the latest and greatest packages from the internet
3. you keep a durable log of which container version was used for what, when
4. (perhaps) you also keep a durable log of the version of the docker client & host OS & maybe even the host processor architecture that was used for what, when
5. any input data required by the container is also saved to storage somewhere, and kept track of (if it isn't baked into the container image)

Alas, what i actually see quite often in industry working in teams of software professionals (not researchers) is:

1. people often rig automated scripts to install the _latest_ version of a container image, not a specific pinned one
2. when authoring a Dockerfile, people often don't fix a specific version of the base container image, they implicitly ask for the latest one
3. when authoring a Dockerfile, people often use operating system level or programming language specific dependency managers to install packages, but don't specify pinned versions of those dependencies to install, they instead implicitly ask to install the latest ones

So unless you have already decided that you value reproducible builds and are diligently thinking about each step and how it could go wrong, adding docker doesn't actually fix the problem, it just wraps the problem in a container without solving it and creates the new problems of how to reproducibly build container images and how to reproducibly pull and run the same container image version you did way back when
posted by are-coral-made at 1:55 AM on May 1 [7 favorites]


the other side of the coin of pinning specific versions of all your packages so you can reliably rebuild your software product is that, over time, people find security vulnerabilities in all the old package versions you pinned, and then if you care about the security of your product, you need to upgrade your old package version to a newly published version that has an emergency hotfix applied.

but, if you try to jump from an ancient package version to the new hotfixed one, the new one probably is incompatible with all your code, or some other package you depend on, or some transitive dependency it depends on, so your product doesn't even build, and you're looking at someone spending maybe days of effort to even be able to estimate how long it will take you to actually make it work.

so to reduce the risk of ending up in that situation where you must upgrade but upgrading doesn't work, you've created an new job for someone to regularly and diligently attempt bumping each pinned package version to a newer pinned package version and re-test the product still builds and works, so that if you do need to upgrade to an emergency hotfix patch of some package dependency, then it is a low risk incremental change
posted by are-coral-made at 2:33 AM on May 1 [2 favorites]


Mayor West wrote

the author's impulse to go write his own custom library to handle versioned dependencies, without actually referring to any of the well-known prior art, makes me suspect...

geoff. wrote:
there's renv which does exactly what this attempts to do using a more standard methodology, including using lock files.

Uri Simonsohn discusses why he believes renv, checkpoint, and Docker aren't quite right for his design goals in this section of his previous blog post about groundhog, and links to that blog post in the first sentence of the entry clew linked to in the original post ("About a year ago I wrote Colada[95]").

The groundhog website includes a comparison of groundhog with renv.
posted by brainwane at 3:40 AM on May 1 [5 favorites]


biogeo, IIUC you're the only one in this thread who has mentioned personal experience writing R for research purposes so I'm particularly interested in your point of view. Has the Carpentries made a dent at all in better R practices in your field? I see their lessons include "R for Reproducible Scientific Analysis" so I hold out a tiny bit of hope.

This thread has reminded me to catch up a bit on what the fine Reproducible Builds folks are up to so now I know about diffoscope, which "tries to get to the bottom of what makes files or directories different. It will recursively unpack archives of many kinds and transform various binary formats into more human-readable form to compare them. It can compare two tarballs, ISO images, or PDF just as easily." Neat!
posted by brainwane at 4:19 AM on May 1 [2 favorites]


Because it's Sunday and waaaay down the thread, I'll share my colon cancer coal-face re-do the Rnalysis after several years story.
posted by BobTheScientist at 4:29 AM on May 1 [3 favorites]


People routinely cite websites using the date they were accessed.

Yes, but they also note things like the URI, page name, etc. as well.

Uri Simonsohn discusses why he believes renv, checkpoint, and Docker aren't quite right for his design goals in this section of his previous blog post about groundhog, and links to that blog post in the first sentence of the entry clew linked to in the original post ("About a year ago I wrote Colada[95]").

Again, as a working programmer who is currently dealing with the long term ramifications of getting sloppy with references (which has not been fun), I find his arguments uncompelling. Basically, his argument is that there should be a system that makes an R script more robust as a standalone item, while ignoring that the issue is making sure that analysis code for research is well documented. He notes that he doesn't use projects, while ignoring that the whole reason the project paradigm exists in programming is to deal with the specific issues he's struggling with.
posted by NoxAeternum at 5:23 AM on May 1


Yes, but they also note things like the URI, page name, etc. as well.

You still need to name the packages you’re using at some point. As far as URI goes, CRAN is a centralized repository, and if you were using this for GitHub-hosted packages you would still need to provide a URI. More to the point, though, would any of the above be different if you cited by version number instead of by date of access?
posted by en forme de poire at 8:12 AM on May 1


I like this article as someone who's currently learning R and wallowing in dependency hell; I've also had this with virtually all languages.

Like said above, it comes down to what is valued: future compatibility; ease of use; proper development.

Although, like proper technical documentation, I rarely see managers who really care about things like this.
posted by cowlick at 8:35 AM on May 1 [2 favorites]


I have a war story of the one time I was tasked to get some old research code put back together and working. Took about four days of little else but sleep. It was for the High Performance Computing Cluster and did some sort of planetary core magma flow simulation, lots of Fortran and C and needed an old Python and bunches of modules that did weird data formats and things. The basic final simple test took many hours to run depending on how many nodes your using. Turned it into a directory of sources, patches, and a build script that could recreate the whole thing. Everything but kernel, compilers. All snugly contained in a single directory. Could probably turn it into a Docker or VM sort of thing pretty easily.

I'd probably being some sort of research reproducibility something or another. But like others have said, it's mostly the systems programmer/developer types know how to keep things running or at least easily fixed or at least just a bit more tightly all wrapped up in a nice little box.
posted by zengargoyle at 8:42 AM on May 1 [1 favorite]


The reactions here remind me of the response to the totally absurd discovery that Microsoft Excel (and workalike software) mangle the names of genes like SEPT4 and MARCH1, which are “helpfully” converted into dates and stored as the integer number of days between the end of 1899 and the corresponding date in (probably) the current year.

That problem was first mentioned in the scientific literature in 2004. Of course the “correct” solution is not to use a business spreadsheet like Excel as a database. Another “correct” solution is for the person doing data entry to be very careful to select “format as text” for the cells in their spreadsheet-slash-database which contain gene names. According to the linked article, a dozen years later (2016), at least a fifth of published supplementary data had such conversion errors. You could turn that around and say that 80% of published data sets were protected from this stupid problem successfully. A university educator might turn an 80% success rate into a C+ or a B–. Okay, but not thrilling.

After a few more years of unsuccessful community outreach, the genomics community has (starting in 2020) taken the step of changing the standard names of these genes, so that the community can use its existing software toolchain without tripping over this common problem.

Is this solution stupid? Yes, absolutely. But the problem is that scientists are people, and computer programmers are people, and people are stupid, and software is stupid, and somehow we make it all kind of work anyway.
posted by fantabulous timewaster at 8:57 AM on May 1 [11 favorites]


If this is such a problem research scientists can't be bothered to learn some of the basics of package management then maybe the approach should be to create an IDE that helps with usability

Now you have two problems ...

… needed an old Python and bunches of modules

The "Python 2 is deprecated!" thing causes all sorts of problems, and it's only faintly bearable now because everyone's going through it. Give it a few years of link rot, and trying to get old Python running might be a hellscape. Numerics are handled completely differently between Python 2 and 3 (try print(2/3)) and some of the pragmatism for auto-converting text values to numbers has been lost in 3.
posted by scruss at 9:12 AM on May 1 [1 favorite]


Wow I had not heard of the R programming language.

I don't understand why we have reinvented make so many times. It is like software carcinisation.
posted by NoThisIsPatrick at 9:17 AM on May 1 [6 favorites]


Yes, I think in these discussions programmers working in industry are often not clear on the very different set of constraints under which academic researchers are operating.

As far as documentation goes, scientists are actually big adopters of tools that encourage a literate style, like Rmarkdown and Jupyter notebooks. I think that illustrates the value of carpentry-style tools that allow people to adapt their existing workflows to include better reproducibility and documentation practices.

I think a lightweight tweak that gets a working scientist most of the benefits of a more complex solution is much more likely to be adopted, and will be more likely to have a major positive impact, than insisting academic scientists adopt methods wholesale from a particular industry (especially if in that industry, people actually do very different types of work under very different conditions).
posted by en forme de poire at 9:37 AM on May 1 [11 favorites]


The groundhog website includes a comparison of groundhog with renv.

Unfortunately I noticed this right after I posted but didn't want to put a wall of comments up.

Even running under the fallacious assumption that a researcher not bothered to understand package management can create a good Dockerfile, in the time range the article mentions (5 years!) that Dockerfile likely itself will be deprecated. Simply software needs to be actively maintained or face active rot. We're not even introducing issues with new chipsets like the M1. The more I think about it the more I think a solo programmer would be better off capturing a giant 50GB VM for reference, which itself faces being outdated but perhaps at a slower rate. Or more optimistically, if the libraries researchers use solve similar problems how much would it cost to create a $500 million fund to just maintain a standard library of sorts? Like the poster mentioned above somewhat jokingly referred to TensorFlow.

than insisting academic scientists adopt methods wholesale from a particular industry (especially if in that industry, people actually do very different types of work under very different conditions).

I agree and I don't think it is limited to researchers. I think it is a problem programmers who have not worked on teams or long-term projects fall into. It is a lesson learned that you don't really take seriously or fully understand until you lived it as are-coral-made pointed out.

I think what the industry programmers are pointing out is that the R researchers are facing a very, very difficult problem that most professional programmers actually don't run into: rerunning a specific version of software that's years old. More I think of it more I think just capturing a giant snapshot of the development environment is the best as even the most disciplined programmer can't foresee future changes in the industry, even with the best documentation.
posted by geoff. at 9:45 AM on May 1 [3 favorites]


The biggest issue with articles like is that they frame reproducibility as an R issue or a Python issue or a Java issue, instead of one that encompasses the whole development environment.

Suppose that you meticulously track the versions of all your R package dependencies (maybe by keeping the original source code under version control) and you also pin your environment at a specific version of the R distribution. While this is a step in the right direction, it ignores many sources of variability like: a) the versions of all libraries that R itself depends on; b) the configuration options that enable/disable certain R features; c) the version of the C compiler used to build R; etc. All of these can influence the results of a complex computation.

Basically, his argument is that there should be a system that makes an R script more robust as a standalone item, while ignoring that the issue is making sure that analysis code for research is well documented.

IMO high-quality documentation is necessary but not sufficient, due to the problem illustrated above: a meaningful level of reproducibility is very hard to achieve by hand. Systems like Guix and Nix attempt to address this issue by automatically tracking the version numbers and fingerprints of source files used to build every file in the development environment, and by keeping this metadata under source control.

At some point you need actual binaries to bootstrap the build process, but there's ongoing effort to make this bootstrap seed smaller. This is in contrast with systems like Docker which actively encourage people to depend on huge pre-built blobs instead of the original, verifiable source code.

One challenge for Guix and Nix is improving ease-of-use and broadening support to enough build systems that mainstream researchers will see a clear benefit. But articles like this one suggest that they also have an awareness problem: many developers don't fully appreciate what reproducibility means, or why some tools address the issue more comprehensively than others.
posted by SaurianNotSaurian at 9:59 AM on May 1 [2 favorites]


brainwane, the Carpentries have definitely had some impact in my field(s), though not yet as much as I think would be ideal. I have at least one colleague who's really into their approach, and encourages his students to take Carpentry classes. Others have probably never heard of them. There tends to be quite a spectrum in the research domains I intersect with.

On the one extreme, people working on the more theoretical side of computational neuroscience tend to be comparatively sophisticated in their use of software tools, and recognize that a lot of what they're doing is effectively amateur software engineering and therefore they can benefit by following advice from the pros. These are the folks most likely to be aware of Carpentries.

In the middle, and more common, are experimentalists who learn enough Matlab, R, or Python (generally only one of the above) to do the work that they need to do. Most of these folks feel they've already invested a significant effort in learning the tools they use and are only interested in changing if there's a significant demonstrated need.

On the other extreme, and more common than I think any of us would like to admit, are people who "program in Excel." I've seen things you people wouldn't believe... Jerry-rigged databases corrupted on shared drive access... I watched text data converted to dates near the Tannhauser Gate. All those moments will be lost in time, like tears in rain... Time to die.
posted by biogeo at 10:09 AM on May 1 [12 favorites]


I also write reproducible research code in R. I think this is a neat package because it's very simple and can go a long way toward fixing code that is already broken.

This is the first time I have heard of the Carpentries, fwiw.
posted by zug at 10:32 AM on May 1 [5 favorites]


I'm not too surprised to see only sparse discussion of unit testing as a means to improve reproducibility: tests are boring but a mainstay for sanity.

testthat seems popular (https://cloud.r-project.org/web/packages/testthat/index.html), but I wonder what percentage of popular R packages have tests of their own.

The testing chapter looks pretty good: https://r-pkgs.org/tests.html
posted by the Real Dan at 10:42 AM on May 1 [4 favorites]


This is fascinating for me, writing in a corporate environment where reproducibility is only important for the current fiscal year, or maybe even quarter. And if the report breaks, you fix it by updating the code, the admins are too busy keeping the database growing correctly to worry about updating the server side packages or maintaining older instances after update.

In writing this I realize I've picked up some really bad habits as a result of corporate culture. Especially because I assume there are lots of people like me, who bother to update packages only when needed or when moving from one version of R to the next. In this way, groundhog would not be helpful, as I'm sure that on my old personal laptop the versions of basic packages like tidyverse stuff dates to the installation of R 4.2, which was a reluctant upgrade from version 3. Then again, this is for personal projects, nothing that is published or really seen by anyone but my wife or dog.

This is all to say that this is stuff I've never thought about and realize that I really should and it's a little embarrassing that this was never talked about while I was working on an MS in data science.
posted by Hactar at 11:30 AM on May 1 [3 favorites]


This is a really cool package and definitely incredibly useful for people like me. I both use R in my own research and teach the use of R for statistical analysis to biology and environmental science undergraduates with no programming experience. I have interacted with a number of Carpentries folks and used some of their resources in teaching. Exposure to the Carpentries' work has definitely improved my own organization of my coding as well as the way I teach R. I've gotten much better at writing Carpentries-style scripts for my students to modify as needed rather than expecting them to write their own code from scratch, and it would be very easy to include groundhog in these scripts to make sure the scripts I wrote for my Spring 2022 Ecology class will still work for my Fall 2022 Ecology class.

I think what folks may be missing about groundhog is how many packages an R user is often running in a single script, how often especially those of us who teach natural sciences, not programming, need to run an old bit of code, and how often our code is simply a single statistical test (and its accompanying graph, tests for normal residuals, etc.), not an application. Yes, I could go through a long list of package versions and revert my entire computer back to those old versions and the old version of R. Or, my understanding is, groundhog could do all of that for me in a single step so I can run a statistical test and get on with my life.

That means with groundhog I can for example 1) compare new results to old results confident that nothing has changed other than my data, 2) run old code without getting 5000 errors from broken or lost packages, 3) help a student run a script they found in a paper for something that I don't know how to do but that they need to do for a single analysis for a single class project, or 4) run a script that uses an old package no longer available on CRAN until I figure out a substitute for that package.
posted by hydropsyche at 11:47 AM on May 1 [10 favorites]


it's a little embarrassing that this was never talked about while I was working on an MS in data science.

Huh! when? I mean, I've talked to a couple computer science students who didn't know about source code control, but it turned out they were the kind of computer science students who use the good chalk instead of computers.

the Real Dan, I was thinking of test cases too. Even if each paper was represented by an actual physical read-only computer and OS and code, stored forever and available to dump old or new data through -- the more important question is more about how correct the code is, not how unchanged it is. And test cases make it a lot easier to reason through what the code is actually doing. (Which is infinitely harder than even rerunning it from an image, so it isn't the problem we solve first.) That's a great-looking chapter on testing R you link.

NIST does truly glorious software testing, and I don't know how professions can keep accepting analysis-in-Excel given how clearly NIST has said that they don't test Excel as statistical software because it isn't good enough to start. Although I don't find that on the NIST page now, instead discussions of Excel-front-end-to-opensource-analysis bridges. Maybe the bridges protect the original data from Excel rewrites.
posted by clew at 12:00 PM on May 1


I think another thing to keep in mind when we're talking about reproducibility in scientific computing is that while it's important, what scientists really care about is replication. If I make a claim that the analysis I ran provides evidence for one statistical model (and thus scientific model) versus another, reproduction means that someone else can follow the exact steps that I followed to get the same answer, while replication means that someone else following the same methodology that I followed will get the same answer.

Reproducibility is important when (a) you want to know exactly what steps someone followed in order to understand the overall methodology, (b) you have questions about the correctness of those steps, or (c) you want to re-use someone else's work in a different context. Those things are all certainly important. But suppose I run some analysis that results in me deciding to, say, reject a null hypothesis on the basis of a significance test, and it turns out that someone else running the same code on a machine with a slightly different floating-point architecture gets a different answer, due to differences in round-off that get amplified during the analysis procedure. As long as a machine with the appropriate floating-point architecture is available, I could write code that is perfectly reproducible. But it fails to replicate when changing a variable that ought not to have mattered (the architecture of the machine running the analysis). This means my result is not robust and thus is probably not true.

There is some sense in which being overly concerned about reproducibility is focusing on the wrong thing. (Though to be clear, I think the scientific community as a whole is nowhere near that point right now, and more focus on reproducibility is really important.) As long as the code performing the analyses we want is correct in the sense of accurately representing the analysis methods we intend them to, having different researchers running the same analysis on different machines is in a sense a form of replication, that helps confirm that the results we report from our analyses are robust to the myriad of variables present in a computer that we hope shouldn't matter.
posted by biogeo at 12:21 PM on May 1 [4 favorites]


And to clarify my point, "different machines" also means "machines with the same basic architecture but different operating systems, installed software environments, etc.," not just machines with different underlying hardware architecture (which is only a tiny fraction of the cases the actually matter, I just picked that example as a stark case).
posted by biogeo at 12:41 PM on May 1


Clew, we talked about source control in terms of github and teaching your own code, but if anyone mentioned being sure to track which version of packages you were using, I missed it.
posted by Hactar at 12:51 PM on May 1 [1 favorite]


My CS degree was in 2012 and we only had one class where they lumped a lot of the issues larger software teams had into that class. Surprise! It was the only class that mentioned Software version control.

Really, all the homework and small projects we did (as individuals) could be called "Toy projects/programs" and we didn't have a lot of the issues that teams had. We only worked in Teams for a couple of classes.

They covered (at different years) programming in C (briefly), C++ (about 2 years), and Java (the last two years). I always hated the beginning of those because that's when they would throw some ungodly mashup of a setup document that was supposed to create a working environment to do these programs. That was always the point where "just do it" and other forms of "setup magic" were commonplace. I'd been working in Industry for quite a while and it was very frustrating to see the various ad hoc solutions for a working programming environment that were common.

But from working in the field (and reading) I knew this was a *hard* part that they were skipping over.
posted by aleph at 1:09 PM on May 1


aleph, though: Computer Science isn't Software Engineering.

... which is a reductive and glib response that should be unsurprising when you've read through the licence terms which promise at best no warrant of merchantability or fitness for any purpose.

I'm professionally ashamed of that.

It's at least good that "reproducible build" pattern is coming to science done using R, using version control, file hashes and interface contracts to be sure that code at stage X using packages Y with cryptographic hash Z can run the functions your script calls for.
posted by k3ninho at 3:18 PM on May 1


Of course the “correct” solution is not to use a business spreadsheet like Excel as a database.

Excel isn’t a spreadsheet. Excel is a robust and full-featured VM running a powerful Smalltalk-inspired IDE whose display layer happens to resemble a spreadsheet. We don’t describe it like that because if we did we’d have to admit that most of the most important programming in the world is done by underpaid women in pink collar jobs.

For what it’s worth, I like this groundhog idea a lot, mostly because by virtue of its simplicity it is extremely accessible, and by virtue of the fact that it doesn’t insist that it be the only solution in play, gives non-specialists a way to add defence-in-depth resiliency to their code base.

I mean, yes, pin your dependencies, also add them to your own version control if you can. Then also do this.

There is not one right answer in the fight against entropy; redundancy, repairability and constant vigilance are all necessary, and no one thing is sufficient.
posted by mhoye at 4:08 PM on May 1 [12 favorites]


I'm professionally ashamed of that.

This field is less than a century old, lives entirely at the intersection of math, engineering, art and superstition, and still gets amazing things done.

I mean, bridges still fall over now and then, and the civil engineering people have something like a 3000 year head start on us. I think we’re doing ok. Could be doing better, though, programmers still believe a lot of nonsense. But ok.
posted by mhoye at 4:17 PM on May 1 [1 favorite]


scruss, I don't think it was Python3/2 like that. Fuzzy memory but probably RHEL 4/5 and probably configuration options or patching to support Myrinet - Wikipedia, the system Python just wasn't an option. I'd have used it if it was.

University wise, it's probably give an enterprise level developer X% of time to just code manage researchers code habits. My job at university for almost two decades was to manage and keep the infrastructure up in the service of letting researchers and such do their thing.

A chunk more than that one-off "get borrowed for a thing" in extra help or even an advisory role... Eh, probably management. Wonder if there's a job descripton for that along the lines of "we run your resources and the internet and have consultants and even a couple of people to make your code story sane". Like experts (relatively) in that area where you are experts in your area. Bad organization/management. :)
posted by zengargoyle at 4:27 PM on May 1


> Wonder if there's a job descripton for that along the lines of "we run your resources and the internet and have consultants and even a couple of people to make your code story sane"

in some organisations that provide services used by a lot of business-oriented or product-oriented software development teams, this part of the org might be called "SRE" or "platform"
posted by are-coral-made at 10:23 PM on May 1


Thanks for posting this, at least I'm aware of this "groundhog" project before I come across it years after it has become abandoned!

I work in the informatics group of an organization with a population of geneticists/epidemiologists where all of the postdocs/fellows/PIs have gleefully and wildly abandoned SAS for R over the course of the last few years. To some extent I have sympathy for this move, but as someone coming from the Java/Maven ecosystem into the R ecosystem I've watched this transition happen with no small amount of horror. (Derail: of course Maven has warts, but at least it's a coherent effort at providing some kind of portable project management solution that has real utility.)

I see two main problems - these scientists at nearly every level of their education have been thrust into an environment that requires some kind of computing/programming literacy without any effort on behalf of their parent institution to provide that education. It's functionally impossible to do any kind of modern genetic/epidemiological research without computational analysis, and yet the level of computer literacy/competency that these researchers have is often shockingly low. So what ends up happening is that the researcher ends up just getting tossed into the deep end of the pool without knowing how to swim, and all they can focus on is the pure immediacy of the analysis task in front of them. Once the paper is published, they never have to return to the code.

Second - there's a kind of a willful denial of the need to adopt the most basic software development techniques that seems to stem from the attitude of "we're not software engineers, we're data scientists!" I encounter this time and time again, and not just from members of my organization, but from what I see across the relevant parts of the R blogosphere at large. Getting the data analysts to do the very basics of versioning, source code management (i.e., git), and unit testing has been an uphill struggle to say the least. They simply just don't (or can't, or won't) see the utility of using these techniques, even when they have specific complaints about specific issues that would be trivially solved by adopting these techniques. If there's no R package for their problem, there's no solution.

It will be really interesting to see if this "groundhog" approach - which imo is a very noble but fragile approach to solving R's reproducibility problem (which also imo is very real) - will gain any traction, or at least do something to raise awareness of this issue. I hope it does, but I fear it won't.

(edit: grammar)
posted by the painkiller at 7:27 AM on May 2 [2 favorites]


I think this is neat and could see it being useful! I dunno about the rest of y'all but even within a context of trying to do most things right, I end up trying to resurrect old code (both other people's and my own) pretty regularly, and this could be a handy tool to make that less painful.
posted by quaking fajita at 8:32 AM on May 2 [2 favorites]


the painkiller, my sympathies! You are SO not alone in your frustration. Carpentries workshops (including "Data Carpentry" for data scientists) are particularly targeted at orgs like yours, are really low-cost to organize, and "Analysis of The Carpentries Long-Term Surveys (April 2020)" shows that 73-78% of workshop participants say that the tools they learned in Carpentries workshops "are improving my ability to manage data / are improving my ability to analyze data / are improving my overall efficiency" (see p. 8). I'm having trouble scrounging up the stat right now, but I've seen pre- and post-assessment data saying that doing this 2-day workshop saves a researcher a day a week for the rest of their working life. So if you haven't already, please consider pinging the Carpentries folks -- they have been in your shoes and likely have tips for how to sneakily appeal to your colleagues and get them to want to learn this stuff. And, regardless, my condolences!!
posted by brainwane at 8:44 AM on May 2 [2 favorites]


My lab does a lot of analysis in R, and I worry about reproducibility (very much so). We create clinical tools, and are required to be able to audit issues with very high precision. I think this proposal is using the wrong tools for the job. Containerization is necessary because of changes in R itself (although they rarely break backward compatibility in core), system libraries, python libraries, external software (e.g. torch, TF). Once you're containerizing, it's just as easy to use a specific cran snapshot as to enumerate all the package versions. Unfortunately, I don't know of an equivalent for a pypi date snapshot and have to create the induced list by pip-tools.

What I have more-or-less mandated as SOP is
- Use a dockerfile / singularity with fixed versions of everything
- Use a date-specific cran-snapshot or pull fixed commits from GH for GH packages
- Use pip-tools or conda-lock to pin python libraries
- Analysis code and dockerfile / singularity in a git repo
- Container image gets uploaded
- Include in output the image id, session info and package list from R, the pip package list, the code commit id.
- Try to include in the output a signature of the input files, always include the date. A lot of our data is protected and ought not be stored in repos, but the central storage has date-tagged backups.
- Try to include substantive automatic output in an rmarkdown or notebook file
- Tag any work product with the commit ID

This means that for any analysis
- It's obvious exactly what code and data created a product
- The OS, libraries, all packages, and code can be reconstituted
- The data can be reconstituted in the rare event that it's been changed
- You don't have to worry about different versions of anything causing conflicts

This doesn't have to be fully followed while an analysis / product is in its early stages, but it isn't that much work once you've done it once. Our HPC wants everything in a container anyway, and for analysis-centered people just getting used to containers is most of the overhead. It is still possible for hardware-specific reproducibility failures to occur (this has mostly been GPU problems); hypervisor changes or sandboxing failures could happen also, but these are uncommon edge cases.
posted by a robot made out of meat at 11:31 AM on May 2 [6 favorites]


That is so far beyond the consciousness, let alone the ability, of a bunch of (R1! Eminent! But mostly not human biology) labs I’ve kibitzed with that I am wildly curious how you got the will and the funding. Does "clinical tools" mean medical funding? Are there outside bodies with opinions on how you achieve reproducibility and audit?

Also, you don’t mention test or V&V anywhere, but I assume you’ve got plenty - how do they propagate between containers?
posted by clew at 12:57 PM on May 3


Yes, we make AI tools for clinical groups, and the clinical operations people are very conscious of the potential for regulators (or litigants) to demand to know where a tool's decision came from. This level of detail is compatible with FDA's guidance on software as a medical device. We are exempt from having FDA IND filings, but it's more or less what they would need. There isn't any real funding needed for this process; it takes a little time to learn but saves an enormous amount of time when you inevitably have to merge / roll back code.

It was an evolution for me. Many years ago, doing bioinformatics work, making scripts into subversion repositories was necessary to track the analysis and code base going with versions of papers, especially since I had a tool shared with collaborators that I needed to track. It helped a lot that I came from hard sciences where latex (which fits really well with source control) is standard for documentation. SVN was eventually replaced with git and pure latex with rmarkdown, but the utility of being able to roll back code and documents was clear 15+ years ago. Recording package versions and reinstalling was something I ran into pretty early, since my scripts would often not run after sitting for a while.

I was aware of the reproducible-research community pushing containerization, but didn't fall in line for a long time. Our high-performance-computing people were really the impetuous for adopting docker-based workflows. After yet another incident where an analyst lost a laptop with fully identified participant information, the institution put down the hammer: protected health information has to live in storage meeting very high level access control requirements. Very few labs would be able to comply other than using the storage provided by the HPC group. From there, using the HPC compute services were dramatically faster for i/o bound large dataset analysis. Our HPC center requires containerized analysis. It took probably a full time 2 weeks to learn the docker and cluster tools, but the power for isolating analysis was obvious.

As is always the case, getting people facile with this toolset is easier with junior people who aren't really intimidated by spending a few weeks onboarding to a seemingly arbitrary workflow. It's also much more natural for people coming from python, where virtual environments are standard and notebooks are interactive but enforce running as a complete script and outputing documentation. I try to not look like a dictator by blaming the HPC people. We have great support from a collaboration of our library and HPC center with live workshops and references on git, docker / singularity, and jupyter notebooks.

I haven't figured out what the best way to handle validation and unit testing are. I encourage these, but they are very specific to what a particular analysis or code base is trying to do.
posted by a robot made out of meat at 1:01 AM on May 4 [3 favorites]


a robot made out of meat: it's wonderful to hear about the practices in place at your organization!
There isn't any real funding needed for this process; it takes a little time to learn but saves an enormous amount of time
I get what you're saying, about how you haven't experienced any need to set up separate budgets for any of the training or ongoing process maintenance, but what I do read in your description is that you have institutional support (stemming from the protected health information requirements + the HPC group's pre-existing process requirements) that in some other organizations would run aground amidst objections about resource constraints.
It took probably a full time 2 weeks to learn the docker and cluster tools

junior people ... spending a few weeks onboarding

We have great support from a collaboration of our library and HPC center with live workshops and references on git, docker / singularity, and jupyter notebooks
You and your colleagues have okayed spending resources on all of that, in particular the workshops and references. Which is great! In the spirit of appreciative inquiry, I'm not saying "waaah no one else can do that [gives up]" but rather "it's really useful to understand a case study where compliance requirements can be parlayed into genuinely better process" (because I am guessing there are a ton of orgs out there where similar compliance requirements instead triggered a bunch of useless paperwork stapled onto the same cumbersome toil-laden patchwork processes).
posted by brainwane at 7:34 AM on May 4


I wish there were a graceful way to deprecate R packages, just because you write a package in grad school doesn't mean you want to support it for the rest of your life, especially not with CRAN's ever-changing policies and other dependencies shifting underneath you.

It is not so uncommon for CRAN to send you an email that says "fix this in 48 hours or we will take down your package" when it isn't even your package that's changed, but someone found a memory leak in it on the Solaris build of the development version of R, so down it goes.
posted by nfultz at 10:34 PM on May 4 [3 favorites]


brainwane, I 100% appreciate that significant help (esp in the form of training) makes this possible. I don't want to oversell it either; many people I work with also hand me a pile of scripts that may or may not run. I hope that as our collective library of guides and improved tools make this easier, it will diffuse out. I also some times feel like the bad player because despite all this, I usually cannot share underlying data. There is a team here that uses GAN-simulated data or differential-privacy censored + imputed to help make full data and code releases, but I haven't learned those techniques myself. I also appreciate that "make highly reproducible analysis" is similar to "write well documented and maintainable code" in that there isn't a line item on the budget for it, but it certainly takes more money.

I've both had package-from-grad-school-that-isn't-worth-maintaining and been disappointed when a dependency went away, literally as you say because of a problem on Solaris (Dear R Core: there are 2 Solaris users. Just stop.). Other than flatpack-style self-contained packages, I don't know the solution.
posted by a robot made out of meat at 12:47 AM on May 5 [1 favorite]


In the medium-old days when we shipped actual shrinkwrapped software, one of the cries for help/absolution of a developer was "It works on my machine!" . Producing software that would run as expected on any machine of a given spec led us to subtle differences between different instances and how the software was relying on something it shouldn't. Those bugs often arose from errors in our understanding of what the software, including the OS, did.

This really thorough virtualization is a truly impressive feat of being able to always run on the original developer's machine, but that skips debugging a class of errors -- and some of them were significant! (We had a lot of trouble with flavors of pseudorandomness, which seems like it would hink up a lot of statistical analysis.)

So I'm pretty weirded out by this paragraph: "I haven't figured out what the best way to handle validation and unit testing are. I encourage these, but they are very specific to what a particular analysis or code base is trying to do." I... ah... isn't it much worse to have analysis software run without errors and emit results if it hasn't been V&V'd against specifically what the analysis or code base is meant to do?
posted by clew at 5:06 PM on May 15


clew: the issue is that they're all different, so I don't know how to write it into a procedure for everyone to follow. Most of the time I don't know a-priori what the right answer is, or even exactly the right analysis to do. I encourage periodic checks and checks after "major" steps but not how to properly define those things. This can include checking that well-know associations replicate in the data, that marginal distributions are appropriate, and that simple-but-less-valid approaches aren't too far off. Where there are toy datasets that are well understood, I'll use them, but it isn't always the case that there is one handy and fit for purpose. Sometimes simulated data is the right answer, but simulated data rarely pokes the corner cases or data validation problems precisely because you didn't think of them.
posted by a robot made out of meat at 7:16 AM on May 16


« Older He Was 5'7". After Surgery, He’ll Be 5'10".   |   Structural Stupidity Newer »


This thread has been archived and is closed to new comments