computational notebooks
April 5, 2018 7:25 AM   Subscribe

The Scientific Paper Is Obsolete "This is, of course, the whole problem of scientific communication in a nutshell: Scientific results today are as often as not found with the help of computers. That’s because the ideas are complex, dynamic, hard to grab ahold of in your mind’s eye. And yet by far the most popular tool we have for communicating these results is the PDF—literally a simulation of a piece of paper. Maybe we can do better." [via]
posted by dhruva (26 comments total) 23 users marked this as a favorite
See also observable for easily creating interactive notebooks online.
posted by Jpfed at 7:49 AM on April 5, 2018 [4 favorites]

There are better ways of communicating some information than a PDF. Publishing a database, a 3D model, or a video, for example. However, the entire citation system that is used in research is based on the information being cited is in a fixed form that can be found forever. Databases, 3D models, and video tend to be transient. In 100 years, the web site they are posted to likely will no longer be here. As a researcher, I routinely pull up 100 year old papers, then want to see the papers in their references.

Furthermore, as information is added to a database, what it says may change. If I read a paper that bases its results on a database, I should be able to see the database that the paper is based on, not the current one. Most databases are not set up to see archival information as of a given date.

Maybe these things can be fixed. But it isn't going to happen in the short term. The Internet Archive tries to archive the web, but who knows if that will even be around, and it can't archive some things.
posted by Xoc at 7:50 AM on April 5, 2018 [14 favorites]

I'm working in science and feel Bret Victor's Principle that “creators need an immediate connection to what they create" so acutely. The levels of abstraction between yourself and the systems you're interrogating are so many. Seeing a question answered takes so long. And it feels like the base assumptions behind scientific communication really don't respect this.
posted by little onion at 7:55 AM on April 5, 2018

Maybe these things can be fixed.
All of these things can be fixed, you just need a bigger database.
posted by b1tr0t at 7:55 AM on April 5, 2018 [1 favorite]

“At this point, nobody in their sane mind challenges the fact that the praxis of scientific research is under major upheaval,” Pérez, the creator of Jupyter, wrote in a blog post in 2013. As science becomes more about computation, the skills required to be a good scientist become increasingly attractive in industry. Universities lose their best people to start-ups, to Google and Microsoft. “I have seen many talented colleagues leave academia in frustration over the last decade,” he wrote, “and I can’t think of a single one who wasn’t happier years later.”
Is me.
posted by rlk at 7:58 AM on April 5, 2018

I don't know if I'd say the scientific paper is obsolete, or if it needs to evolve. Thirty years ago, before every journal was online, the major constraint on how one put papers together was the need to prep them to be printed for a journal's print issue. There are still plenty of journals that print hardcopy, but all of them have gone online at this point, and I cannot think of a time in graduate school or as a postdoc where I've had to seek out a recent paper in hardcopy. So, a modern paper should, probably, embrace the fact that most if not all of the readership is looking at it electronically, and use it to their advantage.

In my field (vision science) there's quite a bit of this already: journals encourage you to submit GIFs or video of your stimuli, which makes it much easier to actually understand what the experimenters did. There's also a simultaneous push that I'm seeing for open science and open data (e.g., make your experimental code, your analytic code and your anonymized datasets available), coupled with a push for general registration of studies before they're begun. Which is something of a long way of saying that a lot of the ideas put forward in the article are working their way into the field, which promises to improve the work we do.

I don't think I buy the "replace papers with software notebooks" ideas, though, because it seems like it would make it harder to integrate anyone's particular results with the larger body of work. I'd rather see a "here's the peer-reviewed paper, here's the code and data repository" approach, and if you've got an issue with someone's work, you do what we've done for decades (if not really centuries) and you write a paper rebutting them.

Implicit in the ideas in the article is also a general distaste for journals as they currently exist, and I've got a pretty serious bone to pick with that. I have no interest in having publication become "just stick it up on ArXiv, and quality will win out" because there are no authors who do not benefit from multiple sets of eyes on a work. Peer review isn't perfect (and should probably be a paid component of research, given the work involved in doing it well), but it's useful.

Similarly, I don't think we move science forward by, say, giving every paper a comment section. Disagreement is absolutely part of science, but there are existing mechanisms for doing it well and with thought: write a response, and get it through review, or run your own experiment to refute work you disagree with write it up, and get it through review.

On preview, Xoc makes an excellent point: the best argument we have for journals is that they are citable and relatively stable. I cite papers and books in my work going back to the 19th century with some regularity, but I'd be very twitchy about citing someone's blogpost visualization (or software notebook hosted by company XYZ) in a paper. I can imagine a repository for such things similar to NIH PubMed (or where is headed), but you'd want it to be about as stable as things get before it could be trusted the way a library's holdings are.
posted by Making You Bored For Science at 8:03 AM on April 5, 2018 [11 favorites]

This article is right up my alley. I "got" Jupyter Notebooks a couple of years ago and use them a lot for exploratory programming. They're so fun! I love the idea of scientists publishing notebooks as part of their academic CV. I'm not sure Jupyter is quite there to support it; it's not easy to package up all the third party libraries and data files you use along with the notebook. But the readable / rerunnable code part is amazingly great.

Seconding the plug for Observable. It's brand new, and to my eye is taking the best part of Jupyter Notebooks and combining it with the efficient dependency execution of spreadsheet applications. All in a browser native Javascript environment. Built by Metafilter's own Tom MacWright and Mike Bostock, the genius behind D3.js.
posted by Nelson at 8:11 AM on April 5, 2018 [5 favorites]

The only way to ensure that the results of a study are accurately represented is to make those results immutable. The printed page is perfect for that, and the (uneditable) PDF is the best electronic substitute. Adding a database (or other variable datastore) into the mix immediately compromises the integrity of those results.
posted by grumpybear69 at 8:38 AM on April 5, 2018 [3 favorites]

I have some issues with the article.

1: Claim that all stats software is "sloppily written" - citation please? In my experience, there is largely nothing wrong with the stats packages, but there is definitely an argument to be made that many, many people have no idea what analyses might actually be appropriate for a given dataset. Very few stats programs attempt to teach you what is or is not appropriate. A poorly trained person can use even a highly refined tool in dumb, inappropriate ways - it does not mean there is anything wrong with the tool.

2: For all the verbiage about how ubiquitous Mathematica is, I have no familiarity. Have never seen it used. Have never had a colleague who used it, to my knowledge. I'm trained in and/or collaborate with people who do neuroscience, zoology, behavioral biology, cognitive research, obesity, metabolomics, healthcare research... the program may be "as ubiquitous as MS Word" in certain fields but it is not everywhere. Which the article itself states as well - the comment later in the section discussing Jupyter notes that Mathematica "remained relatively obscure". Which directly contradicts the "as ubiquitous as MS Word" statement that preceded it. I've seen this happen before. People in math, physics, and comp sci are pretty sure that EVERYONE uses LaTeX and no one uses Word. Except that a whole shit-ton of research writing is done with Word. I've never worked with anyone who used LaTeX, across multiple disciplines and departments in 3 major universities and a large research hospital. That doesn't mean no one uses it - but it does mean that not everyone does. Likewise, I have actually heard of NumPy (in passing, no context) but not Jupyter or it's predecessor. R, yes. Matlab, absolutely. Coding in science? Perl, Python, etc. etc., I have done it, and I've seen it done, but the skillset to do it is limited severely in a lot of fields as aside from people like me who have a willingness to learn a new language, most rely on tools that come with a GUI. Time is limited and resources are not infinite.

3. Absolutely agree that a dataset that isn't "frozen" can't be used to make a statement. A notebook format is great for sharing raw data (allowing replication, re-analysis, metaanalysis, etc.) but the interpretation of the data is what goes into the paper. The point of the manuscript is to communicate the results and provide your interpretation of the significance and meaning. A notebook format might help with the former but does nothing to provide context for the latter, especially for anyone outside of the field. I could dump a pile of data into the comment box here and you could all look at it, but unless you were aware of the specific "language" used in my field of research, and had the background and training to understand what it was, it would be meaningless. Key words and symbols and etc. used as shorthand in data or results are supposed to be expounded upon and explained and defined in the text. As a reader you are supposed to critically evaluate the results, and decide for yourself (based on context and your own knowledge) whether the conclusion I considered is actually sound or flawed. The peer review process - while itself often problematic - is intended to ensure that the experiment itself was conducted with integrity, and that any obviously flawed interpretations are caught prior to publication. How do you review a notebook?

What is absolutely broken with the current publishing system is the fee basis and the artificial limitations.

Fee basis: I pay for the study, I pay for the publishing costs, someone else does the review for free, and you pay for the results if you want to read it. This doesn't make sense. Quality editors are not cheap - but the publishing companies are making money on both ends off of the work done by the scientific community. And after all that, 90% of the time I still have to do the legwork to make the manuscript compliant with NIH reporting and open access rules. We need a nationalized publication mechanism. PLoS model, where everything done with scientific rigor gets published. PubMed Central model, where anything published is free for all and made available immediately. Nationalize the publishers, run them as a nonprofit, think of how much money your university sinks into licensing fees! How much would be saved, even if there was still a fee to publish to support the costs of the editorial staff? Not to mention the immediate reduction in shit predatory journals that exist only to collect fees and publish garbage.

Artificial limitations: Nobody prints journals any more. Yet a huge number of the most respected journals act as if they do. Papers with dozens of microphotographs, ostensibly showing a result but printed in postage-stamp size - how are we as readers to interpret the integrity of that image? Publishers need to start treating electronic submissions as electronic, not paper. Page limits on text? Sure, otherwise some folks will ramble on forever (you may notice brevity is not my strong point!). Limits on data? No. High resolution images, as part of the article, not ass supplemental data that must be hunted down and found separately. One could even include full datasets, in common file formats, as attachments within the file (you can do this with PDF - and the format will have to stay PDF until/unless a commonly acceptable, open, and free alternative format is developed, and it will have to be one that is suitable for long-term archiving or it's a nonstarter). COULD a notebook format be used for this? Very likely, again pending development of an acceptable, packaged format. If it remains a static snapshot of what was known when it was published.

*last but not least: Breathless comment about how the one PI "actually measured neuronal activity with implanted electrodes" - uh, yeah, that is one thing electrophysiologists do. I have colleagues here doing that. I'm struggling to find out what is so exciting about this approach that made the author go all starry eyed over it, because to be honest an implanted multielectrode recording array is not exactly new or groundbreaking - implantable microwire arrays were developed in the 1950s...
posted by caution live frogs at 8:39 AM on April 5, 2018 [14 favorites]

I'm one of the people who's done the most work to try to make notebook-based research distribution viable for my field (linguistics / semantics), and while I think it could conceivably happen in the not-too-distant-future, there is still just so much to do before this is anything close to a viable substitution for an actual research paper. I've used it in classes and research contexts, given talks about it, etc. Here are some of my main concerns:
  • The existing well-supported tools in this space are very appropriate only for certain kinds of math / models / graphics (and basic stats of course); granted ones that are very popular right now. For fields that don't use these, there's a ton of effort involved in getting anything runnable in a notebook, let alone something that looks good. This is where I've spent most of my time. If you don't have something runnable, there's no good reason to use a notebook format over something else.
  • Formatting capabilities are still far behind e.g. LaTeX. Many standard diagrams / data notations in my field are ones I can't easily recreate in any notebook format. It's not impossible, but the effort involved is on the order of hiring an undergraduate RA to do some javascript coding for a semester. I can always create the diagrams elsewhere and embed them of course. Mathjax is great, but there's still easy ways hit its limits, at least for the stuff I do.
  • You don't need to be a programmer to do research in my field, so my potential audience has many non-programmers, and I think people who are used to these formats tend to underestimate the startup cost / effort to get these things working and understand how to actually use them. Most of my potential target audience doesn't know how to use the command line.
  • Installation: Jupyter notebook is in a much better place than it's ever been, but for years just basic installation problems have been a massive headache for me. It's at this point simplified to "install anaconda, download this other thing", but even this is weird for most "normal" people. It's also a perpetual frustration to me that there's no good, simple way to get double click to open work for the jupyter notebook format. Part of the issue is that the "pythonic" ways are really not in line with the rest of the world in terms of distributing software, so they're very alien to those not embedded in the space.
  • Minimal bibliography support (and what there is is kind of a pain to use). Anyone who thinks a notebook format is a viable wholesale replacement for a scientific paper format without this, to be blunt, ought to think hard about their scholarly practices. Fields have different standards for this of course, and there are probably still those people out there who do the cut and paste word bib, but in LaTeX-oriented fields that do expect you to have good citation practices, this is a really painful gap.
  • In most fields there is no way for something released in notebook format to "count" for your CV. This is a kind of standard problem that new open-access journals need to navigate as well. I tend to think that we need journal-like entities that accept notebook formats, but the formats aren't really there yet.
Right now IMO notebooks are best used in the place of appendices/supplementary material, and that's how I use them in my own research; it's a long road before this will change.
posted by advil at 9:29 AM on April 5, 2018 [11 favorites]

I'm willing to be convinced of the value of notebooks. (Well, perhaps not Mathematica notebooks. But Jupyter's pretty neat.) So far, the interfaces I've met are so incredibly clunky that I only copy things into them in order to share them with others after doing the work in a fully-customizable text editor. Which mostly just adds work, but has been really useful at times.

But, like nearly all open data proposals, this seems applicable to a very narrow vision of what it means to do science. If you're running a statistical regression on a thousand survey responses or a dozen plate counts, I can believe open data is useful. If you're inviting people to explore the implications of a semi-analytic model fitted to dozens of different combinations of small data sets, I can believe a notebook presentation is fantastic.

But, when I write papers, my goal is usually either to convince others that I've properly accounted for systematic noise sources in a complicated cryogenic electronics system, or that a data set consisting of many tens of terabytes of timestreams passes conceptually-trivial algebraic tests after many person-years of low-level processing and cuts. Doing either in the form of a python notebook would not make the task any easier.

There are many things I dislike about the academic publishing industry. It's reliance on papers filled with static text and equations isn't high on the list.
posted by eotvos at 9:34 AM on April 5, 2018 [1 favorite]

advil, could you post some links to samples of your work? I'd love to see.

Some of the distribution and usability problems could be solved by a good Notebook hosting system. There's no real reason to run the Notebook on your own computer with software you install yourself. I'm not up to date with the state of the art here, but I know folks are working on it. (Truthfully there's little reason to run a Notebook yourself at all if you're just reading a paper. But the fact that you could run it and tinker with it is important.)
posted by Nelson at 9:47 AM on April 5, 2018

advil, could you post some links to samples of your work? I'd love to see.

Sure, hopefully the self-linking is ok. My main project along these lines is You can see a pre-executed demo notebook here, though this is written for practicing formal semanticists (i.e. it presupposes graduate level training) and may not be very exciting otherwise. I recommend switching to the SVG mathjax renderer (right click on any formula).

Outside this project, a good place to find samples is the material here, the site for a course that a collaborator and I taught at a logic & linguistics summer school last year that was notebook-based. This course is more ML / data-science based, so the notebooks are more typical for the sorts of models the OP article is targeting.
posted by advil at 9:59 AM on April 5, 2018 [2 favorites]

Thanks, it's fascinating to see notebooks applied to this domain. I mean computational linguistics is still computation, right? But such a different specialty than statistics or physics. You and your software library are the one who is making Jupyter a viable tool for your field.
posted by Nelson at 10:11 AM on April 5, 2018

For archival purposes, I think JavaScript is less of a moving target than Python. But history has shown that DOM-rendering is not, uhh, "beautifully preserved" across browser generations. PDFs can be translated into very simple and preservable bitmaps; can't say the same about code notebooks.

Ironically there is a mature Turing-complete language embedded inside of every PDF viewer. Can PostScript do animations? (more of a modest proposal than serious...)
posted by RobotVoodooPower at 10:44 AM on April 5, 2018 [2 favorites]

"PostScript is a Turing-complete programming language, belonging to the concatenative group. " So, yes, Postscript can do animations. Implementation is left as an exercise for the reader.
posted by fantabulous timewaster at 11:01 AM on April 5, 2018 [1 favorite]

more of a modest proposal than serious
I'm ashamed to admit how useful that parenthetical comment proved.

But, the problem with postscript visualization is that it's still an inefficient two step process. Distributing your work as plain text with an in-line, emacs-lisp-based analysis pipeline, on the other hand, saves you multiple document processing steps.
posted by eotvos at 11:06 AM on April 5, 2018

This is a good article with a terribly misleading headline. Not just methods but computer code (R, Python, etc.) and data should be included with academic articles to have a complete record of the research that has been done. And integrating writing, data processing and visualization (like with RMarkdown documents) can improve workflow and reduce errors. But clearly written academic articles themselves are still needed.

This article is absolutely right that computation is (and should be) integrated into many scientific fields. The real problem is that the incentives for maintenance of software and data archives are not there. The incentives for writing papers is there (although there are many problems there, especially in publishing), but there are very little for writing good tools or organizing your data in a way that will make sense for other researchers.
posted by demiurge at 11:07 AM on April 5, 2018 [2 favorites]

This is, of course, the whole problem of scientific communication in a nutshell: Scientific results today are as often as not found with the help of computers.

I think the author does a disservice by implying that this is the only reason a more dynamic medium than paper would be better for communicating scientific ideas. So many non computer-derived ideas are "complex, dynamic, hard to grab ahold of in your mind’s eye," and would be better simulated and communicated with something more dynamic than immovable print.
posted by little onion at 12:10 PM on April 5, 2018

containerization is a big part of the solution. using something like Singularity you can package up all dependencies, data, etc into essentially a lightweight VM image that will just run your analysis or a notebook program or whatever. publish a hash of it in some really long term archive to make sure people can at least check they have the right image.

it’s still not easy or seamless for non computer nerds but it will work.
posted by vogon_poet at 1:39 PM on April 5, 2018

The Theodore Gray profiled in this article as the inventor of the Mathematica notebook is the partner/husband(?) of stone cold animation genius Nina Paley. He acts as her assistant in her short films, cartoons, and textile art. (Yes, every single frame of that central animation was individually embroidered.)
posted by Harvey Kilobit at 1:48 AM on April 6, 2018 [2 favorites]

There's a non-garbage discussion of this on Hacker News.
posted by vogon_poet at 5:05 AM on April 6, 2018

(lots of horror stories about how making research replicable is often actively avoided by researchers, because it's making life easier for your "competitors", i.e. the rest of the scientific community.)
posted by vogon_poet at 5:07 AM on April 6, 2018

Science has to be somewhat conservative, at least until a clear answer comes along. Paper (and PDF which is as near a faithful recreation of that digitally as we have) is durable. I don't need to worry about technological obsolescence, indeed I can and do reference papers from up to a century ago, with relatively easy access to the primary sources.

Would I be able to do that with a piece of Mathematica code in 2120, or even in 2030 or 2040? With an open source code that appears to be even more of a moving target? Blink and we could have a whole new set of "software notebooks" even a few years from now. I've wrestled with a number of software transitions now, even just in the authoring, graphic and bibliographical software packages and It Has Not Been Fun, for the official record. With luck, I'll still be active in 20 years or so. I don't want to have to port my papers over to an new software framework again.

This is not to say that publications should stay the exact same forever. Increasingly, papers are becoming short adverts for the real meat of a study which is typically reported in the supplemental information. Unfortunately, the SI is too often in PDF form as well, so to access the datasets, we've had to go so far as to physically retype and validate the data. The same goes with code. There's a real problem that needs solving, but I'm not sure idiosyncratic software wrappers are really it.

I'm in much more in favour of open release of datasets and analytical tools. I'm in charge of a few myself at work, and we've been working hard for the last decade to wrestle an old paper-only dataset into a persistent on-line form (e.g. DOI referenced) with some visualizations. I actually see more benefit there, in making datasets open with supporting information, in open and easily ported formats, along with pretty visualization in less portable formats (which can be fully regenerated from the source data).

In addition, these datasets comprise a number of studies (dozens now, not all ours), and so constitutes a reservoir of information that's additive over time and not just single studies. Again, there is great value in being able to put together work from multiple groups and studies into a single format for subsequent analysis by other researchers and end-users. I don't know how we'd use these workbooks in a larger summary context. Would there be a way to machine extract datasets in a uniform way? They seem pretty atomic to me, and so little better than the existing paper or PDF papers. This doesn't seem to do much to solve the json/xml data description issues that are at the heart of wrestling with multiple data sets.

I'm a bit skeptical of the value of these ideas. I agree we have issues to fix with the current way of putting scientific info out there, but I'm not certain code notebooks are the best way forward. We're putting our money instead into making data and files available in persistent data stores in well-documented formats, less about the code, more about preserving data and supporting info.
posted by bonehead at 5:52 AM on April 6, 2018 [2 favorites]

Jupyter, Mathematica, and the Future of the Research Paper
The Atlantic has a great article on new ways to share research results. Its three parts make three points:
  1. A graphical user interface (GUI) can facilitate better technical writing.
  2. Wolfram's proprietary notebook showcased innovative technology, but decades after its introduction, still has few users.
  3. Jupyter is a new open-source alternative that is well on the way to becoming a standard for exchanging research results.
... There is an independent social dimension, where the metrics assess the interactions between people. Does it increase trust? Does it increase the importance that people attach to a reputation for integrity?

It is along this social dimension that open source unambiguously dominates the proprietary model. Moreover, at a time when trust and truth are in retreat, the social dimension is the one that matters.

Jupyter rewards transparency; Mathematica rationalizes secrecy. Jupyter encourages individual integrity; Mathematica lets individuals hide behind corporate evasion. Jupyter exemplifies the social systems that emerged from the Scientific Revolution and the Enlightenment, systems that make it possible for people to cooperate by committing to objective truth; Mathematica exemplifies the horde of new Vandals whose pursuit of private gain threatens a far greater pubic loss–the collapse of social systems that took centuries to build.

Membership in an open source community is like membership in the community of science. There is a straightforward process for finding a true answer to any question. People disagree in public conversations. They must explain clearly and listen to those who response with equal clarity. Members of the community pay more attention to those who have been right in the past, and to those who enhance their reputation for integrity by admitting in public when they are wrong. They shun those who mislead. There is no court of final appeal. The only recourse is to the facts.

It’s a messy process but it works, the only one in all of human history that ever has. No other has ever achieved consensus at scale without recourse to coercion.

In science, anyone can experiment. In open source, anyone can access the facts of the code...
public goods provision contest! inclusive alliance vs excludable rivalry :P
posted by kliuless at 3:56 AM on April 23, 2018

« Older Jimothy Lacoste   |   It's time for an RSS revival Newer »

This thread has been archived and is closed to new comments