Good news for webhosters (and scientists)
February 25, 2014 12:15 PM   Subscribe

PLOS’ New Data Policy: Public Access to Data "PLOS has always required that authors make their data available to other academic researchers who wish to replicate, reanalyze, or build upon the findings published in our journals. In an effort to increase access to this data, we are now revising our data-sharing policy for all PLOS journals: authors must make all data publicly available, without restriction, immediately upon publication of the article. Beginning March 3rd, 2014, all authors who submit to a PLOS journal will be asked to provide a Data Availability Statement, describing where and how others can access each dataset that underlies the findings." also have a primer on why open science data is important.
posted by jaduncan (20 comments total) 29 users marked this as a favorite

Whoa. Awesome.
posted by PMdixon at 12:16 PM on February 25

Nice. I was around for the creation of PLOS and this was always their goal, to not only foster open access to journals, but make all scientific publishing us open data as well.
posted by mathowie at 12:24 PM on February 25

Black box science publishing allows a lot of sloppy data handling, because there is less accountability. At the very least, having to share data means getting it into a form that is shareable, which will push the improvement of formats and data interchange standards. Even if your data are public, if I can't read them, is your experiment reproducible? One question is where the data are archived and who pays for that, which might suggest some future utility of for-fee publishing, where the publishing house hosts the data internally. I think Nature does this.
posted by Blazecock Pileon at 12:32 PM on February 25 [3 favorites]

One question is where the data are archived and who pays for that, which might suggest some future utility of for-fee publishing, where the publishing house hosts the data internally

If the research was publicly funded, I think that it should be archived by the government entity that sponsored it.
posted by Dr. Twist at 1:03 PM on February 25

If the research was publicly funded, I think that it should be archived by the government entity that sponsored it.

That would probably be horrible.
posted by sonic meat machine at 1:26 PM on February 25 [1 favorite]

The NCBI is horrible? (great work PLOS)
posted by benzenedream at 1:32 PM on February 25

No, I'm just picturing a massive data warehouse administered by government contractors and academics. More specifically, trying to interface with that data warehouse. I think it should probably be in text files in some standard format, easily mirrored by a thousand universities.
posted by sonic meat machine at 1:39 PM on February 25

That would probably be horrible.

I can think of a number of reasons why you think so (hey, gov't shutdown!), but I regularly upload raw data - data that wasn't sponsored by the government - to the NASA global change data archive. I use it as a secondary data archive (as a good ol' what if kind of thing) and I love it. I often browse there for data - for fun, without having a specific data set in mind, and then reverse the process (find the paper, etc.) and have a look. It's worked fine as long as I've been using it.

Probably the largest data set in the world for earth science is the ocean drilling program data sets, which have been archived and accessible since 1968. It IS a massive data warehouse administered by contractors and academics, and yes, sometimes interfacing with that data is a pain in the ass. But that's true of any data set - data itself is a pain in the ass, and most people who work it regularly will have a moment where their hands are plunged deep in their hair, muttering obscenities at their computer or their notebook, trying to figure out the best way to collect and organize their data. (Let alone display it.) A government database isn't really going to make a difference, as long as the data is there and it's the raw data.

And yes, if I read your paper and I can't find your data anywhere - even if it's your own university website - I will judge you.
posted by barchan at 1:50 PM on February 25 [5 favorites]

I'm all for this sort of thing, and it's one of the reasons that I like to submit to PLoS journals.

Unfortunately, I'm probably one of the exceptions that they note. These days I work mostly with data collected by the government about individual farms in the UK. These data are a pain to get access to and are heavily protected by privacy laws. I need to sign lots of forms promising that I won't give it away or display or publish it in a way that could identify a farm or a small group of farms.

Which is all meant to say: please don't judge all of us who can't publish our data freely too harshly. Sometimes we don't have much of a choice.
posted by magicicada at 2:35 PM on February 25 [5 favorites]

A great presentation on why gov't/university partnerships are necessary for data archiving, and why we can't rely on gov't alone. Slide 3 blows my mind.
posted by unknowncommand at 3:01 PM on February 25

While I support the idea, I wonder if this is even feasible. In my field (experimental fluid mechanics), certain experiments can generate Terabytes of raw data. If you have four high-speed cameras running at the same time, you generate tens of Gigabytes per second. This data cannot be compressed, at least not efficiently without loss of information, for our application (PIV). We process the data to get velocity fields, which are naturally much smaller. I don't mind supplying those in a convenient plain text format, but the actual data? I am sure other fields run into similar problems.
posted by swordfishtrombones at 3:10 PM on February 25 [2 favorites]

swordfishtrombones, those are similar to problems we face in bioinformatics with the advent of relatively cheap and high-throughput sequencing technology. The Sanger Institute produces around 1Tb/day of sequencing data. NCBI does have a somewhat-tenuously-funded archive called the SRA where you can deposit read files. The raw image files from the experiment are closer to 1Tb/run, which only reduces to ~300Gb of raw sequence data, though there are tricks to squeeze multiple samples into one sequencing run at the cost of coverage.

As noted here, if you're just sequencing DNA from human patients, then yeah, you can eventually compress this down by just looking at confidently-predicted differences from a reference genome. But then you have to trust whatever algorithm is distilling the reads down into variants or matches, and because it's a young technology, these algorithms are still in rapid and active development. And if you throw away the wrong bits of the data, even making decisions that seem totally reasonable at the time (like say, throwing out everything but coding regions in an RNAseq experiment), you can potentially lose the ability to go back and look for exciting new stuff like lincRNAs and circular RNAs. Plus if you're sequencing, say, a population of microbes, some of which may not even have reference genomes, then this type of data reduction becomes very sketchy very quickly.

It's definitely not a problem anyone's solved yet in the general case. I heard a talk where someone mentioned that we're now in the somewhat absurd position (for biology) where the data storage can cost more than repeating the actual experiment. It's a big problem for computational biologists especially, who make their living (and more importantly, scientific discoveries!) in part by being able to re-analyze and integrate existing, sometimes disparate sources of biological data.
posted by en forme de poire at 4:12 PM on February 25 [4 favorites]

This is a great idea, in the abstract.

Unfortunately, I could never do it. (I've published in PLOS ONE.) My agreements with the participants in my research studies prohibit presenting anything other than aggregated statistics, etc. I can't see my research ethics board okaying uploading case-level data (which power my statistical models and, thus, my findings) anywhere, ever.
posted by docgonzo at 5:16 PM on February 25 [1 favorite]

docgonzo, that's a real problem and one that people are thinking about. Also, from the announcement:
Do we allow any exceptions?

Yes, but only in specific cases. We are aware that it is not ethical to make all datasets fully public, including private patient data, or specific information relating to endangered species. Some authors also obtain data from third parties and therefore do not have the right to make that dataset publicly available. In such cases, authors must state that “Data is available upon request”, and identify the person, group or committee to whom requests should be submitted. The authors themselves should not be the only point of contact for requesting data.
A lot of genomic data repositories (dbGAP, TCGA, etc.) have had to deal with how to balance allowing access to data to enable discoveries and ensuring patient privacy; there's some interesting discussion about that here with a good bibliography (a lot of the works cited in the intro are really interesting in their own right).
posted by en forme de poire at 6:15 PM on February 25 [1 favorite]

Ah, thanks -- I didn't think my situation was all that unique!
posted by docgonzo at 7:05 PM on February 25

Well done.
posted by homunculus at 7:29 PM on February 25

While I like the idea of this, I still don't understand the exceptions policy.

In such cases, authors must state that “Data is available upon request”, and identify the person, group or committee to whom requests should be submitted. The authors themselves should not be the only point of contact for requesting data.

How is this supposed to work with human subjects? I can't authorize someone else to release my data. I can't even authorize myself to release my data. The only person who can grant such authorization doesn't have the data; and I can't identify the person to whom those requests should be submitted.
posted by yeolcoatl at 5:45 AM on February 26

Yeolcoatl, you might want to check the link I posted above for examples of how data sharing can be reconciled with patient privacy. In particular there's a review by the Gerstein group in the bibliography that is a pretty good and clear summary (on my phone or would link directly).
posted by en forme de poire at 10:32 AM on February 26

Ah, but my problem is that my data is video, not medical. I literally can't share any raw data and still protect my subject's privacy. I can share transcripts, but that's not raw data. People who do work in gestures, for example, can't use it. So the raw data requirement isn't met.

Now the controlled access option is interesting, but the problem with that is that's not the form my subjects signed. They never signed any forms granting a third party the right to grant access identifying data. They signed forms that grant me personally the right to share non-identifying data.
posted by yeolcoatl at 4:11 PM on February 26

It might be too late for this particular set of studies, but this is more of a question for a journal editor. The type of human research you're describing is also pretty different from, e.g., collecting clinical variables like blood glucose or even DNA sequences, so it's not a type of research I have any real direct experience with. There are also ways of anonymizing video, but you're of course right that this would have to be worked out with the subjects before-hand.

I would guess that whether sharing the transcripts would be "enough" under this policy would probably depend on what exactly the study was, how the videos were transcribed, how subjective the transcription was, and how relevant the non-transcribed elements would have been to replicating any/all conclusions. But again, that's something it would be worth asking the editorial staff of PLoS. I think the way they word it makes it pretty clear that patient privacy is the most important variable in how something like this would be implemented in any particular case.
posted by en forme de poire at 7:28 PM on February 26

« Older Below West 38th Street   |   "...can you get them to stop... Newer »

This thread has been archived and is closed to new comments