Gene name errors are widespread in the scientific literature
August 25, 2016 11:23 AM Subscribe

Microsoft Excel blamed for gene study errors. [G]ene symbols such as SEPT2 (Septin 2) and MARCH1 [Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase] are converted by default to ‘2-Sep’ and ‘1-Mar’, respectively.
posted by He Is Only The Imposter (135 comments total) 37 users marked this as a favorite

Don't use Excel for bioinformatics
posted by a lungful of dragon at 11:25 AM on August 25, 2016 [17 favorites]

Oh my god, this is so true. I mean, I don't prepare important data for papers, but I do prepare a lot of lab order spreadsheets and one has to get one's peptides and oligos and so on from somewhere.
posted by Frowner at 11:26 AM on August 25, 2016 [2 favorites]

That's not Excel error, it's user error.

Format > Cells... > Text

Smooth move(s), smartypants science dudes.
posted by dersins at 11:28 AM on August 25, 2016 [36 favorites]

Yeah well open office converts all my important appointments into gene symbols
posted by ian1977 at 11:32 AM on August 25, 2016 [32 favorites]

That's not Excel error, it's user error.

Yes. The error the user made was using Excel when they should have used a database.
posted by escape from the potato planet at 11:34 AM on August 25, 2016 [70 favorites]

That too.
posted by dersins at 11:34 AM on August 25, 2016 [2 favorites]

Don't use Excel for bioinformatics

Uh huh. And the alternative is... what, again?
posted by Mitrovarr at 11:34 AM on August 25, 2016 [2 favorites]

Saw this on Slashdot. I want to file under PEBKAC except using Excel for everything is all-but encouraged these days. Hell, at work it's used as an alternative to Word to avoid thinking about tables/formatting.
posted by comealongpole at 11:37 AM on August 25, 2016 [1 favorite]

(For all my Gen X cynicism I'm just naive enough to imagine that other workplaces are different)
posted by comealongpole at 11:38 AM on August 25, 2016

escape from the potato planet: Yes. The error the user made was using Excel when they should have used a database.

The average biologist is going to screw up far more making some amateur trainwreck of an access database than anything excel could hope to accomplish.
posted by Mitrovarr at 11:40 AM on August 25, 2016 [19 favorites]

That's not Excel error, it's user error.

I hear this, having dealt with it as both user and manager in my own different work history for decades, but I can also really relate to the users here, since Excel is so goddamn frustrating about re-re-reverting my format preferences and choices (for example, deciding to interpret things like 6-2 and even 6' 2" as dates, rather than heights, no matter how many times I set format)... and it does so, as near as I can tell, whenever I leave my desk for coffee.

As for biostats work, I suspect these people should really be working in R, but that's a bit of a steep curve, especially for 'wet' lab-people without 'dry' technical backgrounds.
posted by rokusan at 11:40 AM on August 25, 2016 [13 favorites]

Excel's date handling is suck garbage. Even if you open a .csv file, which is an ancient spreadsheet format based on formatted text with basically zero options, Excel will auto-convert any date-like strings you input into Date format and alters the value to match its default notation. If you then switch them back to Text format (because anythng else is meaningless in a .csv, and you need your dates formatted a certain way because whatever script you're using to analyze the table is expecting a particular format) they then become some kind of coded string that bears no obvious resemblance to your original input. There is no way to disable this behavior—I've searched for one to no avail. It's fucking maddening.

Excel is a garbage program in so many ways. So much of getting good at it involves learning how to route around its terrible design. It's the de facto standard across pretty much every spreadsheet-using occupation though, and since its file formats are proprietary and nobody wants to risk having all their .xlsx files subtly mangled by a third-party program, we seem to be stuck with it. So much impotent rage.
posted by Anticipation Of A New Lover's Arrival, The at 11:42 AM on August 25, 2016 [37 favorites]

Ngargh. I use Excel even when I probably shouldn't because it is right there and easy to work with, so I have a certain sympathy for the scientists involved.... but come the fuck on, that's why you doublecheck your outputs before you submit the damn paper and disable all autoformatting first.
posted by sciatrix at 11:42 AM on August 25, 2016 [3 favorites]

Yes. The error the user made was using Excel when they should have used a database.

Depending what they were doing, a db might have been overkill, and even if not they probably didn't have the budget to hire a programmer to make the db have a nice front end. But formatting your column cell types correctly is kind of like declaring your variables properly, ignore it at your own risk.
posted by aught at 11:43 AM on August 25, 2016 [2 favorites]

A lot of the time, biostatisticians use Excel to create their data frames, save them as .csvs, and then manipulate them in R. Have you tried transcribing hand-written field notes into a data frame directly from R anytime lately? Excel is a nightmare, but that sounds like pure hell to me.
posted by Anticipation Of A New Lover's Arrival, The at 11:44 AM on August 25, 2016 [5 favorites]

using Excel for everything is all-but encouraged these days.

I do remote customer support for a certain software vendor. More than once I've received screen shots of application errors as images embedded in spreadsheet cells. If all you know how to use is a hammer...
posted by Greg_Ace at 11:49 AM on August 25, 2016 [4 favorites]

"But it’s worse than that. Excel can actually change the format of the dates in a non-Excel file (e.g. CSV file) without your permission."
posted by pemberkins at 11:49 AM on August 25, 2016 [15 favorites]

Also I can really sympathize because Excel is really, really resistant to letting you format your cell types for some reason, and insists on repeatedly reverting them to whatever it thinks is best. The Date format appears to be the worst offender, in my experience. Yes, the researchers should have caught it. But the program is so fucking frustratingly godawful about quietly disregarding the user's explicitly-stated desires that I'm not at all surprised that this slipped through.

Enough people are using Excel in this way that it's inevitable that somebody would fall afoul of this godawful, hostile, brain-dead design eventually. People are not going to be perfect. They are especially not going to be perfect when their tools are quietly working against them all the time. It's unreasonable to blame the users for not perfectly compensating 100% of the time for such an inexcusably foul design.
posted by Anticipation Of A New Lover's Arrival, The at 11:50 AM on August 25, 2016 [28 favorites]

for example, deciding to interpret things like 6-2 and even 6' 2" as dates, rather than heights, no matter how many times I set format

Put a apostrophe first when entering any data excel might mangle. It will absolutely, positively consider it text.
posted by Mitrovarr at 11:50 AM on August 25, 2016 [5 favorites]

See, the thing is, you won't catch every new entry/page/column. It's not that one does not notice that something is auto-formatted, or that one does not know to set columns as "text"; it's that if you're doing a lot of stuff, you miss it a few times and errors creep in.

I don't know enough about excel to know if it's possible to set "text" as the default column - a quick poke around suggests not, but this would be the most sensible thing.
posted by Frowner at 11:51 AM on August 25, 2016 [2 favorites]

Excel's date handling is suck garbage. ... we seem to be stuck with it.

Partly because of the zillions of business spreadsheets full of macros that depend on its shitty, head-to-toe broken date handling. I often work with historical data involving a lot of pre-1900 dates (another thing Excel handles supremely poorly), and I'll sooner write a custom Python script to digest a CSV file than risk Excel fucking everything up.

I wish Microsoft had the courage to fundamentally remake Excel, removing all of the legacy cruft and allowing scripting using any language that can target (a carefully sandboxed version of) .NET. It could be so, so much better than it is.
posted by jedicus at 11:52 AM on August 25, 2016 [10 favorites]

If you're opening a .csv file in Excel, use the Import Data function rather than the Open File function, and be sure to tell it to treat all the cells as Text. (This is not the default option.) Otherwise, you will find all your dates reformatted—possibly misformatted—and if you then save the .csv before you notice that this has happened, you're in for a fun time manually re-entering all of them.
posted by Anticipation Of A New Lover's Arrival, The at 11:53 AM on August 25, 2016 [10 favorites]

Bad design, allows actual calculations to be affected by formatting options, "IS BASICALLY SAME AS SHITS WHERE HE EATS".
posted by hyperbolic at 11:53 AM on August 25, 2016 [4 favorites]

There is no way to disable this behavior—I've searched for one to no avail.

If you open Excel, then use File > Open and select the .csv file, you should get a series of dialog boxes that allow you to specify the delimiter and the formatting for each column before it mangles the original text.
posted by Greg_Ace at 11:53 AM on August 25, 2016 [1 favorite]

No, the only column default option is the worse-than-useless General, aka "whatever Excel thinks it is." The only way around this is to use Import Data or to kick off a new spreadsheet by hitting ctrl-A and setting everything to Text.
posted by Anticipation Of A New Lover's Arrival, The at 11:55 AM on August 25, 2016 [2 favorites]

Greg_Ace, that's not how my excel works. It just opens any csv as a table and converts date looking things to dates regardless of what I say.
posted by teleri025 at 11:57 AM on August 25, 2016

Not in my copy of Excel 2013 you don't, Greg_Ace. I just checked. You have to use Data > Get External Data > From Text to get that dialog.
posted by Anticipation Of A New Lover's Arrival, The at 11:59 AM on August 25, 2016 [4 favorites]

I wish Microsoft had the courage to fundamentally remake Excel, removing all of the legacy cruft and allowing scripting using any language that can target (a carefully sandboxed version of) .NET. It could be so, so much better than it is.

Well - they are kind of working on it - this is why Excel is now available on iOS and Android devices...

ActiveX / COM / OLE component support is disappearing... however, that means VBA and .NET too ... Because it was never "sandboxed" for Office - Microsoft toyed with a .NET sandbox for SharePoint that was an interesting idea, but even ended-up "deprecating" it, the future apparently is... JavaScript/TypeScript...

Why? Well - for them, the future is also cloud... And the last thing they want is your "arbitrary-quality" code executing on their shared-service offerings (Office 365, Office Online Server, SharePoint Online, etc.)... Therefore... offload that to the client via JavaScript, and voila...
posted by jkaczor at 11:59 AM on August 25, 2016 [1 favorite]

I know it works in Excel 2007 and earlier, I assumed the behavior remained for later versions....
posted by Greg_Ace at 11:59 AM on August 25, 2016

The error the user made was using Excel when they should have used a database.

There's a set of work tasks at my job that are routinely done by dumping a whole bunch of text (and only text) into Excel and it is one of the goddamn most frustrating and stupid things about my job. Stop using Excel to format text! It's terrible at it and causes more problems than it could possibly solve. Spreadsheets have a number of perfectly good uses, almost all of which happen to involve numbers. If you're not using numbers, you probably want either a database or just a damn chart like you can make in Word.
posted by Copronymus at 11:59 AM on August 25, 2016 [2 favorites]

Defaulting to silent type coercion is easy to explain: while Excel does heavy lifting for a huge number of organizations so people don't have to admit that they are programming, it was actually designed as an more specialized tool by the fucking devil.
posted by The Gaffer at 12:03 PM on August 25, 2016 [2 favorites]

I do remote customer support for a certain software vendor. More than once I've received screen shots of application errors as images embedded in spreadsheet cells. If all you know how to use is a hammer...

Oh, man, does that bring back (bad) memories.
I used to work for a small software startup as the one-man art department. One of my jobs was assisting the UI design. I would regularly ask a dev to send me a screenshot of the page he was working on. Invariably, I would get an email back from them containing a Word doc, with the screenshot embedded in it.

I once asked why don't you just paste the screenshot in the email and be done with it? They were genuinely baffled by my question. It was just rote for them to drop everything into a Word doc first.
posted by Thorzdad at 12:07 PM on August 25, 2016 [3 favorites]

Uh huh. And the alternative is... what, again?

The alternative is reproducible science.
posted by a lungful of dragon at 12:14 PM on August 25, 2016 [7 favorites]

a lungful of dragon: The alternative is reproducible science.

No, seriously, what program? Your average biologist didn't multiclass programmer and doesn't have access to one, so any kind of coding is out (including R).
posted by Mitrovarr at 12:19 PM on August 25, 2016 [4 favorites]

I dunno, all the grad-student biologists I knew when I was doing that stuff were learning R and maybe some Python on the side. R has so many incredibly potent bioinformatics packages that it's really a must-have skillset. But nobody was using it for data entry. The workflow went Raw Data > Excel > CSV > Import into R > Analyze. A certain amount of reformatting and data-wrangling was done in R, but not the initial tables. You'd have to look over your .csv files in a text editor first though to reality-check them, and validate them in R as well before attempting analysis. Largely that was to guard against errors inteoduced by, you guessed it, Excel.
posted by Anticipation Of A New Lover's Arrival, The at 12:27 PM on August 25, 2016 [8 favorites]

Your average biologist, especially one who works in clinical trials, ought to be at least SOMEWHAT familiar with data science standards. FFS this is why you have to submit a data management plan for your grant.

I study how bugs evolve to be *so pretty* and I can still write a Perl script and do some R-hacking. Your .csv files should just stay .csv.

Especially because you can't afford a nice new laptop and Excel just crashes when you try to handle your data in it so you actually are forced, FORCED to use vim... but that could just be me.
posted by Made of Star Stuff at 12:40 PM on August 25, 2016 [6 favorites]

Windows opens .csv files with Excel, by default, to make matters worse.
posted by thelonius at 12:42 PM on August 25, 2016

A level of programming skills sufficient to use a few scripts in R/Python is very important in much of molecular biology/genetics. I wouldn't say it's a pre-req but the current generation is moving in that direction from what I can tell. Yes, you (or someone else) still need to validate in the test tube/animal model/clinic/whatever, but *omics is such a powerful tool for 99% of the work that can be done before breaking out the eppendorf tubes.

Source: I'm an ex-bench biologist, now part-time tech writer for an *omics diagnostics company, and have seen firsthand PHD biologists without *omics/NGS experience struggle to find jobs.
posted by aperturescientist at 12:43 PM on August 25, 2016 [2 favorites]

I just started learning R for data analysis last year, four years into a biology PhD, and felt very behind the times at that. (I'd held off so long because I was developing fluency in Matlab first for acoustic shit and it was easier to stick with what I knew; and then I was picking up Python.) I don't know any biologists training now who aren't heavily encouraged (if not forced) to pick up at least one programming language enough well enough to tool around in.

That said, Anticipation of a New Lover's Arrival's workflow is pretty much what EVERYONE I know uses. (I know a lot of people, also, who enter their raw data into excel because it's less annoying than making a .csv in a text editor.) For things like bulk analysis of acoustic data in Matlab, we also put file names into Excel .csv files and import them to work with manually. As I understand it, the problem is that Excel will actually fuck with whatever you put into the .csv even if the data never sees .xlsx format? UGH UGH UGH that is annoying as balls.
posted by sciatrix at 12:43 PM on August 25, 2016 [3 favorites]

That's why once you have your .csv file you change it to a .txt. R can handle "this text file is a csv" just fine, and once it's built you generally aren't going to want to bring it back to Excel again if you can help it. That, or you change the default program setting in Windows.
posted by Anticipation Of A New Lover's Arrival, The at 12:44 PM on August 25, 2016 [3 favorites]

Your average biologist didn't multiclass programmer and doesn't have access to one, so any kind of coding is out (including R).

Then your average biologist shouldn't handle data that goes into a paper. No, seriously! You need to know the right tools for the job. Use RStudio or R Commander if you need a GUI.
posted by a lungful of dragon at 12:46 PM on August 25, 2016 [3 favorites]

Good to know, thanks! Excel hasn't fucked with any of my datasets, but most of them are common English words or simple binary numbers. I'll... watch out for it deciding to change that.
posted by sciatrix at 12:47 PM on August 25, 2016

(What I meant by raw data above was the handwritten stuff in your lab book or field notebook, or the files being output by your instruments. That kind of thing.)
posted by Anticipation Of A New Lover's Arrival, The at 12:47 PM on August 25, 2016 [1 favorite]

I'm going to go against the flow a bit here and say that Excel is one of the least unreliable pieces of software I've ever used.

The flaws it has, and to be clear, some are definitely frustrating, are nothing compared to the trash fires of the rest of Office, and nothing compared to the learning curve of asking one of your team members to just run MySQL and learn R instead.

I would love to see a good, quality spreadsheet and/or data analysis software competitor to Excel, but there isn't one out there. Numbers is a toy calculator, Google Sheets isn't much better than Numbers, Lotus 1-2-3 is dead and buried.

A huge chunk of this thread reads like "get a real OS, luser" from the days when the heyday of Linux on the Desktop was just around the corner. And guess what, Fusion is still 20 years away, and the heyday of Linux on the Desktop is still just around the corner. In the meantime, I'm sticking with Excel.
posted by tclark at 12:55 PM on August 25, 2016 [21 favorites]

Lungful of dragon, you either have no idea what the reality of being a biologist is like, or else your experience is very different from what mine and everyone I knew was. Biologists can't just farm out their analysis to someone else. Outside of the top 0.1% of grants, there just isn't money for that.

Plus, data analysis is what scientists do. If you can do experimentation and/or observation but not analysis, you're only half a scientist. These days, that means you have to be a coder on the side. There are huge amounts of biology-specific coding resources out there, and working with R is a totally standard skill nowadays, to the point that it would be a big weakness if you graduated your program without it. And anyway, the amount of involvement that an outside analyst would need to have to be able to do their job right would make them basically a co-author.

That's not to say that projects don't often have co-authors who were largely responsible for the analysis, but those people are usually biologists as well, ones who just happen to have extra experience with the relevant analyses. Sometimes they are the inventors of whatever software package or statistical technique is being used as the centerpiece of the project's analytical aspect. Biologist-as-coder is a very normal thing nowadays.

Same with writing papers. People argue all the time that scientists should have someone else do the writing for them, but the reality is that there's no money for it and unless you've been intimately involved with the project from the beginning, chances are you do not understand it well enough to be able to communicate about it effectively. This is brand new knowledge we're talking about, generally really complicated stuff at that, and until the researcher writes about it it exists only in their head. Who else would be more qualified?
posted by Anticipation Of A New Lover's Arrival, The at 12:58 PM on August 25, 2016 [7 favorites]

Greg_Ace: If all you know how to use is a hammer...

But on the other hand, when you don't have a hammer, pretty soon everything else starts to look like one. (Pliers make good hammers, right?)

Or, to misquote someone else, "If Excel did not exist, we would have to invent it." Build your database, and the first thing that your user will ask for is an Excel-like interface.
posted by clawsoon at 1:04 PM on August 25, 2016 [1 favorite]

Excel is not a tool I would trust for scientific research. Too many ways to make mistakes, too hard to reproduce results.

http://software-carpentry.org/ is an organization that teaches basic lab skills for research computing. Which means Python, R, version control etc.
posted by mdoar at 1:06 PM on August 25, 2016 [4 favorites]

About 25 years ago, I temped for a week doing data entry. We worked from (sometimes blurry) photocopies of journal articles. Most of it was A, C, T, G sequences; I think the others were sequences of three letters, but since I knew (and knew) little about biology, I'm not sure. I joked that my typo might have caused some sort of mutant, but it actually seemed pretty reliable: two people independently entering the data, plus a supervisor visually checking. I don't think we used Excel.
posted by Mr.Know-it-some at 1:10 PM on August 25, 2016

That’s not Excel error, it's user error.

The fact that Excel sucks is Microsoft’s fault. While it’s no doubt true that academics should move away from Excel, there is surely also some (significant) burden on Microsoft to realize that their product has become so widely used that the basic assumptions they are making about users are wrong.

That is, Excel is trying to help out the user by doing these transforms. Excel isn’t helping the user though - it’s hurting them. Ergo, Excel is to blame.
posted by Going To Maine at 1:10 PM on August 25, 2016 [2 favorites]

Also, Fuck R. Python forever.
posted by Going To Maine at 1:11 PM on August 25, 2016 [2 favorites]

Microsoft itself considers that "this is very frustrating when you enter something that you don't want changed to a date". The problem has been annoying people (including me) for decades, MS is well aware of it so I don't know why there's no option to turn it off. I guess that there's a deep legacy / backward compatibility issue involving some quarter century old-spaghetti code nobody wants to touch. Excel 2016 has still options for Lotus 1-2-3 compatibility...
posted by elgilito at 1:12 PM on August 25, 2016 [4 favorites]

Also, Fuck R. Python forever.

Them's fightin' words.
posted by pemberkins at 1:15 PM on August 25, 2016

Plus, data analysis is what scientists do. If you can do experimentation and/or observation but not analysis, you're only half a scientist.

As an organic chemist (i.e. a knuckle-dragger) it's strange to me that, from the sound of it, the typical biologist has far more knowledge of data analysis than one of my brethren, yet we still (wrongly, I am sure) consider ours to be the "harder" science. Where I work, even knowing how to do basic data manipulation in Excel is the exception, not the rule. Open up the latest issue of my favorite journal, Organic Process Research and Development, and I can guarantee you that at least half of the graphs were clearly made in Excel, and this is a journal where many of the contributors are chemical engineers.
posted by Peter J. Prufrock at 1:19 PM on August 25, 2016 [1 favorite]

Well, for whatever it is worth, I think biologists should learn a little coding on the side (on my graduate project's team, I was the closest one we had to a computer expert). But the thing is, if you use some weird program for your data handling you'll be doing it alone. Maybe some people will be able to access it with R, but not all of them, least likely of which is your project lead or advisor (odds are good with newer sort of graduates, but worse with older scientists). But you use anything weirder and nobody will be able to handle your data files.

On the other hand, basically everyone has excel, and everyone can use excel. It's universal. Other people will be able to open your files. Things can be maintained once you move on. It's good for that at least.

Also, most other methods of data entry are slow and horrible. And excel makes it incredibly easy to do simple manipulations to the data. I do a lot of data handling in excel, and despite having it mess up my data many, many, MANY times, I still prefer to work in it. As bad as it is, every other way I've tried is worse.

Anyways you should be sanity checking your data every major step of the way.
posted by Mitrovarr at 1:26 PM on August 25, 2016

This does not surprise me at all. Excel is not a suitable tool for scientific use, and I'm perpetually disappointed (but not surprised) at how many of my colleagues use it. I've seen it lead to problems pretty much exactly like what's being described here. In one particularly memorable case, the data set included a column of three-character codes with a mix of letters and digits, including some like "7E3". The data was recorded by one research group using one version of Excel, and opened by my colleague using a different version of Excel. Even though the original researchers had marked the column as text, somehow during the file format conversion, Excel decided to drop that. (We're not talking hugely different versions, either, maybe Excel 2007 vs Excel 2010). When my colleague opened the spreadsheet, most of the values were fine, but the ones like "7E3" had been interpreted as numbers in scientific notation -- and then reformatted back to text, as "7000".

You should never let Excel touch your scientific data.

As for biostats work, I suspect these people should really be working in R, but that's a bit of a steep curve, especially for 'wet' lab-people without 'dry' technical backgrounds.

I am sympathetic to that point, but the fact is that as a scientist you have a responsibility to understand and use your tools appropriately, from data collection, to data management, to data analysis and presentation. Decades ago the "data management" side of things was your well-kept lab notebook. Today it includes the digital organization, structuring, and transmission of your data. Not everyone needs to be a coder, but learning some basics is pretty much a necessity at this point. If you're capable of writing and following a wet-lab protocol, you're definitely capable of doing the same for digital data management, even if you don't ever really learn to code.

I use Excel even when I probably shouldn't because it is right there and easy to work with, so I have a certain sympathy for the scientists involved.... but come the fuck on, that's why you doublecheck your outputs before you submit the damn paper and disable all autoformatting first.

If you're working with really large datasets, as is common in bioinformatics, double checking your data set by eye isn't necessarily an option. You may have many many thousands of rows , and if 1% of them are subtly screwed up in a way that doesn't lead to your analysis outright failing, you might never know.

No, seriously, what program? Your average biologist didn't multiclass programmer and doesn't have access to one, so any kind of coding is out (including R).

Honestly, if you have nothing else, then a text editor. Just enter your data one row per line, columns separated by commas. Save it with the ".csv" extension and open it with your favorite stats package. But many stats programs will let you enter tabular data through a graphical interface, without going through Excel. Matlab definitely does, and I've seen similar interfaces in more graphical stats software like JMP (though I haven't used it myself). The problem isn't the graphical interface (that's obviously an asset), the problem is Excel making irreversible decisions about your raw data for you.

And the reality is, in 2016, your average biologist does pick up at least one level of programmer during grad school, and if they don't, they're seriously hamstringing their own training. I've taught basic programming to many grad students in biological sciences. Few of them are ever going to enjoy it or be great at it, but understanding the bare minimum to store and manipulate your data set while maintaining its integrity is not hard for most people. And pretty much everyone I know who's serious about pursuing an academic career is comfortable analyzing their data entirely in Matlab, R, or Python (or all three).
posted by biogeo at 1:32 PM on August 25, 2016 [4 favorites]

On the other hand, basically everyone has excel, and everyone can use excel. It's universal. Other people will be able to open your files. Things can be maintained once you move on. It's good for that at least.

Also, I think my story above illustrates that Excel actually provides only the illusion of universality. Yes, other people will be able to open your Excel file, but there's actually no guarantee that what they open is identical to what you sent them. Text formats like CSV or TSV are much more likely to be universal (though again, if they are opening with Excel, there are no guarantees).
posted by biogeo at 1:35 PM on August 25, 2016 [2 favorites]

That's not Excel error, it's user error.

Format > Cells... > Text

Right. So now you've got an enormous database in which some of the text strings that are supposed to be "SEPT2" and "MARCH1" are now "42615" and "42430" instead of "2-Sept" and "1-Mar." Yay, you fixed it.
posted by straight at 1:36 PM on August 25, 2016 [16 favorites]

As an organic chemist (i.e. a knuckle-dragger) it's strange to me that, from the sound of it, the typical biologist has far more knowledge of data analysis than one of my brethren, yet we still (wrongly, I am sure) consider ours to be the "harder" science.

That's pretty funny. I think for biology, there may be a similar relationship to the social sciences. Most biologists would consider ours a "harder" science than sociology, but as far as I understand it, the idea that someone could be a social scientist without being able to run their data through ANOVAs and GLMs using R, SAS, or something similar would get them laughed out of the academy.
posted by biogeo at 1:41 PM on August 25, 2016 [5 favorites]

If you're working with really large datasets, as is common in bioinformatics, double checking your data set by eye isn't necessarily an option. You may have many many thousands of rows , and if 1% of them are subtly screwed up in a way that doesn't lead to your analysis outright failing, you might never know.

Ooh, good point there. I do have some very very large datasets for the genome analysis I'm working on, but those never see Excel at any point in their little lives. They get analyzed exclusively on the cluster computer system, because they're too big for my little laptop to do anything interesting with. (That said, I mostly use Excel for stuff where hand-entering data is a Thing I would like to do, and the hell with doing that for one of those datasets.)

Also.... what happened to JMP and SAS? I haven't used either since undergrad, since even in 2012 it was clear that R was where the biologists were headed, but both are spreadsheeting programs where IIRC it's not too hard to hand-enter your data if you want to on top of being capable of much better statistical analyses than Excel. Also, how are you people using Excel for those analyses at all? I did an ANOVA by hand once in my undergrad in Excel, before my PI at the time showed me how to use JMP, and I think she laughed herself silly once she found out what I had done and all the time I'd wasted before showing me how to do something better.
posted by sciatrix at 1:42 PM on August 25, 2016 [2 favorites]

I haven't really used JMP myself, but I've worked with a grad student who has used it heavily for her analysis. She's a bit R-phobic, but I think I helped cure her of it a bit when I showed her that for the slightly complex analysis we wanted to do on her data, the process in JMP of "click-click-click-click, error message, click-click-click-click, nonsensical result, click-click-click-click..." could be replaced with about three lines of fairly easy-to-understand R code. Frankly I find the JMP interface much more baffling than just coding some R, but YMMV.
posted by biogeo at 1:47 PM on August 25, 2016 [1 favorite]

Most biologists would consider ours a "harder" science than sociology, but as far as I understand it, the idea that someone could be a social scientist without being able to run their data through ANOVAs and GLMs using R, SAS, or something similar would get them laughed out of the academy.

Yeah, the joke is that because psych is so much squishier than molecular biology, psych researchers have to understand advanced statistics in order to make their data sets say something other than ¯\_(ツ)_/¯

/I keed because I love. In addition to deficiencies in programming, many biologists should up their deeper understanding of statistics beyong <point> look!! p < 0.5 !!!one!
posted by porpoise at 1:49 PM on August 25, 2016 [2 favorites]

dersins, you could like, read the article. This happens transparently and destructively even if you just open and save a file.

“opaquely”, I think.
posted by Going To Maine at 1:51 PM on August 25, 2016

I keed because I love. In addition to deficiencies in programming, many biologists should up their deeper understanding of statistics beyong look!! p < 0.5 !!!one!

I managed to write my entire dissertation without reporting a single p-value, despite doing a lot of heavy statistics. My committee was cool with this. I have a certain amount of pride about this fact.
posted by biogeo at 1:54 PM on August 25, 2016 [4 favorites]

p < 0.5 !!!one!

If you’re happy about p < 0.5, there are some deeper issues with statistics.
posted by Going To Maine at 2:15 PM on August 25, 2016 [10 favorites]

p < 0.5 means I'm more likely right than not, right?
posted by biogeo at 2:21 PM on August 25, 2016 [3 favorites]

My face will melt.
posted by Going To Maine at 2:27 PM on August 25, 2016 [6 favorites]

I'm surprised that RStudio doesn't have a GUI helper for data-entry, apparently.
posted by clew at 2:29 PM on August 25, 2016

Data entry in Excel may be bad, but when someone decides to do their data entry in hundreds of slightly-structured Word "questionnaire" documents, you wish they'd used Excel.

Step 1: Write VBScript to dump all the Word documents to text
Step 2: Write a hundred regexps in Python which capture nearly all the original data, dump to csv
Step 3: Make sure that CSV opens in Excel, because that's definitely where the eventual analysis is going to happen
posted by BungaDunga at 2:37 PM on August 25, 2016 [1 favorite]

Yes, other people will be able to open your Excel file, but there's actually no guarantee that what they open is identical to what you sent them. Text formats like CSV or TSV are much more likely to be universal (though again, if they are opening with Excel, there are no guarantees).

And when they open your CSV in Excel and resave as CSV who knows what you'll get!
posted by BungaDunga at 2:38 PM on August 25, 2016 [2 favorites]

Don't use Excel for bioinformatics

A problem that has been known for at least 12 years.
posted by MikeKD at 2:53 PM on August 25, 2016 [1 favorite]

Oh! RStudio *does* have a simple data entry GUI, it just opened up in a wierd place for me. And also the `scan` function and... oh, hey, as usual, the parts of R that aren't publishable stats results feel like a dozen fifty-percent solutions. Well, fair enough, I don't want to write that GUI either.
posted by clew at 2:53 PM on August 25, 2016

The thing that will deny me tenure: teaching evals from students who say 'we're bio majors, you can't expect us to (learn to) program!!!11!!'

profound head:desk

I tried.
posted by Dashy at 2:54 PM on August 25, 2016 [7 favorites]

Actually, text formats are not consistent! I have had to deal with windows to Mac or Linux encoding errors at least as much as excel munged data. Hell, I once spent a day troubleshooting a program error caused by macs using the 'wrong' hard return character!

That's true, but the CR vs. CR+LF thing has been around forever, and any good text editor should have no trouble reading either and resolving it for your platform. It's such a trivial issue to resolve that a program producing an error because of the wrong endline representation should basically be treated as a program bug, not a problem with the file. I shuffle text files back and forth between Mac OS, Windows, and Linux all the time, and I can't remember the last time I was even forced to think about it.

As for actual text encoding differences between platforms, with the advent of UTF-8 Unicode those issues are in my experience mostly a thing of the past. Thank God, because I vividly recall how obnoxious it was opening up a simple file that had been created on Windows in Mac OS, and all of the curled quote marks were interpreted as é or ¿ or something like that.
posted by biogeo at 3:15 PM on August 25, 2016 [2 favorites]

Unless we engage in No true Scotsman fallacies, there are many kinds of scientists and not everyone has to handle massive datasets, perform complex statistics or learn computer languages to do their jobs. There's a large number of use cases and the reason why Excel is such a common tool in science is that if performs adequately in many, if not most, of these cases (though not in genomics and bioinformatics apparently... and for anyone interested, the built-in Solver has some precision issues). The date conversion problem is annoying and MS should do something about it, but I wouldn't be surprised that for many scientists, the default behaviour is actually what they need and that their datasets are small enough that the occasional bad conversion is not such a problem. This discussion about Excel is similar to that about Word ("Word sucks, everyone should use LaTeX to write papers!"): your use case is not everyone's use case.
posted by elgilito at 3:23 PM on August 25, 2016 [6 favorites]

As for actual text encoding differences between platforms, with the advent of UTF-8 Unicode those issues are in my experience mostly a thing of the past

UTF-8 support in Excel is as perilous as its auto column format conversion.
posted by ethansr at 3:25 PM on August 25, 2016 [1 favorite]

simple binary numbers. I'll... watch out for it deciding to change that.

Excel would very much like to remove the leading zeros from your binary numbers: that is probably the first thing to beware of.
---
I guess that there's a deep legacy / backward compatibility issue involving some quarter century old-spaghetti code nobody wants to touch. Excel 2016 has still options for Lotus 1-2-3 compatibility..

I think I remember reading on Joel Splotsky's blog that Excel thinks (incorrectly) that 1900 was a leap year, intentionally, because Lotus 1-2-3 had that same error and they wanted to make sure that any Lotus 1-2-3 users could carry over their work without any issues.

Leap years come every 4 years except for years divisible by 100, unless those years are also divisible by 400
posted by thelonius at 3:29 PM on August 25, 2016 [1 favorite]

Some time ago I learned about Category Errors, which are problems caused by treating something as something it fundamentally isn't, and consequently having issues caused by expecting something unreasonable of it.

In my experience as a freelance translator, Microsoft Office is to category errors what a can opener is to a house full of cats.

There's nothing more fun than translating recipes where I need to be able to type "1/2" in a cell next to the one that says "tbsp" and having Excel "helpfully" convert it to January 2nd, despite telling it over and over that NO that cell is NOT a DATE.

Don't even get me started on page layout. The only thing worse than people trying to use PowerPoint for page layout is people trying to use Word for page layout.

Grump grump grump
posted by DoctorFedora at 3:30 PM on August 25, 2016 [3 favorites]

The second answer here gives various ways of handling this problem, including in Excel add-in that gives you an "Open CSV" button. Not saying it isn't an issue, but it might be helpful to those that need to use Excel and this causes problems for.
posted by markr at 3:32 PM on August 25, 2016

This discussion about Excel is similar to that about Word ("Word sucks, everyone should use LaTeX to write papers!"): your use case is not everyone's use case.

I agree with your point that not all scientists need the same tools or skill sets, but this is qualitatively different. Choosing to write your papers in Word rather than LaTeX doesn't affect the integrity of your data. Letting Excel touch your data potentially does, in subtle ways that can be hard to detect. Not everyone needs to code R, Matlab, Python, SAS, or whatever, but no one should use Excel. Of course they will (I'm currently working in a lab that is heavily wedded to Excel for a lot of its work and isn't going to move away from it any time soon, though I don't use it at all for my work), but there are many reasons why this is a bad idea. If a conventional spreadsheet really is the best way of working for a class of researchers, then they need to demand a research-grade spreadsheet application. Excel isn't it, and isn't going to be.
posted by biogeo at 3:36 PM on August 25, 2016 [5 favorites]

UTF-8 support in Excel is as perilous as its auto column format conversion.

I don't know why I'm surprised by this.
posted by biogeo at 3:37 PM on August 25, 2016

From the article: "The systemic error was not, however, present in Google Sheets."

If you haven't used Google Sheets recently, they've come a very long way as a viable Excel alternative.
posted by treepour at 3:46 PM on August 25, 2016 [1 favorite]

All programs, not just Microsoft's, are made to default to the features most users want. If you're not most users then you should familiarize yourself with File->Options.
posted by rocket88 at 3:57 PM on August 25, 2016 [2 favorites]

I'm currently working in a lab that is heavily wedded to Excel for a lot of its work and isn't going to move away from it any time soon, though I don't use it at all for my work
This is what I don't understand: why "no one should use Excel" if your lab colleagues are actually comfortable with it? What is Excel intrinsic "wrongness"outside the problems cited above? To find Excel wrong for certain types of work is OK, but I don't get why it has to be such a general principle, applicable to all scientists.
posted by elgilito at 4:04 PM on August 25, 2016 [1 favorite]

This whole thread is giving me cold sweats and flashbacks to when I worked as a lab tech in the faculty of agriculture here. I analyzed soil samples all day long and they were typically labeled with sets of two digit numbers: e.g. 01-02 or maybe 01-02-03. These corresponded to particular plots or something (I do chemistry not soil science). Anyways Excel naturally thought these were dates too and just buggered up all the spreadsheets for fun, but Excel is so much easier to do data entry into (like straight from the lab book into the computer), and it was compatible between different labs doing different things and with differing levels of technical competence, so that's what we used, always to be careful and put an apostrophe before the sample id.

The big issue we had was with historical data, which may have been saved in any number of old excel formats or csv, and had inherited whatever excel pox along the way. More than once I had to track down old lab notebooks to figure out what that line of data was. Being agricultural experiments there was often a need to go back through years of results (one set of agricultural experiments we dealt with has been going on since the 1920s for example) and data management was done primarily in a shared drive full of excel spreadsheets. Thank god we kept our lab notebooks (in a big pile in our filing cabinet!).

I see that as being a big problem going forward. It's great that (apparently) everyone is using R these days, but the curse of Excel is still there in all those historical datasets. Doomed to plague us for all eternity.
posted by selenized at 4:12 PM on August 25, 2016 [4 favorites]

All programs, not just Microsoft's, are made to default to the features most users want. If you're not most users then you should familiarize yourself with File->Options.

According to everything in this thread, Excel's behavior around automatic conversions cannot be turned off globally.
posted by BungaDunga at 4:15 PM on August 25, 2016 [6 favorites]

That's not Excel error, it's user error.

Format > Cells... > Text
Right. So now you've got an enormous database in which some of the text strings that are supposed to be "SEPT2" and "MARCH1" are now "42615" and "42430" instead of "2-Sept" and "1-Mar."

Ouch, that sucks. Maybe next time try entering or importing your data into a pre-formatted sheet instead of trying to reformat after it gets fucked up.

Yay, you fixed it.

I know!
posted by dersins at 4:19 PM on August 25, 2016

This is what I don't understand: why “no one should use Excel” if your lab colleagues are actually comfortable with it? What is Excel intrinsic “wrongness” outside the problems cited above? To find Excel wrong for certain types of work is OK, but I don't get why it has to be such a general principle, applicable to all scientists.

biogeo has it above, really. If the potential mangling of dates and values won’t really affect you, go to town. If they might, it’s worth looking elsewhere. (Also, at one point, its stats were not good. Maybe that’s been rectified.)

That said, this comment seems to be missing the forest for the trees: the issues described above are giving people troubles that they don’t realize. So at least some scientists should really be getting away from Excel because it is screwing things up for them and they don’t even know it.
posted by Going To Maine at 4:20 PM on August 25, 2016 [1 favorite]

All programs, not just Microsoft’s, are made to default to the features most users want. If you're not most users then you should familiarize yourself with File->Options.

User-blaming is not cool.
posted by Going To Maine at 4:22 PM on August 25, 2016 [2 favorites]

Last time I checked, Open/Libreoffice imports your .csv as plain text without alteration. And it has regex search-and-replace to manipulate your date/whatever formatting if necessary, as opposed to Excel (once more, last time I checked; which was admittedly quite a while ago).
posted by farlukar at 4:30 PM on August 25, 2016 [2 favorites]

It's such a trivial issue to resolve that a program producing an error because of the wrong endline representation should basically be treated as a program bug, not a problem with the file.

Yep. On the other hand, what is not trivial is figuring out why a program that takes a text file as input (formatted in a special way) is failing mysteriously despite having an input file that is correct as far as you can tell.
posted by Mitrovarr at 4:43 PM on August 25, 2016

This is what I don't understand: why "no one should use Excel" if your lab colleagues are actually comfortable with it?

Because it's constantly screwing up their data in ways that they have to be on guard against, and while I trust their skill and integrity, it's really, really hard for a human to manually validate that much data. If there are subtle problems introduced by the fact that Excel is a general-purpose rather than scientific data management program and makes strong assumptions about the data itself (not just its structure or format), I am not convinced that even good scientists will necessarily catch it. This whole article is about the fact that apparently, 20% of the time, they don't.

Not to mention the time wasted by finding and fixing these errors which could be better spent on science. A lot of my colleagues recognize that Excel has problems, but believe the extra time and effort to learn something more suitable isn't worth it in the end. But when you account for the time and effort spent cleaning up Excel's mistakes (seriously, I have sat through 20 minute conversations during lab meetings in which my colleagues are trying to resolve issues in an Excel spreadsheet that they share, which stem from trying to force a spreadsheet program to act as a relational database engine), I'm pretty sure that learning a tool more appropriate for their needs would pay off pretty quickly. And that's not even accounting for problems that they don't notice.
posted by biogeo at 4:44 PM on August 25, 2016 [2 favorites]

Yep. On the other hand, what is not trivial is figuring out why a program that takes a text file as input (formatted in a special way) is failing mysteriously despite having an input file that is correct as far as you can tell.

I am definitely sympathetic to this. A lot of special-purpose scientific computing software is absolutely riddled with bugs, quirks, and unhelpful or nonexistent error messages. Which is a very important counterpoint to my call for a research-grade spreadsheet application, I suppose.

Oddly enough, I find open-source science software tends to be less prone to this than commercial science software. Or maybe I'm just less annoyed by a bug in the software package that a grad student wrote in the third year of her Ph.D. that I downloaded for free from Github than I am by a bug in the software package the lab paid several thousand dollars for.
posted by biogeo at 4:51 PM on August 25, 2016 [2 favorites]

A common defence of Excel is that it is misused, and is an otherwise fine tool for finance and accounting. This is a profoundly mistaken belief that has caused no small amount of grief. I dealt with the consequences of these errors every day. Given the other gaping holes in accountability for the financial sector, Excel's habit of subtly changing your data could lead to things like people getting hit on their credit score if lucky, subpoenaed if slightly less lucky, or a default judgment plus lien on their house if especially unlucky.

Consider the following example. Something similar to below is actually representative of lots of data files that are used as the sole representation of actual accounts in lots of cases, like when a merger requires moving accounts to a new system, or when debt is bought and sold on a secondary market:

"CID","Exp","Opened","LastName","GivenNames","DoB",
"4520230012345678","04/10","200102","Gates","William Henry","1955-10-28"
"4520230012345677","06/12","200410","Ballmer","Steven Anthony","1956-03-24"
"4520230012345670","08/12","200812","Developer","Developer Developer","1970-01-01"

Pretty basic, right? A little odd that the expiry and the opened dates are different, but these are both common ways of storing month and year in files like these. Both hew to pretty standard conventions. One is exactly how it's seen on the card, the other could possibly be confused as ddmmyy, so is not ideal, but usually the format of these things is a known quantity.

So what happens if someone opens this file in Excel, does nothing, then saves the file? Now it looks like this:

CID,Exp,Opened,LastName,GivenNames,DoB

4.52023E+15,10-Apr,200102,Gates,William Henry,1955-10-28

4.52023E+15,12-Jun,200410,Ballmer,Steven Anthony,1956-03-24

4.52023E+15,12-Aug,200812,Developer,Developer Developer,1970-01-01

Apparently Excel assumes the credit card numbers—that were enclosed in otherwise redundant quotes to mark as text—are just big integers. Wouldn't you rather treat these as floats? I'll just get rid of half of the significand for you. You didn't need it anyway. Also, these numbers look like dates. I'll just assume they are mm/dd. Hmm, this file is missing carriage returns before all these line breaks, I'll just drop those in. You don't actually need quotes here, let me get rid of them for you.

Your observant user sees that things don't look right in Excel, so what to do? Well, formatting the card number as text doesn't do anything, maybe as a number with zero decimal places? That looks right. The dates are also displayed wrong, let's format them both as mm/yy, because that's what our system expects when importing a CSV. It looks right, so save. This is the result:

CID,Exp,Opened,LastName,GivenNames,DoB
4520230012345670,04/16,11/47,Gates,William Henry,1955-10-28
4520230012345670,06/16,09/48,Ballmer,Steven Anthony,1956-03-24
4520230012345670,08/16,10/49,Developer,Developer Developer,1970-01-01

Now it looks superficially right, but your card number has the last digit helpfully rounded to zero. Whereas before for the dates, only the expiry was changed without our knowledge, now the opened date looks bizarre too. What happened? Well, actually, both dates have suffered data loss. Someone getting the first mangled file who knew what to expect could figure out that 10-Apr was actually year and month, and not day and month as Excel assumed. Now that information is lost, as we are stuck with the current year instead of the correct one. The next problem is even more baffling. It turns out that if you date format a cell containing yyyymm values, Excel will take this ambiguous value and make the most reasonable assumption, that it is the number of days between 1900-01-01 and some date in a Buck Rogers storyline. You might be forgiven for seeing 11/47 and thinking "that's crazy, treating that date as 1947", but that would be wrong. It isn't even 2047, but rather 2447. Logical, right? I worked with files like these on a regular basis between 2008 and 2012, which really helped things in the ambiguity department.

Well, that's just a silly default, right? You can change settings so it opens CSVs with all columns treated like text, right? No. You do not have this option. You must instead go to Data > From Text. Now select your file using the open file dialog and use the Wizard. Uncheck tab and check comma for your delimiters. You see next to it that the default 'Text qualifier' is double quotes. Good thing your file has that, so everything will be treated as text, right? Wrong. It still assumes 'General' for every field where every value is enclosed in quotes. Select all the columns that need to be formatted as text and click on the 'Text' radio button. Easy as that.

In conclusion, Excel is a land of contrasts between what your data is, and what Excel thinks it should be changed to. Seriously, fuck Excel. I worked with it every day for years. I have written thousands of lines of VBA for it. I have made locked down workbooks full of sheets of named data ranges and lookup formulae and pivot tables and external data references that take 30 minutes to calculate because the higher-ups demanded all the work be done in Excel. Workbooks where all tables were third normal form, because it was the only way they could be managed. I made a spreadsheet that actually works as an effective mandlebrot set explorer because I was bored one day. I know how to use Excel properly. Often, people ask me for advice with Excel. My first piece of advice is find any other way to accomplish your task. Excel should be your very last resort.

Anyone who saw the date 1970-01-01 and was immediately suspicious is my people. I have yet to encounter anyone born on that date. Instead, their parents improperly cast null to zero.
posted by [expletive deleted] at 5:02 PM on August 25, 2016 [39 favorites]

> A problem that has been known for at least 12 years.

Or, put another way, a bug that has not been patched in at least 12 years.

The link posted by pemberkins above contains an exercise as instructive as it is terrifying. To paraphrase it:

With a text editor (or cat or whatever, but NOT excel), create a file with this data; call it foo.csv:

Gene Symbol,Long Name,Value
SEPT2,Septin 2,0.5
MARCH1,Membrane Associated Ring Finger 1,2.3
DEC1,Deleted in Esophageal Cancer 1,1.1

Open foo.csv in excel (open, not import). Observe that excel has decided that the "Gene Symbol" column contains dates, and has reformatted it to something "prettier". Make mental note of what that "prettier" version is.
Click the save button, but when it asks for confirmation, back out & DO NOT LET IT SAVE. Instead, when excel prompts you (because it doesn't want to save a .csv file), hit escape or click cancel. DO NOT SAVE. If it pops up another save dialogue, cancel out of those as well. DO NOT SAVE.
Quit excel, and again DON'T LET IT SAVE. When it asks if you want to discard/lose changes, say yes. DO NOT SAVE.
Observe that we DID NOT SAVE. We opened foo.csv, idly thought about saving, reconsidered, and then told excel -- not once but twice -- NOT to save the file.
Pop quiz: what should foo.csv look like, given that we DID NOT SAVE after opening it in excel?
Look at foo.csv with a text editor (or less or whatever) -- anything other than excel.

And THAT is a major problem. The user has no expectation that their file will have been modified -- they specifically told excel "no" multiple times -- and yet it has been, overwriting the original data.

(In case you are wondering: the write happens as soon as you click save, before the confirmation box pops up. The "cancel" button is a complete ruse at that point; the csv has already been written.)
posted by Westringia F. at 5:10 PM on August 25, 2016 [29 favorites]

why "no one should use Excel" if your lab colleagues are actually comfortable with it? What is Excel intrinsic "wrongness"outside the problems cited above?

If you do this, people who don't have the hard won knowledge of what Excel can do to your data if you are not careful will encounter all of the issues discussed above.

Excel is the "missing stair" of data tools. He's mostly a great guy, but he's a little weird around dates and large integers. Lots of people really like him, and why should we kick him out of our lab since anyone who's been here a while knows what he's like?
posted by [expletive deleted] at 5:18 PM on August 25, 2016 [5 favorites]

Excel is the lingua franca of data tools. Throwing it out because it has problems is like throwing out the English language; even if you find a better language somewhere, like Esperanto or something, good luck ever working with anyone else using that.
posted by Mitrovarr at 5:23 PM on August 25, 2016 [2 favorites]

biogeo, I just read your earlier comment more carefully (having largely skimmed it the first time, sorry) and you make a pretty compelling argument. I'm out of the game at this point, but if I weren't, I'd be thinking hard about cutting Excel out of my workflow and improving my R skills to compensate.
posted by Anticipation Of A New Lover's Arrival, The at 5:46 PM on August 25, 2016 [2 favorites]

Word sucks, everyone should use LaTeX to write papers!

LaTeX is a garbage fire. TeX is ok, except that Don Knuth is exhibit A on why people who think like a computer should never be allowed to design software that is going to be used by humans.
posted by ennui.bz at 6:21 PM on August 25, 2016 [3 favorites]

Honestly, this is only halfway there, can I add:

Don't use names for bioinformatics, use IDs.

(...but then, I already don't use excel, so I have no idea what horrible thing it would do with ENSG00000186522 and ENSG00000144583)
posted by kevin is... at 6:25 PM on August 25, 2016 [1 favorite]

What you should do with an example like the one above from [expletive deleted], in a Microsoft-Office-only ecosystem is, import the .csv file into Access, where you can better control the data type of the columns in the target table, and then export to Excel. But a lot of installs of Office don't come with Access and far fewer people know how to use it than are familiar with Excel. It has a well-deserved poor reputation as a multi-user database solution, but it is quite convenient for things like data importing scut work.
posted by thelonius at 6:37 PM on August 25, 2016 [1 favorite]

Excel is the lingua franca of data tools.

That’s pretty discipline-dependent; it was always interesting to me that I can find solutions for R and python problems quite easily on Stackoverflow but questions about Excel Macros took me to strange, weird corners of the Internet.

Further, Excel isn’t the lingua franca - graphical table manipulation tools are the lingua franca. As noted in the article Google Sheets doesn’t have the same bug as Excel does here - but for all practical purposes its interface is the same. Let’s not confuse the product’s design with the product itself.
posted by Going To Maine at 7:06 PM on August 25, 2016 [3 favorites]

Tutorial: A few words for R beginners
posted by a lungful of dragon at 7:07 PM on August 25, 2016 [1 favorite]

(...but then, I already don't use excel, so I have no idea what horrible thing it would do with ENSG00000186522 and ENSG00000144583)

Symbol names for genes, transcripts etc. are already a pain in the ass. Not much time left to cure cancer when Excel turns your Nobel Prize-winning genes of interest into lousy Smarch weather.
posted by a lungful of dragon at 7:12 PM on August 25, 2016

I like Excel for quick'n'dirty analytics, but the moment the dataset is financially critical I high-tail it to plain text editor land.
posted by Doleful Creature at 7:20 PM on August 25, 2016

We use Excel (and, urp, Google Sheets) for historical archaeology data. It does indeed mess up dates and "helpfully" convert plain text to other things.

On the other hand, none of us have the time to learn R just to enter transcriptions from 150 year old census documents.
posted by teponaztli at 7:28 PM on August 25, 2016

I think it's a bit of a bummer that Google Refine (now Open Refine, I guess) seems to have stalled out. It seemed like a nice compromise between graphical editing and reproducibility.
posted by kevin is... at 7:31 PM on August 25, 2016 [1 favorite]

This thread was totally worth the price of #NerdFight admission.

And I mean that with love.
posted by mudpuppie at 7:46 PM on August 25, 2016 [3 favorites]

I think the "move away from Excel" advice misses a problem. This isn't about how I handle data -- it's about how my lab mate and my predecessor and my summer intern handle data. If Excel alters .csv files easily, the only way to trust that a .csv file hasn't been modified is to have a full audit trail about who opened the file when and for what purpose.

Which is easy, when it's my data on my hard drive. It's next to impossible when it's the data collected by my predecessor (whom I never met, but everyone says was a fantastic researcher) that's stored on whatever server we're using. (Ironically, this is one case where a stack of burnt CDs might be safer in the long run.) Can I be sure the summer intern (who didn't speak English fluently) never opened the file in Excel? Or the first-year grad student, preparing for his first group meeting, who wanted to use my co-worker's final data set to compare to his own?

If the answer to all of these is yes, you have never worked in the kinds of labs I worked in.
posted by steady-state strawberry at 7:53 PM on August 25, 2016 [3 favorites]

> Excel is the lingua franca of data tools.

That’s pretty discipline-dependent

Seconded. My work tends to intersect with several different disciplines and subdisciplines. Some people use Excel as a matter of course, and assume everyone uses it because everyone they know does. Other people are shocked to discover that anyone would even consider it for data analysis, because no one they know does.
posted by biogeo at 7:57 PM on August 25, 2016

Ways forward:

Johns Hopkins' Reproducible Research course is interesting (and just starting up again).

A couple resources for organising a data workflow that embraces spreadsheets.
posted by ethansr at 8:09 PM on August 25, 2016 [2 favorites]

This isn't about how I handle data -- it's about how my lab mate and my predecessor and my summer intern handle data. If Excel alters .csv files easily, the only way to trust that a .csv file hasn't been modified is to have a full audit trail about who opened the file when and for what purpose.

This summarizes it perfectly. The majority of the time, Excel isn't going to do anything nasty to your data, especially if you stay entirely within the Excel ecosystem. But in certain cases, Excel will irreversibly modify your raw data without asking or warning, and the cases in which this may happen are not obvious, particularly if you're a new trainee with a hundred other things to think about. Even if you adhere perfectly to good practices, your collaborators may not.
posted by biogeo at 8:28 PM on August 25, 2016 [2 favorites]

I believe that when it is reasonable to do so, original data should be stored in human-readable plain text. Issues may arise (like the endline problem Mitrovarr brought up), but then at least you always have the option of looking at the raw file to understand what went wrong and develop a strategy for recovery. Humans are also really good at automatically inferring the structure of data in a plain text file, so if in ten years someone wants to reanalyze your data but has no idea what program they need to open that proprietary .pdq file you stored your data in, having a plain text copy means your data set is still analyzable, and still has life.

CSV isn't perfect as a plain-text format by any means, but it does satisfy the human-readability requirement. JSON and YAML are also good plain-text formats for scientific data, but it's more likely that you'll need some custom software to analyze it or import it into an analysis package. Just about everything can open CSV (though there are different "dialects" which need to be considered in some cases), so I lean towards it as the original data format for my raw data.
posted by biogeo at 8:39 PM on August 25, 2016

I came in here to say I HATE EXCEL with a passion for all the reasons described (including that it comes bundled with every single computer my university buys ....grrrr). But after reading all the comments I also have to say, coming from the super super super duper soft science of Cognitive Science, that grad students who don't learn how to analyze their data in R are royally behind the eight ball when they go on the job market. R is the defacto standard for doing simple (linear mixed models) and complex sampling (like divergence analysis). Data analysis with R after years of using a drop down menu from a GUI program was like moving from the stone age of data analysis. I'm being kind of amazed that you super super super duper hard science people don't use R (or python or matlab) as a defacto standard!
posted by bluesky43 at 9:01 PM on August 25, 2016 [1 favorite]

I do everything in R, have for years, and strongly encourage all my students to do the same. But yes, getting your data into R to begin with is still a pain. I long for a program that's just a CSV editor - oh sure there are some out there, but most of them are overkill (require you to define the column types before you start etc.). I just want a grid of cells. I just want Excel to have a big red button marked "Turn Off EVERYTHING".

A super evil thing about Excel is the way it encourages doofuses to encode data as cell colours.
posted by Jimbob at 9:39 PM on August 25, 2016 [2 favorites]

R is the defacto standard for doing simple (linear mixed models) and complex sampling (like divergence analysis).

Yeah, okay, but this is really more about data entry, data storage, and simple data manipulation than it is about actual analysis. I mean, I use R/Stata/HLM for analysis, but most all of my data was entered in excel and stored as csvs, and I do simple merging and manipulation in excel.
posted by ROU_Xenophobe at 10:03 PM on August 25, 2016

I'm on the 'meh' side of the Excel hating. It's a tool, I don't like it much, and prefer other things far more, but even as an informatics guy in the sciences there are a few things I'll use it for. It can definitely slide people into a bad situations without them realizing it though (like using it for "programming".)

I wouldn't take money that the average scientist I work with is less likely to introduce an error with R or Python by using it for a few hours of data handling. If you come up with ways to check your data integrity after a series of operations (or before importing) you can catch them and fix them, but then the same thing goes for Excel manipulations too . . .

Excel is the lingua franca of data tools.

Heh. More like the pidgin English or pantomime of data tools.
posted by mark k at 10:10 PM on August 25, 2016 [1 favorite]

Fun fact: the most recent version of Excel has three different ways to read CSV files, each with different behavior.

1) Opening the file directly
2) Importing the file into the current workbook
3) Importing the file into the current workbook using "Power Query"

I work on the Power Query functionality, which is also available in Excel 2013 as an addin (though not yet supported on OS X). We avoid the "magic" conversion functionality that's so frustrating and hard to change in the first two paths by doing the same kind of inference on e.g. data types but then using it to generate a query. If the inference was wrong, you just delete the step from the query.

Unfortunately, this is aimed squarely at bringing the data into a new workbook in a repeatable fashion. If what you want is to open the CSV file, edit it, then save it in the CSV format, you're still in the hell of behavior #1.
posted by Slothrup at 10:51 PM on August 25, 2016 [1 favorite]

BLAME CHARLES SIMONYI
posted by iffthen at 12:55 AM on August 26, 2016

This is just ... not news. This was a well known issue when I was writing my PhD thesis six years ago, and for at least the two years before that when I was working with the data. I even wrote a sentence acknowledging it in my thesis methods.

I did also have my data in both R (from the bioinformatician) and in an Access database, but they were both either inappropriate or total overkill for what I was doing at the end. I just needed to sort and filter data to format into Word tables for my thesis and publications. Whereas Excal worked just fine, as long as I took the time to put the little ' in front of those pesky date-like gene names the first time I put the data into a spreadsheet.
posted by shelleycat at 12:57 AM on August 26, 2016 [1 favorite]

Or maybe I'm just less annoyed by a bug in the software package that a grad student wrote in the third year of her Ph.D. that I downloaded for free from Github than I am by a bug in the software package the lab paid several thousand dollars for.

Slightly OT but this, above and beyond anything else, is why I've switched from encouraging my students to learn ArcGIS to encouraging them to learn QGIS (and R). ESRI charges through the nose for software that has bugs that haven't been fixed in a decade. Can you convert floating point polygons to rasters yet? Or do you still have to multiply them by 1,000, convert the field to integer, convert to raster, then divide by 1,000?
posted by Jimbob at 1:00 AM on August 26, 2016 [2 favorites]

A few years back I got pissed off enough at Excel's tendencies to automangle data that I went looking for a deployable Group Policy that would turn that shit off by default for all new files created / opened after that point. There was no way -- you could turn off the autotype conversion only on a per xlsx basis, once the file had been created.

Also, R is no great shakes when it comes to automangling data. There are a hell of a lot of "helpful" features in R dataframe manipulation that can bite you (see The R Inferno for more details).
posted by benzenedream at 1:31 AM on August 26, 2016 [1 favorite]

As an academic biologist that ended up in industry, using Excel is far more sophisticated and less prone to error generation than the hand written lab notebooks that the company regulatory group and FDA required, even for huge datasets.
posted by waving at 3:00 AM on August 26, 2016

So the comments here got me googlin', and I eventually found a program called Ron's Editor... it looks interesting and like it might solve a lot of problems in the world of CSV/flat-file data entry. Anybody here ever tried it?
posted by Doleful Creature at 6:03 AM on August 26, 2016

(Also, at one point, its stats were not good. Maybe that’s been rectified.)
Well, that's what getting me, because it fucking hasn't been! I've used Excel for stats, especially right before I sucked it up and went to go practice my ancient R skills so I could get some actual decent analyses done. You can't do fuck-all for stats in Excel without manually calculating them using several sub-formulas or, IDK, effectively writing your own macros! It's half a step up from manually calculating your stats using a fucking hand calculator! So all the comments here about people actually using it for statistical analysis (as opposed to, basically, a way to create a .csv without having to think too hard) are completely blowing my tiny little mind. You guys, even with the learning curve, picking up some basic programming skills in R is so much better, I promise!

As an academic biologist that ended up in industry, using Excel is far more sophisticated and less prone to error generation than the hand written lab notebooks that the company regulatory group and FDA required, even for huge datasets.
Do not get me started on this. Why is it that the university insists on hand written lab notebooks for coding pipelines? Why can't I just, IDK, have a PDF that I print off and bind every few months or something? I don't want to take the time to handwrite my code and the things I'm trying, especially when I'm already keeping a typed electronic record somewhere accessible to the lab and stored in eighty bajillion places for actual use! grump grump grump
posted by sciatrix at 6:53 AM on August 26, 2016 [2 favorites]

The only thing you can do is Data > Get External Data > Import Text File. This lets you format the column as text BEFORE the damage is done.

That was news to me too, and it's probably about the same as the Import utility in Access that I was promoting above.
posted by thelonius at 9:19 AM on August 26, 2016

I can't believe nobody has yet linked to the definitive data on Using Excel as a Database.

(Yeah it's old, but I love it.)
posted by w0mbat at 9:45 AM on August 26, 2016 [1 favorite]

So the comments here got me googlin', and I eventually found a program called Ron's Editor...

Yep I have, it's one I described in my comment above as "overkill", requiring (as far as I could tell while trying it) you to define the number and type of each column before you can start entering data.

Now, I understand completely that this is actually a Good Thing. Forcing you into a strict format, where each colum's, say, date format is defined in advance and it holds you to it, is a good way to make sure your data entry is correct and standardised and makes you think about what you're doing. But it's not a drop-in replacement for Excel. After I installed it, it was set up as the default file handler for CSV files, and when I double-clicked CSV files I found myself thinking "oh shit it's going to open in Ron'a Editor isn't it...abort abort!"

I'm sure it would be good if I spent time getting used to it.
posted by Jimbob at 1:57 PM on August 26, 2016

I can't think of anything that deserves statistical analysis and doesn't deserve defining and checking the column types, Jimbob. Seems like a great way to catch some errors and no work that I wouldn't have to do at some point anyway.
posted by clew at 2:25 PM on August 26, 2016

Yeah I should give it another go.
posted by Jimbob at 2:40 PM on August 26, 2016

I remember I laughed at this when it came up a while back, but much later I had a bunch of hexadecimal numbers that I had to do bulk math on, and I thought, "fuck it, just use excel".

Then there was cursing and much digging to figure out how to get numbers that started with 0E to not convert to 0.
posted by ckape at 12:42 AM on August 27, 2016 [2 favorites]

Google Sheets is literally tens of thousands of times slower than Excel. It's not a legitimate alternative.
posted by Yowser at 6:50 PM on August 27, 2016 [1 favorite]

I'm gonna side with tclark way above; Excel is one of the least unreliable pieces of software I use. Is it great for the lab? I dunno. But it's kind of a swiss army knife/lowest-common-denominator tool that is on pretty much every windows machine, and pretty much the only MS product I still rely on.

The errors are obviously not all on the user, and Excel does some weird shit, but I suspect the kind of people who aren't noticing their data getting reformatted in the cell wouldn't do any better with an actual database.
posted by aspersioncast at 9:53 PM on August 28, 2016

Google Sheets is literally tens of thousands of times slower than Excel. It's not a legitimate alternative.

I'm super curious whether you meant that "literally" literally and if so, what kind of crazy data manipulation you'd have to be doing to know that.
posted by straight at 1:13 PM on August 30, 2016

Scientific community capitulates to Microsoft, officially changes all gene names to dates
posted by grouse at 5:01 AM on September 9, 2016 [1 favorite]

« Older All mixed up | "I encourage ESPYs to…change the category to Best... Newer »

This thread has been archived and is closed to new comments

MetaFilter

Gene name errors are widespread in the scientific literature
August 25, 2016 11:23 AM Subscribe

Tags

Share

Gene name errors are widespread in the scientific literature August 25, 2016 11:23 AM Subscribe

Tags

Share

Gene name errors are widespread in the scientific literature
August 25, 2016 11:23 AM Subscribe