Skip

"No crime is so great as daring to excel." -- Winston Churchill
March 4, 2009 2:00 AM   Subscribe

A little detective work traced the problem to default date format conversions and floating-point format conversions in the very useful Excel program package. The date conversions affect at least 30 gene names; the floating-point conversions affect at least 2,000 if Riken identifiers are included. These conversions are irreversible; the original gene names cannot be recovered.
Yet another reason not to use Excel as your "database".
posted by orthogonality (153 comments total) 39 users marked this as a favorite

 
I do some work in bioinformatics and this kind of thing is common. Biologists need tools to handle their data and haven't had anyone really build them the things they need. The state of software in that field is atrocious. They end up using old bad software meant to do the job, or off the shelf solutions that weren't meant for the job.
posted by magikker at 2:08 AM on March 4, 2009 [6 favorites]


huh? was anybody on earth ever stupid enough to use excel as anything more than a slightly glorified calculator?!??
posted by UbuRoivas at 2:09 AM on March 4, 2009


big news in 2004
posted by mary8nne at 2:22 AM on March 4, 2009 [1 favorite]


Is it ok if I use it as a "spreadsheet?"
posted by fixedgear at 2:27 AM on March 4, 2009 [10 favorites]


Ubu, I'd guess that about half the fortune 500 companies use Excel for all their business analysis.

Most of the things that instantly raise my blood pressure to boiling rage relate to computer software anticipating my needs, and failing utterly to guess correctly. Whether it's turning (c) into ©, refusing to let me capitalize the second letter in a text string, or guessing how I want dates formatted. It's not hard to find the override, but I'd rather be able to customize features on or off as I choose.
posted by BrotherCaine at 2:36 AM on March 4, 2009 [15 favorites]


huh? was anybody on earth ever stupid enough to use excel as anything more than a slightly glorified calculator?!??

That's pretty harsh. It's easy to use, flexible, ubiquitous. I can't think of a single business I've worked in that didn't base their day to day operations round excel spreadsheets.
posted by MuffinMan at 2:37 AM on March 4, 2009 [1 favorite]


MuffinMan: it is notable, however, that they do not base them around Excel databases...

Excel is probably the single software package that sees the most abuse, I reckon. The number of databases (and even ugly hacks to get psuedo-relational ones), graphic representations using cell colour, and even interactive applications (based around a "database" in sheet two of an Excel file) that I've seen...
posted by Dysk at 2:43 AM on March 4, 2009


BrotherCaine: I guess that the line of reasoning is that 1) the vast majority of users actually need these sort of default behavior, 2) they would be lost if it didn't work like this out of the box and 3) they wouldn't know how to turn it on without external help. The sort of people who want to write stuff like ABc or (c) are also the sort of people who can find the override easily. I've been there but as a person who tend to switch between these modes I don't think that there's an easy solution to this problem.
posted by elgilito at 2:52 AM on March 4, 2009


to clarify: D2D business, ok.

but anything requiring any kind of precision around things like decimals, dates, etc...excel does too much behind the scenes that would easily fuck up anybody who wasn't already aware - mostly through hard experience - of what it was doing.

to give just one example: if you reduce the number of decimals in a cell or range of cells, it only reduces the *displayed* number. the underlying .14159265358979323846 is still there influencing any calculations.

in its favour, it's nowhere near as evil as ms project in the bizarro random, uncontrollable shit it foists upon its user.
posted by UbuRoivas at 2:54 AM on March 4, 2009


GAAAAATES!!!!
/khanfilter
posted by nudar at 2:56 AM on March 4, 2009 [10 favorites]


Before Excel, there was 123, and it, too, was abused. Problem is, too many people learned the spreadsheet, and only the spreadsheet. It's like the old saying goes, "When your only tool is a hammer, every problem looks like a nail" (or something like that).

Way back in the mid 80's, I had worked on a project with some guys from Martin Marrietta. They were doing elaborate planning for upgrading the US air-traffic control system. The part I worked on involved calculating man-hours. (I was 'just a temp').

They had created a huge and elaborate Lotus spreadsheet. It required some input, then the most insane, time consuming, macro would chug away, to produce the desired printout. This wasn't the first time I'd seen spreadsheets abused into doing things other platforms would have been better suited, but it was the most egregious.

I explained the issue to them, and urged them to give me a call, the next time they needed something that elaborate, to see about doing it more efficiently. (yes, I was breaking the rules, trying to drum up some consulting business over a temp placement). I liked these guys, and was fascinated by their work. Their response was to tell my agency I was no good (fortunately, my agency knew better. The thing I liked best about temping was, I was hugely over qualified and was usually seen as one of their best people).
posted by Goofyy at 2:59 AM on March 4, 2009 [3 favorites]


So, you are saying that reducing the precision of calculated cells based upon display format is the way to go? Bad example, methinks, but point taken. /derail
posted by sfts2 at 3:02 AM on March 4, 2009 [1 favorite]


sfts2: from a typical user perspective, WYSIWYG still rules. reduce the decimals, see the values round up or down = assume that the underlying value has changed accordingly. not so.

try calculating the difference between date1 & date2, when the display in the cell expresses it in ddmmyyyy but the underlying program thinks of it in terms of some kind of universal astronomic time where it's down to the last second since the big bang, and you'll realise that excel is *completely fucking useless* for any kind of calculation that needs to meet a standard of precise scientific accuracy.
posted by UbuRoivas at 3:09 AM on March 4, 2009


Well, damn and blast. I've been trying to sequence viable DNA from this data for years. This certainly must explain why most of my clones are unviable, cancerous goo, and the rest only have 4 or less arms. It's supposed to be a twelve-armed hulk 30 feet tall with skin like a battleship and... I've said way too much. I'm trying to clone bunnies. Bunnies with quasi-human intelligence. Talking bunnies. For pets. Not to be shot out of a flaming bunny-launching gatling gun made out of bone and teeth. Not at all.
posted by loquacious at 3:10 AM on March 4, 2009 [38 favorites]


posted by PenDevil at 3:12 AM on March 4, 2009 [5 favorites]


aw, how cute!

my pizza was delivered by a little three-armed bunny with no teeth but an impeccable english accent.
posted by UbuRoivas at 3:17 AM on March 4, 2009 [6 favorites]


Brother Dysk: it is notable, however, that they do not base them around Excel databases

Certainly not for many core processes, where the co-ordination of functions and the business case for process improvement or synchronisation is obvious.

But databases evolve - any collection of data will eventually become, through size, complexity, frequency and variety of usage into something that requires a more suitable software package.

And a lot of SMEs, and functions within larger companies, lack time, budget, resources or expertise to make the step up from an excel sheet of a manageable shape and size to something better.

From my experience, users will carry on with an unwieldy excel sheet for far longer than is optimal because the burden of shifting to a proper database package is high.
posted by MuffinMan at 3:21 AM on March 4, 2009


Hello? The problem isn't that Excel is or is not an appropriate tool for bioinformatics. The problem is that Excel uses imprecise math. Which is, and always has been, disgusting. If a tool is ostensibly designed for precise and accurate work, but does not actually perform in that way, that is a fucked up tool.
posted by seanmpuckett at 3:43 AM on March 4, 2009 [12 favorites]


What tool would recommend instead of Excel for this sort of work?
posted by Brandon Blatcher at 3:48 AM on March 4, 2009


SPSS would have to be better, for a start.
posted by UbuRoivas at 3:59 AM on March 4, 2009 [1 favorite]


I do bioinformatics, and I'd say I spend a fair amount of time checking inputs and outputs in the pipelines I both develop and use. For example, I've been caught by pipelines that do interpolation on integer values, so that discrete event counts become continuous, decimal values. Stuff like that gets cycled through a lot of computation and the end result is, well, inaccurate. It happens all the time, and as the article alludes it can be tough to catch.

If Excel is part of a lab's automated pipeline for any critical analysis, God forbid, or even if it is not, that lab should be setting up the equivalent of "unit tests" to make sure the records that go into a pipeline step, and the ones that come out of that step, are expected.

Even then, with the sheer quantity of data that informaticists handle, like a game of Telephone, it's easy for things to go wrong. Microarrays, proteome arrays and high-throughput genomic sequencing handle huge amounts of data. I'm sure its the same issue for particle physicists.

Excel just happens to be an exceptionally dangerous tool to use, because its users keep buying upgrades from Microsoft that add new "features" and cause unexpected behaviors. The problem isn't that Excel is used as a database, it's that it can do stuff of its own volition in an attempt to "correct" your input, regardless of what you use it for.
posted by Blazecock Pileon at 4:11 AM on March 4, 2009 [10 favorites]


This is why I prefer a stripped-down spreadsheet like gnumeric for the rare occasion that I want a spreadsheet. It'll bloat up eventually, too.
posted by a robot made out of meat at 4:22 AM on March 4, 2009


With regards to the discussion above, I don't think it's fair to blame Excel for the issues revolving around floating point calculations. If the precision of the values you are dealing with is important then you need a customized solution built by software engineers who understand the issues involved and the business needs of your application.

More broadly though? Excel is for managers to make pretty graphs of data to present to higher level managers. Precision (in the fine grained sense) is not only not required, but possibly detrimental. (Before any Excel wizards flip out, yes I realize you can make this shitty piece of software do a lot more than that). It is absolutely, jaw-droppingly mind boggling to me that anyone would use Excel for managing data where the actual data and any results you get from applying algorithms to that data is important to anyone.

I think what the academic world needs at this point is a pairing off of sophisticated software engineers with pure research types who don't have the time to devote to the software aspect of it all. Basically what I'm trying to say here is that deciding Excel was the right tool for this job was fucking retarded.
posted by cj_ at 4:37 AM on March 4, 2009 [3 favorites]


I concur with the opinion that Fortune 500 companies likely do a lot of their "real" work in Excel. Also, according to my sources, biology isn't the only academic field that uses it. Apparently astronomers/astrodynamicists do the same. The really really advanced one might use matlab scripts. (The entity in charge of tracking satellites doesn't do this itself that I know of, but theoretical replacements are often modeled this way. That's part of the reason the existing system is so obsolete and possibly why we saw a collision recently. They are trying to develop modern ideas using non-modern tools.)

Very few of them really know how to program and even fewer hire an actual software engineer. The code that results, if you are lucky enough to even have any code, is atrocious and unmaintainable (I know, I've seen it).

That said, they are not entirely to blame. Software engineers are being trained, whether cathedrally or bazaarly, to build The One Final Solution To Our Well Defined Problem. That's great in the business world, where mortgages or control systems are a well understood problem with well defined parameters and so forth. It's not so great in a research setting where you don't really know what algorithm is going to be needed. You kind of have to balance speed of initial development with later maintainability.

The approach I take is to let my boss, the main algorithm guy, develop a new program the first time around. Once we are actually getting results, I then take his (atrocious and unmaintainable) output and clean it up with some abstractions, optimizations, new algorithms, clearer interfaces, etc. But my situation is somewhat unique and a lot of the programmers around me on other projects can't/won't/aren't doing this. It's a big problem.

We're supposedly in the information age but 90% or more of the information we have (which is a tiny fraction of what's out there) is being very poorly processed.
posted by DU at 4:42 AM on March 4, 2009 [5 favorites]


Dude, you always set the columns to "text" and not "general" when importing. I've had this issue in finance with CUSIP numbers.
posted by FuManchu at 4:43 AM on March 4, 2009 [7 favorites]


People want their software to think for them, and then they complain when it does. Those bleeding hearts in "human factors" have conditioned them to be helpless and ignorant.

This is the way the world works. Anything else is a lie told by programmers to their users.
posted by 0xdeadc0de at 4:44 AM on March 4, 2009 [2 favorites]


If the precision of the values you are dealing with is important then you need a customized solution built by software engineers who understand the issues involved and the business needs of your application.

Which costs a whole lot more money initially than just using the spreadsheet program that you already have installed on your desktop. 9 times out of 10, people working on fixed budgets are going to go for the solution that doesn't involve a massive software development project.
posted by octothorpe at 4:46 AM on March 4, 2009


The problem is that Excel uses imprecise math. Which is, and always has been, disgusting. If a tool is ostensibly designed for precise and accurate work, but does not actually perform in that way, that is a fucked up tool.

If you'd care to expand on this mathematical imprecision in Excel then I'd certainly be interested. I've not encountered it myself so I presume that it relates to a specific area of maths that I've not had to tackle in Excel.

Nonetheless, the issues cited in the posted article are to do with data conversion and not maths. More specifically, the issues are to do with awareness of implicit data conversion and I'm therefore tempted to label them as user errors - no one manipulating large datasets should be making assumptions about the transfer of that data between systems/applications.
posted by MUD at 4:58 AM on March 4, 2009


Strikes me that the problem is not so much that Excel itself is buggy, name me one software development language or tool that doesn't have bugs in it. Rather it's that things built in Excel tend not to be built by people who understand a good software delivery process.

If you follow a reasonable process, and be sure to test inputs and outputs and consider the edge cases and data quality and all that boring stuff that makes software development projects much more expensive, then it's possible to build stable, solid applications in Excel suitable for the task at hand. If you don't then you wind up with buggy software.

I remember reading something a few years ago (can't find the link now, my google-fu fails me) saying that the vast majority Excel spreadsheets pressed into use in businesses around the world had a high number of latent defects as they'd never been tested and no-one had really noticed the problem yet.

And when I build not so complex spreadsheets to inform me in making decisions, and those decisions are made almost entirely on the basis of the output from the spreadsheet, it makes me wonder how many bad business decisions, flawed research etc. get carried out every day because of these flaws.
posted by grahamspankee at 5:00 AM on March 4, 2009 [2 favorites]


Also, according to my sources, biology isn't the only academic field that uses it. Apparently astronomers/astrodynamicists do the same.

Really? When I wrote astronomical software (over 10 years ago) everything was FITS.
posted by vacapinta at 5:05 AM on March 4, 2009


A little googling turns up an organisation dedicated to helping people avoid precisely these kind of mistakes.

Lovely quote from one of their papers Stop that Subversive Spreadsheet (pdf):

"The use of spreadsheets in business is a little like Christmas for children. They are too excited to get on with the game to read or think about the 'rules' which are generally boring and not sexy"
posted by grahamspankee at 5:05 AM on March 4, 2009 [11 favorites]


Strikes me that the problem is not so much that Excel itself is buggy, name me one software development language or tool that doesn't have bugs in it. Rather it's that things built in Excel tend not to be built by people who understand a good software delivery process.

This.
I work for a strategy consulting firm and had to help my colleagues debug a cashflow forecasting tool that they'd built in excel. The problem? They were incrementing the month by doing '+30' to a date. The display options for the date were set to show only the month and year, so it looked fine, but the dates were all wrong and this caused a whole bunch of odd errors.
posted by atrazine at 5:09 AM on March 4, 2009 [2 favorites]


Uh-oh. AC/DC ate my gene sequence.
posted by pracowity at 5:09 AM on March 4, 2009 [1 favorite]


FITS is an image file format, not a programming environment. And I'm talking about orbital dynamics. That said, I don't have this knowledge firsthand, although it does consistent with firsthand knowledge.
posted by DU at 5:11 AM on March 4, 2009


in its favour, it's nowhere near as evil as ms project in the bizarro random, uncontrollable shit it foists upon its user.

Second that.
posted by ZenMasterThis at 5:13 AM on March 4, 2009


If you'd care to expand on this mathematical imprecision in Excel then I'd certainly be interested.

MSDN: Calculation Issue Update (via)

MSDN: Understanding Floating Point Precision, aka “Why does Excel Give Me Seemingly Wrong Answers?”
posted by Blazecock Pileon at 5:13 AM on March 4, 2009 [8 favorites]


Also, according to my sources, biology isn't the only academic field that uses it. Apparently astronomers/astrodynamicists do the same.

Really? When I wrote astronomical software (over 10 years ago) everything was FITS.


Yes. When I was doing astrophysics (two years ago) we used FITS data for just about everything, typically manipulated using either Matlab or custom C code.
I've never seen excel used in astro, but it's quite possibly that people could use it to quickly visualise something.
posted by atrazine at 5:13 AM on March 4, 2009


People keep ragging on Excel, calling it a poor choice for this particular domain. Perhaps, but it could be perfectly adequate. The spreadsheet is one of the most brilliant pieces of software design ever conceived. It is a poor database, but it models the vast majority of databases as most people experience them, as they have existed since we first started putting marks on clay tablets.

If there is no domain-specific software available, and you are unwilling to hire someone trained in programming, or acquire that training yourself, you have no business using a "real" database as you will certainly get hurt much worse than this.

The error in this case is that the user did not understand what an import would do, or that Excel did not adequately convey what it was doing. The munging of the date-like fields is not a bug. It is an unwise default action for import, but one the user could have changed.
posted by 0xdeadc0de at 5:18 AM on March 4, 2009 [7 favorites]


In my experience astronomers never use Excel except for trivial tasks. We roll our own buggy code.
posted by edd at 5:19 AM on March 4, 2009 [4 favorites]


Damn straight, you should be using Perl. CPAN Bio Modules.
posted by zengargoyle at 5:20 AM on March 4, 2009 [1 favorite]


I guess I should inform my source that he's wrong about astronomers. Although all the talk about FITS leaves me with the distinct impression we are talking about totally different things.
posted by DU at 5:25 AM on March 4, 2009


i don't get it - if it's good enough for the mortgage and banking industries, it ought to be good enough for them
posted by pyramid termite at 5:29 AM on March 4, 2009 [1 favorite]


My whole college career was spent preparing to be a software toolsmith for scientists. I had dreams of working for JPL or Fermilab.

Instead I make e-commerce websites and use my intellectual prowess to "make that button more blue."

*weeps*
posted by Mur at 5:31 AM on March 4, 2009 [19 favorites]


The existence of this bug is awesome. It's a testimony to how great a tool Excel and spreadsheets in general are that it can even occur.
posted by Nelson at 5:37 AM on March 4, 2009


Yikes. Glad I don't use Excel for anything ore complicated than just displaying and organizing my data, for manipulations / analysis I use an actual statistics package.

Still blows my mind that none of the Office programs have a keyboard shortcut for "paste without formatting". OpenOffice does, why doesn't MS pick up on this? Even better, the "paste transpose" keyboard shortcut from my favorite stats package would also be incredibly useful in Excel. But no, I'm stuck with right-click, paste special, select options... every damn time... *sigh*
posted by caution live frogs at 5:41 AM on March 4, 2009 [1 favorite]


A spreadsheet is a simple, easy to understand interface for storing data. Rows and columns, all laid out right where you can see them.

The problem is that Microsoft refuses to acknowledge they build a flat-file database front-end, and insist on treating it as a numerical analysis tool. It's a really good numerical analysis tool, mind you, but that's not how most people use it.

So, you can either insist the Mountain come to Mohammed (Use Access or Filemaker Pro, or hire a programmer for a custom application) or you can face reality, and structure your product according to how your customers actually use it.

In this case, build it with the assumption that people are going to use it as a quick-and-dirty database, and apply the numerical/financial analysis stuff on an as-needed basis only, axe auto-formatting unless expressly invoked by the user, and roll in unlimited levels of undo. We have terrabyte-sized drives now, and nobody's drive ever got filled up with Excel files.

This being Microsoft, I'm not holding my breath.
posted by Slap*Happy at 5:45 AM on March 4, 2009 [3 favorites]


Oh wait. There was that one time I did a 17,000+ cell manual bootstrap procedure in Excel... but that was necessary, as my stats package can't do a bootstrap, and it was few enough permutations that I could set it up by hand and double-check it. Other than that, the most complicated thing I regularly do in Excel is average things and occasionally concatenate stuff.

On the bright side, because I was using conditional statements to eliminate zero-value cells in my bootstrap, I uncovered a bug in Excel 08 that only crops up when conditional statements depend on earlier conditional statements, which Microsoft was happy to hear about and has since fixed. So, if it is important to you, you can thank me for that, I suppose.

posted by caution live frogs at 5:46 AM on March 4, 2009


Excel is probably the single software package that sees the most abuse, I reckon

once upon a time i knew an office manager that used excel to visually create a new office floor plan prior to a move; in other words, like graph paper, blacking out some cells and leaving others blank.

sometimes the mark of good software is how many ways it can be usefully used "incorrectly."
posted by 0x029a at 5:50 AM on March 4, 2009 [5 favorites]


It just occurred to me this bug is probably locale dependent. Thanks to Excel's date identification heuristics, American researchers can't study the gene OCT-1. But I'm guessing German researchers are fine with that, just have a hard time studying OKT-1. Chinese researchers probably get off scott-free.
posted by Nelson at 6:04 AM on March 4, 2009



in its favour, it's nowhere near as evil as ms project in the bizarro random, uncontrollable shit it foists upon its user.


Amen to that..if I ever find the fucker that built that application (or fuckers more accurately) I will curse them like a drunken sailor and may even throw a shoe in their general direction. Assholes.
posted by spicynuts at 6:07 AM on March 4, 2009 [2 favorites]


ms. lester is doing some coding for her brother's website. he sent her pictures ... in an excel file. file size was more then 1/2 gb, and took 20 minutes to load.

as a temp, i've seen a lot of abuses. but i've also seen people's eyes glaze over when i say the word 'database.'
posted by lester's sock puppet at 6:07 AM on March 4, 2009 [1 favorite]


ABSTRACT
We are idiots who don't know our tools.
posted by fleacircus at 6:14 AM on March 4, 2009 [8 favorites]


Yikes. Glad I don't use Excel for anything ore complicated than just displaying and organizing my data, for manipulations / analysis I use an actual statistics package.

This might well have caught you too, even just using Excel for data entry and storage.

Even better, the "paste transpose" keyboard shortcut from my favorite stats package would also be incredibly useful in Excel. But no

alt-e s e.
posted by ROU_Xenophobe at 6:15 AM on March 4, 2009


ROU - I'm just putting in numbers, whether whole-number behavioral activity counts or tissue weights with known numbers of decimal places, usually, so it shouldn't be screwed up. The rare times I have to enter timestamped data I'm very careful to check that only timestamps are actually converted.

alt-e s e: Does this work in Mac Office? And if so, why the hell isn't it documented anywhere? That would be a major timesaver.
posted by caution live frogs at 6:21 AM on March 4, 2009


The server is temporarily unable to service your request. Please try again. If you continue to receive this error, please contact info@ncbi.nlm.nih.gov.
posted by Marisa Stole the Precious Thing at 6:22 AM on March 4, 2009


But no, I'm stuck with right-click, paste special, select options... every damn time... *sigh*

Amateur....

ALT+E+S+V for values
ALT+E+S+F for formulas
ALT+E+S+E for transpose
And so on....
posted by loquax at 6:25 AM on March 4, 2009 [10 favorites]


Bioinformatician-in-training... the problem as I've seen it isn't the use of Excel by those doing the hands-on work with the data, it's the use of Excel by people (often traditional wet-bench biologists, but not always) using Excel as a viewer. I get files that someone has taken a great deal of time, painstakingly compiling line by line, reflecting an entire experimental series. Why they chose to do it this way is a topic beyond the scope of this post, fortunately. They then ask me to "do magic, find the pattern, retrieve all data related to X". At this point, saying "Excel is ebil" to them doesn't really help, and it doesn't mean I don't get to spend a certain amount of time looking for unexpected Excel traps.

On the other end of the workflow, I do a fair amount of microarray analysis using the R statistical package. The biologists on the end of that analysis don't want to deal with data matrices and the like, and it just isn't realistic to ask them to install R, BioPerl, BioWhatever to view or manage the output data. They want something very much like an Excel spreadsheet, something where they can look at an array of data, genes in rows, statistical values in columns. Then they re-arrange things, add rows of data from other experiments, or what have you. Blaming Excel, while convenient and maybe a teeny bit gratifying, isn't the problem here. The problem is that we have very professional people with large lacunae in their computer literacy, and while a better Excel-alternative might ameliorate that, it won't totally solve the problem.
posted by vesper at 6:29 AM on March 4, 2009 [6 favorites]


once upon a time i knew an office manager that used excel to visually create a new office floor plan prior to a move; in other words, like graph paper, blacking out some cells and leaving others blank.

That's nothing, I've seen ads built in Excel and intended for magazine and/or newspaper output. They still haunt my nightmares.
posted by Brandon Blatcher at 6:33 AM on March 4, 2009 [6 favorites]


Link is down. Google cache.
posted by CheeseDigestsAll at 6:39 AM on March 4, 2009


The problem is that we have very professional people with large lacunae in their computer literacy, and while a better Excel-alternative might ameliorate that, it won't totally solve the problem.

This is fundamentally the issue. I see it too in chemical analysis (and am guilty of it myself). Nobody has really cracked the problem yet of how to input, present and manage large volumes of data in a way that makes sense to visual and textual thinkers. The spreadsheet metaphor is the best currently available.

It's great that Matlab, R and BioPerl, etc.. are out there, but these tools require a lot of effort to learn, which most bench scientists can't justify sparing from their day to day work. It seems like way to much work for way too little result, when they can "just punch the whole thing into Excel".

The solution, I think, is to start integrating the informatics tools into the lower-year stats requirements that most undergraduates have. Make them all learn R, for example. Of course, R will need a front end that doesn't come from 1971, but people seem to be working on that. Informatics needs to be seen as integral to science education, not just something that students are expected to learn by osmosis, as my generation was expected to.
posted by bonehead at 6:46 AM on March 4, 2009 [3 favorites]


Slap*Happy: The problem is that Microsoft refuses to acknowledge they build a flat-file database front-end, and insist on treating it as a numerical analysis tool. It's a really good numerical analysis tool, mind you, but that's not how most people use it.

This wouldn't even be so goddamned stupid if MS hadn't quashed excellent competitors who actually understood this perfectly. Like, to give a good example, the excellent and still-useful Paradox.
posted by koeselitz at 6:46 AM on March 4, 2009


Save your raw data somewhere, read only, and DON'T SCREW WITH IT! (In the world of drugs we are required to do this, but we can only black box test what's going on between our samples and our network shares so if some instrument vendor decides to do us a favor...)

In related news, just yesterday my group was talking about buying a cell counter in lieu of having humans sit there looking through a microscope and pressing the button on a little clicker. I said, "we have a microscope with an automated x-y stage and a camera that links straight to a computer. Why not get some open source image recognition software, DIY, and save $100,000 or so.

I was looked at like I'd asked for a weasel and quietly told that it was virtually impossible to prove the the FDA's satisfaction that such a home brew was a qualified instrument. I didn't ask what made a magic black box so magical, just ground my teeth.
posted by Kid Charlemagne at 6:51 AM on March 4, 2009 [8 favorites]


From the abstract:

A little detective work traced the problem to default date format conversions and floating-point format conversions in the very useful Excel program package.

I fucking hate that feature.
posted by KokuRyu at 6:52 AM on March 4, 2009


once upon a time i knew an office manager that used excel to visually create a new office floor plan prior to a move; in other words, like graph paper, blacking out some cells and leaving others blank.

sometimes the mark of good software is how many ways it can be usefully used "incorrectly."


I did this once: I had to explain the layout of a room once to six people scattered across a continent on a conference call. With a mix of software packages, I had no guarantee there was any image that everyone could open, but everyone had Excel. In three minutes I put together the floorplan and sent it off to everyone, and our conversation continued unimpeded.
posted by ricochet biscuit at 6:54 AM on March 4, 2009


In my experience astronomers never use Excel except for trivial tasks. We roll our own buggy code.

In Fortran 77 (implicitly-typed).
posted by dirigibleman at 6:57 AM on March 4, 2009 [6 favorites]


Nobody has really cracked the problem yet of how to input, present and manage large volumes of data in a way that makes sense to visual and textual thinkers.

You're very, very right about that. Spreadsheets are an amazingly flexible and intuitive tool. There's nothing really wrong with using them to manipulate this kind of gene data, either, you just have to be very careful that Excel understands your datatypes.

I'm scratching my head thinking of recent software that's advanced informatics. I think Mathematica is probably the best example; it has hard core analytic features and also a lot of attention paid to visualization. But it's aimed at algebraic analysis, not statistical manipulation.

I'm one of those nerds who can stare at a SQL schema or a bunch of Matlab code (or when pressed, some R) and understand what it means. But that's because I can see The Matrix. We need tools for ordinary people who have better things to do with their intelligence than learn low level representations of numbers.
posted by Nelson at 7:10 AM on March 4, 2009 [3 favorites]


The problem is that we have very professional people with large lacunae in their computer literacy

Heh, I'm seeing an MRI of my collaborators and pointing out "here is where the relational-database-center should be located. As you can see, it has been replaced by a cyst."
posted by a robot made out of meat at 7:31 AM on March 4, 2009 [9 favorites]


"I can't think of a single business I've worked in that didn't base their day to day operations round excel spreadsheets."

Not the business which employs me (ISP), and not any of the clients that I've done work for (IT/network services). We run our business on Red Hat servers using in-house billing and database systems.
posted by krinklyfig at 7:34 AM on March 4, 2009


I'm far more concerned about people who use the error-ridden numerical routines in Excel, which fit well into the purpose for which it was ostensibly written.
posted by grouse at 7:35 AM on March 4, 2009


One of the biggest problems I see is the decoupling of data from calculation. The data is stored here, the programmatic part is there and you can never see the two at the same time. Excel (and lims and other kinds of DBs) emphasize data and tuck the formulas out of view, R and Matlab are all about the formulas and hide the data. It's this kind of two-headed view of data separate and distinct from calculation that caused the original error in the post. I've done it myself.

(Fortunately, I always follow Kid Charlemagne's advice, but he doesn't go far enough. Store TWO copies of your data IN DIFFERENT PLACES and maybe, just maybe, bitrot won't take you).
posted by bonehead at 7:36 AM on March 4, 2009


"The problem is that Microsoft refuses to acknowledge they build a flat-file database front-end, and insist on treating it as a numerical analysis tool. It's a really good numerical analysis tool, mind you, but that's not how most people use it."

To MS' credit, they did not invent the spreadsheet, and it was never intended to replace databases. They produce Access for people who want a database with that sort of visual front-end. Not that I'm endorsing this stuff, but, for instance, a lot of software applications will use Access as a quick-and-dirty substitute for a full-fledged SQL database, particularly if there's not a huge amount of data involved, such as an install I just did of CounterSpy Enterprise. It defaults to Access, but you can use SQL if you have a need to.
posted by krinklyfig at 7:39 AM on March 4, 2009


had i only known that this sort of thing was publishable, about a decade ago i would have been able to start a journal dedicated to the cause of reporting solely my own informatics bugs. published weekly!
posted by the painkiller at 7:44 AM on March 4, 2009 [3 favorites]


It's a really good numerical analysis tool

It's great, except when it gives you the wrong answer.
posted by grouse at 7:50 AM on March 4, 2009 [1 favorite]


>People want their software to think for them, and then they complain when it does. Those bleeding hearts in "human factors" have conditioned them to be helpless and ignorant.

People were helpless and ignorant long before us bleeding-heart human factors guys were around. The real problem here was a lack of critical thinking on the part of the scientists, who one would think should have known better. Why would Excel, a tool deliberately designed for list-making and not for complex financial modeling, be the right thing to use when data integrity and calculation precision are your highest priorities? And why the hell did ANYONE think it would be a good idea to use this tool on the original dataset? I wouldn't want these guys running a junior high school science lab, never mind a drug lab at a cancer research center.

This is the way the world works. Anything else is a lie told by programmers to their users.

That might be how the world works, but (as far as non-niche apps go) automatically generating a user interface without any input from UI/UX guys is a good recipe for an ugly, confusing mess that will scare away large portions of your prospective userbase.
posted by xbonesgt at 7:52 AM on March 4, 2009


MS didn't invent the spreadsheet, I might even suggest that Visicalc was the first for the PC/Apple II, but I also remember using one on a Vax.
posted by sfts2 at 7:53 AM on March 4, 2009


ooops misread krinklyfigs post pls hope me...
posted by sfts2 at 7:56 AM on March 4, 2009


Hallelujah. I can tell you exactly why scientists don't have better tools. I used to be one, at least at the postdoc level, and decided to go into the software business for myself when I found myself spending more time, and having more fun, designing a database to keep track of my complex crosses between various transgenic and knockout mouse lines. I've had some success selling that application, but extending that to more general purpose utilities to allow people at the bench to, for example, keep track of the samples they produce and store, and the protocols they use to carry out experiments, has been endlessly frustrating.

One problem is that science and business don't mix well. Given the size of the market, the mouse genetics program I make would probably sell for many thousands or even tens of thousands of dollars if it were targeted toward a comparable business market. However, scientists don't feel like they should have to pay that much for software. Even though everything you buy to equip and run a lab is insanely expensive, you can buy Word and Excel for a few hundred dollars, so these borrowed business tools set the bar for the price of any software. An exception is the software the runs expensive specialized equipment like cell sorters and the like. That software is very expensive, buts it's psychologically folded into the enormous price of the equipment, so it's more acceptable. And anyway, they can't run the equipment without the software, so they don't have a choice.

If I were dealing with a business user, I could make a convincing argument that investing X dollars in my software will save them X*Y dollars, allowing them to justify the purchase. But scientists don't do what they do to make money, so they don't think in dollars and cents terms. They know what they think a reasonable price is for something, and they have no other way of approaching it. So even if they're paying hundreds of thousands of dollars a year to house their mouse colony, that's just the way it is, and paying $2000 to save $10,000 or $30,000 is just, like...whatever.

So it's a difficult market. It's also foreign to businesspeople, and that's who you need to talk to to raise money. I've made several unsuccessful attempts to attract investors, and while a large part of my lack of success undoubtedly stems from my own limitations, it has also become extremely clear that science is foreign to business types: they don't really understand an enterprise that is carried out on a non-monetary basis and that's enough to make them pass and look elsewhere for a place to put their money.

The third major reason this is a very difficult market is that scientists are very resistant to changing the basic way they do things. There is a very good reason for this. Most scientists are working at the very limit of what it is possible to do technically. That means that a technique, say whole-mount in situ hybridization, that is working for you today might stop working for you tomorrow, for a reason that might take you several months to figure out. Thus, the scientist or lab who declares one day "We're doing so well with mice, I think we'll switch to Drosophila!" would be committing professional suicide. This leads to a mentality of "well, even though the way we keep track of our samples, in which the majority get lost over time as people leave, we can't read their handwriting on the tube, and can't find the experiment in their notebook where they describe making it, is probably not optimal, it's what we do, and its good enough." Another related source of resistance is that labs tend to be very egalitarian, so any change that's resisted by any portion of the lab staff tends to go away and never happen.

But I'm committed to putting this software out. Why? Because, as magikker started off this thread, the state of scientific software is dismal. Scientists have none of the specialized software tools to help them carry out their day-to-day work in the lab the businesspeople take for granted. And, oh, the money there is to be made! Science is a multi-billion dollar industry, in which everything is expensive. Crack that nut and you'll not only make yourself rich, but you'll be giving a gift to the world. Rationalizing scientific data record keeping, wrenching it out of the handwritten notebook in a form where it can be stored, searched, easily retrieved and shared, will transform the enterprise.

I'll stop now.
posted by Turtles all the way down at 7:57 AM on March 4, 2009 [39 favorites]


I think an argument could be made that the desire for interfacially-simplistic but internally complex and flexible tools is part of what got us into this mess in the first place, and that using these tools -- even ones better decigne than excel -- will be fraught with hazard if it's not accompanied by a cultural shift toward better interdisciplinary collaboration.

The fact is that if you have a simple-looking tool and don't understand what's happening under the hood, you won't be equipped to notice when it's returning nonsense, and you'll be liable to misuse it when you need something that it's not quite designed to do. Biologists shouldn't have to be expert in R/matlab/&c in order to do their analysis; their are people with that expertise (and who, in turn, lack the biological knowledge to generate the data or interpret the results) who can collaborate with them. Yet, too often I hear biologists poo-pooing statisticians and computer scientists -- the very ones who detected this particular problem -- as being excessively wedded to rigor (and the mathy types poo-pooing the biologists for demanding interpretability). The complexity of modern biomedical research necessitates a team with an understanding that is so broad and deep that a single person's schooling will not adequately cover it, and yet many researchers are still holding on to the old idea that they should be able to do their work start to finish (even if they outsource parts of it to techs and grad students).

[I'm a physicist by training currently doing a genomics postdoc that would best be described as computational systems biology, and I bump my head on this every. damn. day.]
posted by Westringia F. at 8:01 AM on March 4, 2009 [3 favorites]


s/decigne/designed/ (preview, wot?)
posted by Westringia F. at 8:04 AM on March 4, 2009


0x029a: ... sometimes the mark of good software is how many ways it can be usefully used "incorrectly."

And sometimes the mark of bad software is that its core usefulness hasn't expanded in eighteen long fucking years, while in many ways it actually becomes less useful as more and more crufty "features" which don't actually serve any purpose.

I started using Excel in 1995. I was using it professionally in 1997. I've been using Excel 2007 for a while. You know what? I can't remember a time in 1997 when I said to myself, "gee, I keep seeing spreadsheets with grammatical errors and spelling errors - I'd be willing to give my left nut if that ever happened again." The vast and ridiculous extents to which Microsoft has tried to implement automated spelling and grammar correction make me gag. And yet, you know what I was thinking in 1997? I was thinking, "you know what? This would be a really good front-end for some kind of database - wouldn't it be sweet if the worked in some relational databasing features, maybe a more developed and standardized query function and a few other tricks." I think Bill Gates might have muttered this to a retarded assistant at some point, because when I look in my Excel 2003 book here's a whole section on the so-called "DATABASE FUNCTIONS" - only they're not really database functions at all, at least not proper ones, and were probably invented by somebody who either didn't know the difference or didn't want to confuse the "common user." But Microsoft is so large and so unwilling to accept risk in one segment for the benefit of another that Excel has been hamstrung from the beginning.

I mean, every competent database professional has a little litany in his head of all the ways that Excel could have been made better; but we know in our hearts why none of those things could ever be brought to pass. We'd love it if there was some interoperability within Excel with something like SQL - not that SQL is the best database paradigm, but it is dominant. But we know that if that's ever even considered in Redmond, a manager reminds everybody that they'd be threatening Microsoft Access' market segment. So, fine: why not put some of Microsoft Access (not a terrible product, by Microsoft standards) into Excel so that Excel has at least a little bit of relational database functionality? No, they're two separate products, and you don't merge them if it will mean less market segment for both. And coming up with something that actually does the things that people who really manage data need it to? Improving functionality so that there's an actual power behind the software? Not happening. (I'd even like it if they'd do some stupid little improvements in function processing. Have you ever tried to do a VLOOKUP query on a key that was to the right, rather than to the left, of the target? Or tried to do a COUNTIF function with more than one parameter? Ugh.) I would hate to be an Excel developer; I imagine that means you're doomed by the ubiquity of the software to be stuck doing endless graphic redesigns and barred from any significant revision of the actual product. "What - you want to add a range of functions? You want to reenvision the back-end so that it's more integrated and offers portals to something more? Sorry, buddy - this is one of the most popular computer programs ever, and you're not going to fuck with something that's working."

I'm confident that actual improvement of Excel would be popular in the long-term. I'm confident that many, many people like me would rejoice if Microsoft took initiative to fix these things about Excel and move it forward; and, though that rejoicing would be relatively silent, gradually, over time, all those people who are now misusing spreadsheets might gradually come either to use spreadsheets less as a sheet of numbers and more as a database, or at least to understand that they weren't intended for what they're used as. I'm confident that this would actually improve the world, given how huge Excel is; if the many, many people who use numbers badly every day via Excel slowly started using numbers correctly, the world would become a better place in ways that would be large enough for us all to notice. Orthogonality's post here details an obvious example; the amount of scientific error that misuse of Excel has introduced is probably astronomical, and removing that error would be a fantastic benefit to science and to humanity in general. The ubiquity of Excel shouldn't be forgotten: it's really hard to overestimate its impact.

Of course, I'm also confident that Excel will never change until it dies.

Those who use Excel for everyday things, wondering why those of us who like things like databases are always whinging about Excel and saying that it sucks, wondering why we can't be happy with something simple that works the same every time, would do well to think about this one, simple, solitary fact: Excel has not changed significantly in almost thirty years. Seriously: sure, VisiCalc is crap compared to Excel, but by 1985, Excel was out, and with most of the functions we hold so dear. This has been around for twenty-five years - with no improvements and a whole host of newly-introduced problems? When, in the history of computers - hell, not just computers, the history of technology in general - when has one program not improved but rather degraded over thirty years and still remained popular?
posted by koeselitz at 8:06 AM on March 4, 2009 [1 favorite]


"One problem is that science and business don't mix well. Given the size of the market, the mouse genetics program I make would probably sell for many thousands or even tens of thousands of dollars if it were targeted toward a comparable business market. However, scientists don't feel like they should have to pay that much for software. Even though everything you buy to equip and run a lab is insanely expensive, you can buy Word and Excel for a few hundred dollars, so these borrowed business tools set the bar for the price of any software. An exception is the software the runs expensive specialized equipment like cell sorters and the like. That software is very expensive, buts it's psychologically folded into the enormous price of the equipment, so it's more acceptable. And anyway, they can't run the equipment without the software, so they don't have a choice."

It sounds like you need to convince the scientific community that virtual tools like software are still tools. Not that I envy anyone who would attempt to do so. I have a hard enough time convincing my business clients that some software tools are worth paying for, if it's the right tool for the job. At least a cost-benefit analysis will convince the suits, most of the time anyway.
posted by krinklyfig at 8:06 AM on March 4, 2009


"When, in the history of computers - hell, not just computers, the history of technology in general - when has one program not improved but rather degraded over thirty years and still remained popular?"

There is no need to "improve" Excel. Well, except perhaps the accuracy of floating point calculations. But it is what it is: a spreadsheet application. Anything they do to it now above and beyond that is bloating and unnecessary. Trying to bend it to the use of someone who doesn't really need a spreadsheet is not going to improve Excel for those who do need a spreadsheet. That said, if I find the need I usually stick to the spreadsheet in OpenOffice, except when I have to open a complex .xls file with graphs and such.
posted by krinklyfig at 8:10 AM on March 4, 2009


I'm going to send this link to my proteomics lab. Not that they don't already know -- everyone knows they shouldn't be using Excel, and they have a sense of humor about the fact that they do.

My feeling is that there's more and more respect for software development in the scientific community these days, but the day of budgeting time for tasks that sound like pure software development into grants is still far away.
posted by gurple at 8:15 AM on March 4, 2009


>

I was just talking about this last night, funny that Metafilter should bring it up this morning.

The most outrageous abuse of Excel I can recall was at the Fairfax County regional science fair, "computer science" category. I had somehow gotten first place at my high school, and went to regionals. Basically you get to stand in a super noisy gym for 8 hours surrounded by the most socially awkward boys ever, repeating your 30sec spiel ad-nauseam.

Some guy decided he could calculate the probability that there was life on mars with excel. Aside from being an absurd, un-testable hypothesis, (that the judges weren't buying), I believe he had a slight creationist agenda as well. He loved showing off that some of his cells contained over a hundred lines of code. Every single judge asked him, "Why excell?", and I don't think he had an answer.
posted by fontophilic at 8:17 AM on March 4, 2009 [3 favorites]


The third major reason this is a very difficult market is that scientists are very resistant to changing the basic way they do things.

A very expensive Java + Oracle-based LIMS system — a record-keeping system — was built from scratch to implement the equivalent of a digital notebook for a lab I worked in. But the front-end was inflexible, the developers had moved on to other contracts, and most of the scientists went back to more flexible and fragile tools.

Digital record keeping is a hard problem. Scientists keep changing their needs because the science changes so fast, and the questions they ask change. Most software development efforts can't really keep up.

What it seems like, from my perspective, is that you have a handful of informaticists who have a toolbox of software, programming skills, and some biology knowledge, who can glue together more complex analytical tools on demand. If the results can be published or monetized, then some effort will be put into place for refinement.

But, by and large, abstracting a lab's processes is not a trivial enough task that you can buy some product off the shelf and start using it. Excel only works for this problem, because non-programmers basically bludgeon a solution out of it. If a process is repeatable or abstract enough that its details can be re-implemented in software form, only then does it seem possible to make software to support it, but that's no guarantee it will be made or even used.
posted by Blazecock Pileon at 8:18 AM on March 4, 2009 [1 favorite]


Basically you get to stand in a super noisy gym for 8 hours surrounded by the most socially awkward boys ever, repeating your 30sec spiel ad-nauseam.

That sounds like the poster session at all of the conferences I have ever been to (except that there are now thankfully socially awkward girls as well, and sometimes there is wine).
posted by grouse at 8:20 AM on March 4, 2009 [1 favorite]


Used to work in bioinformatics, and later at SPSS.

The hard truth is software any more suited for the scientific task at hand is simply too work intensive to develop, and too specific to be useful anywhere else. Hence why tools and packages attempting to address the problem either a) suck or b) are so expensive that no one buys them.

Enter excel. HUGE amounts of science (heck, work in general - business especially) is going to use just barely good enough tools. Screw scale. Screw accuracy. Just get it done.

I wish I was still writing scientific software - but it doesn't pay cause people won't pay for it. Maybe after I win the lottery. There was a brief shining moment early in this decade when people thought bioinformatics might be the next internet, and money flowed accordingly - but the return (and there will be a return) didn't come fast enough.

Full disclosure: I've actually uninstalled excel from scientists' computers, with words to the effect of "Stop it. No more. Not for you."
posted by NoRelationToLea at 8:27 AM on March 4, 2009 [5 favorites]


Oh, and designing a database schema that accommodates patterns of genetic inheritance is a very high-order task. If I had a nickel for every potential customer who told me "Oh, we had a summer student a couple of years ago who designed an Excel spreadsheet and it works great!", well, I, well I don't know what I'd do, but I'd have a lot of nickels, let me tell you.
posted by Turtles all the way down at 8:27 AM on March 4, 2009 [1 favorite]


Excel is kind of thrilling. Like riding atop a crowded bus on a vertiginous, narrow mountain road with no guardrail, a driver who takes a puff from his opium pipe every 10 kilometers, chickens and goats in the back, mothers giving birth in the front, wheels missing lugnuts, the whole works. Just enjoy the crisp mountain air and scenery and expect to get to your destination and don't worry that the next turn very well may result in a death plunge.
posted by Burhanistan at 8:32 AM on March 4, 2009 [5 favorites]


MSDN: Understanding Floating Point Precision, aka “Why does Excel Give Me Seemingly Wrong Answers?”

According to those documents, the problem comes from MS following IEEE standards, not because they randomly wrote the software to be inaccurate.
posted by jmd82 at 8:40 AM on March 4, 2009


Though this particular problem was about formatting, another thing people should always remember is that computer math is not the same as arithmetic.

For example on most systems, (x / y) * y is NOT guaranteed to be equal to (x * y) / y. In algebra, both examples equal x. In computer math, the former will create a floating point fraction that may not resolve when multiplied.

Additionally, because most floating point hardware uses exponential representation, it appears that very large numbers can be represented, e.g., 10^50. However, it is not actually holding 50 digits of precision and so if you set x = 10^50, you may find that x + 1 is equal to x.

There is a whole field of numerical analysis that deals with properly structuring computational algorithms to deal with these problems, but only a few CS majors ever get exposed to it.
posted by CheeseDigestsAll at 8:41 AM on March 4, 2009 [3 favorites]


fontphilic: He loved showing off that some of his cells contained over a hundred lines of code.

Awesome. Just awesome.
posted by koeselitz at 8:45 AM on March 4, 2009


According to those documents, the problem comes from MS following IEEE standards, not because they randomly wrote the software to be inaccurate.

Yes, Microsoft would say that, wouldn't they? Of course, if you read these documents, you will see that sometimes the problem is because Microsoft just programmed poor algorithms or did so inaccurately.
posted by grouse at 8:45 AM on March 4, 2009


Also, according to my sources, biology isn't the only academic field that uses it. Apparently astronomers/astrodynamicists do the same.

I work in a cognitive neuroscience lab, and we use Excel to design presentation materials and analyze behavioral data. I use it every single day, and don't know what sort of alternatives are out there. I'd love to hear about some.
posted by solipsophistocracy at 8:57 AM on March 4, 2009


krinklyfig: There is no need to "improve" Excel. Well, except perhaps the accuracy of floating point calculations.

Yes, there is. Like I said, have you ever tried to run a VLOOKUP query keyed to a range to the right rather than the left of the target? That's just one tiny example of a function that works a little bit if at all.

I agree with you on this, though: they should take out all those ridiculously crufty functions and either replace them with functions that are streamlined or leave them out entirely. Most of the process of "improving" Excel would consist in deleting useless features.
posted by koeselitz at 8:57 AM on March 4, 2009


When, in the history of computers - hell, not just computers, the history of technology in general - when has one program not improved but rather degraded over thirty years and still remained popular?

Emacs.

(g, r, d, :wq!)
posted by eriko at 9:01 AM on March 4, 2009 [2 favorites]


According to those documents, the problem comes from MS following IEEE standards, not because they randomly wrote the software to be inaccurate.

Before they were following IEEE standards, whose standards were they following? I'm not clear what point this is making, other than that Microsoft recently engineered one known inaccuracy into its product and some users were left to scratch their heads at the output?
posted by Blazecock Pileon at 9:03 AM on March 4, 2009


This conversation would be more interesting if folks assumed the team writing Excel at Microsoft were not idiots. Because honestly, they're not.

Koeselitz, your rant is awesome, but you overlook a couple of things. Excel has changed a lot in 30 years, not the least of which is the integration of VB and VB.NET. And the precise reason why Excel is not a database, why it's not just like Access, is because it's a spreadsheet, not a database. Furthermore you're perfectly welcome to build an application integrating Excel and Access, thanks in no small part to the wonders of VB.NET. Now we can argue about whether all this stuff is easy to use, or works very well, but it's unfair to suggest Microsoft hasn't tried.

the amount of scientific error that misuse of Excel has introduced is probably astronomical

I'm not sure how to interpret the hyperbole, but I think it's fair to say the amount of science that's been accomplished with Excel is super-duper astronomical.

PS: the biggest use/abuse of Excel I saw while at Google was the way some customers would manage entire ad campaigns as spreadsheets. One row per targeted keyword, or worse one row per keyword/url pair. And whole sets of software tools built around manipulating spreadsheets, data passed around as CSV (with all the quoting nightmares), everything. And it all fell apart as soon as an advertiser's campaign got more than 65,535 keywords. Because, yes, row indices were limited to 16 bit numbers. That got improved in Excel 2007 to 20 bits, 1 million rows. But it's still a big giant grid, not an open ended datastore.
posted by Nelson at 9:09 AM on March 4, 2009 [2 favorites]


But it's still ok to use Excel as a gaming system, yes?
posted by mazola at 9:14 AM on March 4, 2009 [1 favorite]


Wow, I've know about this for >10 years. I had no idea I could get a publication out of it. Next time...
posted by batou_ at 9:21 AM on March 4, 2009 [1 favorite]




"Yes, there is. Like I said, have you ever tried to run a VLOOKUP query keyed to a range to the right rather than the left of the target? That's just one tiny example of a function that works a little bit if at all."

What I'm saying is, there is no need to make Excel into something it's not. Make the spreadsheet better and more accurate, sure, but don't expect it to act like a database, and don't expect it to shine in dealing with complex scientific applications.
posted by krinklyfig at 9:34 AM on March 4, 2009 [1 favorite]


Oh, my god. That's shameful, and horrifying, and the stuff of my nightmares. My work uses Excel for a lot of things (though we always keep a copy of whatever data we're using in its native format), and I'm pretty sure it's only a matter of time before something goes horribly wrong.
posted by OverlappingElvis at 9:40 AM on March 4, 2009


Funny thing is that historians - who actually have to work with dates - HATE the date function in spreadsheets (and it's not just Excel - OpenOffice is just as bad). We have to make up workarounds to keep data from being converted to the date format.

I use a text marker to make sure that all my dates stay as text or, if I'm doing serious date stuff rather than just tagging an entry, I use a separate fields for year, month and day. Because Excel really can't handle dates like "1648/49-01-30" or "1672-07-??".

But if you use separate fields, or ISO date styles and keep them as text, it's very easy to sort by dates - even with messed up Old Style years,
posted by jb at 9:47 AM on March 4, 2009 [1 favorite]


The best ever use of Excel isMortgage Pig.
posted by vespabelle at 9:59 AM on March 4, 2009


Still blows my mind that none of the Office programs have a keyboard shortcut for "paste without formatting"

Besides the awkward multiple keystroke menu access it is dead easy to assign your own short cut to any menu item in Office. I use CTRL+G for paste unformatted. Can't remember what CTRL+G does but I never use it, one is free to select another combination or even a chord.
posted by Mitheral at 10:00 AM on March 4, 2009


I'm confused, how is ther original gene name not recoverable?
posted by fistynuts at 10:03 AM on March 4, 2009


I think part of the problem with getting better software tools, and tools in general, in biology is that there is not a lot of support or respect for technique development. In academic biology, the result is king, and if you can get the big scientific result, it doesn't really matter how you got there. I tell biologists all the time, "well, you could get better data if you did x, y, and z" and more often than not they're uninterested, because what they're doing is working well enough. It's related to what Turtles all the way down said in his excellent comment - so much in biology is truly hard that you don't want to mess with anything if its working. Pretty much every biologist I know has a story of how their experiments quit working for 6 months or a year due to some maddening technical quirk that they may never even have figured out. In that environment it's hard to convince people to change things when they're working well enough.

The specific problem with Excel in biology is due to the fact that most biologists have zero computer savvy. There is still essentially no training in software tools in biology programs either at the undergraduate level or the graduate level (except for bioinformatics programs and similar) so most biologists probably don't even really realize that Excel isn't the right tool for the job. I once had a colleague who was manipulating image data in Excel and I told him that if he spent a week learning Matlab he could do everything better and faster. His response? "This works well enough for now."
posted by pombe at 10:03 AM on March 4, 2009 [2 favorites]


I'm a CS grad student from the CS department with a CS adviser but our audience is often Biologists. I really feel for those guys... They end up using terrible stuff.

Turtles all the way down, I personally know Biologists seem to be willing to buy software, well, at least in one area. When they are dealing with a topic that excel, word, or similar simply will not handle. For instance if you are working in phylogeny. Nothing off the shelf handles trees well. You really have to buy or download some tools, and the options are all terrible. If there is a market it's in those darkest corners of science.

Interesting anecdote.
My genetics major girlfriend ended up spending a semester consolidating data from a bunch of text files onto a wiki. This was done by hand. While I applaud the effort to move into the 2000's by using a wiki. Assigning a student a semester of moving stuff by hand instead of borrowing a CS student that knows a little perl/python/ruby/whatever is really appalling. The craziest thing is my girlfriend wasn't the only one doing this, and the cut and paste went on for multiple semesters.
posted by magikker at 10:07 AM on March 4, 2009 [2 favorites]


me: When, in the history of computers - hell, not just computers, the history of technology in general - when has one program not improved but rather degraded over thirty years and still remained popular?

eriko: Emacs. (g, r, d, :wq!)


Good point, but you seem to have forgotten that last word of my question.

Nelson: This conversation would be more interesting if folks assumed the team writing Excel at Microsoft were not idiots. Because honestly, they're not.

I don't assume that, by the way - I thought I'd assumed that they were pretty competent people. It was their fate that I lamented. There is no greater curse than being a developer of a product that is already common enough that it's hard to say if any improvement will help its market share; I doubt that anybody at Microsoft with the forthrightness to try to build on Excel could ever get clearance to do something so daring and risky, given the unlikelihood (at least in the eyes of someone like Steve Ballmer) that improvements will be anything but detrimental to the market share.

Koeselitz, your rant is awesome, but you overlook a couple of things. Excel has changed a lot in 30 years, not the least of which is the integration of VB and VB.NET.

I haven't really delved into VB.NET at all yet, but Excel was functioning with VB in 1993; that was a long time ago. And pardon me for saying so: I'm not really a big fan of its scripting capabilities.

And the precise reason why Excel is not a database, why it's not just like Access, is because it's a spreadsheet, not a database.

If spreadsheets are going to exist at all, they should coexist cleanly and easily with larger databases. I know your point because I've probably had to make it a billion times, too; how many times have you had to explain to somebody the difference between a spreadsheet and a database? After the last few decades, I think it's become a bit more clear that the last thing we need is more people like us patiently explaining to people what the difference between two arbitrarily distinguished pieces of software are. What we need is software that draws the distinction for us. This isn't 1993; in the face of things like SQLite, it seems as though it should be possible to introduce spreadsheet "power users" who are scraping the ceiling of Excel's capabilities to more appropriate options in a more organic way.

Furthermore you're perfectly welcome to build an application integrating Excel and Access, thanks in no small part to the wonders of VB.NET. Now we can argue about whether all this stuff is easy to use, or works very well, but it's unfair to suggest Microsoft hasn't tried.

Hold on a moment: you want me to build an application integrating Excel and Access in VB.NET? I guess I could do that - though I don't really know anybody who would, and though if I were going to do such a thing I'd probably go in different routes with Python. But the average user of Excel - and even the average user of Access, in my experience - isn't going to learn a programming language.

Isn't there some way in which spreadsheets and databases can be linked organically? If not, then the best thing for Excel would be to strip off the database-like functionality that it sometimes almost attains. But that will never happen, not because the Excel developers are idiots, but because their bosses know that they'll lose market share if they strip out that stuff that's encouraging people to do silly things.

Sometimes I think we should have grown out of spreadsheets years ago.
posted by koeselitz at 10:09 AM on March 4, 2009


For some reason everything I do winds up, eventually, being a flat file database. Spreadsheets are OK but I actually prefer using a comma delimited format. I currently use CSVed as a graphical front-end for users and do all my reporting with awk scripts. This is actually much simpler than maintaining a spreadsheet, at least for me.
posted by jim in austin at 10:13 AM on March 4, 2009 [4 favorites]


I'm confused, how is ther original gene name not recoverable?

You might have two different strings — denoting two different genes — that are mapped to the same output. You can't undo that process.

Hypothetical example:

1. Open Excel (say, Excel 2008!)
2. Type "Sep2" and press Return
3. Type "Sept2" and press Return

Both fields are now "2-Sep". There is no way to know how to get back to "Sep2" and "Sept2" from this transformation. Your data is now screwed.
posted by Blazecock Pileon at 10:37 AM on March 4, 2009


LONG LIVE AWK.

it's sad that a better tool was invented in the 70's for manipulating this kind of data yet most people ignore it.

Why do biologists ignore computer science? If a comp sci major said they didn't need to know any chemistry or biology there'd be all sorts of counter-arguments about being well-rounded and having an educated citizenry, etc. But not knowing how to write AWK scripts? It's dismissed as nerd stuff or that they don't have time. It ain't that hard. Doing bioinfomatics in Excel is like renting a billboard with your photo the the words "IDIOT". Given the importance of this data you'd think these people would treat it more carefully.

And even if you don't like AWK, doing it in Python isn't very hard either and that's about as full-blown a development environment as it gets. I mean, chances are you used software written in Python already today.
posted by GuyZero at 10:38 AM on March 4, 2009 [4 favorites]


Sometimes I think we should have grown out of spreadsheets years ago.

Spreadsheets are fine, but every piece of software is written for a specific audience. Excel is still great for basic financial modeling. That it doesn't work for genetics researchers is not really a huge indictment of the spreadsheet in general.
posted by GuyZero at 10:40 AM on March 4, 2009 [3 favorites]


Given the size of the market, the mouse genetics program I make would probably sell for many thousands or even tens of thousands of dollars if it were targeted toward a comparable business market. However, scientists don't feel like they should have to pay that much for software.

You should see how much I pay for licenses every year. On the other hand, if your project budget is $50k to $100k, it's very hard to justify paying $10k to $75k per annum to some programmer in another country who is barely civil to you when you have a support issue. Rather a worst case (real life) example, but. Also, these people tend to disappear after a few years leaving you with a very expensive unsupported piece of software. We do need better tools, but most aren't willing to take that risk.
posted by bonehead at 10:41 AM on March 4, 2009


Why do biologists ignore computer science?

Because biology (chemistry) is already hard/interesting enough to occupy all of your attention? Because you are already expected to be an expert in statistics as well as your subject area? Because, in addition to being a dual-career scientist-statistician, you are also expected to be a part-time entrepreneur and seek funding for and manage your project team and facilities? And be part-time mechanic/machinist/electronics tech, so that you can fix the instruments that are always in need of repair?

Adding a fourth of fifth major requirement is more than most can manage. Most scientists aren't interested in informatics as such, they just want tools to facilitate science.
posted by bonehead at 10:47 AM on March 4, 2009 [4 favorites]


Excel is still great for basic financial modeling.

Actually, I should correct that. I have seen a $20M company run extremely detailed budgets completely on Excel. It was insane but it worked better than any dedicated financial package I've ever seen. It's really a great tool if you data is money.
posted by GuyZero at 10:49 AM on March 4, 2009


I've taught a few non-IT folks how to use Access to do the database things they were trying to do in Excel. They are all smart folks with no programming experience, and found it fairly easy to get the hang of Access. Pretty quickly they were doing things in seconds that would have taken days in Excel.

All of these people were on board the moment they saw how the 'accidentally sort one column of your table' mistake is really tough to make in Access. I think people continue to use Excel for the wrong tasks because they don't know they have other reasonable options, or have found the initial learning curve for something like Access just a bit too steep to climb on their own.
posted by FishBike at 10:49 AM on March 4, 2009 [2 favorites]


Turtles, you have some good points there but you're forgetting one thing about biologists and software: To replace the notebook, your solution has to be as simple and flexible as writing things down in a notebook.

The only solutions I can see are the note-taking software solutions that allow you to drop file fragments, images, etc. into your program, as if you were pasting things into a notebook, but I haven't ever used them - because the paper notebook is still easier for me when I'm in the middle of a procedure. Last thing I want near me during some messy procedures is my laptop! If I get blood / chemicals / radioactive crap on a sheet of paper, no big deal, but my laptop? No way.

Aside from that, design better tools and people will be grateful. I know half the software I use looks like it was designed by engineers with no biology training. And never underestimate the value of a nice-looking interface - ugly but functional is hard to sell.

(My best science software example is GraphPad Prism. I actually bought a copy, after running the demo. They really seem to get it - it does what it should, it looks nice, it has some really well-thought out features, and bug reports are promptly responded to, often by the company president.)
posted by caution live frogs at 10:57 AM on March 4, 2009 [1 favorite]


When I was doing temp work in Silicon valley, I used to see Excel used for all kinds of weird database applications, like the company that had all it's parts lists and suppliers listed on an incredibly complex multi-page Excel spreadsheet. IIRC they wanted their overseas partners to easily update without having a dedicated person enter the information...and I eventually told them that what they really needed to do was convert everything to an Access database. My job ended the next day.


The main reason I think Excel is misused as a database is that Access is such a pain to set things up with. You have to front-load all the design work at the beginning, sort out the relationships...it's one of the most annoying systems I've ever worked with, outside of dedicated government databases. Excel on the other hand, gives you an inviting grid, just waiting for you to put things in rows, and make it look like everything's properly interrelated. It's only when systems get complicated that the kludgyness of Excel increases exponentially.
posted by happyroach at 11:17 AM on March 4, 2009 [1 favorite]


You have to front-load all the design work at the beginning, sort out the relationships...it's one of the most annoying systems I've ever worked with

Why IT is hard in a nutshell.
posted by GuyZero at 11:20 AM on March 4, 2009 [1 favorite]


Ah, caution live frogs just beat me to the Graphpad Prism recommendation.

One of my hardest-fought battles was getting rid of Excel for data analysis in a biology department that shall remain nameless. We brought in Graphpad Prism, did several trainings, appointed a few guru users to serve as resources for the rest of the department ... then waited two years for the last Excel holdout to capitulate under peer pressure and increasing isolation from the rest of the company. This was because s/he had a lot of highly customized and VB-ridden spreadsheets from long ago which had long ago turned into a magic data mill. Enter data, and magic numbers come out 5 worksheets down.

The best thing about Prism is that it makes doing good statistics easy, and bad statistics hard. Sigmoidal curve fitting when the data is good and clean? Four clicks. Sigmoidal curve fitting when your data is crap? 15 clicks to disable the data guards and manually enter min/max values. It took a while, but biologists are now confident enough with the system to just do a better experiment rather than attempt data torture.

It is not perfect -- its XML output and scripting capabilities are limited, and any programmer will leap for R or Matlab instead -- but it hits a sweet spot between robustness and flexibility that works for most biologists.
posted by benzenedream at 11:26 AM on March 4, 2009 [1 favorite]


Because you are already expected to be an expert in statistics... in addition to being a dual-career scientist-statistician... part-time entrepreneur... And be part-time mechanic/machinist/electronics tech

I'm being half-snarky and half-serious, but have scientists ever heard of "teams"? Or the idea of specialized roles? I spent numerous years as a product manager, but no one ever expected me to actually execute every task I needed done personally. I understand science lab budgets are tight compared to the business world, but at some point there's simply no other way. Unless you are comfortable doing part of your job in a half-ass manner that threatens to render all the other work useless (like, say, having your data eaten by Excel's auto-date-formatting).
posted by GuyZero at 11:27 AM on March 4, 2009


I've also run into this problem with our lead researcher using excel, its incredibly frustrating to deal with someone who "doesn't have the time to learn another tool" Who doesn't have the time to do precise calculations if its part of your job? Doesn't it take way more time to get excel work they way you want it to in the first place? How do you even do QA on it? Or check the answers? Half the time you can't even tell if you have a rounding error. Its incredibly hard to document too. The amount of time people spend futzing around with it to get their sub-optimal results boggles my mind.

Perhaps there needs to be some sort of annual "Review of Computing Tools" for each major area of study along with the usual research publications.
posted by captaincrouton at 11:31 AM on March 4, 2009


Turtles, you have some good points there but you're forgetting one thing about biologists and software: To replace the notebook, your solution has to be as simple and flexible as writing things down in a notebook.

caution live frogs: I agree completely. Nothing is going to get rid of the notebook. It's convenient, always at hand, can serve as a legal document, and there's something satisfying about sitting at your desk at the end of the day and doing essentially the same thing as Isaac Newton and Louis Pasteur did.

But it's a really crappy way to record information in a form that's searchable and retrievable, and the degree to which the information is recorded is utterly dependent on the skill of the operator and the mood they happen to be in that day. The software I'm designing is integrated with barcoded labels: stick one label on the microfuge tube/plate/film/cryotube, a matching numbered label in your notebook, and you've established a permanent connection between sample and a notebook page. The app just requires you then to enter, at a minimum, the notebook and page in question, and something about the location of the sample. If you find a hoary old frosted tube in the -70º one day, scrape off that frost, scan the label into the software, and it will lead you to someone's notebook and the page in which they described the experiment. Even if they left the lab 10 years ago and nothing else they wrote on the label is legible.

Do that, and the several hundred dollars in enzyme and materials that experiment might have cost doesn't need to be wasted. From a software engineering perspective, this is kind of a prosaic application to develop. But, done right, taking care not to make the scientist expend one iota of effort more than is necessary at the end of the day to put the stuff into the computer, it's the first step to giving scientists the simple tools they very much need.
posted by Turtles all the way down at 11:46 AM on March 4, 2009 [2 favorites]


Seems like an opportunity for interdisciplinary collaboration in the university environment.

CS grad students developing advanced computing environments for biologists, Statistics grad students crunching real-world data and consulting on experiment design, biologists designing gene therapies to make CS students more social; surely utopia is at hand!
posted by Crabby Appleton at 11:51 AM on March 4, 2009


I'm being half-snarky and half-serious, but have scientists ever heard of "teams"? Or the idea of specialized roles?

Of course, but informatics is pretty far down the totem pole as to what needs to get done on a daily basis. I don't generally ask the bench scientists to machine their own custom parts either (though, back in the day...). Corporate "IT" will install Excel for you, but don't ask them to help setup a database or to write an awk script. Often, I'm left with the question: do I spend that last $15k on supplies or on a db contractor? I've tried, but have never seen contracted-out custom software work very well---domain knowledge transfer problems. Most of the time I have to do it myself to get it right.
posted by bonehead at 11:55 AM on March 4, 2009 [1 favorite]


There are too many smart people in this thread. I guess because it is a thread about being a professional scientist? WHATEVER! Here's excel pac-man.
posted by damehex at 11:55 AM on March 4, 2009


Of course, but informatics is pretty far down the totem pole as to what needs to get done on a daily basis.

heh, it sounds like QA in a small software company. I probably don't have to go into detail on how under-investing in QA usually works out.
posted by GuyZero at 12:07 PM on March 4, 2009


We have this procedure, which I shall not name, written by a guy who wasn't a techie. It's a user/fee thing, quite complex, over tens of thousands of rows, double-alphabet columns, and multiple worksheets. Well, he left, and it turns out that a lone guy just kept following this elaborate set of steps, which includes getting into data into Excel and running this macro, running that macro, changing a few things, then you're done.

He has a detailed set of procedures listing a great deal of what to do, but with no explanation of why each step is being done. No comments in anything. Lone guy moves to Office 2007, the macro will no longer do anything but generate gibberish. So we have a separate machine for just that task, running Office 2003.

I could probably write something to do what is required, but nobody remains who knows what the reasoning is behind all of the complex shuffling of unlabeled columns, the strange sorting, and whatnot is all about. We don't even know. This is just a black box which requires special care and elaborate, incomprehensible ritual to satisfy. I'm surprised we don't have to lop the head off of a dove once a month.

My point, in a roundabout way, is that Excel often the first thing that a non-techie reaches for when they want to automate something, anything, and it is done without a lot of the rigor needed for long-term maintenance, much less notification of others as to what this is all about. Excel is a great spreadsheet tool (and I still like Office 97, durnit), but people do not use it well.
posted by adipocere at 12:45 PM on March 4, 2009 [2 favorites]


"Adding a fourth of fifth major requirement is more than most can manage. Most scientists aren't interested in informatics as such, they just want tools to facilitate science."

Understandable, but if you work in a medical facility, for instance, someone has to know how to use the billing and records software. Saying you're too busy is not really a good response to the problem of not understanding it immediately. Someone has to be trained. In an academic environment, I can see where cross-disciplinary collaboration would really be helpful. But using the wrong tool and blaming it on the people who made the tool is not productive.

I once took an ASL class from someone who was developing a software dictionary for the language. He was talking about grammar, and how some people don't like to learn it, because it's boring and may not be all that critical to understanding the language. He said that he was interested in computers, but he didn't much like math, although in order to develop the software he wanted he needed to understand math (calculus, to be precise). So he took classes in order to get him to the point where he could use the necessary math in writing his software. And that's how he got it done.
posted by krinklyfig at 12:49 PM on March 4, 2009


the problem comes from MS following IEEE standards

The problem comes from the fact that they're using floating point numbers for the internal representation of numbers that really have no business being anything but fixed point.
posted by zsazsa at 12:58 PM on March 4, 2009


If I were dealing with a business user, I could make a convincing argument that investing X dollars in my software will save them X*Y dollars, allowing them to justify the purchase. But scientists don't do what they do to make money, so they don't think in dollars and cents terms. They know what they think a reasonable price is for something, and they have no other way of approaching it. So even if they're paying hundreds of thousands of dollars a year to house their mouse colony, that's just the way it is, and paying $2000 to save $10,000 or $30,000 is just, like...whatever.

The other angle to this is that very often, our hands are tied as to how we're able to spend our grant money. At my institution and very probably others, and particularly in cases were federal dollars are involved, you have to allocate your expenses into discrete categories. Buying software and often, computers robust enough to run it is viewed as a frill and not directly related to research, and is often a very minor component of a grant for that reason. I've had to write specialized mini-grants to get funding to buy hardware and software for my work, because we weren't permitted to spend out of our existing budget to buy it. In many cases labs may want to buy that great LIMS system or whatnot, it's just that they can't with their current funding allocations.


Seems like an opportunity for interdisciplinary collaboration in the university environment.

I actually tried this early in my graduate program. I had a situation that at the time was beyond my means. I arranged a meeting -- through a friend in the program, not just as some random whiner wanting help -- with the comp sci program at our fairly well-regarded university. The meeting went nowhere: the comp sci people either lacked the interest or the capability to understand what my proposal was, and basically smiled, said "wow, interesting" and offered me no real solutions. That meeting is why I'm now a half-biologist, half-informatician, and why I've spent years adding Perl, MySQL, R, and other tools to my arsenal. The idea of a collaboration is great on paper; in practice, not so easy to execute. All parties have to be invested in the relationship, have to benefit from the relationship, for it to pan out. In this case, I'm sure the people I was appealing to thought there was no immediate benefit for them to develop a relational database and front-end data retrieval utility for some random grad student. Given that modern science is essentially a network of interlocking collaborations, I think it's a bit unfair to say the problem is biologists don't play well with others.


Of course, but informatics is pretty far down the totem pole as to what needs to get done on a daily basis.

While I agree with a lot of what you've said in this thread, I respectfully disagree with this statement. Biology and related disciplines are becoming increasingly informatics-oriented. We're generating too much data to handle with a notebook and a pencil, and we need to start using the tools available, training people in their proper use, and developing utilities for the next wave of data. This recent paper from Lincoln Stein pretty neatly makes the case I'm clumsily trying to argue here.
posted by vesper at 1:06 PM on March 4, 2009 [2 favorites]


Most of the process of "improving" Excel would consist in deleting useless features.

Ditto for Word. But then, you could have a similar discussion on the rampant and regrettable but unstoppable use of Word as a layout program when Quark or InDesign or other software is so much better for it.

Thankfully, only aesthetics is usually harmed there, not science.
posted by emjaybee at 1:27 PM on March 4, 2009 [2 favorites]


I`d just like to say that that the person at Microsoft who decided to completely change the Excel interface for the 2007 is a moron. I`m sure the development team was told to do it by some manager (hey lets make this thing shinier with doo-dads and such I’m sure they said). I can`t tell you what a stupid decision this was. They have built up a huge user base that has grown used to the system and know where to find things in Excel up to the 2003 version and then they reinvent the whole thing for 2007 so you have to figure out where everything has moved too.

And so ends my rant.
posted by Sargas at 1:40 PM on March 4, 2009


Biology and related disciplines are becoming increasingly informatics-oriented. We're generating too much data to handle with a notebook and a pencil...

Oh I understand where you're coming from. I generate megabytes of data every time I run a sample. Not as much as microbiologists do, but not out of the ballpark either. Fundamentally, it's a resource issue, and one not given high priority by many scientists. Requests for supplies, instruments or people are all well understood, but IT gets turned into knots. I've been to more than six "kick-off" meetings for information management systems in the past five years; I'm going to another in a couple of months. Data management is a huge problem for us, but nobody, we scientists, the vendors, or IT, has any real clue for handling it.
posted by bonehead at 2:07 PM on March 4, 2009


Data management is a huge problem for us, but nobody, we scientists, the vendors, or IT, has any real clue for handling it.

Buy more storage arrays.
posted by Blazecock Pileon at 2:14 PM on March 4, 2009


I concur with the opinion that Fortune 500 companies likely do a lot of their "real" work in Excel. Also, according to my sources, biology isn't the only academic field that uses it. Apparently astronomers/astrodynamicists do the same.

I've been in astronomy for a while now and I've never run across anyone using Excel. I know this part of the conversation is kind of over, but I couldn't let this one slide :-)

Most people use some combination of C/Fortran/IDL/Matlab/Python. Personally, I do almost everything in IDL backed up by the awesome resource that is the IDL Astro Library. IDL is pretty awesome for dealing with images, which is something that we have to do a lot. Its also really easy to learn (although difficult to master). For those interested, you can learn the deep secrets of IDL programming here: Fanning IDL Guide.
posted by kms at 2:14 PM on March 4, 2009


Wow this is a great thread. Some of my recent AskMetafilter questions have been just about this very topic (err, minus the bio-academia slant). Excel is a very, very powerful tool. I'm not a developer, but I've worked with a lot of different software packages. None seem to work at as a general purpose tool quite like Excel. If I'm doing something once, or just starting out with something that requires me to store and manipulate data, Excel is the very first thing I'll go to. It is just there, like a giant monolith. Come play with me! Oh look you can make headers look so professional, so quickly. Hey I can do your formula real neat, just throw it in this cell over here and reference in this one and drag and drop ... whoa there, look what you did kid!

So yeah I'll design databases in Excel, it is a great testing tool. The other day (which prompted my question), I thought, shit if other people in this company start using this tool, it'lll fill up fast and they won't be as careful as me. Well wouldn't it be neato if I could throw this into an SQL database and more formally define all the fields, tables, etc. Then I can do data manipulation, analysis, etc.

Hey great, except this isn't easy. Because I'd be the only one who understands SQL and even though it is not "hard," Excel works so well until it doesn't, it is hard to encourage upper management to use it. Why go to all that trouble? Why are we hiring a dba to do this? Hey looks great in Excel. Oh, Ms. Admin from downstairs copied a bunch of pictures into Excel, except they refer to a location on her local desktop? And now when she sends it won't work and she'll spend 20 minutes screaming about she spent 3 hours doing it wrong but it wasn't wrong until she sent it so someone needs to fix it right now.

The problem really, lies in the fact IT works so well until it doesn't, and to make it work really well you need to sit down and do some designing from an abstract level and then drill down into the technical level, hopefully the technical level will change but that's okay because you have an abstract representation of it all and at that point it is just a matter of hiring a code monkey. The problem is that this is often not appreciated in organizations and if you try to explain to people why, say, importing important information into Excel won't work (when clearly you copy and paste and it appears to work just great) gives you cock-eyed stares like, why don't you go back to your data center dungeon and argue about Simpsons episodes.

I think we'll start to see change. I have friends who are not IT, who are my age, who understand that Computers are Complex, and are complex in a math / logic sort of way. They understand the basic concepts behind computers, how data is manipulated and how to avoid problems like this. Eventually they'll be at the top, but if you have any organization where IT isn't important from the top-down, you'll create a culture where bench scientists or middle managers are doing this they really should be consulting IT for. Really good organizations do this, but in smaller organizations or organizations with independent research arms, it often does not make economic sense to hire someone at $75-$150 hr to look over and formalize your data when that money could immediately be spent on sales or a new fancy microscope.

This is not really a phenomena related to IT, see Taleb's rants on the Black Swan. High impact, low probability events are hard to get people to act on. I bet whoever screwed this up used Excel without problems every day in his or her life. I bet they never saw problems like this, and I bet no one they worked with ever saw problems like this. Those of us who know how the "black box" of computers works will cry with or geekery as to why this is bad, but it is really, really hard to convince people to sit down and spend time correcting a problem that will probably never encounter.
posted by geoff. at 2:15 PM on March 4, 2009


I'm an ecologist, who's also lucky to have a bit programming skill, which seems to make me incredibly valuable as the scientist who, you know, knows how to run computers and stuff.

And I use Excel. Often because it's just what people give me data in. But some times because it's just a bloody quick way of fixing up the flat files other software gives you. Quickly delete and rename columns to an appropriate format, using the text-to-columns feature. It really is a tool within a broader toolset for me.

And that broader toolset can be pretty enormous. Analyses I've done recently have involved:

CSV -> Excel -> CSV -> ArcGIS -> DBF -> Access -> R

or:

Some weird, never before seen GRID file -> Python -> Postgres -> R -> CSV -> Excel -> ArcGIS

I feel lucky I know how to use all that stuff, because I don't know how other people doing the sorts of things I do get by. Access, by the way, is very useful as a quick-and-dirty SQL database. And it's also a stepping stone from Excel - people who are used to putting their data in Excel can generally pick up Access easily, and once it's in Access you have the whole ODBC interface that will let it talk to anything else.

When I was doing my PhD, some colleagues were trying to measure leaf area, by scanning leaves and running the scans through software. They were about to pay over $2,000 for some over-capable image analysis software for the task. I spent 2 hours writing a little program in Python for them that would clean up the image and count pixels for them, then multiply by the DPI to give a leaf area. Maybe I'm the reason you can't sell your software, Turtles. Scientists just can't turn down free stuff, particularly in cash-strapped fields like Ecology. I'm also a big advocate for R, which I think every scientist everywhere should be locked in a room for a month or so and forced to learn to use.
posted by Jimbob at 2:32 PM on March 4, 2009 [1 favorite]


If only there was some kind of specialised software they could use instead.
posted by turgid dahlia at 3:13 PM on March 4, 2009


I'm being half-snarky and half-serious, but have scientists ever heard of "teams"? Or the idea of specialized roles? I spent numerous years as a product manager, but no one ever expected me to actually execute every task I needed done personally. I understand science lab budgets are tight compared to the business world, but at some point there's simply no other way. Unless you are comfortable doing part of your job in a half-ass manner that threatens to render all the other work useless (like, say, having your data eaten by Excel's auto-date-formatting).

I'm talkin' out my ass here, as I have no experience with a lab environment, but I suspect that this is partly an economy of scale issue. Not only do you need to hire someone with informatics expertise, but you need someone on staff who can ask the interviewees the right questions in two or three domains so that you can hire the right person. Ideally, you also need to hire someone to document your systems/procedures as you go on as well so that when that person inevitably leaves you don't have to maintain every system they developed as a black box. I'd think the larger organizations would have a lot of advantage over smaller labs for this.

But using the wrong tool and blaming it on the people who made the tool is not productive.

Blaming Excel for screwing up things that should have been in a more specialized tool is dubious, but blaming Excel for errors that would have affected its core market as well is legitimate. I guarantee you that some business somewhere has had a cost analysis spreadsheet with a huge list of part numbers affected by a similar reformatting issue. The difference is that the business world doesn't publicize its informatics errors. Chalk this up to yet another area where I'm surprised open source isn't crushing closed source. I remember hearing that one of the reasons Fortran was used so extensively and for so long was the relatively bulletproof well tested math libraries. If you stick with open source solutions (or closed source where you have leverage over your vendors), your operating environment seems to be much less chaotic over the long term. Instead of being forced to upgrade software on arbitrary schedules with constant undocumented changes to the infrastructure your tools run on, you can preserve a relatively static environment.
posted by BrotherCaine at 3:49 PM on March 4, 2009


I forgot to say, using open source, you can also run multiple versions of your software on every machine you want without paying for a license for each version; your IT guy may need therapy though.
posted by BrotherCaine at 3:53 PM on March 4, 2009


(I'd even like it if they'd do some stupid little improvements in function processing. Have you ever tried to do a VLOOKUP query on a key that was to the right, rather than to the left, of the target? Or tried to do a COUNTIF function with more than one parameter? Ugh.)

You want MATCH and INDEX for the vlookup (faster too), and you want an array-formula of =SUM(IF(AND(whatever))) for your counting.

*ducks*
posted by pompomtom at 4:32 PM on March 4, 2009


I'm being half-snarky and half-serious, but have scientists ever heard of "teams"? Or the idea of specialized roles?

The way science is funded, and the way universities are run these days means that teams are indeed the norm. Small teams, who often have little to do with one another within a wider department. I work in a "lab" with 4 other research associates (two of which actually work at the other side of the country), 1 PhD student, and 2 honours students, all of whom are mostly working on their own discrete projects, and a technical officer who manly handles glasshouse facilities, equipment and purchasing. We are extremely lucky to have a tech-o. We are extremely lucky, within the context of our department, to have the staffing and grants that we do. And, if I can say, we're also lucky that myself and one of the other researchers have the computing background that we do, because most other labs don't. I know this, because I get honours students, PhD students, and even the occasional professor knocking on my door asking for my help in doing things like getting ArcGIS to produce a simple map.

The fact is, in the situation where I work, people are employed on contracts based on grants, and when you apply for the grant and specify who you want to hire, putting down "IT Professional" as one of the positions probably isn't going to work, except on seriously, seriously large, rare grants.

And, as I've eluded to here before, the IT provided by the university doesn't go much beyond guys who know how to re-image a PC.
posted by Jimbob at 4:34 PM on March 4, 2009


Because biology (chemistry) is already hard/interesting enough to occupy all of your attention? Because you are already expected to be an expert in statistics as well as your subject area?

This is the case for every field of science. If you do science without math or statistics, you are a mere philosopher. Nothing of what you described is unique to biology, and yet biologists seem to be the worst offenders, in my highly biased and unrepresentative sample of scientists: I've met dual major CS/math, CS/physics, and CS/music, but biologists seem to be allergic to computers.

I ignorantly attribute it to bad selection bias; people who like nature go into the field, even though the nuts and bolts of progress today are statistical.
posted by pwnguin at 4:59 PM on March 4, 2009 [3 favorites]


pwnquin: my boss is cross-appointed statistics and human genetics. He remembers R quirks on a regular basis. It happens.
posted by a robot made out of meat at 5:55 PM on March 4, 2009


FYI, this article was written in 2004...
posted by pwnguin at 6:09 PM on March 4, 2009


Interesting thread.

I'm a programmer for website back-ends and one thing I've ended up specializing in is oddball data transformations and imports. One project I worked on was for a national auto-service chain who had their their entire database of tires and cross-references to vehicles on one Excel spreadsheet.

There was one plausible reason for this: they programmed some auto-lookup stuff into it so that they could then send it to all their outlets. Anyhoo, this just ended up their master format as well.

Likewise I've found big companies who ended up with important data stored in ONE Access db on a junior's desktop, simply because that's how it was started.

I respect Excel, which is more than I can say for the rest of the Office suite ( my personal bias, I admit it). At home I use OpenOffice, and I now prefer their spreadsheet.
posted by Artful Codger at 6:47 PM on March 4, 2009 [1 favorite]


There's nothing wrong with the concept of spreadsheets — there's a reason they were a true killer app back in the 80s when they first came out. They're useful.

Excel's problems are due almost entirely to it having a near-total monopoly of the spreadsheet market. It tries to do everything, and ends up doing much poorly, and few things really well. Ten or twelve years ago, when there was more competition, you had various packages filling particular market niches. Now there's basically just Excel, at least for most people. There's no competition anymore.

So instead of getting an ecosystem of products tailored to various applications, we instead get the Borg.

The end result is that if your problem happens to be something that Excel can do — that its designers have anticipated, basically — you will probably come out okay. But the day you try to do something different, you and your data will both be screwed. And chances are, since you've been using Excel for years, you won't know anything else.

The solution as I see it is, if you're not comfortable letting your problem space be dictated by the whims of the Excel team, start learning the ins and outs of some other tools, so that you're not wed to Excel when the day arrives that it doesn't do what you want. For people who are actually doing statistics in Excel, I'd say learn SPSS (at least it retains the spreadsheety look-and-feel). If you use Excel macros, try Python or some other high-level programming language and IDE (there are libraries to suck data right out of Excel files now). Etc.

Excel isn't that bad when you just use it for making lists, sorting them, or doing basic financial calculations (reconciling a year's worth of expense receipts? it's probably a fair choice), but the second you start to feel limited in any way, run. You're using the wrong tool. And while you're at it, check all your data — in many cases, by the time you start to feel constrained, it's already started to pinch your data in subtle yet damning ways.
posted by Kadin2048 at 7:02 PM on March 4, 2009


Data management is a huge problem for us, but nobody, we scientists, the vendors, or IT, has any real clue for handling it.

Another place where (light, informal) collaboration can help -- your HEP colleagues have had to deal with huge gobs of data for decades, and have been writing support for it into their grants. It's not necessary to reinvent the wheel. Yet -- and I'm not accusing you/your lab of this -- I've encountered some very strange resistance to trying known solutions.

By way of example, solutions for backing up large data sets en mass in an automated fashion have existed for years, and yet the MO in most labs at the NIH is for researchers to back up their own data to CDs/DVDs by hand. The NIH once offered a TSM service but retired it because not enough labs were taking advantage of it -- thus forgoing what could have been a great economy of scale and forcing the CD/DVD backup "system" on labs that didn't want to invest in their own tape robots (ie the majority).

How much work do you think gets regularly archived? Work that could cure cancer? Work that's supported through your tax bux?
posted by Westringia F. at 4:27 AM on March 5, 2009


• MSDN: Understanding Floating Point Precision, aka “Why does Excel Give Me Seemingly Wrong Answers?”
posted by Blazecock Pileon at 8:13 AM on March 4 [7 favorites +] [!]


I love the wording here. It's not that the answers Excel gives are wrong, it's that you just don't understand how right they are.
posted by Who_Am_I at 5:41 AM on March 5, 2009 [1 favorite]


This gives me renewed hope for my own paper, Accidentally Hitting the Caps Lock Key Can Result in Words of Improper Case.
posted by troybob at 1:31 AM on March 6, 2009 [2 favorites]


« Older Kutiman mixes YouTube   |   Neither Steam Nor Punk Newer »


This thread has been archived and is closed to new comments



Post