Network Nonsense
February 12, 2014 9:12 AM Subscribe
Open warfare erupts in the world of mathematical biology, as Lior Pachter of UC-Berkeley writes three blog posts attacking two papers in Nature Bioscience, accusing one of them of being "dishonest and fraudulent": The Network Nonsense of Albert-Laszlo Barabasi, The Network Nonsense of Manolo Kellis, and Why I Read the Network Nonsense Papers. Kellis (MIT) and his co-authors respond (.pdf.)
Would be nice to have more context for this, even the blog post is pretty inside-baseball-ish.
posted by Wretch729 at 9:34 AM on February 12, 2014
posted by Wretch729 at 9:34 AM on February 12, 2014
This is interesting. The overall story is: When you're doing really complicated science (computational molecular biology) and someone thinks they've found a glaring error in a paper, what do they do about it? In their view, its an objective error, not just a subjective opinion. But the underlying maths makes the objection hard for most people to understand, and the paper-approval process for that specific conference is so opaque, that the objections are essentially ignored. The paper then goes on to win awards. So the bigger issue is - how complex can science get before peer-review breaks down?
posted by memebake at 9:39 AM on February 12, 2014 [4 favorites]
posted by memebake at 9:39 AM on February 12, 2014 [4 favorites]
This appears to be a squabble between some academics
It appears it might be a bit more than that, where a paper was changed in a significant (or perhaps not significant) way after publication. Nonetheless, going after Kellis is a big deal — it's basically like a theoretical physicist calling someone of Hawking's professional stature a fraud.
posted by Blazecock Pileon at 9:44 AM on February 12, 2014
It appears it might be a bit more than that, where a paper was changed in a significant (or perhaps not significant) way after publication. Nonetheless, going after Kellis is a big deal — it's basically like a theoretical physicist calling someone of Hawking's professional stature a fraud.
posted by Blazecock Pileon at 9:44 AM on February 12, 2014
The combinatorial argument on the motifs was interesting to me and was an interesting argument: you're seeing these structures because of combinatorics, not because of anything special in your actual data. I have less to say about the squabble over the following papers.
posted by kaibutsu at 9:46 AM on February 12, 2014 [1 favorite]
posted by kaibutsu at 9:46 AM on February 12, 2014 [1 favorite]
What's fascinating to me is that these top-tier journals did nothing to correct the errors. I'm assuming, of course, that they really are errors--criticisms of Lior Pachter's responses seems to be limited to the tone, with no one calling into question the facts of the allegations.
What a wretched and ridiculous way to squander your position as a leading journal. I hope they fall in prestige rapidly.
This bit impressed me:
Unlike some journals, where, say, a supplement can be revised while having the original removed (see the figure switching of Feizi et al.), arXiv preprints are permanent.
Now that's integrity. Way to go, arXiv.
posted by jsturgill at 9:52 AM on February 12, 2014
What a wretched and ridiculous way to squander your position as a leading journal. I hope they fall in prestige rapidly.
This bit impressed me:
Unlike some journals, where, say, a supplement can be revised while having the original removed (see the figure switching of Feizi et al.), arXiv preprints are permanent.
Now that's integrity. Way to go, arXiv.
posted by jsturgill at 9:52 AM on February 12, 2014
There is a versioning system on the arxiv; if you update revision to a preprint, it gets saved and becomes the default link, though the previous versions will also still be completely accessible. Thus, it's possible to throw a retraction note at the head of the default version of a paper.
posted by kaibutsu at 10:05 AM on February 12, 2014 [1 favorite]
posted by kaibutsu at 10:05 AM on February 12, 2014 [1 favorite]
Kinda reminds me of the Reinhart/Roghoff economics paper that turned out to have the wrong ranges on its Excel formulas.
posted by memebake at 10:16 AM on February 12, 2014
posted by memebake at 10:16 AM on February 12, 2014
Right, kaibutsu. You can also 'withdraw' arXiv preprints which make the latest version show up as 'Article Withdrawn', but previous versions still do exist. The 'preprints are permanent' statement while technically correct, is somewhat disingenuous.
This is also a good thing. Science should be as transparent as possible and have a paper trail. ArXiv is really in no way behaving badly here.
posted by McSwaggers at 10:17 AM on February 12, 2014 [1 favorite]
This is also a good thing. Science should be as transparent as possible and have a paper trail. ArXiv is really in no way behaving badly here.
posted by McSwaggers at 10:17 AM on February 12, 2014 [1 favorite]
Kellis' response doesn't address Pachter's accusation of "fraud": switching out a figure in their correction to the supplement without acknowledging that the figure was changed. I'm not going to comment on whether this is in fact fraudulent (they claimed the figure was "clarified"), but this is the most deadly accusation Pachter makes. It's clear, simple to understand, and documented. Kellis really, really needs to respond to this specific accusation.
posted by mr_roboto at 10:21 AM on February 12, 2014 [2 favorites]
posted by mr_roboto at 10:21 AM on February 12, 2014 [2 favorites]
Well I'm clearly unqualified to determine who is in the right here. I see no alternative but to fall back on the judgment of the divine and so I propose a trial by combat between the various disputants, using a single-elimination tournament structure, with the winner being understood to have been blessed by God, in mathematics as in combat.
posted by Naberius at 10:35 AM on February 12, 2014 [2 favorites]
posted by Naberius at 10:35 AM on February 12, 2014 [2 favorites]
Oh my god this is awesome. All my grad school grumblings put on paper by a real academic.
When I get a chance I'll try and give a rundown from the perspective of someone who was an insider five years ago. Basically this is about construction of null hypotheses.
posted by PMdixon at 10:42 AM on February 12, 2014 [2 favorites]
When I get a chance I'll try and give a rundown from the perspective of someone who was an insider five years ago. Basically this is about construction of null hypotheses.
posted by PMdixon at 10:42 AM on February 12, 2014 [2 favorites]
An argument about matrix math? <pops popcorn, pulls up chair>
posted by benito.strauss at 11:22 AM on February 12, 2014 [3 favorites]
posted by benito.strauss at 11:22 AM on February 12, 2014 [3 favorites]
There are two principal reasons I don't submit my work to boutique journals like Science and Nature. The first is that my work is nowhere near sexy enough to appeal to the editors and readership, and that's a high bar to clear. The second is that I have serious concerns that the reputation of those journals is no longer deserved; if it's not, then the Herculean effort required to publish in them is mostly wasted. I've seen some very sketchy stuff in top-shelf publications that the rigorous review process is supposed to filter out, but if your name is big enough that doesn't seem to matter.
posted by wintermind at 1:22 PM on February 12, 2014 [6 favorites]
posted by wintermind at 1:22 PM on February 12, 2014 [6 favorites]
One thing I always heard is that Science and Nature have a heavy bias for sexy findings, more than good ones.
posted by PMdixon at 1:41 PM on February 12, 2014 [4 favorites]
posted by PMdixon at 1:41 PM on February 12, 2014 [4 favorites]
It seems to be a problem with accepting papers from any field(?), and having a very high filter(?).
posted by grobstein at 2:26 PM on February 12, 2014 [1 favorite]
posted by grobstein at 2:26 PM on February 12, 2014 [1 favorite]
Okay, here's my rundown on the Barabási piece. This is all based on my knowledge as of leaving grad school 5 years ago, but this was the thing my dissertation was gonna be on, so I'm calling dibs on the expert hat till someone takes it. Note: This is solely based on the pieces by Pachter. I don't have database access anymore, and haven't read the reply. This is a paraphrase, and hopefully any editorial asides are clear as such. My biases are very much aligned with Pachter on this bit, but they've not been in an environment to be challenged in 5 years.
First, the background. In the last 10-15 years, there's been a series of new experimental techniques in biology that are known as "high-throughput methods," that is, they generate lots and lots of data at once. Not quite particle accelerator level, but only 3-4 orders of magnitude off. Microarrays are the best known-such, though there are others.
To be crude, biologists are bad at math. Also biology is hard. Also time-resolution of most of the high-throughput methods is pretty expensive, so you're typically sampling at least 15 minutes apart. Also biology is hard. So no one really has figured out how best to use these numbers. Okay, I have a snapshot of the expression of 10,000+ genes at one point in time --- so what?
One of the so what's is known as gene regulatory network reconstruction. Gene regulation is the process by which the expression of one gene affects the expression of another. Easy examples are activation and inhibition: Gene A expressing causes higher (lower) expression of gene B. A regulatory network is just a graph, a la connect the dots, where there's an edge between two nodes depending on whether there's a regulatory relationship between the genes they represent---a directed network attempts to represent which way influence flows, while an undirected one just says there's a relationship. To "reconstruct" this is to make an inference from data about the regulatory structure of the genes. (Reconstruct is sort of a misnomer, if we're good empiricists and don't reify our constructs. There is no real genetic network. Just a bunch of chemical reactions happening.)
This particular question has a conference centered around it, DREAM, with an associated competition: Some folks provide data from somewhere (some in silico pseudo-data, some in vitro. One year there was actually data from artificially inserted genes to E. coli), and people try to reconstruct what it came from. I actually competed one year! My method was basically the mutual information one that ranked 19/65 in DREAM5.
One thing that is common to mathematical biology, as in many other fields, is theoretical physicists. I don't understand the sociology of it, but there's this thing where people come in with some model derived from some completely other behavior, and slap it on the thing at hand. When all you have is a hammer etc. "Scale-free networks" are one of these hammers. More on that later. See Cosma Shalizi for power law details. Barabasi is aligned with that crowd.
Okay, so the details of this paper: A perturbation experiment works like this: Set up your system. Let it sit for a while so that whatever things you did to it to make it observable should be leveled out. (This is called "at equilibrium".) In my case, my system was Arabidopsis, and I think they were putting leaf shoots in solution? IDK. Measure your variables. Then change whatever input. Mine was citrate; the one described in the link is directly changing expression levels, maybe by RNA interferase, IDK. (This is one of the things theoretical physics-types often do: Give approximate solutions to an estimate obtained by an impossible experiment.) Measure your stuff again.
In this case, the ith row of matrix G is the ratio of the change in the expression of each gene divided by the amount of change in the ith gene. Basically it's an estimate of the derivative of each gene's expression level with respect to the other gene's expression levels. This is why the diagonal is definitionally 1.
Okay, here's where we get to the math part, and the bitchyness.
So B&B claim that given our observed matrix G, we should be interested in S that is related to G in the following way: Gij = (GS)ij, i != j, and the diagonal is identically 1. Pachter doesn't take issue with their claim we should want S. However, what B&B do claim is that we can't explicitly calculate S from G, and have to approximate it. Pachter says not only can we explicitly calculate it, we can give a nice easy 1-line formula.
I cannot possibly overstate how big a slap in the face this is. This is probably more insulting than a literal slap in the face would be.
I have not tried to check the math by hand, nor will I, cuz this isn't my job anymore. It looks plausible.
So that's the big bomb. From there there's a bit where he says that as soon as it comes into contact with dirty data (all the data is dirty), the model collapses, and then says they cherry-picked weak methods to compare themselves to.
Color commentary: Barabasi is awful. The dude can't wipe his ass without finding a scale-free network. (Please read all of Cosma Shalizi, linked above, for an understanding of why you should mock most people who claim something follows a power-law.) He makes giant sweeping claims, and publishes the same god-damn paper over and over. If he finally gets nailed on something this egregious that'd be awesome. I am hugely distrustful of people who claim that we "see the same patterns everywhere," (mostly false) and then claim that because two things are distributed similarly it means they were generated by a similar process. To the extent this embarasses them I'm super excited.
posted by PMdixon at 3:46 PM on February 12, 2014 [29 favorites]
First, the background. In the last 10-15 years, there's been a series of new experimental techniques in biology that are known as "high-throughput methods," that is, they generate lots and lots of data at once. Not quite particle accelerator level, but only 3-4 orders of magnitude off. Microarrays are the best known-such, though there are others.
To be crude, biologists are bad at math. Also biology is hard. Also time-resolution of most of the high-throughput methods is pretty expensive, so you're typically sampling at least 15 minutes apart. Also biology is hard. So no one really has figured out how best to use these numbers. Okay, I have a snapshot of the expression of 10,000+ genes at one point in time --- so what?
One of the so what's is known as gene regulatory network reconstruction. Gene regulation is the process by which the expression of one gene affects the expression of another. Easy examples are activation and inhibition: Gene A expressing causes higher (lower) expression of gene B. A regulatory network is just a graph, a la connect the dots, where there's an edge between two nodes depending on whether there's a regulatory relationship between the genes they represent---a directed network attempts to represent which way influence flows, while an undirected one just says there's a relationship. To "reconstruct" this is to make an inference from data about the regulatory structure of the genes. (Reconstruct is sort of a misnomer, if we're good empiricists and don't reify our constructs. There is no real genetic network. Just a bunch of chemical reactions happening.)
This particular question has a conference centered around it, DREAM, with an associated competition: Some folks provide data from somewhere (some in silico pseudo-data, some in vitro. One year there was actually data from artificially inserted genes to E. coli), and people try to reconstruct what it came from. I actually competed one year! My method was basically the mutual information one that ranked 19/65 in DREAM5.
One thing that is common to mathematical biology, as in many other fields, is theoretical physicists. I don't understand the sociology of it, but there's this thing where people come in with some model derived from some completely other behavior, and slap it on the thing at hand. When all you have is a hammer etc. "Scale-free networks" are one of these hammers. More on that later. See Cosma Shalizi for power law details. Barabasi is aligned with that crowd.
Okay, so the details of this paper: A perturbation experiment works like this: Set up your system. Let it sit for a while so that whatever things you did to it to make it observable should be leveled out. (This is called "at equilibrium".) In my case, my system was Arabidopsis, and I think they were putting leaf shoots in solution? IDK. Measure your variables. Then change whatever input. Mine was citrate; the one described in the link is directly changing expression levels, maybe by RNA interferase, IDK. (This is one of the things theoretical physics-types often do: Give approximate solutions to an estimate obtained by an impossible experiment.) Measure your stuff again.
In this case, the ith row of matrix G is the ratio of the change in the expression of each gene divided by the amount of change in the ith gene. Basically it's an estimate of the derivative of each gene's expression level with respect to the other gene's expression levels. This is why the diagonal is definitionally 1.
Okay, here's where we get to the math part, and the bitchyness.
So B&B claim that given our observed matrix G, we should be interested in S that is related to G in the following way: Gij = (GS)ij, i != j, and the diagonal is identically 1. Pachter doesn't take issue with their claim we should want S. However, what B&B do claim is that we can't explicitly calculate S from G, and have to approximate it. Pachter says not only can we explicitly calculate it, we can give a nice easy 1-line formula.
I cannot possibly overstate how big a slap in the face this is. This is probably more insulting than a literal slap in the face would be.
I have not tried to check the math by hand, nor will I, cuz this isn't my job anymore. It looks plausible.
So that's the big bomb. From there there's a bit where he says that as soon as it comes into contact with dirty data (all the data is dirty), the model collapses, and then says they cherry-picked weak methods to compare themselves to.
Color commentary: Barabasi is awful. The dude can't wipe his ass without finding a scale-free network. (Please read all of Cosma Shalizi, linked above, for an understanding of why you should mock most people who claim something follows a power-law.) He makes giant sweeping claims, and publishes the same god-damn paper over and over. If he finally gets nailed on something this egregious that'd be awesome. I am hugely distrustful of people who claim that we "see the same patterns everywhere," (mostly false) and then claim that because two things are distributed similarly it means they were generated by a similar process. To the extent this embarasses them I'm super excited.
posted by PMdixon at 3:46 PM on February 12, 2014 [29 favorites]
Oh, god. It's really as brutal as PMdixon describes. From the Barabási take-down:
While it would be nice for us to claim that our managing to quickly supersede the main result of a paper published in Nature Biotechnology was due to some sort of genius, in fact the entire exercise would be suitable for an undergraduate linear algebra homework problem. Barabási likes to compare himself to the great physicist and nobel laureate Subrahmanyan Chandrasekhar, but it is difficult to imagine the great Chandrasekhar having similar difficulties. [Links in original.]
posted by mr_roboto at 4:14 PM on February 12, 2014 [2 favorites]
While it would be nice for us to claim that our managing to quickly supersede the main result of a paper published in Nature Biotechnology was due to some sort of genius, in fact the entire exercise would be suitable for an undergraduate linear algebra homework problem. Barabási likes to compare himself to the great physicist and nobel laureate Subrahmanyan Chandrasekhar, but it is difficult to imagine the great Chandrasekhar having similar difficulties. [Links in original.]
posted by mr_roboto at 4:14 PM on February 12, 2014 [2 favorites]
From the "compare himself to Chandrasekhar" link (brought to you, because of course, by edge.org):
One of the shocking things that I discovered through my son is that from a very young age, I kept saying "do you want to be an astronomer? Do you want to go to the moon?" Always he always said, "no, I don't want to go." But he would like to go to work for Google, he would like to go to work for Facebook. We have a generation that is growing up for whom the traditional goals of going to the moon, of flying to faraway stars, don't exist anymore. That's not what excites them. What excites them is data, networks, social systems and all of these things that were really not part of the thinking. We don't have a goal. We don't have a computational social science department, at any university.posted by junco at 4:26 PM on February 12, 2014
[...] right now, much of my support is piggybacking on traditional disciplines. I cannot get a network science grant. I have to piggyback on lots of other things that we do, and sell it as physics, sell it as biology, sell it as many other things so that I can fit in the traditional funding system, in the traditional department system, in the university system.
I hope people remember this when they hear people pushing the line that "the idea of university departments is outdated" and "we need to establish new organizational methods to solve 'big picture' problems".
posted by junco at 4:28 PM on February 12, 2014
posted by junco at 4:28 PM on February 12, 2014
Also, happy to do the physicsmatt thing and answer questions about background context if people have them. I promise to qualify most answers mostly appropriately and declare as many biases as I'm aware of.
On preview: Yeah, one of the problems with interdisciplinary work is a persistent lack of awareness of interdisciplinary history. Like, I'm pretty sure I read someone taking a bunch of articles on social network inference dealies from the arXiv and matching them up against sociological texts from the first couple decades of the 20th century.
posted by PMdixon at 4:31 PM on February 12, 2014
On preview: Yeah, one of the problems with interdisciplinary work is a persistent lack of awareness of interdisciplinary history. Like, I'm pretty sure I read someone taking a bunch of articles on social network inference dealies from the arXiv and matching them up against sociological texts from the first couple decades of the 20th century.
posted by PMdixon at 4:31 PM on February 12, 2014
k5.user: "This appears to be a squabble between some academics over whether ones model is (a combination of) fraud, over-hyped, inaccurate or poorly reviewed, with some grudge-match. A callout doesn't usually make for a good post."
I don't have expertise in these fields, but I still found it an interesting set of things to read on my longish bus ride today, and it all built to a really great conclusion in the final blog post that probably ought to have been highlighted in either the FPP or by the blog author. The conclusion wasn't really inside baseball at all, and has applicability to a tremendous number of fields. It's worth digging up at the end of ... augh no anchor tags in the post ... you know what? Fukkit, probably easiest to just quote it and highlight some of the good bits (with apologies to Pachter; but c'mon, man! Foreground this stuff!):
I don't have expertise in these fields, but I still found it an interesting set of things to read on my longish bus ride today, and it all built to a really great conclusion in the final blog post that probably ought to have been highlighted in either the FPP or by the blog author. The conclusion wasn't really inside baseball at all, and has applicability to a tremendous number of fields. It's worth digging up at the end of ... augh no anchor tags in the post ... you know what? Fukkit, probably easiest to just quote it and highlight some of the good bits (with apologies to Pachter; but c'mon, man! Foreground this stuff!):
Speaking frankly, it was difficult work to write the blog posts about these articles. In addition to the time it took, it was exhausting and exasperating to discover the flaws, fallacies and frauds. Both Nick and I prefer to do research. But we felt a responsibility to spell out in detail what had happened here. Manolis Kellis is not just any scientist. He has, and continues to play leading roles in major consortium projects such as mod-ENCODE and ENCODE, and he has served on numerous advisory committees for the NHGRI. He is a member of the GCAT (Genomics, Computational Biology and Technology) study section until 2018. That any person would swap out a key figure in a published paper without publishing a correction, and without informing the editor is astonishing. That a person with great responsibility towards scientists is an abuser of science is unacceptable.posted by barnacles at 5:29 PM on February 12, 2014 [5 favorites]
Manolis Kellis’ behavior is part of a systemic problem in computational biology. The cross-fertilization of ideas between mathematics, statistics, computer science and biology is both an opportunity and a danger. It is not hard to peddle incoherent math to biologists, many of whom are literally math phobic. For example, a number of responses I’ve received to the Feizi et al. blog post have started with comments such as
“I don’t have the expertise to judge the math, …”
Similarly, it isn’t hard to fool mathematicians into believing biological fables. Many mathematicians throughout the country were recently convinced by Jonathan Rothberg to donate samples of their DNA so that they might find out “what makes them a genius”. Such mathematicians, and their colleagues in computer science and statistics, take at face value statements such as “we have figured out what makes a human human”. In the midst of such confusion, it is easy for an enterprising “computational person” to take advantage of the situation, and Kellis has.
I believe the solution for this problem is for computational biologists to start taking themselves more seriously. Whether serving as reviewers for journals, as panel members for funding agencies, on hiring/tenure committees, or writing articles, all of us have to tone down the hype and pay closer attention to the science. There are many examples of what this means: a review of a math/stats containing paper cannot be a single paragraph long and based on a hunch, and similarly computational biologists shouldn’t claim, as have many of the authors of papers I’ve reviewed in these posts, pathways to cure disease and explanations for what makes humans human. Don’t fool the biologists. Don’t fool the computer scientists, statisticians, and mathematicians.
The possibilities for computational methods in biology are unlimited. The future is exciting, and there are possibilities for significant advances in areas ranging from molecular and evolutionary biology to medicine. But money, citations and fame cannot rule the day. The details of the #methodsmatter.
Also, happy to do the physicsmatt thing and answer questions about background context if people have the...
You are awesome for taking the time to explain all this PM...thanks so much. I had read one of Barabasi's papers having to do with network attacks a few years ago, but I was not aware of his attempts to wrap himself in mulitdisciplinary glory.
One question, if you are willing. "Gene regulatory network reconstruction"...is there a good introductory reference to read more about this? I don't get what the overarching goal is. Is the hope that if the math keeps identifying some subset of the network under different perturbations, you would learn something about what part of the biology is important and what isn't? Thanks...
posted by superelastic at 5:33 PM on February 12, 2014
You are awesome for taking the time to explain all this PM...thanks so much. I had read one of Barabasi's papers having to do with network attacks a few years ago, but I was not aware of his attempts to wrap himself in mulitdisciplinary glory.
One question, if you are willing. "Gene regulatory network reconstruction"...is there a good introductory reference to read more about this? I don't get what the overarching goal is. Is the hope that if the math keeps identifying some subset of the network under different perturbations, you would learn something about what part of the biology is important and what isn't? Thanks...
posted by superelastic at 5:33 PM on February 12, 2014
Also don't miss awesome academic smack talk in the first post about how a system was "trivially tractable" and that his student was able to figure out a far simpler equation in less time than it took Pachter for a toilet break.
posted by barnacles at 5:33 PM on February 12, 2014
posted by barnacles at 5:33 PM on February 12, 2014
Hey PMDixon.
I'm a math nerd so I think I understand the matrix G you described. It's like the edge weights of the incidence graph connecting genes (or maybe relative weights), no?
If I've got that right, can you tell me what the matrix S is supposed to represent? I didn't get it from the first post linked in the FPP, and from what you've described, it seems the n x n Identity matrix would work for S. Thanks.
posted by benito.strauss at 5:50 PM on February 12, 2014 [1 favorite]
I'm a math nerd so I think I understand the matrix G you described. It's like the edge weights of the incidence graph connecting genes (or maybe relative weights), no?
If I've got that right, can you tell me what the matrix S is supposed to represent? I didn't get it from the first post linked in the FPP, and from what you've described, it seems the n x n Identity matrix would work for S. Thanks.
posted by benito.strauss at 5:50 PM on February 12, 2014 [1 favorite]
Actually, I too am confused by this, benito.strauss! In both Pachter's writeup, and in the Barabasi paper being attacked, I don't understand why S = id doesn't serve as a solution. There must be some other condition on S I'm not seeing that rules this out? PMDixon, help us!
posted by escabeche at 6:23 PM on February 12, 2014 [1 favorite]
posted by escabeche at 6:23 PM on February 12, 2014 [1 favorite]
The way I read it, they're looking for effects of genes on other genes, so only the non-diagonal entries are of interest, so they force Sii=0 ∀ i. So S = I is ruled out.
posted by mr_roboto at 6:40 PM on February 12, 2014 [4 favorites]
posted by mr_roboto at 6:40 PM on February 12, 2014 [4 favorites]
Hey,
So G is something like the observed-Jacobian-of-the-equilibrium-state-relative-to-a-forcing-on-each-gene. benito.strauss, S is actually the matrix describing (direct) edge weights. What the weights represent depend on the model; at a guess, I'd say it's something like the relative derivative [(dxi/dxj)/xj], so that you get that the response to an input perturbation vector x is something like exp(S*x)? Not sure exactly, it's a work night. But basically, G is the outcome considering downstream effects, and we want to estimate from that the "first order" effects, which S represent under some parameterization. So yeah, direct self-regulation (while totally biologically meaningful!) is ruled out from these sorts of models. Also hella hard to detect in general.
superelastic: This paper by one of my ex-supervisors is I think a pretty good intro to a toy version of the question for someone with a strong general math background. Let me dig into my old lit review and see what I can find. (Never thought I'd say that.) But yeah, basically the goal is to translate the gigs and gigs of raw data into a meaningful statement about the underlying biology, for the usual reasons we care about the underlying biology.
posted by PMdixon at 6:46 PM on February 12, 2014
So G is something like the observed-Jacobian-of-the-equilibrium-state-relative-to-a-forcing-on-each-gene. benito.strauss, S is actually the matrix describing (direct) edge weights. What the weights represent depend on the model; at a guess, I'd say it's something like the relative derivative [(dxi/dxj)/xj], so that you get that the response to an input perturbation vector x is something like exp(S*x)? Not sure exactly, it's a work night. But basically, G is the outcome considering downstream effects, and we want to estimate from that the "first order" effects, which S represent under some parameterization. So yeah, direct self-regulation (while totally biologically meaningful!) is ruled out from these sorts of models. Also hella hard to detect in general.
superelastic: This paper by one of my ex-supervisors is I think a pretty good intro to a toy version of the question for someone with a strong general math background. Let me dig into my old lit review and see what I can find. (Never thought I'd say that.) But yeah, basically the goal is to translate the gigs and gigs of raw data into a meaningful statement about the underlying biology, for the usual reasons we care about the underlying biology.
posted by PMdixon at 6:46 PM on February 12, 2014
mr_roboto is right, the requirement that S have zero diagonal is right there in Pachter's post (and presumably somewhere in the Nature paper too.)
posted by escabeche at 6:48 PM on February 12, 2014
posted by escabeche at 6:48 PM on February 12, 2014
Ah, we both mis-read/mis-remembered 0-diagonal as 1-diagonal. Oops.
posted by benito.strauss at 7:16 PM on February 12, 2014
posted by benito.strauss at 7:16 PM on February 12, 2014
And actually, what's up with the "Sherman-Morisson formula?" This is not something I know, but you don't need any formula to get Bray's bathroom-break computation for S. Like this: since SG and G agree off the diagonal, we know that
SG = G - D
for some diagonal matrix D. Multiplying on the right by G^{-1} we get
S = I - DG^{-1}.
But D is determined uniquely by the fact that S has zero diagonal; namely, it has to be the diagonal matrix D(1/G^{-1}) in Pachter's notation. Done.
posted by escabeche at 7:29 PM on February 12, 2014 [7 favorites]
SG = G - D
for some diagonal matrix D. Multiplying on the right by G^{-1} we get
S = I - DG^{-1}.
But D is determined uniquely by the fact that S has zero diagonal; namely, it has to be the diagonal matrix D(1/G^{-1}) in Pachter's notation. Done.
posted by escabeche at 7:29 PM on February 12, 2014 [7 favorites]
But D is determined uniquely by the fact that S has zero diagonal; namely, it has to be the diagonal matrix D(1/G^{-1}) in Pachter's notation.
I think some math happened somewhere around "namely."
posted by PMdixon at 7:49 PM on February 12, 2014 [1 favorite]
I think some math happened somewhere around "namely."
posted by PMdixon at 7:49 PM on February 12, 2014 [1 favorite]
Wow, escabeche, way to make me look like an idiot. Thanks a bunch.
No, seriously, though, I can't believe I didn't see that. It's a great reminder of the dangers of getting stuck on one way of looking at a problem. In this case, I started off looking at it as a system of equations and never stopped and just looked at the matrix equation that it was all coming from.
It's pointless now, of course, but the Sherman-Morrison formula expresses how the inverse of a matrix changes when you add a rank-one matrix (for non-mathy folks, a matrix which is very simple in some sense) to it. After noticing that the system of N^2 equations actually decoupled into N systems of equations, I realized that the matrices for those systems of equations were all just G plus a different rank-one matrix and the Sherman-Morrison formula let you express all their solutions in terms of G^{-1}.
When I saw how simple the solution ended up being, I knew that there must be an easier way to get there but never got around to trying. That'll show me.
posted by nicolas.bray at 8:06 PM on February 12, 2014 [4 favorites]
No, seriously, though, I can't believe I didn't see that. It's a great reminder of the dangers of getting stuck on one way of looking at a problem. In this case, I started off looking at it as a system of equations and never stopped and just looked at the matrix equation that it was all coming from.
It's pointless now, of course, but the Sherman-Morrison formula expresses how the inverse of a matrix changes when you add a rank-one matrix (for non-mathy folks, a matrix which is very simple in some sense) to it. After noticing that the system of N^2 equations actually decoupled into N systems of equations, I realized that the matrices for those systems of equations were all just G plus a different rank-one matrix and the Sherman-Morrison formula let you express all their solutions in terms of G^{-1}.
When I saw how simple the solution ended up being, I knew that there must be an easier way to get there but never got around to trying. That'll show me.
posted by nicolas.bray at 8:06 PM on February 12, 2014 [4 favorites]
Well there goes my expert hat.
posted by PMdixon at 8:08 PM on February 12, 2014 [4 favorites]
posted by PMdixon at 8:08 PM on February 12, 2014 [4 favorites]
Welcome to MetaFilter, Nicolas! The real idea -- that the computation is massively easier than the Nature paper makes it look -- is yours, obviously; it's just a question of how much massively easier.
(Are you related to Hugh Bray, by the way?)
posted by escabeche at 8:17 PM on February 12, 2014
(Are you related to Hugh Bray, by the way?)
posted by escabeche at 8:17 PM on February 12, 2014
Escabeche: Thanks for the welcome! I felt compelled to come prostrate myself. Also: big fan of your jalapeños. And, no, no known relation to Hubert Bray (I assume that's Hugh). There's few of us around that it's exciting to see another one but enough that it actually does happen from time to time.
PMdixon: There are many kinds of expert hats! I liked your comments upthread. I knew nothing about Barabasi going into this and Cosma Shalizi (whose site I've long enjoyed) provided a lot of background for me.
posted by nicolas.bray at 8:30 PM on February 12, 2014 [1 favorite]
PMdixon: There are many kinds of expert hats! I liked your comments upthread. I knew nothing about Barabasi going into this and Cosma Shalizi (whose site I've long enjoyed) provided a lot of background for me.
posted by nicolas.bray at 8:30 PM on February 12, 2014 [1 favorite]
Yeah, Cosma Shalizi has some wonderful takedowns of the whole "scale-free everything" nonsense.
The switching out of a supplementary figure is definitely disconcerting. I'm wondering if it was a difference between the "accepted/advance online" and the "finalized" version of the paper because in my experience it's not uncommon for even somewhat errors to be silently corrected between those two versions. I agree that not to mention it in the corrigendum is weird. It's also the kind of thing that an editor should have caught (wouldn't an editor have to approve the change to begin with, if the SI is hosted on NatBiotech's website?).
Fraud to me is more about deliberately tampering with facts and I'm not convinced that this was a attempt to commit fraud, as much as it was a case of sloppiness and a rush to publish combined with an editor who was phoning it in.
To be honest, I'm starting to get an impression of this whole Broad-proximal cohort of computational biologists as being "publish first, ask questions later" cowboys (remember this whole business about the "maximal information metric"? or this classification study of malaria from clinical samples that turned out to be in large part due to a batch effect?). What's extra distressing is that this seems to be a strategy that pays off: it appears to be way too easy to get a not-very-impressive computational biology method into CNS or CNS Jr. if you make it sound abstruse and broad-ranging enough. It's almost as if there's a U-shaped relationship between rigor and the sexiness of the publication -- blatantly non-rigorous methods won't (well, shouldn't) pass peer review, but papers that make more guarded, careful claims may tend to get editorial rejections for not meeting the sexiness threshold.
posted by en forme de poire at 9:21 AM on February 13, 2014 [2 favorites]
The switching out of a supplementary figure is definitely disconcerting. I'm wondering if it was a difference between the "accepted/advance online" and the "finalized" version of the paper because in my experience it's not uncommon for even somewhat errors to be silently corrected between those two versions. I agree that not to mention it in the corrigendum is weird. It's also the kind of thing that an editor should have caught (wouldn't an editor have to approve the change to begin with, if the SI is hosted on NatBiotech's website?).
Fraud to me is more about deliberately tampering with facts and I'm not convinced that this was a attempt to commit fraud, as much as it was a case of sloppiness and a rush to publish combined with an editor who was phoning it in.
To be honest, I'm starting to get an impression of this whole Broad-proximal cohort of computational biologists as being "publish first, ask questions later" cowboys (remember this whole business about the "maximal information metric"? or this classification study of malaria from clinical samples that turned out to be in large part due to a batch effect?). What's extra distressing is that this seems to be a strategy that pays off: it appears to be way too easy to get a not-very-impressive computational biology method into CNS or CNS Jr. if you make it sound abstruse and broad-ranging enough. It's almost as if there's a U-shaped relationship between rigor and the sexiness of the publication -- blatantly non-rigorous methods won't (well, shouldn't) pass peer review, but papers that make more guarded, careful claims may tend to get editorial rejections for not meeting the sexiness threshold.
posted by en forme de poire at 9:21 AM on February 13, 2014 [2 favorites]
(argh, "somewhat errors" -> "somewhat significant errors" above)
posted by en forme de poire at 9:41 AM on February 13, 2014
posted by en forme de poire at 9:41 AM on February 13, 2014
Great to see this being discussed even if I can't follow all of it.
For anyone else wondering what just happened, nicholas.bray is the same Bray that worked with Pachter on the analysis of the Barabasi and Kellis papers.
And escabeche just did something clever with matrixes or something that he liked.
Wow. Such matrix. So me-fi.
posted by memebake at 3:04 PM on February 13, 2014
For anyone else wondering what just happened, nicholas.bray is the same Bray that worked with Pachter on the analysis of the Barabasi and Kellis papers.
And escabeche just did something clever with matrixes or something that he liked.
Wow. Such matrix. So me-fi.
posted by memebake at 3:04 PM on February 13, 2014
It's pointless now, of course, but the Sherman-Morrison formula expresses how the inverse of a matrix changes when you add a rank-one matrix (for non-mathy folks, a matrix which is very simple in some sense) to it. ...
When I saw how simple the solution ended up being, I knew that there must be an easier way to get there but never got around to trying. That'll show me.
Eh, there's no kill like overkill. Or, why use a flyswatter when you've got a perfectly good elephant gun?
(I'm reminded of the number theory class in which we worked out a proof of the infinitude of primes that relied on Fermat's Last Theorem, or as the professor referred to it, "the celebrated result of Wiles.")
posted by PMdixon at 4:09 PM on February 13, 2014
When I saw how simple the solution ended up being, I knew that there must be an easier way to get there but never got around to trying. That'll show me.
Eh, there's no kill like overkill. Or, why use a flyswatter when you've got a perfectly good elephant gun?
(I'm reminded of the number theory class in which we worked out a proof of the infinitude of primes that relied on Fermat's Last Theorem, or as the professor referred to it, "the celebrated result of Wiles.")
posted by PMdixon at 4:09 PM on February 13, 2014
Without venturing a comment on Barabasi, Kellis, networks, or any of the bigger issues here, it seems worth pointing out that, for the Barabasi critique at least, the fact that someone is able to come up with a quick approach that renders long calculation totally obsolete and a little silly looking isn't prima facie evidence that the previous work was "nonsense". For instance, escabeche improved upon Bray's result in a way that made even it seem rather long-winded. This doesn't diminish the utility of the Pachter & Bray result, but it may diminish their suggestion that just because someone else found a quicker and much more direct method, anyone who missed it is peddling nonsense. This happens in science all the time -- we're always missing the short-cut that should have been so obvious. That's not to say that the rest of the paper might not have serious flaws as detailed in the rest of the blog post. But the fact that they missed something "obvious" in the matrix algebra doesn't in itself mean there's been misconduct, any more than the miniature version that happened here means it.
posted by chortly at 7:26 PM on February 14, 2014 [1 favorite]
posted by chortly at 7:26 PM on February 14, 2014 [1 favorite]
So two things:
a) MeFi comments are not a peer reviewed publication. (or rather they are but only post publication)
b) There is a huge difference between taking a more round about route to an exact solution and claiming none such is feasible.
posted by PMdixon at 7:34 PM on February 14, 2014
a) MeFi comments are not a peer reviewed publication. (or rather they are but only post publication)
b) There is a huge difference between taking a more round about route to an exact solution and claiming none such is feasible.
posted by PMdixon at 7:34 PM on February 14, 2014
Both true.
But as I took it, the question was less about where it was published, than about how we scale the mistake that was made. The infeasibility claim was certainly wrong, but only occupies about a line of the paper ("Equation (3) is exact and the sum accounts for all network paths connecting i and j ... It is of limited use, however, as it requires us to solve N2 coupled algebraic equations.") The question is whether this was a huge mistake showing the work to be nonsense (and casting significant doubt on the reviewers and journal), or whether it was a more common mistake: stupid and embarrassing in retrospect, but more a part of the standard scientific process. What I liked about the exchange upthread was that, blog or journal, it showed that process of improvement and mild embarrassment in miniature. Oversights or mild mistakes aren't in and of themselves a disgrace. Nor is claiming that something can't be solved and then presenting an approximation, when in fact it could be solved; this too happens all the time. Maybe this is worse than that, but saying so I think requires that Pachter & Bray do a little more work explaining why this mistake is an especial violation of the scientific method. The fact that they solved it while one of them was in the bathroom isn't in itself an explanation (as well as being slightly belittling of Bray's work, to be honest).
posted by chortly at 8:07 PM on February 14, 2014
But as I took it, the question was less about where it was published, than about how we scale the mistake that was made. The infeasibility claim was certainly wrong, but only occupies about a line of the paper ("Equation (3) is exact and the sum accounts for all network paths connecting i and j ... It is of limited use, however, as it requires us to solve N2 coupled algebraic equations.") The question is whether this was a huge mistake showing the work to be nonsense (and casting significant doubt on the reviewers and journal), or whether it was a more common mistake: stupid and embarrassing in retrospect, but more a part of the standard scientific process. What I liked about the exchange upthread was that, blog or journal, it showed that process of improvement and mild embarrassment in miniature. Oversights or mild mistakes aren't in and of themselves a disgrace. Nor is claiming that something can't be solved and then presenting an approximation, when in fact it could be solved; this too happens all the time. Maybe this is worse than that, but saying so I think requires that Pachter & Bray do a little more work explaining why this mistake is an especial violation of the scientific method. The fact that they solved it while one of them was in the bathroom isn't in itself an explanation (as well as being slightly belittling of Bray's work, to be honest).
posted by chortly at 8:07 PM on February 14, 2014
Without speaking for P&B, I would take it as a sign they don't take their work seriously, as it were. When such was my job, two pretty basic aspects of any model that you pay a ton of attention to were a) whether you could solve exactly or numerical approximation was required and b) sensitivity to noise. To the extent that P&B are correct in their claims, I take that as strong evidence that B (not interested in picking on grad students, so no claims re: other B) has a fundamentally unserious attitude and methodology. To lose one parent may be regarded as a misfortune; to lose both looks like carelessness, etc.
posted by PMdixon at 8:42 PM on February 14, 2014
posted by PMdixon at 8:42 PM on February 14, 2014
I should clarify, none of my comments were meant to belittle Bray's work. He was the one who came up with the exact solution first, and deserves real credit for that; my only complaint in fact was that Pachter's larger point about Barabasi causes him to minimize Bray's work in order to be able to claim that Bray's solution is trivially obvious and therefore an egregious error by Barabasi. The fact that eschabeche found an even simpler solution is just normal science, and in no way diminishes Bray's work. There's always a faster, better solution that one should have thought of; that's science. The only question was whether such a view covers the original paper's mistakes as well. But I don't mean to diminish Bray or P&B by comparing them to the original paper, especially for those who hold the original paper in low regard.
posted by chortly at 9:26 PM on February 14, 2014
posted by chortly at 9:26 PM on February 14, 2014
I'm late to the party (or maybe, more oddly, coming back after it's over) but a few points:
1) "The infeasibility claim was certainly wrong, but only occupies about a line of the paper ... "
The claim of infeasibility is only a line but the entire paper is based on applying this model (kind of) and in particular the approximation that they come up with. You're right that science always involves improving on what came before but you need to consider the scale here: this was not some deep, complicated thing that they missed. This was sophomore-level math that should not take anyone more than a few hours. (Really, as escabeche points out, it shouldn't take more than a few minutes.)
And yet they published two papers in Nature about this model without managing to solve this, not to mention that it was first published in 2009. I think PMdixon's last comment here is spot-on.
2) "do a little more work explaining why this mistake is an especial violation of the scientific method"
While their failure to solve their model is a pretty large embarrassment for the authors, I don't think it's the biggest problem with their paper that Pachter and I pointed out. We also have:
*) Their results are very, very weak.
*) They use their model on data that doesn't arise from it.
*) Their noise assumptions seem very unrealistic.
*) The model itself may have some acute sensitivities to noise.
3) "my only complaint in fact was that Pachter's larger point about Barabasi causes him to minimize Bray's work"
A lot of people seem to be missing the fact that I co-wrote those posts with Pachter so there's no worries about him unfairly representing my contributions. And, believe me, the work I did on solving that model can't really be minimized because it was already fairly minimal. It was a kind of fun little application of the Sherman-Morrison formula but nothing more.
posted by nicolas.bray at 4:05 PM on February 18, 2014 [2 favorites]
1) "The infeasibility claim was certainly wrong, but only occupies about a line of the paper ... "
The claim of infeasibility is only a line but the entire paper is based on applying this model (kind of) and in particular the approximation that they come up with. You're right that science always involves improving on what came before but you need to consider the scale here: this was not some deep, complicated thing that they missed. This was sophomore-level math that should not take anyone more than a few hours. (Really, as escabeche points out, it shouldn't take more than a few minutes.)
And yet they published two papers in Nature about this model without managing to solve this, not to mention that it was first published in 2009. I think PMdixon's last comment here is spot-on.
2) "do a little more work explaining why this mistake is an especial violation of the scientific method"
While their failure to solve their model is a pretty large embarrassment for the authors, I don't think it's the biggest problem with their paper that Pachter and I pointed out. We also have:
*) Their results are very, very weak.
*) They use their model on data that doesn't arise from it.
*) Their noise assumptions seem very unrealistic.
*) The model itself may have some acute sensitivities to noise.
3) "my only complaint in fact was that Pachter's larger point about Barabasi causes him to minimize Bray's work"
A lot of people seem to be missing the fact that I co-wrote those posts with Pachter so there's no worries about him unfairly representing my contributions. And, believe me, the work I did on solving that model can't really be minimized because it was already fairly minimal. It was a kind of fun little application of the Sherman-Morrison formula but nothing more.
posted by nicolas.bray at 4:05 PM on February 18, 2014 [2 favorites]
There's a new post up where Pachter/Bray try to explain deconvolution to readers who don't understand maths very well (unsuccessfully in my case), then respond to the response to one of their network nonsense barrages. I still have very little idea what deconvolution is but I find the back-and-forth fascinating (as far as I can tell Pachter/Bray seem to be winning).
posted by A Thousand Baited Hooks at 4:32 AM on February 19, 2014
posted by A Thousand Baited Hooks at 4:32 AM on February 19, 2014
I lol'd a bit at "only requires high school algebra" and yet they're talking about eigenvalues. Maybe eigen-decomposition is covered in Hungarian high schools, but not usually here in the USA.
posted by en forme de poire at 8:59 AM on February 19, 2014 [1 favorite]
posted by en forme de poire at 8:59 AM on February 19, 2014 [1 favorite]
(I don't know if either Pachter/Bray spent any time in Hungary, btw. That was supposed to be a joke about Hungarians producing a lot of mathematicians but it came off a little opaque, sorry.)
posted by en forme de poire at 9:00 AM on February 19, 2014
posted by en forme de poire at 9:00 AM on February 19, 2014
FWIW, I do still think that relying on the dictionary definition of fraud is a little inappropriate here because as I understand it "scientific fraud" is really a term of art for a specific type of ethical breach, usually limited to data falsification or fabrication. That's the kind of stuff that will get a lab shut down and will permanently end careers. I still this is more of a case of sloppy work and overselling the results than either of those two latter scenarios, and I think it's actually enough to say that the paper is misleading and incorrect - that's bad enough.
I also worry a little that calling it "fraudulent" might diminish the more general problem of overselling claims in computational biology. It could make it sound like this is about one or two bad actors committing a clear, open-and-shut rules violation, rather than an overall worrying trend in bioinformatics of burying important details in a 100 page supplement that gets very little rigorous attention; blustering papers past reviewers and editors who are intimidated by, as opposed to competent to assess, the methods; pumping out hyperbolic press releases that imply that your method is more universal than it is; ignoring alternative approaches instead of comparing to them, etc., etc. These are problems that high-profile bioinformatic methods are plagued by and it moves the competitive bar unfairly (I think my trust in a bioinformatic method is almost anticorrelated with impact factor now, or at least moves in that U-shaped curve I described above).
That said I think Pachter and Bray have done really good work here. I know how frustrating, time-consuming, and unrewarding it can be to figure out what, exactly, one of these methods is doing "under the hood" and whether that matches up to the mathematical and biological claims it's making, and they seem to have done an extremely thorough job here. (Another case of "forensic bioinformatics"?)
posted by en forme de poire at 9:17 AM on February 19, 2014 [3 favorites]
I also worry a little that calling it "fraudulent" might diminish the more general problem of overselling claims in computational biology. It could make it sound like this is about one or two bad actors committing a clear, open-and-shut rules violation, rather than an overall worrying trend in bioinformatics of burying important details in a 100 page supplement that gets very little rigorous attention; blustering papers past reviewers and editors who are intimidated by, as opposed to competent to assess, the methods; pumping out hyperbolic press releases that imply that your method is more universal than it is; ignoring alternative approaches instead of comparing to them, etc., etc. These are problems that high-profile bioinformatic methods are plagued by and it moves the competitive bar unfairly (I think my trust in a bioinformatic method is almost anticorrelated with impact factor now, or at least moves in that U-shaped curve I described above).
That said I think Pachter and Bray have done really good work here. I know how frustrating, time-consuming, and unrewarding it can be to figure out what, exactly, one of these methods is doing "under the hood" and whether that matches up to the mathematical and biological claims it's making, and they seem to have done an extremely thorough job here. (Another case of "forensic bioinformatics"?)
posted by en forme de poire at 9:17 AM on February 19, 2014 [3 favorites]
Yeah, unfortunately I think if the problem were actually due to intentional bad actors it would be much easier to deal with than the terrible incentives that actually drive this.
posted by PMdixon at 10:19 AM on February 19, 2014 [2 favorites]
posted by PMdixon at 10:19 AM on February 19, 2014 [2 favorites]
« Older Apartheid in South Africa (1957) Documentary | La Voz del Pueblo/Voice of the People Newer »
This thread has been archived and is closed to new comments
Some additional context would help - who are these folks, what are their models really proposing, etc.
posted by k5.user at 9:31 AM on February 12, 2014