DNA’s Dirty Little Secret
March 6, 2010 1:47 PM   Subscribe

 
Related post.
posted by homunculus at 1:47 PM on March 6, 2010


I teach probability to first-year college students and let me tell you, you can't get a good grade on my exam if you can't explain the Prosecutor's Fallacy.

Great article, one which unfortunately will have to be written again and again until the general public gets used to the main ideas of conditional probability.
posted by escabeche at 2:03 PM on March 6, 2010 [4 favorites]


The California DNA database now contains approximately 1.2 million convicted people. And it's about to grow dramatically. In January 2009, Attorney General (and now gubernatorial candidate) Jerry Brown announced a "major expansion" of the California database, as the state began to add the DNA of not just those convicted of a crime, but those merely arrested.
posted by CheeseDigestsAll at 2:06 PM on March 6, 2010


The more I read about forensics, the more I think the whole field is a bad joke.
posted by sevenyearlurk at 2:19 PM on March 6, 2010


Reading the article, it seems like the example that they're using was more a case of sloppy forensics than anything else. They didn't adhere to their own standards for matches, using only five and half markers, when the standard is 12 markers and California's felony database requires 7 for a match. So, it's not the fault of the tool, it's the fault of police, forensics experts and prosecutors using a tool incorrectly.
posted by Punkey at 2:22 PM on March 6, 2010 [3 favorites]


Related (interesting/funny) Ted Talk. (Peter Donnelly/Stats&Juries)

also;
It's not the "forensic sciences" that is a joke... it's the ability of after the fact people involved to twist statistics. It is a misinformed public that is once again a problem.
or;
Don't hate computers because they ave the potential to get a virus; improve the knowledge of users, so they don't click randomly every time something pops up on screen.
posted by infinite intimation at 2:25 PM on March 6, 2010 [1 favorite]


Bonnie Cheng, the crime lab technician who did the analysis, argued that this was not a significant stumbling block—outside of the one marker, where she acknowledged a mixture was present, she testified that it was "highly unlikely" that there was much mingling of genetic material. This runs counter to the views of most experts, who insist that mixed samples tend to be blended throughout, making it exceedingly difficult to separate one person’s DNA from another.

This sort of thing seems to happen so often that I'd be tempted to conclude, were I on a jury and I heard an expert testify that p, that almost certainly ¬p. Especially if p concerned the validity of the testing procedure itself.

And this …

Rather than try to sort out the disparities between its numbers and database findings, the FBI has fought to keep this information under wraps.

Way to serve justice, guys.
posted by kenko at 2:27 PM on March 6, 2010 [5 favorites]


Punkey, the article also establishes that analyses of the DNA databases are starting to reveal significant numbers of pairs of people matching 9 or more markers, randomly. Sloppy forensics aside, the growing sizes of these databases means that it becomes more and more likely that any one person's DNA markers could randomly match another's to a misleading degree.
posted by chudmonkey at 2:28 PM on March 6, 2010


NTM the judge's bizarre conduct.
posted by kenko at 2:31 PM on March 6, 2010


It sounds like part of the problem was the 1/1,100,000 figure. That's the probability of the DNA matching any particular other sample by accident. But people got confused and thought it was the chances of matching any one of the entire set of examples.

So in fact the probability of at least one accidental match would have been, I think, 1 - (1 - 1/1,100,000)n where n is the size of the database. With 1 million samples, that's actually 59%, and if you go up to 10 million samples it's almost certain.

(I think that's the formula, because the odds of no match are (1 - p), the odds of no match in n samples is (1 - p)n, and the odds of one or more matches would then be 1 - (1 - p)n)

So yeah, people not understanding statistics is a big problem.
posted by delmoi at 2:36 PM on March 6, 2010 [3 favorites]


That's why the standard is 12, chudmonkey. 7 or 9 is simply enough to say "Hey, it might be this person, you might want to talk with them". Anything lower than a full-point match should not be presented as proof of guilt in a court of law, and doing so is slip-shod prosecution and police work.

That's the real problem at stake here. DNA testing, when done correctly, is as good as advertised, math bears it out. But prosecutors and police, in their zeal to find, arrest and convict someone, get ahead of the evidence, and the public tends to go along with them. Like kenko said, way to serve justice, guys.
posted by Punkey at 2:39 PM on March 6, 2010


Punkey, we don't disagree about the legitimacy of DNA testing or the or the need for law enforcement to behave better, but if one database of ~1,000,000 people has 122 9-marker-matching pairs in it, and the database itself is set to grow (potentially by a whole factor), then a random 12-marker match is hardly outside the bounds of reality. (And yes, I get how much less statistically likely is it to match 12 over match 9, but still...)

The point is that these "cold hits" shouldn't be allowed. If there isn't some other testimony or evidence connecting an individual to a crime, the DNA matches shouldn't be allowed as the only evidence. Like with OJ Simpson... if OJ wasn't the disgruntled ex-husband of one of the victims, the DNA match would have been way too flimsy from a scientific basis.
posted by chudmonkey at 2:49 PM on March 6, 2010


...And obviously, jurors should be allowed to have ALL the information on the statistical likelihood of matches vs. mis-matches, etc.
posted by chudmonkey at 2:51 PM on March 6, 2010


Exactly! I agree completely, chudmonkey. It's always gratifying to find out when you're arguing two sides of the same point. I never said that DNA hits, even the full-on 12 marker hits, should be the end-all-be-all of evidence in a trial, even if it's for a speeding ticket. DNA just completes one-third of the Motive-Means-Opportunity triangle, you still have to explain why they did it and how. If we really want to be cheesy about it, identical twins provide all the proof you'll ever need. Just because you're a DNA hit doesn't mean you did it, it just means that you were there.
posted by Punkey at 3:14 PM on March 6, 2010


Maybe now would be a good time to stop executing people! Since, you know, we're human and unable to be perfect even with DNA evidence.
posted by sallybrown at 3:14 PM on March 6, 2010 [5 favorites]


DNA’s Dirty Little Secret: A forensic tool renowned for exonerating the innocent may actually be putting them in prison ensuring that violent criminals are free to commit again.

You see, if you frame it that way and convince the "low-information voters" of it, then you actually have some chance of accomplishing something. People don't generally care if the innocent are accused, no matter what the rule of law says. They just want to know that they/their sister/their wife/their daughter is not going to be the next victim.
posted by swimming naked when the tide goes out at 3:16 PM on March 6, 2010 [4 favorites]


He also volunteered to give them a fresh sample of his DNA. When it matched the sample in the database, he was arrested and charged with first-degree murder.

Never volunteer information to police when it may incriminate you. Ever. If they want it, make them compel you.
posted by secret about box at 3:16 PM on March 6, 2010 [8 favorites]


Punkey wrote: "Anything lower than a full-point match should not be presented as proof of guilt in a court of law, and doing so is slip-shod prosecution and police work."

If there is already reason to suspect a particular person, less than a full match may be indicative of guilt, presuming that the remaining markers were merely untestable, rather than a negative match. It should not, however, be relied on as the sole basis for a conviction.

That said, as long as so-called experts are overinflating the statistical probability that a given sample is in fact the defendant's DNA, it probably shouldn't be allowed. As soon as they are willing to go up there and say "it's probably the defendant's DNA, but there's a 30% chance it's not," I think it would again have some validity.
posted by wierdo at 3:26 PM on March 6, 2010


That's why the standard is 12, chudmonkey.

Off-topic and all, but I love that this sentence is actually not impolite.
posted by kenko at 3:34 PM on March 6, 2010 [6 favorites]


...And obviously, jurors should be allowed to have ALL the information on the statistical likelihood of matches vs. mis-matches, etc.

From a different article regarding the same case:
"Through dueling experts, the prosecution and defense offered jurors a dizzying array of numbers to consider in weighing the DNA match. (...)

The defense team countered with its own experts, who made different calculations based on more conservative assumptions. Their numbers, 1 in 16,400 and 1 in 40,000, showed a greater, albeit still remote, chance of finding a coincidental match. Jurors would have to decide for themselves which numbers best fit the evidence. (...)

Interviewed outside court after the verdict, jurors said they had struggled to weigh the different statistics. One said that the "likelihood ratio" was appealing because its name made sense in plain English.

In the end, however, jurors said they found the 1-in-1.1-million general-population statistic Merin had emphasized to have been the most "credible" and "conservative." It was what allowed them to reach a unanimous verdict."

If there isn't some other testimony or evidence connecting an individual to a crime, the DNA matches shouldn't be allowed as the only evidence.

From the article linked in the original post:
"Still, the prosecution had some chilling circumstantial evidence to present; in 1977, Puckett had been convicted of raping two women and sexually assaulting a third, crimes for which he later served eight years in prison. Because the revelation of past offenses is highly prejudicial, most courts keep these details from jurors. But California allows prosecutors to present this information to show that a crime matches a pattern of offense. In Puckett’s case, all three victims were brought in to testify. Each of them described how Puckett had conned his way into their cars by posing as a police officer and got them to drive out to a deserted area. Using a knife or an ice pick as a weapon, he then forced them to perform oral sex. "He … grabbed my throat, and I started to scream," recalled one victim. "He started to squeeze and telling me to shut up, and then I felt a knife at my throat."

Also, this article by David H. Kaye:
"The probability of a coincidental match means the chance that Mr. Puckett's DNA would match (and no other DNA in the database would) if he were not the killer and if he were unrelated to the killer.

This definition refers to the probability of the DNA evidence given the hypothesis of coincidence. Again, neither 1 in 1.1 million nor 1 in 3 expresses this value, but 1 in 1.1 million is a far closer estimate than is 1 in 3. (...)

The media portrait of the database-trawl issue bears but a faint resemblance to the peer-reviewed statistical literature on the subject."
posted by iviken at 3:40 PM on March 6, 2010


the growing sizes of these databases means that it becomes more and more likely that any one person's DNA markers could randomly match another's to a misleading degree.

...well, yes! exactly! Plus, any "match" which has a potential for occurring in every 1.1 million people means that about 300 people (probably more) within the United States alone will be a match. Meaning, that using this as a basis for proving guilt easily fails the "beyond a probable doubt" metric.

This needs to be made public as widely as possible so potential jury members know not to fall for this misleading tactic.
posted by hippybear at 3:43 PM on March 6, 2010


If you tried a lot of things that they do in Forensics in the pharma sector, an FDA auditor would rip out your heart and take a bite out of it before you had time to fall down. Just saying.

Also, this.
posted by Kid Charlemagne at 4:17 PM on March 6, 2010


I thought I was reasonably well informed about this kind of thing, but I had no idea that the defense wouldn't be allowed to challenge the prosecution's statistics. That's completely unjustifiable. I thought that they could mount any reasonable defense that they wanted. It's been a while since I've been truly shocked by some new revelation about our messed up society, but that did it.
posted by Pater Aletheias at 4:21 PM on March 6, 2010


Mikey-san: Never volunteer information to police when it may incriminate you. Ever. If they want it, make them compel you.

Holy shit, that should be an FPP!
posted by P.o.B. at 4:24 PM on March 6, 2010


In particular, they recommend multiplying the FBI’s rarity statistic by the number of profiles in the database, to arrive at a figure known as the Database Match Probability. When this formula is applied to Puckett’s case (where a profile with a rarity of one in 1.1 million was run through a database of 338,000 offenders) the chances of a coincidental match climb to one in three.

This is amazing. I consider myself to be a pretty savvy and well informed person, but the things one can do with statistics just boggle the mind.
posted by SLC Mom at 4:29 PM on March 6, 2010


iviken wrote: "Each of them described how Puckett had conned his way into their cars by posing as a police officer and got them to drive out to a deserted area. Using a knife or an ice pick as a weapon, he then forced them to perform oral sex. "He … grabbed my throat, and I started to scream,""

That testimony should never have been allowed, and even if it was, the jury should have disregarded it, as the MO in the two cases is completely different.
posted by wierdo at 4:50 PM on March 6, 2010


I have been sitting here puzzling over the half marker, not understanding what a half marker is. I should understand what a half marker is since in my lab we use the same markers to monitor patients who have received a bone marrow transplant.

Routinely we test only four markers: since we have already tested both the marrow donor and the pre-transplant patient, this is most of the time sufficient to establish if we have a pure donor specimen, a pure recipient specimen, a mixture of donor or recipient.

Donor profile: ________|___|_____ (one locus)


Recipient profile: _____|___|_________


After transplant: _____|___|___|_____

Each of the vertical lines represent one allele and normal people have two allele (lets not discuss number of alleles at the moment, I'm aware that occasionally there could be more than two). After transplant we have at this locus the presence of three alleles, and the height of the peaks would indicate the % of donor and recipient. For example, if the middle peak were twice the height of the two other ones, the mixture would be 50-50. ( a bit more complicated, since we measure peak area). The horizontal line represent how far the allele traveled on a electrophoretic medium.

The markers were chosen originally because of the variety of alleles ( 13 or more different alleles, their DNA composition different in the number of junky repeats they contain, thus traveling further on the electrophoretic medium or lagging behind other alleles).

So, at any given locus you can have anything from one peak to four, one being totally uninformative because it means that there is only one variety of alleles (the same on each of the two donor chromosomes and the two recipient chromosomes) and four peaks when the two alleles of the donor vary from each other and from the alleles of the recipient.

Half a marker? I'm stumped.
posted by francesca too at 5:09 PM on March 6, 2010 [2 favorites]


iviken the bit you quoted isn't very relevant to the situation at hand.
"The probability of a coincidental match means the chance that Mr. Puckett's DNA would match (and no other DNA in the database would) if he were not the killer and if he were unrelated to the killer.

This definition refers to the probability of the DNA evidence given the hypothesis of coincidence. Again, neither 1 in 1.1 million nor 1 in 3 expresses this value, but 1 in 1.1 million is a far closer estimate than is 1 in 3. (...)
I'm not sure what exactly the criticism is. It's true that the probability of a coincidental match is 1.1 million for that specific person is 1.1 million, and the probably of any coincidental match is one in three.

Imagine handing out lottery tickets to 380,000 people, each with a 1/1,100,000 chance of winning. Then there's a 1 in 3 chance that someone will win, except instead of a million dollars, they get life in jail.

That's essentially what people think happened. It could have been Puckett, or it could have been someone else.


Then there's this:
Definition 2. The probability of a coincidental match means the chance that Mr. Puckett's DNA would match (and no other DNA in the database would) if he were not the killer and if he were unrelated to the killer.
...
This definition refers to the probability of the DNA evidence given the hypothesis of coincidence. Again, neither 1 in 1.1 million nor 1 in 3 expresses this value, but 1 in 1.1 million is a far closer estimate than is 1 in 3.
That would be nice, but according to the linked article that didn't happen, in fact, they found 9 other matches, or possibly 122. Remember with this 'half marker' or whatever there was additional possible sequences.
Definition 3. The probability of a coincidental match means the chance that one or more DNA profiles in the database would match if no one in the database is the killer.

This definition refers to the probability of one or more hits in the database given that the database is innocent. This probability is approximately 1 in 3.
Exactly. 1 in 3. As far as I can tell, the "definition 2" provided is totally spurious.
posted by delmoi at 5:27 PM on March 6, 2010


. In Puckett’s case the actual chance of a false match is a staggering one in three, according to the formula endorsed by the FBI’s DNA advisory board and the National Research Council, a body created by Congress to advise the government and the public on scientific issues. But the jury that decided Puckett’s fate never heard that figure. In fact, his lawyers were explicitly barred from bringing it up.

What the fuck is up with a judge who won't let the jury hear something like that?

Bonnie Cheng, the crime lab technician who did the analysis, argued that this was not a significant stumbling block

Reminds me of the case that was on an episode of This American Life where a forensics tech testified at the trials of three different defendants that the blood type didn't match this defendant, but that it did match the other two.

Perjurious witnesses and the prosecutors who enlist them need jail time. Instead, they walk away. On rare occasions they are disbarred, but that seems to be the limit. And even then they cry about the injustice of expecting them to be honest.
posted by Jimmy Havok at 6:01 PM on March 6, 2010


Well, this is all coming from the FBI's supposed expertise, just like the miracle tool marks, etc. ad nauseam. It's policy, carved in stone and set in concrete that the FBI will lie, cheat, etc. to never be found to be wrong about anything. It's an institution not just adverse to reform, but immune to it.

Given the current political climate, it's unlikely a bill could get through Congress requiring an audit of the FBI database. Maybe, but not likely, since only a bleeding heart liberal who loves letting violent criminals roam the streets would ever question the nation's political police.

It could come about as the result of a lawsuit or be ordered by a humane and intelligent federal judge. There are a few of those and one is enough, unlike the congresscritters. How that would come about, I have no idea.

Just another example of things going wrong and enough people in power with a vested interest in keeping them that way.

But there ought to be some remedy.
posted by warbaby at 6:57 PM on March 6, 2010


That would be nice, but according to the linked article that didn't happen, in fact, they found 9 other matches, or possibly 122. Remember with this 'half marker' or whatever there was additional possible sequences.
That's not what the article said at all. The article said that when the evidence from the Sylvester case was run against the California DNA database, there was "a" match -- Puckett.

A search of the Arizona database had yielded 122 pairs of partially-matching (9 or more markers) records, in all. This was unrelated to the Sylvester case.
posted by planet at 7:07 PM on March 6, 2010


"A search of the Arizona database had yielded 122 pairs of partially-matching (9 or more markers) records, in all. This was unrelated to the Sylvester case."

Unrelated, but concerning, no? Just because there we no extra candidates in the California case doesn't mean that they pinned the right guy.
posted by Sportbilly at 8:30 PM on March 6, 2010


The judge's pretrial rulings seem completely wrong. I'll be interested to see what the appellate courts do on this one -- though likely, by the time the case makes its way through the system, the defendant will be dead.

I wonder if the defense attorneys tried to make an offer of proof by using another expert, such as a statistician, who could have then explained the evidence in such a way that the judge could be convinced a jury would understand it.
posted by Happydaz at 9:28 PM on March 6, 2010


The case here not allowing explanation of information seems very odd, like the woman recently who could not speak to the fact that she was assaulted... things where juries are given less than full information seem like bad ideas, especially with the number of people who turn around and say "only the jury really 'knew' what happpened in the courtroom, and the suggestion that people are 'just monday morning quarter-jurying'... sometimes the outside world really does know more than the jury did.
Really (re)recommending the ted talk I mentioned above... The speaker applies statistical methods to genetic data -- spurring advances in disease treatment and insight on our evolution.-by day. (I severely undersold his explanation of the involved issues. He explains the numbers games involved.) I think he hits the involved question involved at it's basis with this...

He speaks to a very simple slip in logicmaths we all make(ted hivemind raised their hands that they will be both taking same number of flips to come up in sequence (that is explained to be a commonly mistaken conception of random.)
He details;

.where (H)eads, and (T)ails.
HTH is a repeating pattern...
HTT being not a repeating pattern.

Therefore to try to seek a particular set of code to "match" as a goal, one or the other can become more/less common (random isn't so chancy when you have patterns you are seeking and letters and random assortments, and varying lengths and some are repeating patterns. I will get x HTH in a row more frequently than I will get HTT) How big will the chunk be when we are selecting, one can increase or decrease the chance of seeing a double if it's in a particular repeating pattern.



Simple Sequence Length Polymorphisms (SSLPs) are used as genetic markers with Polymerase Chain Reaction (PCR). An SSLP is a type of polymorphism: a difference in DNA sequence amongst individuals. SSLPs are repeated sequences over varying base lengths in intergenic regions of Deoxyribonucleic Acid (DNA). Variance in the length of SSLPs can be used to understand genetic variance between two individuals in a certain species.


the term restriction fragment length polymorphism
, or RFLP, (commonly pronounced “rif-lip”) refers to a difference between two or more samples of homologous DNA molecules arising from differing locations of restriction sites, and to a related laboratory technique by which these segments can be distinguished. In RFLP analysis the DNA sample is broken into pieces (digested) by restriction enzymes and the resulting restriction fragments are separated according to their lengths by gel electrophoresis.
(wow, with no context and a skilled lawyer I can see how the involved sciences can be made to resemble reading tea leaves {or conversely to be venerated as flawless and infallible as determinants of guilt [as pointed out above, the ultimate reality of a scenario must involve Motive-Means-Opportunity with full explanations] to allow a logically sound determination of guilt beyond all reasonable doubt.} These Genetics sciences are actually extremely interesting to read more of the science, and realize the vast variety of scales of reference which can be examined through DNA, from an individuals history, to familial relations, ancestry, and even markers from specific events can signify where a particular ancient distant relative was at some point in time.)
posted by infinite intimation at 11:36 PM on March 6, 2010


Mikey-san: Never volunteer information to police when it may incriminate you. Ever. If they want it, make them compel you.

P.o.B.: Holy shit, that should be an FPP!

It was an FPP, and there is some extra stuff in that post besides the excellent video of the "Don't Talk to the Police" lecture given by Law professor James Duane.
posted by electricinca at 6:22 AM on March 7, 2010 [2 favorites]


I'm glad they don't have jury trials here...
posted by Pendragon at 8:05 AM on March 7, 2010


In particular, they recommend multiplying the FBI’s rarity statistic by the number of profiles in the database, to arrive at a figure known as the Database Match Probability. When this formula is applied to Puckett’s case (where a profile with a rarity of one in 1.1 million was run through a database of 338,000 offenders) the chances of a coincidental match climb to one in three.

Isn't this completely wrong? By that calculation, running the profile through a database of 1.2 million would have a probability of over one. I hope that's the article's fault and not an accurate representation of the metric they use.
posted by revfitz at 1:06 PM on March 7, 2010


The more I read about forensics, the more I think the whole field is a bad joke.

No, no, no, it's just like CSI, with all the cool monitors and modern looking offices and really, really good scientists who can do any kind of chemistry or physical test, even the ones that don't really exist, except that Hollywood spends more on this shit than the states do, and Hollywood's stuff is just props that don't really function. So, yeah, maybe.
posted by Mental Wimp at 2:15 PM on March 7, 2010


the things one can do with statistics just boggle the mind.

...which is why I became a statistician. I try to use my powers for good only.
posted by Mental Wimp at 2:25 PM on March 7, 2010


revfitz: It's actually a reasonably good approximation. A more precise calculation would be this: suppose the chance of two given profiles matching is p, in this case 1/1'100'000, and suppose the database contains n profiles, in this case 338'000. Then the probability of your sample not matching a specific profile in the database is (1-p), so the probability that your sample matches none of the database profiles is (1-p)n. This can be expanded using the binomial theorem; you then get a sum that begins 1 - pn + p2n(n-1)/2 - ... The third term there is approximately (pn)2, the fourth term will be about (pn)3, and so on. This means that as long as pn is a good amount smaller than 1, the entire sum can be approximated as 1 - pn. This was the probability of your sample not matching anyhing in the database, so the probability of your sample matching at least one profile will be about pn. In this case, pn is about 1/3, which is smaller than 1, but still big enough that the approximation starts to break down. Wolfram Alpha tells me that the actual probability, using this calculation, is about 1/4.

By the way, the number pn has another important role -- it is the expected value of the binomial distribution, the model I used to do the calculation above. This means that if you took a bunch of samples and ran them against the database, noting the number of matches you got for each sample, then the average number of matches will be approximately pn.
posted by Tau Wedel at 3:32 PM on March 7, 2010


« Older Brooms - the Perfect Weapon   |   ...but can it run Crysis? Newer »


This thread has been archived and is closed to new comments