The dangers of A/B testing
February 23, 2014 2:39 PM   Subscribe

A/B testing has become a familiar term for most people running web sites, especially e-commerce sites. Unfortunately, most A/B test results are illusory (PDF, 312 kB). Here's how not to run an A/B test. Do use this sample size calculator or this weird trick.
posted by Foci for Analysis (38 comments total) 81 users marked this as a favorite
 
Some of my friends are web analysts for major sites, and they were half talking shop/half teaching me about the shop this week, and one of the big takeaways from one of them was that if you're doing A/B testing, you've got to do it over the course of several months at the least to know anything worthwhile. (Maybe that's a really simple observation. I don't know. Not my field.)
posted by Navelgazer at 4:27 PM on February 23, 2014 [4 favorites]


In this article I’ll show that badly performed A/B tests can produce winning results which are more likely to be false than true.

Sounds like I should just look at the results of A/B tests and do the opposite.
posted by charlie don't surf at 4:38 PM on February 23, 2014 [2 favorites]


Well everyone seems to have trouble with stats. But still it's the right direction to try to at least try to measure than just guess.
posted by sammyo at 5:46 PM on February 23, 2014 [1 favorite]


This seems less about how A/B tests are bad and more how people don't understand how to run experiments. Or, perhaps more troubling, that people building the testing platforms don't know how to run experiments.
posted by Going To Maine at 5:56 PM on February 23, 2014 [4 favorites]


I dunno, A/B testing works for us in terms of optimizing conversion rates on-sight, or reducing ad spend etc.

I think the key is to regard A/B testing as just one indicator, but to rely, when sensible, on common sense, and, to some, small extent, intuition.

It's really more important to "do the right things" and measure over several months. In our experience, we see an uptick in traffic and conversions.

I have also noticed that data from A/B testing is pretty useful for internal influence within an organization. It can be pretty difficult to implement any sort of (according to web marketers) positive change internally, but the data, as they say, does not lie. Heh.

So for some orgs there is a political element to deploying A/B tests.
posted by KokuRyu at 6:09 PM on February 23, 2014 [4 favorites]


I think some commenters are responding to a misquote of the linked white paper. The real title is "Most Winning A/B Test Results are Illusory." The implication is not that A/B tests are worthless. The implication is that, even if you do an A/A test, random chance dictates that there will sometimes be "wins" that show that product A does better than itself. A properly set up test needs to account for that.
posted by muddgirl at 6:20 PM on February 23, 2014 [4 favorites]


After reading the article this seems to be a continuation of the theme that people always assume test results are infallible. This is especially relevant in my field, medicine, but I see it everywhere.

Junk in, junk out.
posted by hobo gitano de queretaro at 6:23 PM on February 23, 2014 [2 favorites]


If that one weird trick isn't acai berries, I'm going to be disappointed.
posted by bicyclefish at 6:25 PM on February 23, 2014 [11 favorites]


What's an A/B test?
posted by Sand at 6:39 PM on February 23, 2014 [1 favorite]


I hate, loathe and utterly detest A/B testing.

I work for a company that has decided all features need to be A/B tested before being released to 100% of our user base. I work on the mobile product (specifically iPhone) and A/B testing has effectively guaranteed that the mobile product feels fractured, confusing and inconsistent.

On top of that I feel A/B testing discourages boldness and risk taking. You end up with all features becoming small, incremental, uninnovative changes developed over a short development period (why take the risk working on something large if the data proves negative?).

I miss the days of just making something awesome and polishing it up with internal betas.
posted by schwa at 6:49 PM on February 23, 2014 [9 favorites]


A/B testing is the practice of creating two versions of an element on a webpage (say, the logo at the top of Metafilter) to segments of your audience to see which one performs better (i.e. clicks on the logo). Let's say you're not sure if the Metafilter logo should be blue/greenish or purple. You can set up an A/B test that shows half your audience the purple logo, and the other half of your audience the blue/green logo, and then measure how many people from each audience click the logo.

There's more to A/B tests than just that but that's the basic idea.
posted by chrominance at 6:51 PM on February 23, 2014 [1 favorite]


To give them credit, a vendor of A/B testing software has recently been trying to come to terms with this problem [10]. It would appear that their customers have been asking why even A/A tests seem to produce just as many winning results as A/B tests!
Well that's a clever way to test these tools!
posted by pwnguin at 6:53 PM on February 23, 2014 [1 favorite]


You end up with all features becoming small, incremental, uninnovative changes

But surely, if you're talking about updating a program, most people would actually prefer that? I know I certainly frigging would, e.g with Windows office and OS.
posted by smoke at 7:04 PM on February 23, 2014 [4 favorites]


Sand, Chrominance

...and the concept isn't limited to web page optimization, either.

Popcap Games recently ran a much-maligned A/B Test to determine whether they could squeeze more microtransaction money from Plants vs. Zombies 2 fans. Unfortunately for Popcap, the infringement on a venerable element of the series' gameplay generated such disgust and distaste even among the limited test group that they were forced to withdraw it... with prejudice, I'd wager.
posted by The Confessor at 7:05 PM on February 23, 2014 [2 favorites]


But surely, if you're talking about updating a program, most people would actually prefer that?

Not if you're talking constant change. And even though the changes could be small, they could still cause confusion. Maybe moving some button to some other screen can increase "engagement" but it can still confuse users who had learnt where that button was.
posted by schwa at 7:18 PM on February 23, 2014


In this article I’ll show that badly performed A/B tests can produce winning results which are more likely to be false than true.

Sounds like I should just look at the results of A/B tests and do the opposite.


That's because you're not attending to the vital word "can" in that sentence.
posted by straight at 7:27 PM on February 23, 2014 [1 favorite]


Testing can be busy make-work or it can actually pay off. I'm thinking of a Jared Spool lecture I saw a couple of years ago about A/B testing at Amazon, where changing one thing about reviews (featuring which reviews were most helpful) brought in millions of dollars in additional revenue.

Of course, when you're Amazon, you do so many transactions a day that you can get significant results in minutes. But that scope of testing is a bit beyond the scope of this paper.

Personally, I consider it a win if people even think about testing a thing before just launching into whatever the HIPPO decided this week.
posted by fifteen schnitzengruben is my limit at 8:35 PM on February 23, 2014 [3 favorites]


I think critiques of A/B testing are warranted and good things but, if we're going to have this somewhat tedious conversation, it's worth looking at why A/B tests became popular. In Silicon Valley (and probably elsewhere), as recent as the early-mid 2000's, it seems to me that there was very much a "Let's just build this awesome thing/feature! People will love it!" mentality. Obviously, you can hit a few home runs this way, but it may also be the case that your way of interacting with the world/website/product may be very different than that of most users. Your awesome idea might be awesome for you, but it could terrible for everyone but you. To make matters worse, people who have allegedly great and visionary ideas about what to build, what changes to make, etc., are sometimes very charismatic people who relentlessly evangelize their ideas until everyone's on board.

I think A/B testing became a big thing when investors realized that it could bring an objective, scientific, measure whether or not an idea is really a good one.

A/B testing culture certainly has its problems -- people who don't really know stats are interpreting the results and designing the tests, companies that rely too much on A/B testing may squash true innovation, and so on -- but I think the adoption of A/B testing was a response to the incredibly difficult problem of the sway that personal charisma and/or bullying can have on organizations, their investments, and directions.

That said, we desperately need to get better at them and also have active conversations about when to consider other factors when making product/design decisions.
posted by treepour at 8:53 PM on February 23, 2014 [11 favorites]


I know I certainly frigging would, e.g with Windows office and OS.

Eh, I hated the Office Ribbon for a solid week back in 2008, and have found that I prefer it since then. Did you really want to see the Office 1995 interface reach the age of majority?
posted by pullayup at 8:55 PM on February 23, 2014


I was in charge of AB testing on the data side at my last job. We were seeing the same problems as those mentioned earlier on this page-- non-reproducible results, false negatives, false positives, the whole works. I actually showed a statistical consultant this exact calculator from Evan Miller's page about how I thought we weren't using large enough sample size, but he (convincingly) stated that Miller's sample sizes were far too large and that if we were truly getting a representative sample, we should be able to get significance in a few thousand results.

I don't have a statistics PhD, though I was conversant enough to do some testing, but this whole exercise has convinced me that I literally know just enough to be dangerous. I have no idea what to think now.
posted by thewumpusisdead at 9:04 PM on February 23, 2014 [1 favorite]


Foci for Analysis: "this weird trick."

My distrust for any link that says that is so thoroughly ingrained that it's hard to force myself to click it.
posted by double block and bleed at 9:11 PM on February 23, 2014 [6 favorites]


It does seem to really boil down to having a large enough sample size. Having worked on these types of projects it is very tempting to stop testing after what seems like a "clear win" after a seemingly large number of users have been a part of the test.

A lot of this is driven by the seductive idea that small changes here and there can make significant differences to a company's bottom line. The reality is that despite the shiny case studies where a green button beat a red button by 25%, most incremental changes have incremental results. Over time lots of incremental results can make a big difference so A/B testing is valuable but our experience has been that for true accuracy you need more samples than most of the off the shelf tools suggest.
posted by cell divide at 11:46 PM on February 23, 2014


When I was looking into jobs recently I looked into possibly moving into marketing. Looking at person specifications I noticed that I needed to understand A/B testing. I was somewhat concerned that I had never heard of this, and was amused to discover, after some googling, that this just refered to statistical hypothesis testing!

There seems to be a lot of low hanging fruit for statisticians to grab at the moment. That "one weird trick" linked in the op is essentially the same as if clinical trials recruited patients as they stepped into the hospital and were deeply disappointed when most of them were not coming for cancer treatment.

Of course increasing your sample size won't solve your problems if most of the things you are testing between are the same anyway: if 99% of tests are A/A then (if you are using a 5% significance level) for every 1000 tests you'll have ten true positive (assuming 100% power!) and 49.5 false positives!
posted by Cannon Fodder at 12:14 AM on February 24, 2014 [1 favorite]


It's been a while since I studied any probability and statistics formally, but if I remember correctly there is one particular result in statistics that applies here. It states that the chances of getting a result that is n standard deviations from the mean approach 1 as the sample size becomes unbounded.

In practical terms, I think it means this: if you hypothetically ran A/A testing on your website, you can always let it run until it shows a statistical significance, and it always eventually will.
posted by cotterpin at 3:02 AM on February 24, 2014


What's an A/B test?

That's when you go into the hi-fi store and listen to the same piece of music on two sets of speakers side-by-side to decide which you like better.

Or maybe play the same guitar through two different amplifiers (or vise versa) for the same reason.

Or a similar exercise in any domain allowing slightly different experiences over time.

Often involving a switch labeled "A / B".


Apparently the term has been recently colonised by web 'optimizers'.
 
posted by Herodios at 3:53 AM on February 24, 2014 [1 favorite]


The thing about A/B testing that everyone gets wrong is that it's got to be A/B testing, not A/Dog-riding-a-snowboard testing. You can't test two wildly different features that should never be compared in the first place. It is completely meaningless to say "users converted better when the logo was white instead of blue, and the header looked like this, and this button was over here, and this menu had these links in it instead of these, and we put a picture of a dog riding a snowboard at the bottom". You can't A/B test a million variables at the same time, and you have to give users time to adjust to remove the shock factor from the equation. Unfortunately, product managers often have the scientific background of an aquatic mammal, which leads to all manner of ridiculous misuses of these techniques.
posted by deathpanels at 5:00 AM on February 24, 2014 [3 favorites]


The triumph of A/B testing is a classic case of precision trumping accuracy.

I've been in the business since 1997, a lot of that time in either marketing departments or straight-up advertising, and I can't remember any time during which advertising revenues weren't increasing or people weren't willing to loudly predict the imminent demise of online advertising.

Why were the revenues increasing? Because people could measure the performance.

Back when web advertising first started out, it was often a terrible deal for most of the people doing it. They were very likely not reaching their audience as effectively or efficiently as they would through other media -- but the publishers in other media couldn't give the advertisers precise numbers. Publishers on the web could give you SUPER precise numbers (and they've only gotten more so). That was extremely sexy to people who had to go and make that internal case that Kokoryu was talking about up-thread.

A/B testing is no panacea. Overall it's better that people are willing to test, but you'd like them to understand that not all tests are created equal, and precision is no substitute for accuracy. Absent that understanding, really terrible things can happen, ranging from badly built landing pages up to global financial policy.
posted by lodurr at 6:41 AM on February 24, 2014 [1 favorite]


I like the idea of measuring the effectiveness of anything with statistical accuracy, but I'm leery of those who do measurements for the sake of doing them. It's like the seed of another cargo cult.
posted by ZeusHumms at 6:47 AM on February 24, 2014 [1 favorite]


mildly amazed (and chagrined) that it took this long for someone to use the term 'cargo cult.'
posted by lodurr at 7:01 AM on February 24, 2014 [1 favorite]


I first encountered the term "A/B testing" in the field of high-end audio equipment. Back in the early 2000s it was almost sacrilege to assert that the best way to tell if a new high-end gold-wire plutonium-shielded cable was actually doing anything was to blindly compare the audio quality to your old cable. But yes, it's about as basic as hypothesis testing can get.

I think the white paper is a good summary of where A/B tests tend to go wrong, but I don't think it is perfectly clear about the underlying issues for non-statisticians, especially when talking about statistical power. Statistics Done Wrong (previously) is the best resource I've seen for laypersons.

I also find the author's repeated praises of other scientific disciplines to be rhetorically effective but sort of hilarious given that scientific and medical studies often get their stats wrong, either due to ignorance or real-world constraints.
posted by muddgirl at 7:16 AM on February 24, 2014 [1 favorite]


A/B testing in the audio world can be hilarious. I personally know people who've A/B tested directional audio cables and insist that they can tell a difference.
posted by lodurr at 7:28 AM on February 24, 2014


There seems to be a lot of low hanging fruit for statisticians to grab at the moment.

More low hanging fruit than there are statisticians, as far as I can tell. For most people good at math, the inclination is to do ANYTHING BUT STATISTICS as a first choice, because you've got lots of options to choose from, and almost all of them are sexier than a nameplate that reads 'statistician'.

That said, I've recently been greatly enjoying learning some stats under the guise of machine learning, though, and finding it really fun. For a mathematician, stats is the medium by which "math you are interested in" and "anything else in the world your are interested in" connect to one another. So long as you have an 'anything else,' it's pretty easy and painless to learn some stats in that applied context.

You can't A/B test a million variables at the same time, and you have to give users time to adjust to remove the shock factor from the equation.

Interestingly, there are cool ways to track ten factors at once.... You just then have to vary _all_ of the factors and get way larger sample sizes. But then you get interesting information about how the different factors interact with one another. (There's a really nice bit of representation theory involved in the analysis, too, which makes me pretty happy. As a mathematician...)
posted by kaibutsu at 8:55 AM on February 24, 2014 [2 favorites]


And most published research results are false.
posted by madcaptenor at 9:24 AM on February 24, 2014


I first encountered the term "A/B testing" in the field of high-end audio equipment. Back in the early 2000s

Same here -- only more like 1970.

More history:

The History of A/B and Mutlivariate Testing
[In 1747] British Royal Navy ship surgeon, Mr. James Lind, started what would be the beginning of A/B and multivariate testing. By giving specific crew members different solutions, then testing the results of those variations over time, he came to the conclusion that citrus fruits were the cure to scurvy.
Introduction to A/B Testing
[T]he first recorded clinical trial was an A/B test. In 1747, scotsman James Lind (a physician in the British Royal Navy) divided twelve scurvy afflicted sailors into six pairs. He then provided these groups different variations of citrus fruit, vinegar, cider, etc. and through the results found that Vitamin C rich citrus fruits help in preventing and curing scurvy.
See also: ABX testing (mostly audio)

wp on ABX testing.

Why ABX Testing Usually Produces Null Results with Human Subjects
ABX testing has been extremely controversial since it was introduced decades ago. . . . The "X" in the ABX is either A or B, randomly selected, the listener needs to identify whether that "X" is "A" or "B". Unfortunately human beings do not have the ability to compare three sonic events sequentially. . . . Thus ABX tests usually get null results, and cause listening fatigue.
Testing Audiophile Claims and Myths
 
posted by Herodios at 9:46 AM on February 24, 2014 [3 favorites]


Apparently the term has been recently colonised by web 'optimizers'.

Actually Jakob Nielsen has been pushing for (at least) multivariate testing in usability for about twenty years. A/B testing hit big about a decade ago with fast development platforms like Ruby on Rails making the development investment worthwhile for even smaller sites. High-traffic sites like Google have been A/B testing almost from the beginning. I'm sorry your personal jargon world has been invaded.
posted by dhartung at 5:57 PM on February 24, 2014


Actually Jakob Nielsen has been pushing for (at least) multivariate testing in usability for about twenty years.
[In 1747] British Royal Navy ship surgeon, Mr. James Lind, started what would be the beginning of A/B and multivariate testing.
I accept your apology.
 
posted by Herodios at 6:16 PM on February 24, 2014


'A/B testing' is the term we use for this practice in web development and web-based marketing, and it's what the FP refers to.

So while pointers to the etymology of the term are certainly welcome, any assertion of greater legitimacy because someone else uses the term for some other related practice just isn't really relevant to the discussion.
posted by lodurr at 8:45 AM on February 25, 2014


It states that the chances of getting a result that is n standard deviations from the mean approach 1 as the sample size becomes unbounded.

Yes, for example, if adult Americans have an average annual income of $30000, with a standard deviation of $10000, and those incomes were distributed per a bell curve (spoiler: they're not), if you had an infinite number of adult Americans to sample from, you would, if you sampled enough of them, find one with an annual income of 38 billion dollars, or 943 quintillion dollars, or -516 octillion dollars, or any other number, positive or negative, that you'd care to name.

In practical terms, I think it means this: if you hypothetically ran A/A testing on your website, you can always let it run until it shows a statistical significance, and it always eventually will.

No, statistical significance is something different. If you found a $38 billion-per-year American far more often than your model predicted, that would be a statistically significant difference from your model. But if you find that $38 billion-per-year American as often as your model indicated, the mere existence of that person does not constitute a statistically different significant difference from your model.

The linked paper, in fact, directly addresses this point: if you simply run a test until you see a "statistically significant" difference—"peeking," as the article calls it—you are especially prone to generate a false positive, but it won't always happen. In fact, you would get a positive result 41% of the time in A/A testing by "peeking." 59% of the time you would never show a statistically significant difference, even if you let the test run ad infinitum.
posted by DevilsAdvocate at 1:00 PM on February 25, 2014


« Older American Deep Blues Touring 1960's Britain   |   Music saves me still Newer »


This thread has been archived and is closed to new comments