Test Everything
May 9, 2012 8:28 AM   Subscribe

In 2007, Google project manager Dan Siroker took a leave of absence from Google, moved to Chicago, and joined up with Obama’s campaign as a digital adviser.
"At first he wasn’t sure how he could help. But he recalled something else Obama had said to the Googlers: “I am a big believer in reason and facts and evidence and science and feedback—everything that allows you to do what you do. That’s what we should be doing in our government.” And so Siroker decided he would introduce Obama’s campaign to a crucial technique—almost a governing ethos—that Google relies on in developing and refining its products. He showed them how to A/B test."

If you spend any time on the web, you've probably been an unwitting subject in what’s called an A/B test. It’s the practice of performing real-time experiments on a site’s live traffic, showing different content and formatting to different users and observing which performs better.

Bonus Link: Gaming site IGN made 7 small tweaks to its homepage. Employing the A/B testing methodology, IGN noted how those tiny changes can have a big effect.
posted by Petrot (82 comments total) 36 users marked this as a favorite
 
The BBC also tried it, with an example of the code used.
posted by Petrot at 8:30 AM on May 9, 2012


Which Americans are in the B test for drug policy and public health care insurance?
posted by DU at 8:32 AM on May 9, 2012 [17 favorites]


Canada is the B test.
posted by Meatbomb at 8:36 AM on May 9, 2012 [20 favorites]


The rich.
posted by TwelveTwo at 8:37 AM on May 9, 2012 [6 favorites]


I'm pretty sure that A/B testing is the reason I get all these stupid emails in my inbox from Democrats with "Hey, what's up" as the subject, instead of something meaningful. It actually pissed me off to the point where I started unsubscribing from all of those lists.
posted by empath at 8:43 AM on May 9, 2012 [9 favorites]


A/B testing is crucial when you absolutely, positively want to avoid having to take responsibility for a decision.
posted by Thorzdad at 8:50 AM on May 9, 2012 [31 favorites]


As useful as it is, this is the danger of A/B testing. Some people feel like it liberates them from choosing appropriate variants for both A and B. If something nonsensical gets more clicks, how can it be wrong?
posted by the jam at 8:50 AM on May 9, 2012 [1 favorite]


Clicks good. Bounce bad.
posted by TwelveTwo at 8:50 AM on May 9, 2012 [1 favorite]


Conversion best.
posted by TwelveTwo at 8:51 AM on May 9, 2012 [2 favorites]


I'm A/B testing Presidents. Some of you will get Mitt Romney, others Obama, still others will have Reanimated Millard Filmore.
posted by zippy at 8:53 AM on May 9, 2012 [5 favorites]


Which Americans are in the B test for drug policy and public health care insurance?

Romney's the B. Hence, the A is better.
posted by Ironmouth at 8:54 AM on May 9, 2012


Maybe Romney will hire someone from Wikipedia to do A/B testing for his campaign and they'll end up with creepy unshaven closeups of him all over his webiste.
posted by burnmp3s at 8:56 AM on May 9, 2012 [2 favorites]


A/B testing is also really great if you just want to think about the next few weeks and tiny incremental improvements, not investigate how something works long term or know why it works.

Yes, making that tiny change means that people do X faster now, but it prevents you from making some other change that would have been better. Or it only helps the n00bs do X faster but not the experts, whom it harms. Or it's a statistical help because now some slow subset is excluded (I bet using internet slang would test great, for instance, because slow old people would drop out).

You really need to know what you are doing, not just generate random changes and A/B test them. Otherwise you something that works but far from optimally, like evolutionarily-developed retinas with their backwards nerves.
posted by DU at 8:59 AM on May 9, 2012 [9 favorites]


Of course we get scientific methods applied to the problem of political campaigning. Of course we will never get a scientific approach to policy.
posted by grobstein at 9:00 AM on May 9, 2012 [25 favorites]


As far as Google and testing is concerned...
posted by Thorzdad at 9:02 AM on May 9, 2012 [1 favorite]


So, DU, as a n00b, if I understand what you are saying correctly, it implies the need to do A/B testing within the contextual knowledge of the users and their operating environment gained from more qualitative observations/insights rather than in isolation basing decisions on metrics alone?
posted by infini at 9:05 AM on May 9, 2012 [1 favorite]


I guess short-term decisions can't really be blamed on A/B testing. It's really just the same concept as the control group in science. I'm just sick of UIs designed on "scientific" principles that ask the wrong questions (like "how intuitive is it"--how long are your users going to be in the learning stage? why not optimize for the bulk of their time, not first impressions?) and so forth. The website design angle, especially from Google who've made every wrong decision possible in the last 10 years, pressed my button.
posted by DU at 9:09 AM on May 9, 2012 [2 favorites]


I'm starting to see people refer to A/B testing as some kind of techno-geek magic. A/B testing is the new "social" or the new "interactive."

We were A/B testing back in 1997. It was called "let's see which one works better."
posted by Cool Papa Bell at 9:11 AM on May 9, 2012 [5 favorites]


Yes, making that tiny change means that people do X faster now, but it prevents you from making some other change that would have been better. Or it only helps the n00bs do X faster but not the experts, whom it harms.

A/B is only the beginning. Google actually runs multiple experiments simultaneously (to the extend that they don't conflict). You can be included in multiple experiments. GA will separate your site into new and return users; and you're free to segment your users further. And intelligent students of CS will tell you that you need to test big, possibly negative changes to avoid falling into local maxima. Furthermore, without A/B testing, we have little reason to believe the "it works" part of "why it works."

I do wish that Wired would include links to their sources. At some point I'd like to introduce A/B to our small nonprofit website, and learn more about this stuff, but I'd like to learn the lesson's of other's failures before inventing my own failures.
posted by pwnguin at 9:13 AM on May 9, 2012 [1 favorite]


And so Siroker decided he would introduce Obama’s campaign to a crucial technique—almost a governing ethos—that Google relies on in developing and refining its products. He showed them how to A/B test."


"Due to a technical glitch, the experiment was a disaster."
posted by chavenet at 9:13 AM on May 9, 2012


"We've replaced this candidate's website with NEW Folger's Crystals...let's see if these voters notice the difference."
posted by briank at 9:14 AM on May 9, 2012 [4 favorites]


you need to test big, possibly negative changes to avoid falling into local maxima

This is true of more than just A/B testing, and it should be stapled to the forehead of every new MBA as a condition of graduating.
posted by aramaic at 9:16 AM on May 9, 2012 [4 favorites]


Cool Papa Bell: "It was called "let's see which one works better.""

Yeah, I don't get it. Likewise, every hot new software engineering process that comes down the pike looks the same to me:

1. Make a list of stuff to do, sorted by priority.
2. March down the list, top to bottom, until you have to ship.
3. Ship.
4. GOTO 1.
posted by Rat Spatula at 9:17 AM on May 9, 2012 [2 favorites]



like "how intuitive is it"--how long are your users going to be in the learning stage? why not optimize for the bulk of their time, not first impressions?


Interesting. (serious questions) Do you think that learning how to use it in the first place might be a barrier to adoption?
posted by infini at 9:19 AM on May 9, 2012


I'm just sick of UIs designed on "scientific" principles that ask the wrong questions (like "how intuitive is it"--how long are your users going to be in the learning stage? why not optimize for the bulk of their time, not first impressions?)

Because if users give up in frustration before they learn how to do anything then it doesn't matter how great the product is for experienced users.

The website design angle, especially from Google who've made every wrong decision possible in the last 10 years, pressed my button.

Depends on the metric you use. If your testing uses profitability as the metric, then Google is apparently doing pretty great.
posted by jedicus at 9:19 AM on May 9, 2012


Or what jedicus said, except for the part about Google. I hate their new redesign but I must be on the lunatic fringe of outliers as the internet expands its bell curve to absorb more billions.
posted by infini at 9:21 AM on May 9, 2012


Amazon does something like 100 A/B tests a day. The beauty of their traffic is they can do a test for 30 seconds and get a realistic sample size. (For more, watch Jared Spool's fantastic presentation on Design Treasures From the Amazon, which talks about this in more detail.)

A/B testing works great if you're trying to figure out which language or shape of a button is the best one. It's not so great when your boss says, "Hey, let's do A/B testing?" and you reply, "Fantastic! What would you like to test?" and they reply, "You mean we have to come up with variants?" and then the conversation dies for six months.
posted by fifteen schnitzengruben is my limit at 9:26 AM on May 9, 2012 [8 favorites]


Do you think that learning how to use it in the first place might be a barrier to adoption?

That can depend on the situation. But in general, if the tool really does work well once you know how to use it, people use it. Look at emacs, vi, blender and for that matter *nix in general. Or manual cameras. Or machine tools. Very few people will argue that these are "intuitive" but everybody who uses them swears by them.

Furthermore, those same people complain mostly about attempts to "simplify" or "make it more intuitive" to lower that barrier. Not because of the unwashed masses coming in but because it removes the power they once had.

I think it's the very rare task that can be solved by a tool that is both simple and powerful, therefore most tools that are simple are terrible.
posted by DU at 9:28 AM on May 9, 2012


There's more than one way to make an easy learning curve. Most tools do it by making the plateau at the top very low. We call these tools "intuitive". Good tools do it by making the time very long. Of these tools we say "you have to get used to it".
posted by DU at 9:31 AM on May 9, 2012


If your testing uses profitability as the metric, then Google is apparently doing pretty great.

It doesn't show weakness to competition, however.

Let's say company X is dominant in the market, and their leadership is so strong that absent an incredibly bad site change, users would still go to them.

An A/B test in this case isn't going to show potential defections to a competitor. The site's not going to run a test where the front page is defaced with a shock image. They're probably not going to do a test where half the users get a "server unavailable" message, or double-billed transactions.

Because of this blind spot, a dominant site may accumulate many weaknesses. It may be impossible to tell that a local optima is far from a strong position, because absent a Really Bad Event, there's no way to measure it.

Through these untestable weaknesses, user resentment and the potential for defection grows. Then, during a real stumble, there's a competitor with something you could not have tested for: a viable and radical alternative. You're broken, they work. You're stale, they're not. You're optimized for last year, they're five years in the future and desirable.

Sony mp3 players / Apple iPods
Nokia smart phones / Apple iPhones
Yahoo circa 1999 / Google circa 1999
DSL / Fiber
posted by zippy at 9:31 AM on May 9, 2012 [7 favorites]


The website design angle, especially from Google who've made every wrong decision possible in the last 10 years, pressed my button.

It's good something pressed your button, DU. Everyone else was pressing Google's.

How large is your startup, to the nearest $1B, again?
posted by IAmBroom at 9:35 AM on May 9, 2012 [4 favorites]


More Money = Better Than
posted by DU at 9:38 AM on May 9, 2012 [4 favorites]


We were A/B testing back in 1997. It was called "let's see which one works better."

Yes, but, as you've read the article, you know that it's the real-time multivariate testing that is the real game changer here.

The optimization might seem trivial, but when you're working on a scale like Amazon or one of the big sites (or a presidential election), it's absolutely essential and now almost mandatory. I sure wish Gore A/B tested his Florida strategy...

But yeah, Obama did this back in 2007 to tremendous effect apparently.

How Obama Raised $60 Million by Running a Simple Experiment

I'm in the industry, so this trend towards optimization (site and search) in the recent decade has been pretty depressing for me, but in a corporation it is nigh impossible to argue against.

I'm glad the slightly better presidential candidate is probably slightly better than his opponent at optimization ... and then I'm incredibly depressed by the thought that today's BlackHatSEO chumps are the campaign managers of 2016.
posted by mrgrimm at 9:39 AM on May 9, 2012


How large is your startup, to the nearest $1B, again?

So only owners of billion dollar companies are qualified to talk about testing metrics and effectiveness? That's just dumb.
posted by patrick54 at 9:41 AM on May 9, 2012 [2 favorites]


But in general, if the tool really does work well once you know how to use it, people use it. Look at emacs, vi, blender and for that matter *nix in general. Or manual cameras. Or machine tools. Very few people will argue that these are "intuitive" but everybody who uses them swears by them.

Right, the small percentage of the population that could be bothered to get over the learning curve swears by them. But none of those things have been embraced by a majority of users (except maybe machine tools, but that's a niche market to start with).

Compare market share for traditional text editors with Word, for example, even for pure text-editing tasks. Or the various proprietary 3D editors compared to Blender. Or Windows & Mac for Unix. Or point & shoots for dSLRs (even complex point & shoots that are just as expensive as entry-level dSLRs).

So do you want to make a powerful tool for a small niche of dedicated users? Or do you want to have a product that most people will bother to use? The former is not really a route to profitability and growth, especially in the context of open source software, since those dedicated users will feel perfectly comfortable installing from source and acting as their own tech support.

Anyway, bringing this back 'round to politics: A/B testing all depends on the metric you use. Google probably used something tied to profitability. The Obama campaign used something tied to getting votes. Thus, its platform and policies are optimized for winning an election in the short term, not doing what's best for the country in the long term.

As was pointed out above, bringing scientific methods to the campaign has nothing to do with bringing scientific methods to governance.
posted by jedicus at 9:42 AM on May 9, 2012 [1 favorite]


I liked this article. Thanks.
posted by salvia at 9:44 AM on May 9, 2012 [5 favorites]


I liked this article. Thanks!
posted by salvia at 9:45 AM on May 9, 2012 [25 favorites]


A/B testing and its bastard cousin MVP are pretty much the reason I hate modern web page design.

Turns out that the optimum route to conversions is stupid SEO shit your concience would stop you doing if it didn't have a cool, google approved name.

Of course I should have A/B tested this comment to ensure maximum conversions. I should have watered it down and pivoted the most popular elements of it. I should have said the opposite of what I wanted to say because the metrics don't lie.

I should have pre-tested it on twitter and eked out every. Single. Possible. Favourite.

And then I should have quit the human race. Because who wants to create anything so souless and empty.
posted by zoo at 9:49 AM on May 9, 2012 [2 favorites]


I'm A/B testing Presidents. Some of you will get Mitt Romney, others Obama, still others will have Reanimated Millard Filmore.

I, and my fellow libertarians, are holding out for a) Zombie Barry Goldwater or b) Skeletal William Howard Taft.
posted by THAT William Mize at 9:52 AM on May 9, 2012 [1 favorite]


So do you want to make a powerful tool for a small niche of dedicated users? Or do you want to have a product that most people will bother to use? The former is not really a route to profitability and growth...

Well I think that's my point. Are you making a good tool or one that will be widespread? If your goal is money, the answer is obvious.
posted by DU at 9:53 AM on May 9, 2012


A/B testing and its bastard cousin MVP are pretty much the reason I hate modern web page design.

A/B posted by zoo on May 9 [One Beeeelion favorites +][!]
posted by zippy at 9:55 AM on May 9, 2012


The MetaFilterB version of this thread is much better, I do not regret paying the $10 at all.
posted by burnmp3s at 9:55 AM on May 9, 2012


Yes, but, as you've read the article, you know that it's the real-time multivariate testing that is the real game changer here.

I did read the article (when it came out a few weeks ago), and we were doing real-time testing back then, too!

It's really easy to read a log file, dude.
posted by Cool Papa Bell at 9:56 AM on May 9, 2012


So do you want to make a powerful tool for a small niche of dedicated users? Or do you want to have a product that most people will bother to use? The former is not really a route to profitability and growth...

Well I think that's my point. Are you making a good tool or one that will be widespread? If your goal is money, the answer is obvious



Context. Uber alles.

Now, for example, I work with populations popularly known as the "Next 4 Billion" or "Next Billion" (as in "How to bring the next billion users online" type of next billion) aka the lower income segments outside the Global North.

Are you making a good tool or one that will be widespread?

Its not an either/or in this context - the tool must be good. It must be easy to understand and use. And it could conceivably be used by billions.

People will not bother to use it if they can't figure out how this tech (even if its just a stove, not software) works.

Or, will make the effort, if it offers something lifechanging - like the mobile phone did. But once learnt, that small niche of dedicated users turned into the virtual torrent, many of whom simply memorized the Nokia UI as a pattern due to stronger retention skills (if illiterate).
posted by infini at 10:11 AM on May 9, 2012


The MetaFilterB version of this thread is much better, I do not regret paying the $10 at all.

And all of us look so much better with goatees.
posted by zombieflanders at 10:22 AM on May 9, 2012 [2 favorites]


Not seeing why optimized delivery is soulless and empty -- because it's calculated? Is this an art versus science thing? Because I'm one of those who analytical nerds who totally gets excited by the idea of +/- 10% variance on something as natural as a button label choice.
posted by cavalier at 10:24 AM on May 9, 2012


Not seeing why optimized delivery is soulless and empty

Optimization isn't bad. Optimization without thought is, as it's a form of design by committee and tends to lack any particular direction.

See also The Most Wanted Painting (here's the USA winner, available in dishwasher and television sizes)
posted by zippy at 10:41 AM on May 9, 2012 [2 favorites]


Which Americans are in the B test for drug policy and public health care insurance?
It would be nice to see this philosophy applied to poltiical decisions rather than just political campaigns, wouldn't it? Perhaps we could divide up into say, 50 subgroups, each of which would get to make its own decisions about things like drug laws and public health care...
posted by roystgnr at 10:49 AM on May 9, 2012


How large is your startup, to the nearest $1B, again?

So only owners of billion dollar companies are qualified to talk about testing metrics and effectiveness? That's just dumb.


No, patrick54, "dumb" is claiming that Google "has made every wrong decision possible in the last 10 years", when they are the very paradigm of successful 21st-century startups.

If revenue growth of a web-only business isn't some sort of metric about how good their web design decisions are, your metrics are self-congratulatory at best.
posted by IAmBroom at 11:03 AM on May 9, 2012


Which Americans are in the B test for drug policy and public health care insurance?

It would be nice to see this philosophy applied to poltiical decisions rather than just political campaigns, wouldn't it? Perhaps we could divide up into say, 50 subgroups, each of which would get to make its own decisions about things like drug laws and public health care...


Which has long been supposed as part of the motivation, roystgnr, except that: the states have wildly differing starting conditions and control variables (access to waterways and coastlines, mineral wealth, climate), which makes it virtually impossible to draw meaningful conclusions from the decisions and their effects.

Which plays into the hands of both political sides, while they're campaigning.
posted by IAmBroom at 11:06 AM on May 9, 2012 [1 favorite]


Google is profitable, yes, but their sense of design leaves a lot to be desired. And as evidence I would point to the mirad of google products that have completely failed over the years. (Looking at you, wave.) Google places? Ugly and unusable. Google Offers? Poorly thought-out and executed. G+? Surviving but far from flourishing. Each example is marked by deep flaws.

But google's core profitability comes from one or two segments of their business are so huge it papers over all the bad decisions. That's the value of being huge.
posted by elwoodwiles at 11:15 AM on May 9, 2012 [2 favorites]


That last sentence is the B version.
posted by elwoodwiles at 11:16 AM on May 9, 2012 [1 favorite]


No, patrick54, "dumb" is claiming that Google "has made every wrong decision possible in the last 10 years", when they are the very paradigm of successful 21st-century startups.

By your own standards since you haven't shown that you are owning a multi-billion dollar startup your input on this issue is not really worth discussing.

BTW it's not really necessary to address me specifically. The point I'm making is not particular to my person.
posted by patrick54 at 11:17 AM on May 9, 2012


Next to last, ack, nevermind....
posted by elwoodwiles at 11:17 AM on May 9, 2012


I was with a company once that came up with a project which we tried to sell to Microsoft. They wanted some way to automate the choice of ads on the pages of MSN. So we made a front end script that would pull keywords from a give article on MSN, and then we essentially A/B tested different types of ads against those keywords, trying to find previously unpredicted relations between content and product type. Microsoft let us do a proof of concept run. MSN has a huge amount of traffic, and it only took a few days to get statistically significant data. After those days we presented our findings to Microsoft:

US: We're done testing the relationship between content and ads.
MSFT: What were the results?
US: We found that no matter what the content was, more people always clicked on Victoria's Secret ads.
MSFT: So you're saying that people like women in lingerie?
US: Uh, yeah... I guess so.
MSFT: We knew that already.

They didn't give us any more money.
posted by twoleftfeet at 11:17 AM on May 9, 2012 [13 favorites]


Hey, at least you can consider your findings confirmed.
posted by Holy Zarquon's Singing Fish at 11:37 AM on May 9, 2012


I worry that results from A/B testing could be misunderstood in the same way that other scientific data can be.

There's a well know publishing effect- scientists publish the studies that have statistically significant results.

Also regression to the mean happens; you notice an effect, it's significant at however many sigma-- but then you do another study and the effect is less strong.

Also of course stats itself gets misused- people who don't really understand a stats package can still get the package to make some kind of claim about the significance of their results.

It seems to me all of these issues are present in a/b too.
posted by nat at 11:52 AM on May 9, 2012


That was a joke, right?
posted by cavalier at 11:55 AM on May 9, 2012


A/B testing can go wrong in very interesting ways. At one point we did the split based on userId modulo 2. At first the results looked promising, then someone notice a weird 'coincidence'. No matter what change we implemented, we would get better retention in bucket 1.

After an all nighter we figured out that the load balancer also used use rid modulo some multiple of 2 to assign servers. The users in the first bucket landed on the newer faster boxes.

Another time, to teach the MBA p.m. about sample size and statistical significance a mathematician engineer made A identical to B for a one day one percent test. The P.M. found performance differences and wrote a whole design doc based on his findings.

Bad A/B testing is worse than no testing, but good A/B testing is a very powerful tool.
posted by Ayn Rand and God at 12:38 PM on May 9, 2012 [1 favorite]


Bad A/B testing is worse than no testing, but good A/B testing is a very powerful tool.

But wouldn't this be true of any user research?
posted by infini at 12:42 PM on May 9, 2012


No. Lying to users or invading their privacy in the name of good research is not better than no research at all, even if it would be a very powerful tool.
posted by Ayn Rand and God at 12:46 PM on May 9, 2012 [1 favorite]


I guess Google plus was A/B tested. I KEEED!

Clinton did this in the 90s, it was called triangulation, now we just have "Do what Axelrod says." Oh man, I'll be here all day!

If you haven't read this article by Jim Manzi on experimentation in the world, do. He makes a really great point about the difficulty of knowing the milieu of what you are actually experimenting within. Losing the forest for the walking ent trees, I guess.

"But clinical trials place an enormous burden on being sure that the treatment under evaluation is the only difference between the two groups. And as experiments began to move from fields like classical physics to fields like therapeutic biology, the number and complexity of potential causes of the outcome of interest—what I term “causal density”—rose substantially. It became difficult even to identify, never mind actually hold constant, all these causes."
posted by stratastar at 1:26 PM on May 9, 2012 [2 favorites]


The success of Capital One, Fairbank told Fast Company, was predicated on its “ability to turn a business into a scientific laboratory where every decision about product design, marketing, channels of communication, credit lines, customer selection, collection policies and cross-selling decisions could be subjected to systematic testing using thousands of experiments.” By 2000, Capital One was reportedly running more than 60,000 tests per year. And by 2009, it had gone from an idea in a conference room to a public corporation worth $35 billion.

Do you think this approach can be viable for diametrically different contexts?

The article is excellent btw, thank you. Though I must confess to not liking the dehumanizing aspect of RCTs.
posted by infini at 1:47 PM on May 9, 2012


I worked on the system that allows what they call "experiments" when I was an engineer at Google.

It's a pretty awesome system, but I wasn't the only one to wonder about its overall efficacy. It has the trap common to all greedy algorithms which is that you can get to a local maximum that's much smaller than the global maximum and have no way to get out.

It also doesn't deal with issues of aesthetics and coherency - and the worst part is that it's always interpreting the response at one distance removed. Consider for example two cases - in one case you get a response, click on it, and it's perfect - success! so you close the search page. In the other case you click on a response, and it's so bad you just give up. But these two cases are necessarily marked in the same bucket.

As for "Google having made every decision wrong in the last ten years" - this is obviously a hyperbolic statement, but Google haven't actually come out with any really successful new products since 2004's Gmail, and even that one is probably still a net money loser for Google, it simply keeps users in their orbit so they can use AdWords and AdSense.
posted by lupus_yonderboy at 1:56 PM on May 9, 2012 [2 favorites]


Why do people like the ottoman better if it appears to the left of the throw rug than if it appears to the right? There’s no time to ask the question, and no reason to answer it. After all, what does it matter if you can get the right result?
No reason to answer it? It's really better to flail around in the dark and not direct your efforts at all? Understanding allows for even better, more focused, and more responsive design. By all means try things that you don't expect to work, lest your thinking become ossified and miss some new trend, but to not have enough curiosity to both want to know why and understand why asking why is important? What a sad life some people must live.
posted by wierdo at 2:42 PM on May 9, 2012


Google haven't actually come out with any really successful new products since 2004's Gmail

Android has a majority share of the smartphone market. Obviously Google doesn't directly profit from this what with the giving it away to phone manufacturers, but they give away Gmail too.
posted by Holy Zarquon's Singing Fish at 2:59 PM on May 9, 2012 [2 favorites]


Many of the objections to split-testing are, in practice, not that great an issue: Split-tests frequently have multiple goals, each of which is separately scored. So if making the button blue rather than green increases clicks, but fewer of those that click wind up buying, or fewer buy the second-order, upsell offerings... well, an effectively designed split-test can usually ferret that out, and then you can do the math on what is most profitable/effective, over a given period of time.

For that matter, tests often are not simply A/B, but multi-variate (composed of multiple A/B/C/N combinations), or meta-splits, with MVT #1 competing against MVT #2 and MVT #3.

Obviously, the bigger the sample size, the faster and better you can get useful data out of such a test.
posted by darth_tedious at 3:44 PM on May 9, 2012


As someone who has run (and is currently running) all sorts of A/B tests and experiments at Google, I can say that some people in this thread have the wrong idea.

We don't just make up random ideas and throw them into experiments with no understanding of what is going on. But there is sometimes a disconnect between what you expect users to do and what they actually do, and this is where it comes in: validating what you expect your change to accomplish. If you don't have a working hypothesis, you can't really say what your experiment did.

Also, much of the time this is used for fairly technical things, so it's more like: "did this change break anything?". Especially for stuff that might break based on end-user environments, where there is a limit to how much you can test in a lab (some of the HTML5 stuff I've worked on comes to mind).

So I agree with DU: "You really need to know what you are doing, not just generate random changes and A/B test them". Which is why we don't do that, even though I sometimes think thats the external perception.
posted by wildcrdj at 4:04 PM on May 9, 2012 [1 favorite]


>As useful as it is, this is the danger of A/B testing. Some people feel like it liberates them from choosing appropriate variants for both A and B. If something nonsensical gets more clicks, how can it be wrong?

>Romney's the B. Hence, the A is better.

These comments are made for each other.
posted by pompomtom at 4:35 PM on May 9, 2012 [1 favorite]


If you want to know the limitations of A/B testing, I'd say take a look at the disaster that is Gmail's user interface.
posted by storybored at 8:52 PM on May 9, 2012 [1 favorite]


storybored: "If you want to know the limitations of A/B testing, I'd say take a look at the disaster that is Gmail's user interface"

Lets imagine two people randomly assigned to use email only via A) Gmail UI or B) Yahoo UI. Which person would you prefer to be?
posted by pwnguin at 10:48 PM on May 9, 2012 [1 favorite]


The dedicated child who borrows Dad's Eudora CD?
posted by infini at 10:58 PM on May 9, 2012


Google haven't actually come out with any really successful new products since 2004's Gmail

Android has a majority share of the smartphone market. Obviously Google doesn't directly profit from this what with the giving it away to phone manufacturers, but they give away Gmail too.


Google Analytics.
posted by mrgrimm at 11:26 PM on May 9, 2012 [1 favorite]


If there are any serious usability testing geeks around here, I've got to say that A/B testing is dramatically less efficient, and possibly much more misleading, than the statistical methods of experimental design.

Here's the problem. Traditional A/B testing of web pages manipulates one variable at a time; you make the button say "Click Me" or "Learn More" or "Touch Me" or whatever. You initiate by presenting options equally randomly, and refine selections based on feedback from click-throughs or some other metric. So you have data associated with the options presented for that variable. Then you move on to some other web page variable; "Move the logo to the left", "Switch to sans-serif instead of serif", "Make the background color brighter." Really, there are a zillion possible variables, and the only way anything gets done is that people have genuine insights that can limit the variables you need to A/B test.

But my point is that even if you can narrow down the variables to a reasonably small number, the traditional methodology of A/B testing can be downright misleading. By testing one variable at a time, and then moving on to a different variable, the possible relationships between the variables are ignored. So your tests tell you that the button should say "Touch Me" and the logo should be more red, but they don't tell you that the combination of the two is terrible.

There are well developed methods in statistics, in experimental design, which allow you to test, say, 8 or 24 variables at the same time, in an efficient sequence, that consider relationships between the variables. There are many services that aim to increase usability or conversion rate, but if all they're doing is A/B testing then they can actually make things worse, or at least suboptimal.
posted by twoleftfeet at 12:13 AM on May 10, 2012


So your tests tell you that the button should say "Touch Me" and the logo should be more red, but they don't tell you that the combination of the two is terrible.

Yes. They do. That's kinda the whole point of multivariate testing. You get every combination of variants.

Yes, of course, the variants need consideration, but these combinations are definitely considered.
posted by mrgrimm at 6:20 AM on May 10, 2012 [1 favorite]


The article is excellent btw, thank you. Though I must confess to not liking the dehumanizing aspect of RCTs.

That's it, innit? Someone above asked why optimization was depressing.

It's b/c SEO has become more important than quality original content.

It's because of F2P, 69 tracking cookies on every page, and Facebook spam.

The marketers are winning.
posted by mrgrimm at 7:24 AM on May 10, 2012 [1 favorite]


... and not to continue the very distantly related derail ..

Google example If you use GA on your website, Google would not only know all of the IP addresses (and other browser unique identifi visitors to your website (and which pages they looked at on your site), but because most other websites use GA as wel another Google product), Google would know for a visitor to your site, eg. the 6 other websites that person visited earl today and the 367 websites he looked at in the last month. Because more than 50% of all websites on the Internet use Google Adsense or another Google product using tracking beacons, Google is able to build a very accurate picture of m websites visited by any given user.

Y'all realize how many sites use GA? And that Google is merging all their privacy policies? I'd say it' their most successful product yet.
posted by mrgrimm at 7:33 AM on May 10, 2012


If you want to know the limitations of A/B testing, I'd say take a look at the disaster that is Gmail's user interface.

If you're referring to the recent UI change, this was _not_ done via A/B testing, quite explicitly. It was the result of actual designers.

Now, obviously it's been met with mixed reviews, but it's actually an example of Google not doing the n-shades-of-blue approach.
posted by wildcrdj at 4:18 PM on May 10, 2012


If you're referring to the recent UI change, this was _not_ done via A/B testing, quite explicitly. It was the result of actual designers.

Now, obviously it's been met with mixed reviews, but it's actually an example of Google not doing the n-shades-of-blue approach.


With all due respect to you, wildcrdj, and this isn't personal. But

FAIL.
posted by infini at 11:00 PM on May 10, 2012


I like the gmail interface. The interface is not the problem. Missing functionality and occasional service issues are the problems for me.
posted by mrgrimm at 10:42 AM on May 11, 2012


I was surprised to find so many dismissive and sarcastic comments about A/B testing. Most stuff gets built and thrown at users with hardly any testing and a cursory "looks good to me" comment from the boss. I suppose you can over do anything but I would like to see more testing done to find out what actually works instead of finger to the wind speculation.
posted by dgran at 1:59 PM on May 30, 2012




« Older The Curse of Chief Wahoo   |   Czech out this L'Enfant Terrible Newer »


This thread has been archived and is closed to new comments