Comments on: Test Everything

Test Everything

Petrot — Wed, 09 May 2012 08:28:34 -0800

In 2007, Google project manager Dan Siroker took a leave of absence from Google, moved to Chicago, and joined up with Obama's campaign as a digital adviser.

"At first he wasn't sure how he could help. But he recalled something else Obama had said to the Googlers: "I am a big believer in reason and facts and evidence and science and feedback—everything that allows you to do what you do. That's what we should be doing in our government." And so Siroker decided he would introduce Obama's campaign to a crucial technique—almost a governing ethos—that Google relies on in developing and refining its products. He showed them how to A/B test."

If you spend any time on the web, you've probably been an unwitting subject in what's called an A/B test. It's the practice of performing real-time experiments on a site's live traffic, showing different content and formatting to different users and observing which performs better. _{_{Bonus Link: Gaming site IGN made 7 small tweaks to its homepage. Employing the A/B testing methodology, IGN noted how those tiny changes can have a big effect.}}

By: Petrot

Petrot — Wed, 09 May 2012 08:30:10 -0800

The BBC also tried it, with an example of the code used.

By: DU

DU — Wed, 09 May 2012 08:32:26 -0800

Which Americans are in the B test for drug policy and public health care insurance?

By: Meatbomb

Meatbomb — Wed, 09 May 2012 08:36:54 -0800

Canada is the B test.

By: TwelveTwo

TwelveTwo — Wed, 09 May 2012 08:37:02 -0800

The rich.

By: empath

empath — Wed, 09 May 2012 08:43:44 -0800

I'm pretty sure that A/B testing is the reason I get all these stupid emails in my inbox from Democrats with "Hey, what's up" as the subject, instead of something meaningful. It actually pissed me off to the point where I started unsubscribing from all of those lists.

By: Thorzdad

Thorzdad — Wed, 09 May 2012 08:50:08 -0800

A/B testing is crucial when you absolutely, positively want to avoid having to take responsibility for a decision.

By: the jam

the jam — Wed, 09 May 2012 08:50:35 -0800

As useful as it is, this is the danger of A/B testing. Some people feel like it liberates them from choosing appropriate variants for both A and B. If something nonsensical gets more clicks, how can it be wrong?

By: TwelveTwo

TwelveTwo — Wed, 09 May 2012 08:50:59 -0800

Clicks good. Bounce bad.

By: TwelveTwo

TwelveTwo — Wed, 09 May 2012 08:51:15 -0800

Conversion best.

By: zippy

zippy — Wed, 09 May 2012 08:53:52 -0800

I'm A/B testing Presidents. Some of you will get Mitt Romney, others Obama, still others will have Reanimated Millard Filmore.

By: Ironmouth

Ironmouth — Wed, 09 May 2012 08:54:36 -0800

Which Americans are in the B test for drug policy and public health care insurance? Romney's the B. Hence, the A is better.

By: burnmp3s

burnmp3s — Wed, 09 May 2012 08:56:58 -0800

Maybe Romney will hire someone from Wikipedia to do A/B testing for his campaign and they'll end up with creepy unshaven closeups of him all over his webiste.

By: DU

DU — Wed, 09 May 2012 08:59:42 -0800

A/B testing is also really great if you just want to think about the next few weeks and tiny incremental improvements, not investigate how something works long term or know why it works. Yes, making that tiny change means that people do X faster now, but it prevents you from making some other change that would have been better. Or it only helps the n00bs do X faster but not the experts, whom it harms. Or it's a statistical help because now some slow subset is excluded (I bet using internet slang would test great, for instance, because slow old people would drop out). You really need to know what you are doing, not just generate random changes and A/B test them. Otherwise you something that works but far from optimally, like evolutionarily-developed retinas with their backwards nerves.

By: grobstein

grobstein — Wed, 09 May 2012 09:00:47 -0800

Of course we get scientific methods applied to the problem of political campaigning. Of course we will never get a scientific approach to policy.

By: Thorzdad

Thorzdad — Wed, 09 May 2012 09:02:41 -0800

As far as Google and testing is concerned...

By: infini

infini — Wed, 09 May 2012 09:05:53 -0800

So, DU, as a n00b, if I understand what you are saying correctly, it implies the need to do A/B testing within the contextual knowledge of the users and their operating environment gained from more qualitative observations/insights rather than in isolation basing decisions on metrics alone?

By: DU

DU — Wed, 09 May 2012 09:09:26 -0800

I guess short-term decisions can't really be blamed on A/B testing. It's really just the same concept as the control group in science. I'm just sick of UIs designed on "scientific" principles that ask the wrong questions (like "how intuitive is it"--how long are your users going to be in the learning stage? why not optimize for the bulk of their time, not first impressions?) and so forth. The website design angle, especially from Google who've made every wrong decision possible in the last 10 years, pressed my button.

By: Cool Papa Bell

Cool Papa Bell — Wed, 09 May 2012 09:11:17 -0800

I'm starting to see people refer to A/B testing as some kind of techno-geek magic. A/B testing is the new "social" or the new "interactive." We were A/B testing back in 1997. It was called "let's see which one works better."

By: pwnguin

pwnguin — Wed, 09 May 2012 09:13:13 -0800

Yes, making that tiny change means that people do X faster now, but it prevents you from making some other change that would have been better. Or it only helps the n00bs do X faster but not the experts, whom it harms. A/B is only the beginning. Google actually runs multiple experiments simultaneously (to the extend that they don't conflict). You can be included in multiple experiments. GA will separate your site into new and return users; and you're free to segment your users further. And intelligent students of CS will tell you that you need to test big, possibly negative changes to avoid falling into local maxima. Furthermore, without A/B testing, we have little reason to believe the "it works" part of "why it works." I do wish that Wired would include links to their sources. At some point I'd like to introduce A/B to our small nonprofit website, and learn more about this stuff, but I'd like to learn the lesson's of other's failures before inventing my own failures.

By: chavenet

chavenet — Wed, 09 May 2012 09:13:41 -0800

And so Siroker decided he would introduce Obama's campaign to a crucial technique—almost a governing ethos—that Google relies on in developing and refining its products. He showed them how to A/B test." "Due to a technical glitch, the experiment was a disaster."

By: briank

briank — Wed, 09 May 2012 09:14:07 -0800

"We've replaced this candidate's website with NEW Folger's Crystals...let's see if these voters notice the difference."

By: aramaic

aramaic — Wed, 09 May 2012 09:16:06 -0800

you need to test big, possibly negative changes to avoid falling into local maxima This is true of more than just A/B testing, and it should be stapled to the forehead of every new MBA as a condition of graduating.

By: Rat Spatula

Rat Spatula — Wed, 09 May 2012 09:17:11 -0800

Cool Papa Bell: "It was called "let's see which one works better."" Yeah, I don't get it. Likewise, every hot new software engineering process that comes down the pike looks the same to me: 1. Make a list of stuff to do, sorted by priority. 2. March down the list, top to bottom, until you have to ship. 3. Ship. 4. GOTO 1.

By: infini

infini — Wed, 09 May 2012 09:19:06 -0800

like "how intuitive is it"--how long are your users going to be in the learning stage? why not optimize for the bulk of their time, not first impressions? Interesting. (serious questions) Do you think that learning how to use it in the first place might be a barrier to adoption?

By: jedicus

jedicus — Wed, 09 May 2012 09:19:10 -0800

I'm just sick of UIs designed on "scientific" principles that ask the wrong questions (like "how intuitive is it"--how long are your users going to be in the learning stage? why not optimize for the bulk of their time, not first impressions?) Because if users give up in frustration before they learn how to do anything then it doesn't matter how great the product is for experienced users. The website design angle, especially from Google who've made every wrong decision possible in the last 10 years, pressed my button. Depends on the metric you use. If your testing uses profitability as the metric, then Google is apparently doing pretty great.

By: infini

infini — Wed, 09 May 2012 09:21:27 -0800

Or what jedicus said, except for the part about Google. I hate their new redesign but I must be on the lunatic fringe of outliers as the internet expands its bell curve to absorb more billions.

By: fifteen schnitzengruben is my limit

fifteen schnitzengruben is my limit — Wed, 09 May 2012 09:26:05 -0800

Amazon does something like 100 A/B tests a day. The beauty of their traffic is they can do a test for 30 seconds and get a realistic sample size. (For more, watch Jared Spool's fantastic presentation on Design Treasures From the Amazon, which talks about this in more detail.) A/B testing works great if you're trying to figure out which language or shape of a button is the best one. It's not so great when your boss says, "Hey, let's do A/B testing?" and you reply, "Fantastic! What would you like to test?" and they reply, "You mean we have to come up with variants?" and then the conversation dies for six months.

By: DU

DU — Wed, 09 May 2012 09:28:46 -0800

Do you think that learning how to use it in the first place might be a barrier to adoption? That can depend on the situation. But in general, if the tool really does work well once you know how to use it, people use it. Look at emacs, vi, blender and for that matter *nix in general. Or manual cameras. Or machine tools. Very few people will argue that these are "intuitive" but everybody who uses them swears by them. Furthermore, those same people complain mostly about attempts to "simplify" or "make it more intuitive" to lower that barrier. Not because of the unwashed masses coming in but because it removes the power they once had. I think it's the very rare task that can be solved by a tool that is both simple and powerful, therefore most tools that are simple are terrible.

By: DU

DU — Wed, 09 May 2012 09:31:23 -0800

There's more than one way to make an easy learning curve. Most tools do it by making the plateau at the top very low. We call these tools "intuitive". Good tools do it by making the time very long. Of these tools we say "you have to get used to it".

By: zippy

zippy — Wed, 09 May 2012 09:31:56 -0800

If your testing uses profitability as the metric, then Google is apparently doing pretty great. It doesn't show weakness to competition, however. Let's say company X is dominant in the market, and their leadership is so strong that absent an incredibly bad site change, users would still go to them. An A/B test in this case isn't going to show potential defections to a competitor. The site's not going to run a test where the front page is defaced with a shock image. They're probably not going to do a test where half the users get a "server unavailable" message, or double-billed transactions. Because of this blind spot, a dominant site may accumulate many weaknesses. It may be impossible to tell that a local optima is far from a strong position, because absent a Really Bad Event, there's no way to measure it. Through these untestable weaknesses, user resentment and the potential for defection grows. Then, during a real stumble, there's a competitor with something you could not have tested for: a viable and radical alternative. You're broken, they work. You're stale, they're not. You're optimized for last year, they're five years in the future and desirable. Sony mp3 players / Apple iPods Nokia smart phones / Apple iPhones Yahoo circa 1999 / Google circa 1999 DSL / Fiber

By: IAmBroom

IAmBroom — Wed, 09 May 2012 09:35:19 -0800

The website design angle, especially from Google who've made every wrong decision possible in the last 10 years, pressed my button. It's good something pressed your button, DU. Everyone else was pressing Google's. How large is your startup, to the nearest $1B, again?

By: DU

DU — Wed, 09 May 2012 09:38:37 -0800

More Money = Better Than

By: mrgrimm

mrgrimm — Wed, 09 May 2012 09:39:41 -0800

We were A/B testing back in 1997. It was called "let's see which one works better." Yes, but, as you've read the article, you know that it's the real-time multivariate testing that is the real game changer here. The optimization might seem trivial, but when you're working on a scale like Amazon or one of the big sites (or a presidential election), it's absolutely essential and now almost mandatory. I sure wish Gore A/B tested his Florida strategy... But yeah, Obama did this back in 2007 to tremendous effect apparently. How Obama Raised $60 Million by Running a Simple Experiment I'm in the industry, so this trend towards optimization (site and search) in the recent decade has been pretty depressing for me, but in a corporation it is nigh impossible to argue against. I'm glad the slightly better presidential candidate is probably slightly better than his opponent at optimization ... and then I'm incredibly depressed by the thought that today's BlackHatSEO chumps are the campaign managers of 2016.

By: patrick54

patrick54 — Wed, 09 May 2012 09:41:18 -0800

How large is your startup, to the nearest $1B, again? So only owners of billion dollar companies are qualified to talk about testing metrics and effectiveness? That's just dumb.

By: jedicus

jedicus — Wed, 09 May 2012 09:42:32 -0800

But in general, if the tool really does work well once you know how to use it, people use it. Look at emacs, vi, blender and for that matter *nix in general. Or manual cameras. Or machine tools. Very few people will argue that these are "intuitive" but everybody who uses them swears by them. Right, the small percentage of the population that could be bothered to get over the learning curve swears by them. But none of those things have been embraced by a majority of users (except maybe machine tools, but that's a niche market to start with). Compare market share for traditional text editors with Word, for example, even for pure text-editing tasks. Or the various proprietary 3D editors compared to Blender. Or Windows & Mac for Unix. Or point & shoots for dSLRs (even complex point & shoots that are just as expensive as entry-level dSLRs). So do you want to make a powerful tool for a small niche of dedicated users? Or do you want to have a product that most people will bother to use? The former is not really a route to profitability and growth, especially in the context of open source software, since those dedicated users will feel perfectly comfortable installing from source and acting as their own tech support. Anyway, bringing this back 'round to politics: A/B testing all depends on the metric you use. Google probably used something tied to profitability. The Obama campaign used something tied to getting votes. Thus, its platform and policies are optimized for winning an election in the short term, not doing what's best for the country in the long term. As was pointed out above, bringing scientific methods to the campaign has nothing to do with bringing scientific methods to governance.

By: salvia

salvia — Wed, 09 May 2012 09:44:47 -0800

I liked this article. Thanks.

By: salvia

salvia — Wed, 09 May 2012 09:45:08 -0800

I liked this article. Thanks!

By: zoo

zoo — Wed, 09 May 2012 09:49:53 -0800

A/B testing and its bastard cousin MVP are pretty much the reason I hate modern web page design. Turns out that the optimum route to conversions is stupid SEO shit your concience would stop you doing if it didn't have a cool, google approved name. Of course I should have A/B tested this comment to ensure maximum conversions. I should have watered it down and pivoted the most popular elements of it. I should have said the opposite of what I wanted to say because the metrics don't lie. I should have pre-tested it on twitter and eked out every. Single. Possible. Favourite. And then I should have quit the human race. Because who wants to create anything so souless and empty.

By: THAT William Mize

THAT William Mize — Wed, 09 May 2012 09:52:35 -0800

I'm A/B testing Presidents. Some of you will get Mitt Romney, others Obama, still others will have Reanimated Millard Filmore. I, and my fellow libertarians, are holding out for a) Zombie Barry Goldwater or b) Skeletal William Howard Taft.

By: DU

DU — Wed, 09 May 2012 09:53:01 -0800

By: zippy

zippy — Wed, 09 May 2012 09:55:00 -0800

A/B testing and ~~its bastard cousin~~ MVP are pretty ~~much the reason I hate~~ modern web ~~page~~ design. A/B posted by zoo on May 9 [One Beeeelion favorites +][!]

By: burnmp3s

burnmp3s — Wed, 09 May 2012 09:55:41 -0800

The MetaFilterB version of this thread is much better, I do not regret paying the $10 at all.

By: Cool Papa Bell

Cool Papa Bell — Wed, 09 May 2012 09:56:11 -0800

Yes, but, as you've read the article, you know that it's the real-time multivariate testing that is the real game changer here. I did read the article (when it came out a few weeks ago), and we were doing real-time testing back then, too! It's really easy to read a log file, dude.

By: infini

infini — Wed, 09 May 2012 10:11:22 -0800

So do you want to make a powerful tool for a small niche of dedicated users? Or do you want to have a product that most people will bother to use? The former is not really a route to profitability and growth... Well I think that's my point. Are you making a good tool or one that will be widespread? If your goal is money, the answer is obvious Context. Uber alles. Now, for example, I work with populations popularly known as the "Next 4 Billion" or "Next Billion" (as in "How to bring the next billion users online" type of next billion) aka the lower income segments outside the Global North. Are you making a good tool or one that will be widespread? Its not an either/or in this context - the tool must be good. It must be easy to understand and use. And it could conceivably be used by billions. People will not bother to use it if they can't figure out how this tech (even if its just a stove, not software) works. Or, will make the effort, if it offers something lifechanging - like the mobile phone did. But once learnt, that small niche of dedicated users turned into the virtual torrent, many of whom simply memorized the Nokia UI as a pattern due to stronger retention skills (if illiterate).

By: zombieflanders

zombieflanders — Wed, 09 May 2012 10:22:54 -0800

The MetaFilterB version of this thread is much better, I do not regret paying the $10 at all. And all of us look so much better with goatees.

By: cavalier

cavalier — Wed, 09 May 2012 10:24:14 -0800

Not seeing why optimized delivery is soulless and empty -- because it's calculated? Is this an art versus science thing? Because I'm one of those who analytical nerds who totally gets excited by the idea of +/- 10% variance on something as natural as a button label choice.

By: zippy

zippy — Wed, 09 May 2012 10:41:14 -0800

Not seeing why optimized delivery is soulless and empty Optimization isn't bad. Optimization without thought is, as it's a form of design by committee and tends to lack any particular direction. See also The Most Wanted Painting (here's the USA winner, available in dishwasher and television sizes)

By: roystgnr

roystgnr — Wed, 09 May 2012 10:49:50 -0800

Which Americans are in the B test for drug policy and public health care insurance?

It would be nice to see this philosophy applied to poltiical decisions rather than just political campaigns, wouldn't it? Perhaps we could divide up into say, 50 subgroups, each of which would get to make its own decisions about things like drug laws and public health care...

By: IAmBroom

IAmBroom — Wed, 09 May 2012 11:03:15 -0800

How large is your startup, to the nearest $1B, again? So only owners of billion dollar companies are qualified to talk about testing metrics and effectiveness? That's just dumb. No, patrick54, "dumb" is claiming that Google "has made every wrong decision possible in the last 10 years", when they are the very paradigm of successful 21st-century startups. If revenue growth of a web-only business isn't some sort of metric about how good their web design decisions are, your metrics are self-congratulatory at best.

By: IAmBroom

IAmBroom — Wed, 09 May 2012 11:06:54 -0800

Which Americans are in the B test for drug policy and public health care insurance? It would be nice to see this philosophy applied to poltiical decisions rather than just political campaigns, wouldn't it? Perhaps we could divide up into say, 50 subgroups, each of which would get to make its own decisions about things like drug laws and public health care... Which has long been supposed as part of the motivation, roystgnr, except that: the states have wildly differing starting conditions and control variables (access to waterways and coastlines, mineral wealth, climate), which makes it virtually impossible to draw meaningful conclusions from the decisions and their effects. Which plays into the hands of both political sides, while they're campaigning.

By: elwoodwiles

elwoodwiles — Wed, 09 May 2012 11:15:27 -0800

Google is profitable, yes, but their sense of design leaves a lot to be desired. And as evidence I would point to the mirad of google products that have completely failed over the years. (Looking at you, wave.) Google places? Ugly and unusable. Google Offers? Poorly thought-out and executed. G+? Surviving but far from flourishing. Each example is marked by deep flaws. But google's core profitability comes from one or two segments of their business are so huge it papers over all the bad decisions. That's the value of being huge.

By: elwoodwiles

elwoodwiles — Wed, 09 May 2012 11:16:54 -0800

That last sentence is the B version.

By: patrick54

patrick54 — Wed, 09 May 2012 11:17:20 -0800

No, patrick54, "dumb" is claiming that Google "has made every wrong decision possible in the last 10 years", when they are the very paradigm of successful 21st-century startups. By your own standards since you haven't shown that you are owning a multi-billion dollar startup your input on this issue is not really worth discussing. BTW it's not really necessary to address me specifically. The point I'm making is not particular to my person.

By: elwoodwiles

elwoodwiles — Wed, 09 May 2012 11:17:22 -0800

Next to last, ack, nevermind....

By: twoleftfeet

twoleftfeet — Wed, 09 May 2012 11:17:36 -0800

I was with a company once that came up with a project which we tried to sell to Microsoft. They wanted some way to automate the choice of ads on the pages of MSN. So we made a front end script that would pull keywords from a give article on MSN, and then we essentially A/B tested different types of ads against those keywords, trying to find previously unpredicted relations between content and product type. Microsoft let us do a proof of concept run. MSN has a huge amount of traffic, and it only took a few days to get statistically significant data. After those days we presented our findings to Microsoft: US: We're done testing the relationship between content and ads. MSFT: What were the results? US: We found that no matter what the content was, more people always clicked on Victoria's Secret ads. MSFT: So you're saying that people like women in lingerie? US: Uh, yeah... I guess so. MSFT: We knew that already. They didn't give us any more money.

By: Holy Zarquon's Singing Fish

Holy Zarquon's Singing Fish — Wed, 09 May 2012 11:37:59 -0800

Hey, at least you can consider your findings confirmed.

By: nat

nat — Wed, 09 May 2012 11:52:29 -0800

I worry that results from A/B testing could be misunderstood in the same way that other scientific data can be. There's a well know publishing effect- scientists publish the studies that have statistically significant results. Also regression to the mean happens; you notice an effect, it's significant at however many sigma-- but then you do another study and the effect is less strong. Also of course stats itself gets misused- people who don't really understand a stats package can still get the package to make some kind of claim about the significance of their results. It seems to me all of these issues are present in a/b too.

By: cavalier

cavalier — Wed, 09 May 2012 11:55:57 -0800

That was a joke, right?

By: Ayn Rand and God

Ayn Rand and God — Wed, 09 May 2012 12:38:06 -0800

A/B testing can go wrong in very interesting ways. At one point we did the split based on userId modulo 2. At first the results looked promising, then someone notice a weird 'coincidence'. No matter what change we implemented, we would get better retention in bucket 1. After an all nighter we figured out that the load balancer also used use rid modulo some multiple of 2 to assign servers. The users in the first bucket landed on the newer faster boxes. Another time, to teach the MBA p.m. about sample size and statistical significance a mathematician engineer made A identical to B for a one day one percent test. The P.M. found performance differences and wrote a whole design doc based on his findings. Bad A/B testing is worse than no testing, but good A/B testing is a very powerful tool.

By: infini

infini — Wed, 09 May 2012 12:42:57 -0800

Bad A/B testing is worse than no testing, but good A/B testing is a very powerful tool. But wouldn't this be true of any user research?

By: Ayn Rand and God

Ayn Rand and God — Wed, 09 May 2012 12:46:23 -0800

No. Lying to users or invading their privacy in the name of good research is not better than no research at all, even if it would be a very powerful tool.

By: stratastar

stratastar — Wed, 09 May 2012 13:26:04 -0800

I guess Google plus was A/B tested. I KEEED! Clinton did this in the 90s, it was called triangulation, now we just have "Do what Axelrod says." Oh man, I'll be here all day! If you haven't read this article by Jim Manzi on experimentation in the world, do. He makes a really great point about the difficulty of knowing the milieu of what you are actually experimenting within. Losing the forest for the walking ent trees, I guess. "But clinical trials place an enormous burden on being sure that the treatment under evaluation is the only difference between the two groups. And as experiments began to move from fields like classical physics to fields like therapeutic biology, the number and complexity of potential causes of the outcome of interest—what I term "causal density"—rose substantially. It became difficult even to identify, never mind actually hold constant, all these causes."

By: infini

infini — Wed, 09 May 2012 13:47:56 -0800

The success of Capital One, Fairbank told Fast Company, was predicated on its "ability to turn a business into a scientific laboratory where every decision about product design, marketing, channels of communication, credit lines, customer selection, collection policies and cross-selling decisions could be subjected to systematic testing using thousands of experiments." By 2000, Capital One was reportedly running more than 60,000 tests per year. And by 2009, it had gone from an idea in a conference room to a public corporation worth $35 billion. Do you think this approach can be viable for diametrically different contexts? The article is excellent btw, thank you. Though I must confess to not liking the dehumanizing aspect of RCTs.

By: lupus_yonderboy

lupus_yonderboy — Wed, 09 May 2012 13:56:29 -0800

I worked on the system that allows what they call "experiments" when I was an engineer at Google. It's a pretty awesome system, but I wasn't the only one to wonder about its overall efficacy. It has the trap common to all greedy algorithms which is that you can get to a local maximum that's much smaller than the global maximum and have no way to get out. It also doesn't deal with issues of aesthetics and coherency - and the worst part is that it's always interpreting the response at one distance removed. Consider for example two cases - in one case you get a response, click on it, and it's perfect - success! so you close the search page. In the other case you click on a response, and it's so bad you just give up. But these two cases are necessarily marked in the same bucket. As for "Google having made every decision wrong in the last ten years" - this is obviously a hyperbolic statement, but Google haven't actually come out with any really successful new products since 2004's Gmail, and even that one is probably still a net money loser for Google, it simply keeps users in their orbit so they can use AdWords and AdSense.

By: wierdo

wierdo — Wed, 09 May 2012 14:42:26 -0800

Why do people like the ottoman better if it appears to the left of the throw rug than if it appears to the right? There's no time to ask the question, and no reason to answer it. After all, what does it matter if you can get the right result?

No reason to answer it? It's really better to flail around in the dark and not direct your efforts at all? Understanding allows for even better, more focused, and more responsive design. By all means try things that you don't expect to work, lest your thinking become ossified and miss some new trend, but to not have enough curiosity to both want to know why and understand why asking why is important? What a sad life some people must live.

By: Holy Zarquon's Singing Fish

Holy Zarquon's Singing Fish — Wed, 09 May 2012 14:59:17 -0800

Google haven't actually come out with any really successful new products since 2004's Gmail Android has a majority share of the smartphone market. Obviously Google doesn't directly profit from this what with the giving it away to phone manufacturers, but they give away Gmail too.

By: darth_tedious

darth_tedious — Wed, 09 May 2012 15:44:05 -0800

Many of the objections to split-testing are, in practice, not that great an issue: Split-tests frequently have multiple goals, each of which is separately scored. So if making the button blue rather than green increases clicks, but fewer of those that click wind up buying, or fewer buy the second-order, upsell offerings... well, an effectively designed split-test can usually ferret that out, and then you can do the math on what is most profitable/effective, over a given period of time. For that matter, tests often are not simply A/B, but multi-variate (composed of multiple A/B/C/N combinations), or meta-splits, with MVT #1 competing against MVT #2 and MVT #3. Obviously, the bigger the sample size, the faster and better you can get useful data out of such a test.

By: wildcrdj

wildcrdj — Wed, 09 May 2012 16:04:00 -0800

As someone who has run (and is currently running) all sorts of A/B tests and experiments at Google, I can say that some people in this thread have the wrong idea. We don't just make up random ideas and throw them into experiments with no understanding of what is going on. But there is sometimes a disconnect between what you expect users to do and what they actually do, and this is where it comes in: validating what you expect your change to accomplish. If you don't have a working hypothesis, you can't really say what your experiment did. Also, much of the time this is used for fairly technical things, so it's more like: "did this change break anything?". Especially for stuff that might break based on end-user environments, where there is a limit to how much you can test in a lab (some of the HTML5 stuff I've worked on comes to mind). So I agree with DU: "You really need to know what you are doing, not just generate random changes and A/B test them". Which is why we don't do that, even though I sometimes think thats the external perception.

By: pompomtom

pompomtom — Wed, 09 May 2012 16:35:46 -0800

>As useful as it is, this is the danger of A/B testing. Some people feel like it liberates them from choosing appropriate variants for both A and B. If something nonsensical gets more clicks, how can it be wrong? >Romney's the B. Hence, the A is better. These comments are made for each other.

By: storybored

storybored — Wed, 09 May 2012 20:52:25 -0800

If you want to know the limitations of A/B testing, I'd say take a look at the disaster that is Gmail's user interface.

By: pwnguin

pwnguin — Wed, 09 May 2012 22:48:17 -0800

storybored: "If you want to know the limitations of A/B testing, I'd say take a look at the disaster that is Gmail's user interface" Lets imagine two people randomly assigned to use email only via A) Gmail UI or B) Yahoo UI. Which person would you prefer to be?

By: infini

infini — Wed, 09 May 2012 22:58:48 -0800

The dedicated child who borrows Dad's Eudora CD?

By: mrgrimm

mrgrimm — Wed, 09 May 2012 23:26:42 -0800

Google haven't actually come out with any really successful new products since 2004's Gmail Android has a majority share of the smartphone market. Obviously Google doesn't directly profit from this what with the giving it away to phone manufacturers, but they give away Gmail too. Google Analytics.

By: twoleftfeet

twoleftfeet — Thu, 10 May 2012 00:13:40 -0800

If there are any serious usability testing geeks around here, I've got to say that A/B testing is dramatically less efficient, and possibly much more misleading, than the statistical methods of experimental design. Here's the problem. Traditional A/B testing of web pages manipulates one variable at a time; you make the button say "Click Me" or "Learn More" or "Touch Me" or whatever. You initiate by presenting options equally randomly, and refine selections based on feedback from click-throughs or some other metric. So you have data associated with the options presented for that variable. Then you move on to some other web page variable; "Move the logo to the left", "Switch to sans-serif instead of serif", "Make the background color brighter." Really, there are a zillion possible variables, and the only way anything gets done is that people have genuine insights that can limit the variables you need to A/B test. But my point is that even if you can narrow down the variables to a reasonably small number, the traditional methodology of A/B testing can be downright misleading. By testing one variable at a time, and then moving on to a different variable, the possible relationships between the variables are ignored. So your tests tell you that the button should say "Touch Me" and the logo should be more red, but they don't tell you that the combination of the two is terrible. There are well developed methods in statistics, in experimental design, which allow you to test, say, 8 or 24 variables at the same time, in an efficient sequence, that consider relationships between the variables. There are many services that aim to increase usability or conversion rate, but if all they're doing is A/B testing then they can actually make things worse, or at least suboptimal.

By: mrgrimm

mrgrimm — Thu, 10 May 2012 06:20:54 -0800

So your tests tell you that the button should say "Touch Me" and the logo should be more red, but they don't tell you that the combination of the two is terrible. Yes. They do. That's kinda the whole point of multivariate testing. You get every combination of variants. Yes, of course, the variants need consideration, but these combinations are definitely considered.

By: mrgrimm

mrgrimm — Thu, 10 May 2012 07:24:58 -0800

The article is excellent btw, thank you. Though I must confess to not liking the dehumanizing aspect of RCTs. That's it, innit? Someone above asked why optimization was depressing. It's b/c SEO has become more important than quality original content. It's because of F2P, 69 tracking cookies on every page, and Facebook spam. The marketers are winning.

By: mrgrimm

mrgrimm — Thu, 10 May 2012 07:33:23 -0800

... and not to continue the very distantly related derail .. Google example If you use GA on your website, Google would not only know all of the IP addresses (and other browser unique identifi visitors to your website (and which pages they looked at on your site), but because most other websites use GA as wel another Google product), Google would know for a visitor to your site, eg. the 6 other websites that person visited earl today and the 367 websites he looked at in the last month. Because more than 50% of all websites on the Internet use Google Adsense or another Google product using tracking beacons, Google is able to build a very accurate picture of m websites visited by any given user. Y'all realize how many sites use GA? And that Google is merging all their privacy policies? I'd say it' their most successful product yet.

By: wildcrdj

wildcrdj — Thu, 10 May 2012 16:18:21 -0800

If you want to know the limitations of A/B testing, I'd say take a look at the disaster that is Gmail's user interface. If you're referring to the recent UI change, this was _not_ done via A/B testing, quite explicitly. It was the result of actual designers. Now, obviously it's been met with mixed reviews, but it's actually an example of Google not doing the n-shades-of-blue approach.

By: infini

infini — Thu, 10 May 2012 23:00:38 -0800

If you're referring to the recent UI change, this was _not_ done via A/B testing, quite explicitly. It was the result of actual designers. Now, obviously it's been met with mixed reviews, but it's actually an example of Google not doing the n-shades-of-blue approach. With all due respect to you, wildcrdj, and this isn't personal. But FAIL.

By: mrgrimm

mrgrimm — Fri, 11 May 2012 10:42:21 -0800

I like the gmail interface. The interface is not the problem. Missing functionality and occasional service issues are the problems for me.

By: dgran

dgran — Wed, 30 May 2012 13:59:01 -0800

I was surprised to find so many dismissive and sarcastic comments about A/B testing. Most stuff gets built and thrown at users with hardly any testing and a cursory "looks good to me" comment from the boss. I suppose you can over do anything but I would like to see more testing done to find out what actually works instead of finger to the wind speculation.

By: jeffburdges

jeffburdges — Fri, 08 Jun 2012 08:03:03 -0800

"The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt." -Bertram Russell