Numbers average up but words pile on.
December 12, 2019 12:00 PM Subscribe

Combining Probability Forecasts: 60% and 60% Is 60%, but Likely and Likely Is Very Likely. "... imagine that you are purchasing a plane ticket for your next vacation and you check two websites, Kayak and Hopper, to see if they predict any future price changes. If both websites say that there is a 60% chance that prices will increase, you would typically average the two and also believe there is a 60% chance. However, if both sites say that it is “likely” that prices will increase, you would act as if you are “counting” each prediction as a positive signal, becoming more confident in your prediction and believing that a price increase is “very likely.” " Also: Verbal probabilities: Very likely to be somewhat more confusing than numbers.
posted by storybored (14 comments total) 26 users marked this as a favorite

Have only skimmed the paper, but it seems to make intuitive sense. If a bunch of independent forecasters all say X is 60% likely, I'll probably conclude that it is very likely that the likelihood is 60%. It's relatively easy to keep numbers and words separate.

But if they all say it is "likely," I have to take a lot of care not to collapse my judgment that X is very likely to be likely into a simple judgment that X is very likely.

This seems like it might be closely related to the trick prosecutors like to play, where they get a conviction not by proving that there is no reasonable doubt the defendant committed the crime, but by proving that there is no reasonable doubt that the suspect probably committed the crime.

Given the "high variability" in how the subjects interpreted verbal probability reports, I wonder if relative accuracy might correlate with tolerance for ambiguity?
posted by Not A Thing at 12:41 PM on December 12, 2019 [1 favorite]

Our brains appear to be broken in specific ways. The sad part is that people are figuring out what these ways are, and are exploiting this for their own ends.
posted by tallmiddleagedgeek at 12:52 PM on December 12, 2019 [5 favorites]

Have only skimmed the paper, but it seems to make intuitive sense.
posted by Not A Thing at 3:41 PM on December 12 [+] [!]

Having also just skimmed this paper, it seems likely that this paper is on to something.

The next logical statement is this must be true!
posted by Nanukthedog at 12:55 PM on December 12, 2019 [11 favorites]

Depending on the circumstances, the verbal approach of adding probabilities may actually be correct.

If two people have access to the same data and perspectives and both conclude that the probability of an event is 60%, averaging their estimates makes sense. But if they have access to different pieces of information and they both conclude that the probability is 60%, the actual probability is likely to be higher than 60%.

Philip Tetlock explores this idea in one of his books (I think it's Superforecasting). I think his example was about the intelligence on bin Laden's location when Obama authorized the mission that killed him. Imagine that the CIA, military, and State Department leaders all believe that bin Laden is in the cave, but there's barely better than a coin flip each. Putting all their perspectives together gives more information than each of them individually because they're each making a judgment based on a different set of information. So, if they're each giving it a 60% probability, the actual likelihood is higher because their areas of information compliment each other.
posted by philosophygeek at 2:58 PM on December 12, 2019 [7 favorites]

Here we get into a question of what probability even means. The frequentist interpretation is easiest to understand and work with, but fails when asking what is the probability of this thing (which is definitely true or not true) to turn out to be actually true. Then, if the same evidence basis is used for these belief estimates, then you can only average them. But if different evidence is marshalled, then you can sometimes validly conclude greater confidence.
posted by sjswitzer at 3:40 PM on December 12, 2019 [1 favorite]

Oh duh, I was quoting from the wrong paper. Which a clever person could probably turn into some sort of example of the challenges humans face in processing multiple ideas at once.

This paper (Mislavsky & Gaertig) is thoroughly unconcerned with individual variability. They phrase all their conclusions in terms of "people," whatever those are. Which makes sense because each of their 7 studies contains some variation on:

We recruited 205 participants (35.0% female, Mage = 33.7 years) on Amazon’s Mechanical Turk (MTurk). Participants who completed the survey were paid $0.35.

Given those criteria, it would be hard to reach any conclusions about anything more specific than "people."
posted by Not A Thing at 5:20 PM on December 12, 2019

Just came here to echo what philosophygeek said. Unless two sources have identical information and processing, the correct inference after two reports of 60% is not 60%, but higher. That doesn't change the fact that numbers and words are being treated inconsistently, but the incorrect inference is the numeric example, not the verbal example.
posted by chortly at 6:26 PM on December 12, 2019

the correct inference after two reports of 60% is not 60%, but higher.

So the correct inference from two reports of 40% would be less than 40%? (Even if 40% were, in context, a surprisingly optimistic projection?) This puzzles me.
posted by Not A Thing at 7:11 PM on December 12, 2019 [1 favorite]

Huh. Nifty. This one is definitely one of those papers that I'm initially just impressed someone got a paper out of something so trivial, then realized I hadn't quite realized the implications of this, so good on them.

(If you care, IMO the best object of this class is Gelman's Differences in Statistical Signficance are not Themselves Statistically Significant.)

Unless two sources have identical information and processing, the correct inference after two reports of 60% is not 60%, but higher

This is certainly not true as a blanket claim. I'd say it's usually false. If multiple reports are actually estimating a probability independently the estimates should bracket the true probability.

I'd say it's most true not when probability is in play but people are meaning closer to "percent of evidence I need to believe something." If you're 60% sure I killed someone because you saw me with a smoking gun, and philosophygeek is 60% sure I did it because he saw me walk in and out of the victims room at the time the murder happened, you've got a pretty good case. But this is a pretty narrow situation I'd argue.

IME it's equally likely to go the other way in practice. "A lot of people 60% sure something is true" is a warning sign to me--if so many people believe it why don't we have better evidence? Why can't we get better than that?
posted by mark k at 8:00 PM on December 12, 2019 [3 favorites]

Yeah, I guess I was thinking more in the second sense. If I see a coin come up heads 7 out of 10 times and you see it come up heads 5 out of 10 times, obviously the best guess is 60%. But if Kayak predicts that a price rise is 60% probable based on price history, and Hopper predicts a price rise is 60% probable based on weather forecasts, I would conclude that a price rise is more than 60% probable. But I don't know how to formalize that, and it may be just another way that our brains screw up probability.
posted by chortly at 8:59 PM on December 12, 2019

So the correct inference from two reports of 40% would be less than 40%? (Even if 40% were, in context, a surprisingly optimistic projection?) This puzzles me.

It depends on who regards 40% as optimistic.

Bayesian inference, which is what is being glossed here, tells us how to update our beliefs in the light of new information, but not how to form our initial beliefs. This shortcoming is traditionally papered over by the principle of indifference, which states that in the absence of any information, a 50% probability should be assigned to a yes/no proposition. This rule, to the extent that it is actually observed, gives 50% its flimsy status as the dividing line between "encouraging" and "discouraging" forecasts.

If I know a forecaster is following this rule, then their sub-50% forecast is an indication to me that they possess information that caused them to revise their expectation downward -- information that argues against the event being forecast. It is then rational for me to revise my own expectation downward, wherever it stands currently. Even if my own forecast prior to that time is 25%, a forecast of 40% from someone who started at 50% should cause me to drop my forecast still lower, not bring it up toward 40%. Another forecast of 40% from a second, differently-informed agent who also started at 50% should have the same effect again.

On the other hand, a forecast of 40% from someone who initially believed the likelihood of the event to be less than 40% is a positive signal indicating that they got new information that argues in favor of the event being forecast. So I can't properly make use of a "40% forecast" (integrating it with other information) without knowing something of the methodology that produced it.
posted by aws17576 at 10:02 PM on December 12, 2019 [3 favorites]

If the above sounds weirdly dependent on subjective factors, maybe this will clarify. When I teach Bayes' Theorem, I give students a contrived problem where we're trying to figure out if some (fictional) coffee beans are decaf or regular. The "evidence" comes from repeated observations of whether a (fictional) subject acts wired after drinking the coffee; we assume the subject will appear wired some percentage of the time after drinking regular coffee, and some smaller percentage of the time after drinking decaf.

Now here's the twist: I ask each student to start with a different prior belief about the probability that the coffee is regular. Some might start at 20 percent, some at 40, and some at 75. But they all hear the same evidence, and use the same math (Bayes' Theorem) to update their belief after each observation. What happens is that each observation causes all the students to revise their belief in the same direction -- up each time the subject acts wired, down each time the subject doesn't act wired -- regardless of starting point. The ones who assigned the highest initial probabilities to the regular coffee hypothesis end up with the highest final probabilities, but everyone agrees on which way the evidence tended (and then we discuss how a sufficiently large number of observations would cause everyone's belief to incline toward zero or one).

If one student in this exercise wasn't listening when I said "Day 4: Subject acts wired", they could figure out by looking at a neighbor's paper. It wouldn't help them to know that the neighbor currently assigned a 71 percent probability to the regular coffee hypothesis. But it would help them to know that the neighbor had, on the basis of what I said, revised their belief from 65 to 71 percent. The direction is the tell.
posted by aws17576 at 10:27 PM on December 12, 2019 [2 favorites]

It does seem like there's a fundamental difference between situations like that, where the "true" probability of the coffee being decaf is either 0 or 1, and forecasting situations like "what is the likelihood that the coffee will be decaf tomorrow," where the "true" probability (as of today) is almost certainly not 0 or 1.

To take a more common forecasting situation, if two weather forecasters using different models both predict that there is a 60% chance of a hailstorm in [City] tomorrow, even if I would ordinarily expect the probability of hail to be < 1%, my inference would be not that it is certain to hail tomorrow, but that the best evidence as of today indicates there's about a 60% chance. Not sure if I'm thinking about this right, but that's the kind of scenario I was imagining in the context of the pricing forecasts in the OP.
posted by Not A Thing at 8:00 AM on December 13, 2019

Any excuse to post my favorite declassified CIA publication, The Definition of Some Estimative Expressions:

Finished intelligence, particularly in making estimative statements, uses a number of modifiers like "highly probable," "unlikely," "possible" that can be thought of as expressing a range of odds or a mathematical probability, and these are supplemented by various other expressions, especially verb forms, conveying the sense of probability less directly "may," "could," "we believe." Certain other words express not probability but quantity, imprecisely but perhaps within definable ranges -- "few," "several," "considerable." Some people object to any effort to define the odds or quantities meant by such words. They argue that context always modifies the meaning of words and, more broadly, that rigid definitions deprive language of the freedom to adapt to changing needs.

It is possible, however, to state the definitions in quantitative terms without making them artificially precise. And if two-thirds of the users and readers of the word probably, for example, feel it conveys a range of odds between 6 and 8 out of 10, then it is more useful to give it this definition than to define it more or less tautologically in terms of other words of probability. This would not deny to context its proper role as the arbiter of value, but only limit the range of its influence. Nor would it freeze the language in perpetuity; as the meanings of the words evolved the quantitative ranges could be changed.

This article describes the results of a survey undertaken to determine if such words are indeed understood as measurable quantities and if so to ascertain the extent to which there is a consensus about the quantitative range of each. A three-part questionnaire on the subject was distributed in the intelligence community -- to INR/State, the DIA Office of Estimates, and five CIA offices -- and a simplified version of it was sent to policy staffs in the White House, State, and the Pentagon. Responses were received from 240 intelligence analysts and 63 policy officers.

The responses showed a satisfactory consensus with respect to various usages of likely and probable, phrases expressing greater certainty than these, and modifications of chance -- good, better-than-even, slight. There was no satisfactory agreement on the meaning of possible or a wide variety of verb forms such as we believe and might. There was also little agreement on the non-odds quantitative words such as few and many. The policy offices consistently assigned lower probabilities than intelligence analysts did. Correlation between values assigned in and out of context was good.

posted by caek at 9:34 AM on December 14, 2019 [2 favorites]

« Older where the record is unclear about the number of... | Whom My Soul Loves Newer »

This thread has been archived and is closed to new comments

MetaFilter

Numbers average up but words pile on.
December 12, 2019 12:00 PM Subscribe

Tags

Share

Numbers average up but words pile on. December 12, 2019 12:00 PM Subscribe

Tags

Share

Numbers average up but words pile on.
December 12, 2019 12:00 PM Subscribe