Raising the forecasting waterline (part 1)

Previously: Raising the waterline, see also: 1001 PredictionBook Nights (LW copy), Techniques for probability estimates

Low waterlines imply that it's relatively easy for a novice to outperform the competition. (In poker, as discussed in Nate Silver's book, the "fish" are those who can't master basic techniques such as folding when they have a poor hand, or calculating even roughly the expected value of a pot.) Does this apply to the domain of making predictions? It's early days, but it looks as if a smallish set of tools - a conscious status quo bias, respecting probability axioms when considering alternatives, considering references classes, leaving yourself a line of retreat, detaching from sunk costs, and a few more - can at least place you in a good position. 

A bit of backstory

Like perhaps many LessWrongers, my first encounter with the notion of calibrated confidence was "A Technical Explanation of Technical Explanation". My first serious stab at publicly expressing my own beliefs as quantified probabilities was the Amanda Knox case - an eye-opener, waking me up to how everyday opinions could correspond to degrees of certainty, and how these had consequences. By the following year, I was trying to improve my calibration for work-related purposes, and playing with various Web sites, like PredictionBook or Guessum (now defunct).

Then the Good Judgment Project was announced on Less Wrong. Like several of us, I applied, unexpectedly got in, and started taking forecasting more seriously. (I tend to apply myself somewhat better to learning when there is a competitive element - not an attitude I'm particularly proud of, but being aware of that is useful.)

The GJP is both a contest and an experimental study, in fact a group of related studies: several distinct groups of researchers (1,2,3,4) are being funded by IARPA to each run their own experimental program. Within each, small or large number of participants have been recruited, allocated to different experimental conditions, and encouraged to compete with each other (or even, as far as I know, for some experimental conditions, collaborate with each other). The goal is to make predictions about "world events" - and if possible to get them more right, collectively, than we would individually.1

Tool 1: Favor the status quo

The first hint I got that my approach to forecasting needed more explicit thinking tools was a blog post by Paul Hewitt I came across late in the first season. My scores in that period (summer 2011 to spring 2012) had been decent but not fantastic; I ended up 5th on my team, which itself placed quite modestly in the contest.

Hewitt pointed out that in general, you could do better than most other forecasters by favoring the status quo outcome.2 This may not quite be on the same order of effectiveness as the poker advice to "err on the side of folding mediocre hands more often", but it makes a lot of sense, at least for the Good Judgment Project (and possibly for many of the questions we might worry about). Many of the GJP questions refer to possibilities that loom large in the media at a given time, that are highly available - in the sense of the availability heuristic. This results in a tendency to favor forecasts of change from status quo.

For instance, one of the Season 1 questions was "Will Marine LePen cease to be a candidate for President of France before 10 April 2012?" (also on PredictionBook). Just because the question is being asked doesn't mean that you should assign "yes" and "no" equal probabilities of 50%, or even close to 50%, any more than you should assign 50% to the proposition "I will win the lottery".

Rather, you might start from a relatively low prior probability that anyone who undertakes something as significant as a bid for national presidency would throw in the towel before the contest even starts. Then, try to find evidence that positively favors a change. In this particular case, there was such evidence -  the National Front, of which she was the candidate, consistently reports difficulties rounding up the endorsements required to register a candidate legally. However, only once in the past (1981) had this resulted in their candidate being barred (admittedly a very small sample). It would have been a mistake to weigh that evidence excessively. (I got a good score on that question, compared to the team, but definitely owing to a "home ground advantage" as a French citizen rather than my superior forecasting skills.)

Tool 2: Flip the question around

The next technique I try to apply consistently is respecting the axioms of probability. If the probability of event A is 70%, then the probability of not-A is 30%.

This may strike everyone as obvious... it's not. In Season 2, several of my team-mates are on record as assigning a 75% probability to the proposition "The number of registered Syrian conflict refugees reported by the UNHCR will exceed 250,000 at any point before 1 April 2013".

That number was reached today, six months in advance of the deadline. This was clear as early as August. The trend in the past few months has been an increase of 1000 to 2000 a day, and the UNHCR have recently provided estimates that this number will eventually reach 700,000. The kicker is that this number is only the count of people who are fully processed by the UNHCR administration and officially in their database; there are tens of thousands more in the camps who only have "appointments to be registered".

I've been finding it hard to understand why my team-mates haven't been updating to, maybe not 100%, but at least 99%; and how one wouldn't see these as the only answers worth considering. At any point in the past few weeks, to state your probability as 85% or 91% (as some have quite recently) was to say, "There is still a one in ten chance that the Syrian conflict will suddenly stop and all these people will go home, maybe next week?."

This is kind of like saying "There is a one in ten chance Santa Claus will be the one distributing the presents this year." It feels like a huge "clack".

I can only speculate as to what's going on there. Queried for a probability, people are translating something like "Sure, A is happening" into a biggish number, and reporting that. They are totally failing to flip the question around and explicitly consider what it would take for not-A to happen. (Perhaps, too, people have been so strongly cautioned by cautions, from Tetlock and others, against being overconfident that they reflexively shy away from the extreme numbers.)

Just because you're expressing beliefs as percentages doesn't mean that you are automatically applying the axioms of probability. Just because you use "75%" as a shorthand for "I'm pretty sure" doesn't mean you are thinking probabilistically; you must train the skill of seeing that for some events, its complement "25%" also counts as "I'm pretty sure". The axioms are more important than the use of numbers - in fact for this sort of forecast "91%" strikes me as needlessly precise; increments of 5% are more than enough, away from the extremes.

Tool 3: Reference class forecasting

The order in which I'm discussing these "basics of forecasting" reflects not so much their importance, as the order in which I tend to run through them when encountering a new question. (This might not be the optimal order, or even very good - but that should matter little if the waterline is indeed low.)

Using reference classes was actually part of the "training package" of the GJP. From the linked post comes the warning that "deciding what's the proper reference class is not straightforward". And in fact, this tool only applies in some cases, not systematically. One of our recently closed questions was "Will any government force gain control of the Somali town of Kismayo before 1 November 2012?". Clearly, you could spend quite a while trying to figure out an appropriate reference class here. (In fact, this question also stands as a counter-example to the "Favor status quo" tool, and flipping the question around might not have been too useful either. All these tools require some discrimination.)

On the other hand, it came in rather handy in assessing the short-term question we got late september: "What change will occur in the FAO Food Price index during September 2012?" - with barely two weeks to go before the FAO was to post the updated index in early October. More generally, it's a useful tool when you're asked to make predictions regarding a numerical indicator, for which you can observe past data. 

The FAO price data can be retrieved as a spreadsheet (.xsl download). Our forecast question divided the outcomes into four: A) an increase of 3% or more, B) an increase of less than 3%, C) a decrease of less than 3%, D) a decrease of more than 3%, E) "no change" - meaning a change too small to alter the value rounded to the nearest integer.

It's not clear from the chart that there is any consistent seasonal variation. A change of 3% would have been about 6.4 points; since 8/2011 there had been four month-on-month changes of that magnitude, 3 decreases and 1 increase. Based on that reference class, the probability of a small change (B+C+E) came out to about 2/3. The probability for "no change" (E) to 1/12 - the August price was the same as the July price. The probability for an increase (A+B), roughly the same as for a decrease (C+D). My first-cut forecast allocated the probability mass as follows: 15/30/30/15/10.

However, I figured I did need to apply a correction, based on reports of a drought in the US that could lead to some food shortages. I took 10% probability mass from the "decrease" outcomes and allocated it to the "increase" outcomes. My final forecast was 20/35/25/10/10. I didn't mess around with it any more than that. As it turned out, the actual outcome was B! My score was bettered by only 3 forecasters, out of a total of 9.

Next up: lines of retreat, ditching sunk costs, loss functions

This post has grown long enough, and I still have 3+ tools I want to cover. Stay tuned for Part 2!



1 The GJP is being run by Phil Tetlock, known for his "hedgehog and fox" analysis of forecasting. At that time I wasn't aware of the competing groups - one of them, DAGGRE, is run by Robin Hanson (of OB fame) among others, which might have made it an appealing alternate choice if I'd know about it.

2 Unfortunately, the experimental condition Paul belonged to used a prediction market where forecasters played virtual money by "betting" on predictions; this makes it hard to translate the numbers he provides into probabilities. The general point is still interesting.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:27 PM
Select new highlight date
All comments loaded

I've been finding it hard to understand why my team-mates haven't been updating to, maybe not 100%, but at least 99%; and how one wouldn't see these as the only answers worth considering. At any point in the past few weeks, to state your probability as 85% or 91% (as some have quite recently) was to say, "There is still a one in ten chance that the Syrian conflict will suddenly stop and all these people will go home, maybe next week?."

GJP has pointed out already that forecasters are not updating as fast as they could. I assume a lot of forecasters are like me in rarely updating their predictions.

(In season 2, out of distaste for the new UI, I've barely been participating at all.)

Talking about increments of 5% runs counter to my intuitions regarding good thinking about probability estimates. For most purposes, the difference between 90% and 95% is significantly larger than the difference between 50% and 55%. Think in logs.

Yes, near the extremes it makes a difference - but we're using a Brier scoring rule, averaged over all days a forecast is open. That makes thinking in logs less important - 99% isn't much worse than 100% on errors. I'll discuss that in pt.2 under 'loss function'.

I really, really want to answer "shortly" and leave it at that.

But since you ask, 45% chance I'll do it tomorrow, 25% over the week-end, 10% monday, and 20% later than monday.

Predictions involving what I'm gonna do are trickier for me, because there's a feedback loop between the act of making a prediction, and my likelihood of taking the corresponding actions once the prediction has turned them into a public commitment; it's a complicated one which sometimes triggers procrastination, sometimes increased motivation.

Thanks. Your prediction is now recorded on PredictionBook.

I hope you don't take it personally, but my estimate that you'll have the essay ready by tomorrow is lower than yours. Even those who, like Kahneman, know that the inside view yields overoptimistic estimates in cases of this sort tend to rely on it more than they should.

Of course, the fact that I'm making this prediction might also enter into the feedback loop you describe. I suspect the overall effect is that your prediction is now more likely to be true as a consequence of my having publicly given a lower estimate than you did.

By the way, when you say "I really, really want to answer 'shortly'", is this just because you sometimes dislike giving precise estimates, or do you think there is sometimes a rational justification for this reluctance? Without having thought about the matter carefully, it seems to me that the only valid reason for abstaining from giving precise estimates is that one's audience might make assumptions about the reliability of the estimate from the fact that it is expressed in precise language (more precision suggests higher reliability). But provided one gives independent reliability measures (by e.g. being explicit about one's confidence intervals), can this reluctance still be justified?

is this just because you sometimes dislike giving precise estimates

It's because it's now 3am and I've stuck a knife in the back of my tomorrow-self, who will wake up sleep deprived, so that my present-self (with a 1500 word first draft completed) can enjoy the certainty of hitting an estimate which was only that, not a commitment. Hyperbolic discounting is a royal pain.

It's because I'm a sucker for this kind of thing, as are many of my colleagues working in software development. :-/

Hewitt pointed out that in general, you could do better than most other forecasters by favoring the status quo outcome.

I vaguely recall some academic work showing this to be true, or more generally if you're predicting the variable X_t over time, the previous period's value tends to be a better predictor than more complicated models. Can anyone confirm/deny my memory? And maybe provide a citation?

This is a theme of multiple papers in the 2001 anthology Principles of Forecasting (a PDF of which is findable online), to give a specific citation.

I vaguely recall some academic work showing this to be true, or more generally if you're predicting the variable X_t over time, the previous period's value tends to be a better predictor than more complicated models.

These get called AR(1) models, for autoregressive 1.

Most complicated models that I'm familiar with include both the previous value and other factors (since there is generally more going on than a random walk).

provided I think a question and its negation are equally likely to have been asked, there is a 50% chance that the answer to the question you have asked is yes.

Well, yes. But ought I believe that a yes/no question I have no idea about is as likely as its negation to have been asked? (Especially if it's being asked implicitly by a situation, rather than explicitly by a human?)

Ratio of true statements to false ones: low. Probability TraderJoe wants to make TheOtherDave look foolish: moderate, slightly on the higher end. Ratio of the probability that giving an obviously false statement an answer of relatively high probability would make TheOtherDave look foolish to the probability that giving an obviously true statement a relatively low probability would make TheOtherDave look foolish: moderately high. Probability that the statement is neither true nor false: low.

Conclusion: أنا من (أمريك is most likely false.

I assume you mean without looking it up.

My answer is roughly the same as TimS's... it mostly depends on "Would TraderJoe pick a true statement in this context or a false one?" Which in turn mostly depends on "Would a randomly selected LWer pick a true statement in this context or a false one?" since I don't know much about you as a distinct individual.

I seem to have a prior probability somewhat above 50% for "true", though thinking about it I'm not sure why exactly that is.

Looking it up, it amuses me to discover that I'm still not sure if it's true.

Just because you use "75%" as a shorthand for "I'm pretty sure" doesn't mean you are thinking probabilistically; you must train the skill of seeing that for some events, its complement "25%" also counts as "I'm pretty sure".

Would expressing these things in terms of odds rather than probability make it easier to avoid this error?