The debate on the proper use of inside and outside views has raged for some time now. I suggest a way forward, building on a family of methods commonly used in statistics and machine learning to address this issue — an approach I'll call "model combination and adjustment."
Inside and outside views: a quick review
1. There are two ways you might predict outcomes for a phenomenon. If you make your predictions using a detailed visualization of how something works, you're using an inside view. If instead you ignore the details of how something works, and instead make your predictions by assuming that a phenomenon will behave roughly like other similar phenomena, you're using an outside view (also called reference class forecasting).
Inside view examples:
- "When I break the project into steps and visualize how long each step will take, it looks like the project will take 6 weeks"
- "When I combine what I know of physics and computation, it looks like the serial speed formulation of Moore's Law will break down around 2005, because we haven't been able to scale down energy-use-per-computation as quickly as we've scaled up computations per second, which means the serial speed formulation of Moore's Law will run into roadblocks from energy consumption and heat dissipation somewhere around 2005."
Outside view examples:
- "I'm going to ignore the details of this project, and instead compare my project to similar projects. Other projects like this have taken 3 months, so that's probably about how long my project will take."
- "The serial speed formulation of Moore's Law has held up for several decades, through several different physical architectures, so it'll probably continue to hold through the next shift in physical architectures."
See also chapter 23 in Kahneman (2011); Planning Fallacy; Reference class forecasting. Note that, after several decades of past success, the serial speed formulation of Moore's Law did in fact break down in 2004 for the reasons described (Fuller & Millett 2011).
2. An outside view works best when using a reference class with a similar causal structure to the thing you're trying to predict. An inside view works best when a phenomenon's causal structure is well-understood, and when (to your knowledge) there are very few phenomena with a similar causal structure that you can use to predict things about the phenomenon you're investigating. See: The Outside View's Domain.
When writing a textbook that's much like other textbooks, you're probably best off predicting the cost and duration of the project by looking at similar textbook-writing projects. When you're predicting the trajectory of the serial speed formulation of Moore's Law, or predicting which spaceship designs will successfully land humans on the moon for the first time, you're probably best off using an (intensely informed) inside view.
3. Some things aren't very predictable with either an outside view or an inside view. Sometimes, the thing you're trying to predict seems to have a significantly different causal structure than other things, and you don't understand its causal structure very well. What should we do in such cases? This remains a matter of debate.
Eliezer Yudkowsky recommends a weak inside view for such cases:
On problems that are drawn from a barrel of causally similar problems, where human optimism runs rampant and unforeseen troubles are common, the Outside View beats the Inside View... [But] on problems that are new things under the Sun, where there's a huge change of context and a structural change in underlying causal forces, the Outside View also fails - try to use it, and you'll just get into arguments about what is the proper domain of "similar historical cases" or what conclusions can be drawn therefrom. In this case, the best we can do is use the Weak Inside View — visualizing the causal process — to produce loose qualitative conclusions about only those issues where there seems to be lopsided support.
In contrast, Robin Hanson recommends an outside view for difficult cases:
It is easy, way too easy, to generate new mechanisms, accounts, theories, and abstractions. To see if such things are useful, we need to vet them, and that is easiest "nearby", where we know a lot. When we want to deal with or understand things "far", where we know little, we have little choice other than to rely on mechanisms, theories, and concepts that have worked well near. Far is just the wrong place to try new things.
There are a bazillion possible abstractions we could apply to the world. For each abstraction, the question is not whether one can divide up the world that way, but whether it "carves nature at its joints", giving useful insight not easily gained via other abstractions. We should be wary of inventing new abstractions just to make sense of things far; we should insist they first show their value nearby.
In Yudkowsky (2013), sec. 2.1, Yudkowsky offers a reply to these paragraphs, and continues to advocate for a weak inside view. He also adds:
the other major problem I have with the “outside view” is that everyone who uses it seems to come up with a different reference class and a different answer.
This is the problem of "reference class tennis": each participant in the debate claims their own reference class is most appropriate for predicting the phenomenon under discussion, and if disagreement remains, they might each say "I’m taking my reference class and going home."
Responding to the same point made elsewhere, Robin Hanson wrote:
[Earlier, I] warned against over-reliance on “unvetted” abstractions. I wasn’t at all trying to claim there is one true analogy and all others are false. Instead, I argue for preferring to rely on abstractions, including categories and similarity maps, that have been found useful by a substantial intellectual community working on related problems.
Multiple reference classes
Yudkowsky (2013) adds one more complaint about reference class forecasting in difficult forecasting circumstances:
A final problem I have with many cases of 'reference class forecasting' is that... [the] final answers [generated from this process] often seem more specific than I think our state of knowledge should allow. [For example,] I don’t think you should be able to tell me that the next major growth mode will have a doubling time of between a month and a year. The alleged outside viewer claims to know too much, once they stake their all on a single preferred reference class.
Both this comment and Hanson's last comment above point to the vulnerability of relying on any single reference class, at least for difficult forecasting problems. Beware brittle arguments, says Paul Christiano.
One obvious solution is to use multiple reference classes, and weight them by how relevant you think they are to the phenomenon you're trying to predict. Holden Karnofsky writes of investigating things from "many different angles." Jonah Sinick refers to "many weak arguments." Statisticians call this "model combination." Machine learning researchers call it "ensemble learning" or "classifier combination."
In other words, we can use many outside views.
Nate Silver does this when he predicts elections (see Silver 2012, ch. 2). Venture capitalists do this when they evaluate startups. The best political forecasters studied in Tetlock (2005), the "foxes," tended to do this.
In fact, most of us do this regularly.
How do you predict which restaurant's food you'll most enjoy, when visiting San Francisco for the first time? One outside view comes from the restaurant's Yelp reviews. Another outside view comes from your friend Jade's opinion. Another outside view comes from the fact that you usually enjoy Asian cuisines more than other cuisines. And so on. Then you combine these different models of the situation, weighting them by how robustly they each tend to predict your eating enjoyment, and you grab a taxi to Osha Thai.
(Technical note: I say "model combination" rather than "model averaging" on purpose.)
Model combination and adjustment
You can probably do even better than this, though — if you know some things about the phenomenon and you're very careful. Once you've combined a handful of models to arrive at a qualitative or quantitative judgment, you should still be able to "adjust" the judgment in some cases using an inside view.
For example, suppose I used the above process, and I plan to visit Osha Thai for dinner. Then, somebody gives me my first taste of the Synsepalum dulcificum fruit. I happen to know that this fruit contains a molecule called miraculin which binds to one's tastebuds and makes sour foods taste sweet, and that this effect lasts for about an hour (Koizumi et al. 2011). Despite the results of my earlier model combination, I predict I won't particularly enjoy Osha Thai at the moment. Instead, I decide to try some tabasco sauce, to see whether it now tastes like doughnut glaze.
In some cases, you might also need to adjust for your prior over, say, "expected enjoyment of restaurant food," if for some reason your original model combination procedure didn't capture your prior properly.
Against "the outside view"
There is a lot more to say about model combination and adjustment (e.g. this), but for now let me make a suggestion about language usage.
Sometimes, small changes to our language can help us think more accurately. For example, gender-neutral language can reduce male bias in our associations (Stahlberg et al. 2007). In this spirit, I recommend we retire the phrase "the outside view..", and instead use phrases like "some outside views..." and "an outside view..."
My reasons are:
-
Speaking of "the" outside view privileges a particular reference class, which could make us overconfident of that particular model's predictions, and leave model uncertainty unaccounted for.
-
Speaking of "the" outside view can act as a conversation-stopper, whereas speaking of multiple outside views encourages further discussion about how much weight each model should be given, and what each of them implies about the phenomenon under discussion.
(cont'd from previous comment)
As I have mentioned at the beginning, the reports up to 2005 contained highly overoptimistic projections for on-chip frequency and supply voltage, which became dramatically more pessimistic in the 2007 edition. The reports clearly state, however, that these numbers are meant as targets and are not necessarily "on the road to sure implementation", especially where it has been highlighted that solutions were needed and not yet known. They can therefore not necessarily serve as a clear indictment of the ITRS' predictive powers, but I remain puzzled by some of their projections and comments on these before 2007. Getting clarification on this from industry insiders was the next thing I had planned for this project before we paused it.
Specifically, tables 4c and 4d in the Overall Roadmap Technology Characteristics, found in a subsection of the Executive Summary titled Performance of Packaged Chips, contain on-chip frequency forecasts in MHz, which became dramatically more pessimistic in 2007 than they had been in the previous 3 editions. A footnote in the 2007 edition states:
Later editions seem to have reduced the expected scaling factor even further (1.04 in the 2011 edition), but there were also changes made to the metric employed, so I am not sure how to interpret the numbers (though I would expect the scaling factor to be unaffected by those changes).
Relatedly, a paragraph in the System Drivers document titled Maximum on-chip (global) clock frequency states that the on-chip clock frequency would not continue scaling at a factor of 2 per generation for several reasons. The 2001 edition states 3 reasons for this, the 2003 and 2005 edition state 4. But only in 2007 was the limitation from maximum allowable power dissipation added to this list of reasons. This strikes me as very puzzling. The paragraph, as it appears in the 2007 edition, is (emphasis added):
Finally, the Overall Roadmap Technology Characteristics tables 6a and 6b (found in a subsection titled Power Supply and Power Dissipation in the Executive Summary) contains projected values of the supply power (
) which also became dramatically more pessimistic in the 2007 edition.
I have indicated my puzzlement at these points in an email I have sent out to a number of industry insiders, then asking:
I have received some very kind replies to those emails, but most have focused on the technical reasons for the "breakdown" in Dennard scaling. The only comment I have received on this last question was from Robert Dennard, who sent me a particularly thoughtful email that came with 4 attachments (which mainly provided more technical detail on transistor design, however). At the end of his email, he wrote:
Indeed, which bets it is most rational to make depends on expected payoff ratios as well as on probability estimates. This distinction between targets and mere predictions complicates the question quite a bit.
This was an interesting project, it would be great to pick it up again.