The Optimizer's Curse and How to Beat It

lukeprog

The Optimizer's Curse and How to Beat It

by lukeprog

3 min read16th Sep 201184 comments

98

OptimizationMild OptimizationAI

Frontpage

The best laid schemes of mice and men
Go often askew,
And leave us nothing but grief and pain,
For promised joy!

- Robert Burns (translated)

Consider the following question:

A team of decision analysts has just presented the results of a complex analysis to the executive responsible for making the decision. The analysts recommend making an innovative investment and claim that, although the investment is not without risks, it has a large positive expected net present value... While the analysis seems fair and unbiased, she can’t help but feel a bit skeptical. Is her skepticism justified?¹

Or, suppose Holden Karnofsky of charity-evaluator GiveWell has been presented with a complex analysis of why an intervention that reduces existential risks from artificial intelligence has astronomical expected value and is therefore the type of intervention that should receive marginal philanthropic dollars. Holden feels skeptical about this 'explicit estimated expected value' approach; is his skepticism justified?

Suppose you're a business executive considering n alternatives whose 'true' expected values are μ₁, ..., μ_n. By 'true' expected value I mean the expected value you would calculate if you could devote unlimited time, money, and computational resources to making the expected value calculation.² But you only have three months and $50,000 with which to produce the estimate, and this limited study produces estimated expected values for the alternatives V₁, ..., V_n.

Of course, you choose the alternative i* that has the highest estimated expected value V_i*. You implement the chosen alternative, and get the realized value x_i*.

Let's call the difference x_i* - V_i* the 'postdecision surprise'.³ A positive surprise means your option brought about more value than your analysis predicted; a negative surprise means you were disappointed.

Assume, too kindly, that your estimates are unbiased. And suppose you use this decision procedure many times, for many different decisions, and your estimates are unbiased. It seems reasonable to expect that on average you will receive the estimated expected value of each decision you make in this way. Sometimes you'll be positively surprised, sometimes negatively surprised, but on average you should get the estimated expected value for each decision.

Alas, this is not so; your outcome will usually be worse than what you predicted, even if your estimate was unbiased!

Why?

...consider a decision problem in which there are k choices, each of which has true estimated [expected value] of 0. Suppose that the error in each [expected value] estimate has zero mean and standard deviation of 1, shown as the bold curve [below]. Now, as we actually start to generate the estimates, some of the errors will be negative (pessimistic) and some will be positive (optimistic). Because we select the action with the highest [expected value] estimate, we are obviously favoring overly optimistic estimates, and that is the source of the bias... The curve in [the figure below] for k = 3 has a mean around 0.85, so the average disappointment will be about 85% of the standard deviation in [expected value] estimates. With more choices, extremely optimistic estimates are more likely to arise: for k = 30, the disappointment will be around twice the standard deviation in the estimates.⁴

This is "the optimizer's curse." See Smith & Winkler (2006) for the proof.

The Solution

The solution to the optimizer's curse is rather straightforward.

...[we] model the uncertainty in the value estimates explicitly and use Bayesian methods to interpret these value estimates. Specifically, we assign a prior distribution on the vector of true values μ = (μ₁, ..., μ_n) and describe the accuracy of the value estimates V = (V₁, ..., V_n) by a conditional distribution V|μ. Then, rather than ranking alternatives. based on the value estimates, after we have done the decision analysis and observed the value estimates V, we use Bayes’ rule to determine the posterior distribution for μ|V and rank and choose among alternatives based on the posterior means...

The key to overcoming the optimizer’s curse is conceptually very simple: treat the results of the analysis as uncertain and combine these results with prior estimates of value using Bayes’ rule before choosing an alternative. This process formally recognizes the uncertainty in value estimates and corrects for the bias that is built into the optimization process by adjusting high estimated values downward. To adjust values properly, we need to understand the degree of uncertainty in these estimates and in the true values..⁵

To return to our original question: Yes, some skepticism is justified when considering the option before you with the highest expected value. To minimize your prediction error, treat the results of your decision analysis as uncertain and use Bayes' Theorem to combine its results with an appropriate prior.

Notes

¹ Smith & Winkler (2006).

² Lindley et al. (1979) and Lindley (1986) talk about 'true' expected values in this way.

³ Following Harrison & March (1984).

⁴ Quote and (adapted) image from Russell & Norvig (2009), pp. 618-619.

⁵ Smith & Winkler (2006).

References

Harrison & March (1984). Decision making and postdecision surprises. Administrative Science Quarterly, 29: 26–42.

Lindley, Tversky, & Brown. 1979. On the reconciliation of probability assessments. Journal of the Royal Statistical Society, Series A, 142: 146–180.

Lindley (1986). The reconciliation of decision analyses. Operations Research, 34: 289–295.

Russell & Norvig (2009). Artificial Intelligence: A Modern Approach, Third Edition. Prentice Hall.

Smith & Winkler (2006). The optimizer's curse: Skepticism and postdecision surprise in decision analysis. Management Science, 52: 311-322.

OptimizationMild OptimizationAI

Frontpage

98

Mentioned in

74Neural uncertainty estimation review article (for alignment)

68Paths To High-Level Machine Intelligence

48Does Bayes Beat Goodhart?

37Simultaneous Overconfidence and Underconfidence

32Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

Load More (5/10)

The Optimizer's Curse and How to Beat It

New Comment

84 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:54 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

[-]Vladimir_Nesov13y360

But all you've done after "adjusting" the expected value estimates was producing a new batch of expected value estimates, which just shows that the original expected value estimates were not done very carefully (if there was an improvement), or that you face the same problem all over again...

Am I missing something?

2orthonormal13y

I'm thinking of this as "updating on whether I actually occupy the epistemic state that I think I occupy", which one hopes would be less of a problem for a superintelligence than for a human. It reminds me of Yvain's Confidence Levels Inside and Outside an Argument.

3NancyLebovitz13y

I expect it to be a problem-- probably as serious-- for superintelligence. The universe will always be bigger and more complex than any model of it, and I'm pretty sure a mind can't fully model itself. Superintelligences will presumably have epistemic problems we can't understand, and probably better tools for working on them, but unless I'm missing something, there's no way to make the problem go away.

2orthonormal13y

Yeah, but at least it shouldn't have all the subconscious signaling problems that compromise conscious reasoning in humans- at least I hope nobody would be dumb enough to build a superintelligence that deceives itself on account of social adaptations that don't update when the context changes...

1EliasHasle1y

I must admit that I did not understand everything in the paper, but I think this excerpt summarizes a crucial point: "The key issue here is proper conditioning. The unbiasedness of the value estimates V_i discussed in §1 is unbiasedness conditional on mu. In contrast, we might think of the revised estimates ^v_i as being unbiased conditional on V. At the time we optimize and make the decision, we know V but we do not know mu, so proper conditioning dictates that we work with distributions and estimates conditional on V." The proposed "solution" converts n independent evaluations into n evaluations (estimates) that respect the selection process, but, as far as I can tell, they still rest on prior value estimates and prior knowledge about the uncertainty of those estimates... Which means the "solution" at best limits introduction of optimizer bias, and at worst... masks old mistakes?

0CynicalOptimist8y

Well in some circumstances, this kind of reasoning would actually change the decision you make. For example, you might have one option with a high estimate and very high confidence, and another option with an even higher estimate, but lower confidence. After applying the approach described in the article, those two options might end up switching position in the rankings. BUT: Most of the time, I don't think this approach will make you choose a different option. If all other factors are equal, then you'll probably still pick the option that has the highest expected value. I think that what we learn from this article is more about something else: It's about understanding that the final result will probably be lower than your supposedly "unbiased" estimate. And when you understand that, you can budget accordingly.

1EliasHasle1y

The big problem arises when the number of choices is huge and sparsely explored, such as when optimizing a neural network. But restricting ourselves to n superficially evaluated choices with known estimate variance in each evaluation and with independent errors/noise, then if – as in realistic cases like Monte Carlo Tree Search – we are allowed to perform some additional "measurements" to narrow down the uncertainty, it will be wise to scrutinize the high-expectance choices most – in a way trying to "falsify" their greatness, while increasing the certainty of their greatness if the falsification "fails". This is the effect of using heuristics like the Upper Confidence Bound for experiment/branch selection. UCB is also described as "optimism in the face of uncertainty", which kind of defeats the point I am making if it is deployed as decision policy. What I mean is that in research, preparations and planning (with tree search in perfect information games as a formal example where UCB can be applied), one should put a lot of effort into finding out whether the seemingly best choice (of path, policy, etc.) really is that good, and then make a final choice that penalizes remaining uncertainty. I would like to throw in a Wikipedia article on a relevant topic, which I came across while reading about the related "Winner's curse": https://en.wikipedia.org/wiki/Order_statistic The math for order statistics is quite neat as long as the variables are independently sampled from the same distribution. In real life, "sadly", choice evaluations may not always be from the same distribution... Rather, they are by definition conditional upon the choices. (https://en.wikipedia.org/wiki/Bapat%E2%80%93Beg_theorem provides a kind of solution in the form of an intractable colossus of a calculation.) That is not to say that there can be found no valuable/informative approximations.

[-]jsalvatier13y190

In statistics the solution you describe is called Hierarchical or Multilevel Modeling. You assume that you data is drawn from a set of distributions which have their parameters drawn from another distribution. This automatically shrinks your estimates of the distributions towards the mean. I think it's a pretty useful trick to know and I think it would be good to do a writeup but I think you might need to have a decent grasp of bayesian statistics first.

4Pagw7y

Here's an example, with code, for anyone interested (it's not by me, I add): http://sl8r000.github.io/ab_testing_statistics/use_a_hierarchical_model/

[-]JoshuaZ13y150

The central point of the optimizer's curse not one I have seen before and is a very interesting point.

The solution however leaves me feeling slightly unhappy. It isn't obvious to me what prior one should use in this sort of context. I suspect that a rough estimate by simply using the rule of thumb that the more complicated a logical chain the more likely there is a problem in it might do similar work at a weaker level.

Have you tried to apply this sort of reasoning explicitly to various existential risk considerations? If so, what did you get?

[-]gwern13y210

The central point of the optimizer's curse not one I have seen before and is a very interesting point.

Reminds me of the winner's curse in auctions - the selected bid is the one that is the highest and so most likely to be due to overconfidence/bias.

7malthrin13y

Yes, I recognized that similarity as well. As an aside, Fantasy Football (especially with an auction draft) is a great example to use when explaining these overestimation effects to laypeople.

7lessdazed13y

--2001: A Space Odyssey (Homer, translated from ancient Latin)

8malthrin13y

Interesting sourcing on that quote. I'm not sure what you meant to say with it, so I'll elaborate. In fantasy sports, you begin by calculating an expected value for each player over the upcoming season. These values are used to construct your team in a draft, which is either turn-based (A picks a player, then B, then C) or auction-based (A, B, and C bid on players from a fixed initial pool of money). As the season goes on, you update your expected values with evidence from the past week's games in order to decide which players will be active and accrue points for your fantasy team. The analogy should be obvious for most folks here. You're combining evidence to form a probability (how good was he last season? Is the new coach's game plan going to help or hurt his stats? Is he a particularly high injury risk?) and multiplying by utility to form a preference ranking. In an auction draft, the pricing mechanism even requires you to explicitly compute the expected utility values. When games are played, you update on evidence and revise your rankings. Most people have a hard time relating to decision theory because it doesn't "feel like" what goes on in their head when they make decisions. Fantasy sports is a useful example because it makes the process explicit. I didn't fully realize how good a fit it is before this conversation - maybe I should write up an introductory rationality piece on this foundation.

3lessdazed13y

The quote is from Orwell's 1984. The proles are generally ignorant, but good at tracking lottery numbers because it is a game. That's right, I just generalized from fictional evidence! I figured if people are going to complain about the Burns quote, I'd give them something to really complain about. Wrong book with a date as a title, wrong author of an Odyssey, wrong language. Fantasy sports is a great example of where this would be useful, and I can't think of a better analogy.

[-]cousin_it13y130

Am I missing something, or does the post just say that we shouldn't use frequentist "unbiased estimators" as if they were Bayesian posterior expected values?

5jsalvatier13y

Not quite. If you were to do individual bayesian estimates you would have the same problem because there is shared prior information that would remain unmodeled.

6cousin_it13y

Are you pointing out that each individual Bayesian estimate must be conditioned on all the information available, or is it more subtle than that?

3jsalvatier13y

Nope, that's it.

[-]Mass_Driver13y11-1

consider a decision problem in which there are k choices, each of which has true estimated [expected value] of 0.

Lukeprog, if I've understood you correctly, then this is no good; this is a corner case. The question to be answered here is whether we should expect a "common sense" executive who favors plans with a high prior estimate to do better than a "technical" analyst who favors plans that perform well according to the formal estimation criteria. By assuming that all prior estimates are identical except for bias, this assumption ensures that the technical analyst will win. This, however, begs the question. One could just as easily assume that there is large variation in the true expected values, and that the formal criteria will always produce an estimate of 0, in which case the common sense executive will always win.

Am I missing something? I like the topic; I would enjoy reading about which approach we should expect to perform better in a typical situation.

[-]Nisan13y100

I think the case where all the choices has a "true expected value" of 0 is picked out merely to illustrate the problem.

2lukeprog13y

Yes.

4Mass_Driver13y

That's fine; you're more than welcome to illustrate the problem, and your analysis does in fact do that. It does it very well; your writing, as always, is very lucid. However, you finish the article by claiming that Bayesian analysis can correct for the problem, and this is something that (I don't think) you even begin to show. Bayesian analysis solves the corner case, but does it bring any traction at all on a typical case?

5RobinZ13y

I think it's worse than that: Karnofsky's problem is that he has to compare moderate-mean low-variance estimates to large-mean large-variance estimates, but lukeprog's solution is for comparing the estimate to the result in cases where the variance is equal across the board.

4[anonymous]13y

Put another way, the higher the variance in the true payoffs, the less relevant the curse. This is the flipside of: the more accurate the estimates, the less relevant the curse.

[-]JGWeissman13y90

Is there an example where applying this correction to the expected values changes the decision?

[-]Manfred13y100

In any group there's going to be random noise, and if you choose an extreme value, chances are that value was inflated by noise. In Bayesian, given that something has the highest value, it probably had positive noise, not just positive signal. So the correction is to correct out the expected positive noise you get from explicitly choosing the highest value. Naturally, this correction is greater for when the noise is bigger.

So imagine choosing between black boxes. Each black box has some number of gold coins in it, and also two numbers written on it. The first number, A, on the box is like the estimated expected value, and the second number, B, is like the variance. What happened is that someone rolled two distinct dice with B sides, subtracted die 1 from die 2, and added that to the number of gold coins in the box.

So if you see a box with 40, 3 written on it, you know that it has an expected value of 40 gold coins, but might have as few as 37 or as many as 43.

Now comes the problem: I put 10 boxes in front of you, and tell you to choose the one with the most gold coins. The first box is 50, 1 - a very low-variance box. But the last 9 boxes are all high-uncertainty, all with ... (read more)

4JGWeissman13y

That is a good example of how the optimizer's curse causes an overestimate of the maximum expected value, and even reliably causes a wrong choice to be associated with the maximum expected value. But how do I apply the correction mathematically, so I can know for which expected values on the high uncertainty boxes I should expect their best of them to be better or worse than the low uncertainty box? Even better, how can I deal with situations where the uncertainties of the expected values are not so conveniently categorized (and whose actual values aren't conveniently uniform)?

2Manfred13y

Oh - I learned how, by the way. You start with some prior over how you expect the actual coins to be distributed, and then you convolute in the noise distribution of each box to get the combined distribution for each box. Then, given where the number on the outside of each box falls on the combined distribution, you can assign how much of that you expect to be signal and how much you expect to be noise by distributing improbability equally between signal and noise. Then you subtract out the expected noise.

0Manfred13y

I'm not sure. It's probably in the paper.

0Brickman13y

I'm trying to figure out why, from the rules you gave at the start, we can assume that box 60 has more noise than the other boxes with variance of 20. You didn't, at the outset of the problem, say anything about what the values in the boxes actually were. I would not, taking this experiment, have been surprised to see a box labeled "200", with a variance of 20, because the rules didn't say anything about values being close to 50, just close to A. Well, I would've been surprised with you as a test-giver, but it wouldn't have violated what I understood the rules to be and I wouldn't have any reason to doubt that box was the right choice. The box with 60 stands out among the boxes with high variance, but you did not say that those boxes were generated with the same algorithm and thus have the same actual value. In fact you implied the opposite. You just told me that 60 was an estimate of its expected value, and 37 was an estimate of one of the other boxes' expected values. So I would assign a very high probability to it being worth more than the box labeled 37. I understand that the variance is being effectively applied twice to go between the number on the box to the real number of coins (The real number of 45 could make an estimate anywhere from 25 to 65, but if it hit 25 I'd be assigning the real number a lower bound of 5 and if it hit 65 I'd be assigning the real number an upper bound of 85, which is twice that range). (Actually for that reason I'm not sure your algorithm really means there's a variance of 20 from what you state the expected value to be, but I don't feel like doing all the math to verify that since it's tangential to the message I'm hearing from you or what I'm saying). But that doesn't change the average. The range of values that my box labeled 60 could really contain from being higher than the range the box labeled 37 could really contain, to the best of my knowledge, and both are most likely to fall within a couple coins of the center of that r

2Manfred13y

The key factor is that the 60,20 box is not in isolation - it is the top box, and so not only do you expect it to have more "signal" (gold) than average, you also expect it to have more noise than average. You can think of the numbers on the boxes as drawn from a probability distribution. If there was 0 noise, this probability distribution would just be how the gold in the boxes was distributed. But if you add noise, it's like adding two probability distributions together. If you're not familiar with what happens, go look it up on wikipedia, but the upshot is that the combined distribution is more spread out than the original. This combined distribution isn't just noise or just signal, it's the probability of having some number be written on the outside of the box. And so if something is the top, very highest box, where should it be located on the combined distribution? Now, if you have something that's high on the combined distribution, how much of that is due to signal, and how much of it is due to noise? This is a tougher question, but the essential insight is that the noise shouldn't be more improbable than the signal, or vice versa - that is, they should both be about the same number of standard deviations from their means. This means that if the standard deviation of the noise is bigger, then the probable contribution of the noise is greater. Me saying the same thing a different way can be found here.

2Brickman13y

Oh, I understand now. Even if we don't know how it's distributed, if it's the top among 9 choices with the same variance that puts it in the 80th percentile for specialness, and signal and noise contribute to that equally. So it's likely to be in the 80th percentile of noise. It might have been clearer if you'd instead made the boxes actually contain coins normally distributed about 40 with variance 15 and B=30, and made an alternative of 50/1, since you'd have been holding yourself to more proper unbiased generation of the numbers and still, in all likelihood, come up with a highest-labeled box that contained less than the sure thing. You have to basically divide your distance from the norm by the ratio of specialness you expect to get from signal and noise. The "all 45" thing just makes it feel like a trick.

0CynicalOptimist8y

I think there's some value in that observation that "the all 45 thing makes it feel like a trick". I believe that's a big part of why this feels like a paradox. If you have a box with the numbers "60" and "20" as described above, then I can see two main ways that you could interpret the numbers: A: The number of coins in this box was drawn from a probability distribution with a mean of 60, and a range of 20. B: The number of coins in this box was drawn from an unknown probability distribution. Our best estimate of the number of coins in this box is 60, based on certain information that we have available. We are certain that the actual value is within 20 gold coins of this. With regards to understanding the example, and understanding how to apply the kind of Bayesian reasoning that the article recommends, it's important to understand that the example was based on B. And in real life, B describes situations that we're far more likely to encounter. With regards to understanding human psychology, human biases, and why this feels like a paradox, it's important to understand that we instinctively tend towards "A". I don't know if all humans would tend to think in terms of A rather than B, but I suspect the bias applies widely amongst people who've studied any kind of formal probability. "A" is much closer to the kind of questions that would be set as exercises in a probability class.

0Manfred13y

That's true - when I wrote the post you replied to I still didn't really understand the solution - though it did make a good example for JGWeissman's question. By the time I wrote the post I linked to, I had figured it out and didn't have to cheat.

0Oscar_Cunningham13y

But if you don't know that all the high variance boxes have the same mean then 60 is the one to go with. And if you do know they have the same mean, then it's expected value is no longer 60.

1Manfred13y

Imagine putting gold coins into a bunch of boxes by having them normally distributed about 50 gold coins with standard deviation 10. Then we'll add some Gaussian noise to the estimates on the boxes - but we'll split them into 2 groups. Ten boxes will have noise with standard deviation of 5, while the other ten will have a standard deviation of 25. But since I've still kept the simple situation where we just have 2 groups, you can get the overall biggest by just picking the biggest from each group and comparing them. So we can treat the groups independently for a bit. The biggest one is going to have the biggest positive deviation from 50, combined signal and noise. Because I used normal distributions this time, the combined prior+noise distribution is just a bigger normal distribution. So given that something is big or small by this combined distribution, how do we expect the signal and noise distributions to shift? Well, it would be silly to expect one of them to be more improbable than the other, so we expect their means to shift by about the same number of standard deviations for each distribution. This right there means that the bigger the noise, the more of the variation we should attribute to noise. And also the bigger the element in the combined distribution, the larger we should expect its noise to be.

0Oscar_Cunningham13y

But if you know the boxes were originally drawn from N(50,100) then the number on the box is no longer the correct Bayesian mean. All I'm arguing is that once you have your Bayesian expected value you don't need to update it any further.

3Manfred13y

That's pretty uncontroversial, but in practice it means that you end up penalizing high-noise boxes with high values (and boosting high-noise boxes with low values), which I think is a nontrivial result.

1Johnicholas13y

I'm trying to imagine a scenario. Possibly the decider knows that people sometimes make multiplicative errors, transposing numbers or misplacing decimals, and is confronted with a set of estimates hovering around, say, 0.05 (and that is plausible according to the decider's prior) and a few estimates at estimated around 0.5 and 5.0. Would the correction effectively trim the outliers back to almost exactly 0.05 (because we can't learn much information from an estimate that probably had at least one mistake in it), and the decider should go with the highest of the "plausible" numbers? It seems to me like the conditional distributions that would lead to actually changing your decision are nearly as likely to be a source of error as a correction.

[-]DSimon13y70

Would this issue also apply to picking a contractor for a project based on the lowest bid?

3Solvent13y

No, because the lowest bid is a commitment from the contractor, not an estimate. This particular problem arises from trying to pick the best option from several estimates.

[-]CronoDAS13y130

Sometimes contractors run out of money before finishing and you have to pay more or they leave you with a half-finished project :(

2PhilGoetz13y

It would probably lead to contractors selected that way often going over budget.

[-]handoflixue13y40

I'm not sure how exactly this differs from the GiveWell blog post along the same lines? You seem to both be dealing with roughly the same problem (decision making under uncertainty), and reach the same conclusion (pay attention to the standard deviation, use Bayesian updates)

I did find your graph in the middle a rather useful illustration, but otherwise don't feel like I've come away with anything really new...

[-]Solvent13y100

Well, to start with, Luke has provided an actual mechanism for this mistake to occur by.

[-]PhilGoetz13y30

This is interesting, but I don't see how to apply the solution. Presumably I either have no priors; or the priors are going to be generated by the same process I use to generate the values I am combining them with.

The resulting bias should be smaller if you choose the top 2 or 3 alternatives. E.g., give to 3 charities, not to 1.

How do market traders deal with this problem?

[-]NancyLebovitz13y20

If I understand this correctly, there's an empirical problem.

How optimistic your most optimistic estimate is going to be is going to be a matter of temperament and knowledge for individuals, and group culture for groups. It seems to me that the correction would need to be determined by experience. Or is this the "appropriate prior" problem?

When I'd only seen the title for this article, I thought it was going to be about the question of how much effort you should put into optimizing.

[-][anonymous]13y20

This is nit-picky, but I don't think you should attribute to Robert Burns anything other than the words he actually wrote. Meanings change a lot in translation, and it's not quite fair to do that through invisible sleight of hand. "Robert Burns (standard English translation)" would serve to CYA.

[-]wnoise13y100

The original lines:

The best laid schemes o' Mice an' Men,
Gang aft agley,
An' lea'e us nought but grief an' pain,
For promis'd joy!

are little different than the version Luke quoted, and are mostly understandable (with the exception "gang aft agley") to a sophisticated English reader with no special knowledge. I am somewhat inclined to call that version a rewrite rather than a translation, just as I would consider some modernized versions of Shakespeare to not be translations, but rewrites.

The standard problem of drawing lines in a continuum rears its head again. There are some reasonable arguments for calling Scots from this time a dialect of English, and many others for calling it a separate language. This is complicated by people's personal and national identities being involved. Questions like these generally end up being settled more by politics than by details of the different linguistic varieties involved.

6lukeprog13y

Okay, I added '(translated)'.

5komponisto13y

Would you say the same thing if a translation had been quoted of a poem originally in Latin or French? (My guess: probably not. No one talks about a "standard English translation" of Catullus or Baudelaire. Instead, they credit the translator by name, or simply take the liberty of using the translation as if it were the original author's words.)

0[anonymous]13y

The translator should absolutely be credited by name if he or she is known. Burns has passed kind of into folk status, and is a special case. I would never quote Catullus or Baudelaire in English as if it were the original author's words. No. It's wrong (deprives the translator of rightful credit) -- and, FWIW, it's also low-status.

8komponisto13y

What matters, obviously, is not whether Burns has passed into folk status, but whether the particular translation has. The latter seems an implausible claim (since printed translations can presumably be traced and attributed), but if it were true, then there would be no need for acknowledgement (almost by definition of "folk status"). My comment arose from the suspicion that you reacted as if Burns had been paraphrased, as opposed to translated -- because the original language looks similar enough to English that a translation will tend to look like a paraphrase. I find it unlikely that you would actually have made this comment if lukeprog had quoted Catallus without mentioning the translator; and on the other hand I suspect you would have commented if he had taken the liberty of paraphrasing (or "translating") a passage from Shakespeare into contemporary English without acknowledging he had done so. My point being that the case of Burns should be treated like the former scenario, rather than the latter, whereas I suspect you intuitively perceived the opposite. All translation is paraphrase, of course -- but there is a difference of connotation that corresponds to a difference in etiquette. When one is dealing with an author writing in the same language as oneself, there is a certain obligation to the original words that does not (cannot) exist in the case of an author writing in a different language. So basically, I saw your comment as not-acknowledging that Burns was writing in a different language. I don't see it as lowering the status of the quoter; the status dynamic that I perceive is that it grants very high status to the original author, status so high that we're willing to overlook the original author's handicap of speaking a different language. In effect, it grants them honorary in-group status. For example: Descartes has high enough status that the content of his saying "I think therefore I am" is more important to us than the fact that his actual wor

6Bill_McGrath13y

Google has let me down in finding this quote, both in English and in roughly-translated German. Where is this from?

0komponisto13y

A statement like this is attributed to Schoenberg by a number of people, but I can't find a specific reference either. Perhaps it was just something he said orally, without ever writing it anywhere.

2garethrees12y

The earliest reference I can track down is from 1952. In Roger Sessions: a biography (2008), Andrea Olmstead writes: (The work that Sessions had performed this role in appears to have been Man who ate the popermack in the mid-1920s.) Sessions' essay (originally published in The Score and then collected in Roger Sessions on Music) begins: An entertaining later reference to this quotation appears in Dialogues and a diary by Igor Stravinsky and Robert Craft (1963), where Stravinsky tabulates the differences between himself and Schoenberg, culminating in this comparison:

0[anonymous]12y

This seems to have been Stravinsky's playful characterization of Schoenberg. See Dialogues by Igor Stravinsky and Robert Craft, p. 108, where Stravinsky tabulates the differences between himself and Schoenberg, culminating in: I guess it's possible that Stravinsky is quoting Schoenberg here, but the parallelism suggests not, and when he does quote Schoenberg (as in row 1 in the table), he gives a citation.

5wnoise13y

Right. But there are no hard-and-fast lines for "same language as oneself". You and I both brought up comparisons with Shakespeare. Both can be difficult to read for a struggling reader. For a sophisticated reader, the gist of both can be gotten with a modicum of effort. Full understanding of either requires a specialized dictionary, as vocabulary is different. So was Shakespeare writing in a different language? Was Burns? What's the purpose of this distinction? If it's weighing understanding vs adherence to the original wording, the trade-off is fairly close to the same place for the two. On the other hand, if it's to acknowledge the politic linguistic classification that Scots is a separate language from Modern English, there is a distinction, as no one cares whether Early Modern English is treated as a separate language from Modern English. (EDIT: I should say that I do think it's often more useful to consider Scots a separate language. Just because Burns was mostly intelligible to the English does not mean that other authors or speakers generally were.) Meditations was first published in Latin.

1[anonymous]13y

My comment arose from the suspicion that you reacted as if Burns had been paraphrased, as opposed to translated I don't know what to tell you except that you're wrong. I know the original poem pretty well ("Gang aft agley" is a famous phrase in some circles). Burns isn't my specific field, but my impression, backed by a cursory Wikipedia search, is that the name of the original translator has been lost to the mists of history. If anyone can correct me and supply the original translator's name, I'll be truly grateful. I don't see it as lowering the status of the quote Yes, you wouldn't, and I can't prove it to you except by assembling a conclave of Ivy League-educated snooty New York poets who happen to not be here right now. I will tell you -- and you can update scantily, since you don't trust the source -- that the high-status thing to do is to provide quotes in the original language without translation. You are thereby signalling that not only do YOU read Scots Gaelic (fluently, of course), but you expect everyone you come into contact with socially to ALSO be fluent in Scots Gaelic. The medium-status thing to do is at least to credit or somehow mark the translator, so that people think you are following standard academic rules for citation. The reason that quoting translations without crediting them as such is low-status is that it leaves you open to charges of not understanding the original source material.

[-]wnoise13y140

You are thereby signalling that not only do YOU read Scots Gaelic (fluently, of course), but you expect everyone you come into contact with socially to ALSO be fluent in Scots Gaelic.

Scots Gaelic is not Scots (is not Scottish English, though modern speakers of Scots do generally code switch into it with ease, sometimes in a continuous way). Scots Gaelic is a Gaelic, Celtic language. Scots is Germanic. Burns wrote in Scots.

4[anonymous]13y

You're right, and thanks for the clarification. As I said, Burns isn't really my field.

[-][anonymous]13y110

Scots Gaelic is a thing, but it is not the language in which Burns wrote. That's just called Scots. I wouldn't ordinarily have mentioned it, but... you're coming off as a bit snobby here. (O wad some Power the giftie gie us, am I right?)

[-]JoshuaZ13y100

that the high-status thing to do is to provide quotes in the original language without translation

This may be high status in certain social circles (having interacted with the snooty Ivy League educated New York poets also, they certainly think so) but to a lot of people doing so comes across as obnoxious and pretentious, that is an attempt to blatantly signal high status in a way that signals low status.

The highest status thing to do (and just optimal as far as I can tell for actually conveying information) is to include the original and the translation also.

3[anonymous]13y

I agree that this is probably optimal. My own class background is academics and published writers (both my parents are tenured professors). It's actually hard trying to explain in a codified way what one knows at a gut level: I know that translations need to be credited, and for status reasons, but press me on the reasons and I'm probably not terribly reliable.

[-]gwern13y110

I find it interesting that everyone here is focusing on status; couldn't it just be that crediting translations is absolutely necessary for the basic scholarly purpose of judging the authority and trustworthiness of the translation and even the original text? And that failing to provide attribution demonstrates a lack of academic expertise, general ignorance of the slipperiness of translation ('hey, how important could it be?'), and other such problems.

I know I find such information indispensable for my anime Evangelion research (I treat translations coming from ADV very differently from translations by Olivier Hague and that different from translations by Bochan_bird, and so on, to give a few examples), so how much more so for real scholarship?

[-][anonymous]13y100

Well, what I originally [see edit] wrote was "It's wrong (deprives the translator of rightful credit) -- and, FWIW, it's also low-status." I think people found the "low-status" part of my claim more interesting, but it wasn't the primary reason I reacted badly to seeing a translation uncredited as such.

Edit: on reflection, this wasn't my original justification. I simply reacted with gut-level intuition, knowing it was wrong. Every other explanation is after-the-fact, and therefore suspect.

3JoshuaZ13y

Upvoting for realizing that a rational wasn't your actual reason.

1JoshuaZ13y

Yes, agreed. I did note above that including the translation details with the original was optimal for conveying information but I didn't emphasize it. I think that part of why people have been emphasizing status issues over serious research in this context is that the start of the discussion was about what to do with epigraphs. Since they really are just for rhetorical impact, the status issue matters more for them.

2A1987dM12y

This was the case until about a decade ago, but nowadays it merely signals that you expect the audience to know how (and be willing to) use Google. (The favourite quotations section in my Facebook profile contains quotations in maths, Italian, English, Irish and German and none of them is translated in any other language.)

0ArisKatsaris13y

Status is in the map, not in the territory, siduri. The map of "snooty New-York poets" needn't be our own map.

2JoshuaZ13y

Yes but being aware of what signals one is sending out is helpful. Given that humans play status games it is helpful to be aware of how those games function so one doesn't send signals out that cause people to pay less attention or create other barriers to communication.

7Hey13y

Agreed, but it takes a high degree of luminosity to distinguish between tactical use of status to attain a specific objective, and getting emotionally involved and reactive to the signals of other (inducing this state of confusion is pretty much the function of status-signals for most humans, though). Tactical = dress up, display "irrational confidence", and play up your achievements to maximize attraction in potential romantic partners, or do well at a job interview. Emotional-reactive = seeking, and worrying about, the approval of perceived social betters even though there is no logical reason.

0prase13y

Are you saying that always when a sentence is translated, its author must have high status or gains high status at the moment of translation, because the default attitude is to ignore anything originally uttered in foreign language? If this is what you mean, I find it surprising. I have probably never been in a situation when someone was ignored because he spoke incomprehensible gibberish and that fact was more important than the content of his words. Of course, translation may be costly and people generally pay only for things they deem valuable, which is where the status comes into play. But it doesn't mean that with low-status people it is more important that they speak gibberish than what they say. (A thought experiment: A Gujarati speaking beggar approaches a rich English gentleman, says something and goes away. The Englishman's wife, who is accompanying him at the moment, accidentally understands Gujarati. The man can recognise the language but doesn't understand a word. What is the probability that he asks his wife "what did he say"? As a control group, imagine the same with an English beggar, this time the gentleman didn't understand because when the beggar had spoken, a large truck had passed by. Is the probability of asking "what did he say" any different from the first group?)

2komponisto13y

Yes. More generally, the default attitude is to ignore anything uttered by a member of an outgroup. By calling attention to the fact that a sentence has been translated, one is calling attention to the fact that the author speaks a foreign language and thus to the author's outgroup status. Omitting mention of a person's outgroup status is a courtesy extended to those we wish to privilege above typical outgroup members. Curiosity about what a low-status person says does not imply that one thinks the content of their words is a more important fact about them than their low status. With high probability, the most salient aspect of the beggar from the perspective of the Englishman is that he is a beggar (and, in the first case, a foreign beggar at that). Whatever the beggar said, if the Englishman finds out and deems it worthy of recounting later, I would be willing to bet that he will not omit mention of the fact that he heard it from a beggar.

[-]carey12y10

Note Carl Shulman's counterargument to the assumption of a normal prior here and the comments traded between Holden and Carl.

"If your prior was that charity cost-effectiveness levels were normally distributed, then no conceivable evidence could convince you that a charity could be 100x as good as the 90th percentile charity. The probability of systematic error or hoax would always be ludicrously larger than the chance of such an effective charity. One could not believe, even in hindsight, that paying for Norman Borlaug’s team to work on the Green Revo... (read more)

1Mass_Driver12y

The problem with this analysis is that it assumes that the prior should be given the same weight both ex ante and ex post. I might well decide to evenly weight my prior (intuitive) distribution showing a normal curve and my posterior (informed) distribution showing a huge peak for the Green Revolution, in which case I'd only think the Green Revolution was one of the best charitable options, and would accordingly give it moderate funding, rather than all available funding for all foreign aid. But, then, ten years later, with the benefit of hindsight, I now factor in a third distribution, showing the same huge peak for the Green Revolution. And, because the third distribution is based not on intuition or abstract predictive analysis but on actual past results --it's entitled to much more weight. I might calculate a Bayesian update based on observing my intuition once, my analysis once, and the historical track record ten or twenty times. At that point, I would have no trouble believing that a charity was 100x as good as the 90th percentile. That's an extraordinary claim, but the extraordinary evidence to support it is well at hand. By contrast, no amount of ex ante analysis would persuade me that your proposed favorite charity is 100x better than the current 90th percentile, and I have no problem with that level of cynicism. If your charity's so damn good, run a pilot study and show me. Then I'll believe you.

[-]tetsuo5513y10

quick feedback or question.

In this part: Assume, too kindly, that your estimates are unbiased. And suppose you use this decision procedure many times, for many different decisions, and your estimates are unbiased.

the second time you mention the unbiased makes no sense to me and looks like a typo.

[-]The_Jaded_One7y00

If X = Skill + Luck, with Skill and Luck both random variables, then selecting max(X) will get you something that has high Skill and high Luck.

If Estimate = TrueVal + Error, then max(Estimate) will have both high TrueVal and high Error.

This obvious insight has many applications, especially when the selection is done over a very large number of entities, e.g. trying to emulate the habits of billionaires in order to become rich.

[-]CynicalOptimist8y00

Very interesting. I'm going to try my hand at a short summary:

Assume that you have a number of different options you can choose, that you want to estimate the value of each option and you have to make your best guess as to which option is most valuable. In step one, you generate individual estimates using whatever procedure you think is best. In step 2 you make the final decision, by choosing the option that had the highest estimate in step one.

The point is: even if you have unbiased procedures for creating the individual estimates in step one (ie procedur... (read more)

[+]DanielLC13y-50

Moderation Log