Mini sequence: ~30min of rationality writing per day for 30 days (2/30)

Claim 1: Goodhart's Law is true.

Goodhart's Law (which is incredibly appropriately named) reads "any measure which becomes a metric ceases to be a good measure." Another way to say this is "proxies are leaky," i.e. the proxy never quite gets you the thing it was intended to get you. If you want to be able to differentiate between promising math students and less-promising ones, you can try out a range of questions and challenges until you cobble together a test that the 100 best students do well on and the following 900 do worse on. But as soon as you make that test the test, it's going to start leaking. In the tenth batch of a thousand students, the 100 best ones will still do quite well, but you'll also get a bunch of people who don't have the generalized math skill, but who did get good at answering the specific, known questions. Your top 100 will no longer be composed only of the 100 actual-best math students.

This is analogous to what's happened with Western diets and sugar. Prehistoric primates who happened to have a preference for sweet things (fruit) also happened to get a lot more vitamins and minerals, and therefore they survived and thrived at higher rates than those sugar-ambivalent primates who failed to become our ancestors and died out. The process of natural selection turned a measure for nutrition (sweetness) into a metric (having a sweet tooth/implicit hardwired assumption that more sugar → more utility), which was fine until we learned to separate the sugar from the nutrients (teaching to the test) and discovered that our preferences were hardwired to the proxy rather than to the Actual Good Thing.

Claim 2: When attempting to do operant conditioning with a given reward or punishment, for any desired strength-of-conditioning-effect, ∃ ("there exists") a sufficiently small delay between behavior and consequence to produce that effect.

This one is not literally true. In order for it to be true, the hyperbolic nature of discounting (such that closer rewards are disproportionately more effective in creating reinforcement) would have to extend off to absurdity such that an infinitesimally small reward or response could produce an arbitrarily large conditioning effect if it was immediately proximal to the relevant behavior, and if that were true then clicker training (in which you use a click sound that's been associated with treats and compliments and other rewards to signal to a dog that you like what it just did) wouldn't reinforce the distant behavior of rolling over but would instead reinforce something like the last blink of the dog's eye before the soundwave of the click reached the dog's ear.

However, I claim it is effectively true, for rewards as small as fleeting thoughts or shifts in emotion, and for time scales as small as hundredths of a second. If I want an anti-Oreo conditioning effect that is as strong as the pleasure-burst I receive from eating an Oreo, I can get it, even with a stimulus as small as a thought—provided that thought pops up fast enough.

(This is actually why clicker training is a thing—because you literally cannot deliver a treat fast enough to produce effects of the size you can get through the much-tighter feedback loop provided by the audio channel. If you can make a click into a positive reward for a dog, then you're better off clicking than tossing cheese cubes.)

(For a model of why hyperbolic discounting, consider the bits of data required to locate and confirm a causal link between Behavior #736 and a reward that doesn't appear until after Behavior #755, compared to the bits required to be confident of the link when the reward appears only one or two Behaviors later.)

Claim 3: Our S1s aggregate and analyze a tremendous amount of sensory data into implicit causal models, and those causal models produce binary approach-avoid signals when we encounter new stimuli, based on whether or not (according to those models) those stimuli will be helpful or hurtful re: progress toward our goals.

I think this is what Anna Salamon is after when she talks about "taste." Imagine a veteran doctor who has, in their long career, chased down the explanations for hundreds of confusing, confounded, or hitherto-unknown ailments. In investigating a thousand hypotheses, maybe 100 of them panned out, and 800 of them led to brick walls, and 100 of them remain inconclusive. The part of their brain that builds and maintains a rich, inner model of the universe is (quietly, under the hood) drawing connections between those investigations, noting the elements that the successful ones had in common versus the elements that the unsuccessful ones had in common. When our doctor encounters a new patient and starts investigating, some part of their system makes a lightning-fast comparison—does this new line of research "feel like" or "resemble" those ones which previously paid off, or is it more reminiscent of those ones that ended in frustration?

That information gets compressed into a quick yes-or-no, good-or-bad, approach-or-avoid signal—a gut sense of doom or optimism, interest or disinterest. To the extent that there's been lots of relevant experience and the new situation is in the same class as the old ones, this sense can be extremely accurate and valuable—what we call taste or intuition or second nature—and even when there's been very little training data, this sense can still provide useful insight.

Claim 4: Our brains condition us, often without us noticing.

In brief: there were studies with monkeys whose brains were hooked up to detectors and who had straws positioned to squirt juice into their mouths. When those monkeys exhibited desired behaviors, the scientists would give them a shot of juice, and the detectors would register a dopamine spike.

After a while, though, the dopamine spike migrated. It became associated with a "victory!" screen that the scientists would flash whenever the monkey performed a desired behavior, just like a dog begins to associate clicks with treats and other rewards.

Pause to let yourself be confused for a second. Don't gloss over this.

What. The. Heck.

The dopamine spike moves? How? Why?

I claim that what's going on is that the monkey's brain, separate from the monkey/the monkey's S2/any sapient or strategic awareness that the monkey has, is conditioning the monkey. Remember, a system that is capable of learning from its environment and meaningfully updating on that learning is more likely to survive and thrive than one that does not, so it makes sense that the monkey has some functional, adaptive processes in place to shape its own behavior. Basically, the monkey's brain has access to a) a ton of data, and b) carrots-and-sticks, in the form of pleasure and pain responses. The brain is sitting there wondering how the heck it can get this monkey to perform adaptive behavior, just like a human is sitting there wondering how the heck it can get the dog to roll over. The brain has a model of what sorts of behaviors will lead to success and thriving, just as the human has a model of what cute doggy behavior looks like.

And the brain knows that, with a shot of pleasure, the monkey is vastly more likely to repeat the action it just tried. Things that lead to juice are hard-wired to produce a spike of pleasure, so that juice-seeking behavior will be reinforced. But then the brain slowly starts to notice that there's no decision-tree node between a victory screen and juice—once the screen flashes, juice is inevitable.

So the relevant behavior must be further back. The brain starts reinforcing victory screens as a proxy for juice (which itself is a primordial proxy for calories and micronutrients). Whenever the victory screen appears, the monkey is rewarded by its own brain, such that it becomes more likely to do whatever it was doing just before the screen appeared. And all of this is happening below the level of conscious attention for the monkey—all it knows is it likes juice and it likes being happy and it does things that previously led to juice and happiness. Eventually, the monkey's brain starts rewarding behavior even further back (though probably with a lighter wash of anticipatory exhilaration rather than a sharp spike of pleasure): game actions that lead to victory screens that lead to juice that lead to happiness.

Conclusion: Your brain is conditioning you, all the time, often beneath your notice, toward proxies that, based on past experience, are likely to take you closer to your goals rather than farther away from them. Furthermore, by the combination of Claims 2 and 3, this conditioning is effective—it actually influences behavior to a meaningful degree.

Shitty corollary: Because proxies are always leaky, your brain is conditioning you wrong.

Case in point: Hypothetical Me is trying to lose weight (which is just another proxy), and I've decided to weigh myself every day because what gets measured gets managed (ha). My brain isn't explicitly smart, just implicitly clever, and it's on my side. It slowly starts to figure out that high scale numbers = bad, and low scale numbers = good, and it decides to do whatever it can with that information and its ability to send me visceral signals.

But I've had a few high scale number days, and because humans are risk-averse and loss-averse, those high scale number days hurt pretty badly and they get bumped up in the priority list. So my brain is sitting there with mirror-twin goals of maximize exposure to low scale numbers and minimize exposure to high scale numbers, and it doesn't really know how to do the former, but it sure as heck can do something about the latter, which is the one that seems more urgent anyway.

So I glance toward my bathroom scale, and—often at a level too low to grab my conscious attention—my brain deals me a helpful "owch" that disincentivizes the glance I just made. And because the owch was near-instantaneous, it works (see Claim 1). After a few iterations of this, I'm successfully conditioned into developing a big ol' blind spot where my bathroom scale is, such that I never even notice it anymore (and often such that I don't even notice that I'm not noticing).

If I'm lucky, eventually my train of thought wanders, and my real goal floats back up to the front of my mind, and I realize what's going on, and I say "thanks for trying, brain," (because it really is doing heroic work; don't beat it up for getting it just a tiny bit wrong because guess what, the beating-up is far closer to the noticing than it is to the mistake-making that you're actually trying to disincentivize, think about the implications aaaaaaahhhhhhhhh) and then I do a quick meditation on what the incentives ought to be and try to produce an S1 shift in the right direction.

But if I'm not lucky, this just becomes a part of my blind spot forever.

(Caveat: epistemic status of all of this is somewhat tentative, but even if you assign e.g. only 70% confidence in each claim (which seems reasonable) and you assign a 50% hit to the reasoning from sheer skepticism, naively multiplying it out as if all of the claims were independent still leaves you with a 12% chance that your brain is doing this to you, which seems at least worth a few cycles of trying to think about it and ameliorate the situation.)

27 comments, sorted by
magical algorithm
Highlighting new comments since Today at 11:31 AM
Select new highlight date
Moderation Guidelinesexpand_more

An important thing to notice about Goodhart's law is that we roll three different phenomena together. This isn't entirely bad because the three phenomena are very similar, but sometimes it helps to think about them differently.

Goodhart Level 1: Following the sugar/fruit example, imagine there are a bunch of different fruits with different levels of nutrients and sugar, and the nutrient and sugar levels are very correlated. It is still the case that if you optimize for the most sugar, you will get reliably less nutrients than if you optimize for the most nutrients. (This is just the cost of using a proxy. I barely even want to call this Goodhart's law.)

Goodhart Level 2: It is possible that foods with lots of sugar are usually very correlated with foods with lots of nutrients, but there is one type of fruit that is pure sugar with no nutrients. If this fruit occurred in nature, but only very rarely, it would not mess with the statistical correlation between sugar and nutrients very much. Thus when an agent optimizes very hard for sugar, they end up with no nutrients, but if they optimized only slightly, they would have found a normal fruit with lots of sugar and nutrients.

Note that this is more nuanced than just saying we didn't have pure sugar in the ancestral environment and we do now, so the actions that were good in the ancestral environment are a bad proxy for the actions that are good now. (Maybe just using a bad proxy should be called Goodhart Level 0) The point is that the reason that the environment is different now is that we optimized for sugar. We pushed to the section of possible worlds with lots of sugar using our sugar optimization, and the correlation mostly only existed in the worlds with a moderate amount of sugar.

Goodhart Level 3: Say we live in a world with only a fixed amount of nutrients, and someone else wants a larger share of the nutrients. If you are using the proxy of sugar, and other people know this, and adversary might invent candy, and then trade you their candy for some of your fruit. Another agent had a goal that was in conflict with your true goal, but not in conflict with your proxy, so they exploited your proxy and created options that satisfied your proxy but not your true goal.

I say more about this (in math) here:

Many instances of people Goodharting themselves falls in Level 2 (If I don't step on the scale, I optimize out of the worlds where the scale number is correlated with my weight). However, I claim that some instances might be at Level 3. In particular rationalization. Maybe part of me wants to save the world and uses the ability to produce justifications for why an action saves the world as a proxy for what saves the world. Then, a different part of me wants to goof off and produces a justification for why goofing off is actually the most world-saving action.

Awesome. Thanks for adding. I particularly like the inclusion of adversarial behavior into the mix—I hadn't thought of the goal structure of the candymakers as undercutting/exploiting/taking advantage of the goal structure of the humans.

Oh wow, I had never before thought of modern people over-consuming sugar, as being an application of Goodhart's Law. But it is. That's brilliant.

I very much agree with the ideas presented in this post; for people who are interested in finding out more, I very much recommend the book Don't Shoot the Dog, and maybe also The Power of Habit. That said, those books are pretty much written from a behaviorist perspective, so they don't go very much into the way that mental and abstract concepts become associated with value, as in your doctor example.

A couple of minor suggestions on how to improve on your post further: 1) I think that the ∃ in your Claim 2 is meant to be interpreted as "for a given effect and size, there exists a sufficiently small delay such that the desired result is produced", but I wouldn't have understood that notation if I hadn't had math as a minor in my degree, and probably not all readers have 2) It might be good to quickly explain clicker training in a couple of sentences to people who haven't heard about it before.

Observing the link between wireheading and Goodhart's law seems to be an instance of what Paul Graham recommends in his latest essay. He claims that the most valuable insights are both general and surprising, but that those insights are very hard to find. So instead one is often better off searching for surprising takes on established general ideas, as OP seems to have done. :)

Thanks for the reading recommendations and the suggestions! I decided to leave ∃ for somewhat snarky incentivize-people-to-go-learn-a-thing reasons, but I linked to a clicker training video and will add a couple of sentences.


Note that correctly interpreting the ∃ thing isn't just about knowing that ∃ stands for "there exists"; it also takes a bit of additional knowledge to correctly unpack "for a given effect and size, there exists a sufficiently small delay" as "we can arbitrarily pick a certain effect and size that we want our intervention to have, and regardless of what we pick we can make our intervention satisfy those properties by making the delay small enough".

In fact, in the first version of my comment I wrote something like "I'm interpreting ∃ to stand for 'there exists', but directly substituting that in to make the sentence read 'there exists a sufficiently small delay' doesn't create a sensible sentence", until after I thought that oh right, he means that there exists a duration of delay which makes this come true, and 'makes this come true' is defined as an inequality the way that it's defined when you're doing epsilon-delta proofs! Even if I'd otherwise known what "it exists" means, I don't know if I'd managed to correctly interpret the sentence if I hadn't taken that analysis course and learned how to think about it.

Of course, I might just be particularly dense, and maybe everyone else would have understood it anyway. :-)

Hmmm ... that seems sensible, and produced a shift, but not enough to move my overall weighing. Cue metacognitive doubts about whether I'm just status-quo biasing into protecting my original decision. :-)

Note also that non-alphanumeric symbols are hard to google. I kind of guessed it from context but couldn't confirm until I saw Kaj's comment.

FWIW, I went through pretty much the same sequence of thoughts, which jarred me out of what was otherwise a pleasant/flowing read. Given the difficulty people unfamiliar with the notation faced in looking it up, maybe you could say "∃ (there exists)", and/or link to the relevant Wiki page (

If you're comfortable rephrasing the sentence a little more for clarity, I'd suggest replacing the part after the quantifier with something like "some length of delay between behavior and consequence which is short enough to produce the effect."

I also didn't know what it meant, and it didn't seem worth my time to look it up, it just made the post harder to read.

I claim that what's going on is that the monkey's brain, separate from the monkey/the monkey's S2/any sapient or strategic awareness that the monkey has, is conditioning the monkey.

I think this claim is confusing at best and false at worst. The shifting dopamine response is well-recognized in the neuroscience literature, and explained by Sutton and Barto's Temporal-Difference model.

First, it should be emphasized that midbrain dopamine does not signal reward. The monkey can experience a ton of pleasure without any dopamine reaction. Midbrain dopamine signals reward prediction error, the difference between actual and expected reward. It signals a kind of surprise.

Now the TD model is quite Bayesian. Whereas the Rescorla-Wagner model -- the previously dominant theory of reinforcement -- viewed the prediction error as the difference between actual and expected current reward; the TD model instead views it as the difference between all actual and expected future rewards (properly discounted).

So when the dopamine signal shifts, the monkey is just conserving expected evidence. Initially, it is positively surprised to receive juice. But eventually, it learns that the screen perfectly predicts the juice, and so it is the appearence of the screen itself that becomes the positive surprise. On a classical model of reinforcement, these events are different, as OP seems to recognize. But on the TD model, these are just instances of the very same kind of conditioning event.

For futher reference, see the section "Two Dopamine Responses and One Theory" of Glimcher PW (2011) Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

OP seems to recognize all this, but these observations seems to be complemented with somewhat unfounded interpretations and elaborations.

[Epistemic status: confident OP will be confusing to those without RL background knowledge, but still non-negligible credence that OP is explaning exactly the above but from a different perspective]

Thanks for the info! I think the diff between my explanation and yours largely falls out "true" in your favor, and I'm glad you have additional clarification (correction?) here.

Nitpick: as far as I can tell, you're describing discounting in general, not hyperbolic discounting. Hyperbolic discounting refers specifically to the surprising result that human discounting isn't exponential, like we expected it to be based on economics research.

I agree with the overall claim you're making, though I think you're making it out to be a stronger force than I expect it is. In general, I think a good prior is "what your brain is doing by default is pretty damn close to right, and you have to understand it pretty well before you can recognize the actual flaws in its behavior." I suspect you've mistaken where in the chain something is going wrong - I think it's something related to social permission to make mistakes, social permission to succeed/fail at goals, etc. In other words, you get a punishment from your internalized model of other people when you fail to have lost weight.

I would absolutely expect internalized models to be a part of the thing (to be one of the abstractions or simplifications that your S1 uses to understand all of the data it's ever experienced). I wouldn't be surprised to find out that they're the generator of a lot of the "this is serving my goals" or "this is threatening/dangerous" conclusions that lead to positive and negative pings. I would, however, be surprised to find out that they're the only thing, or even the dominant one. I think we might disagree on type or hierarchy?

I'm positing that the social stuff you're pointing out is like one of many "states" in the larger "nation" of brain-models-that-inform-the-brain's-decision-to-punish-or-reward, whereas if I'm understanding you correctly you're claiming either that the social modeling is the only model, or that the reward/punishment is always delivered through the social modeling channels (it always "comes from" some person-shaped thing in the head).

Please correct if I've misunderstood. I note that I wouldn't be surprised if it's like that for some people, but according to my introspection the social dynamic just doesn't have that much power for me personally.

So, (I claim that) machine learning models provide a pretty good basis for comparison of the dopamine-moving-earlier thing: eg, this is what you'd expect from a system that does a local reinforce-positive update on the policy net as soon as the value net starts predicting a higher future expected value. See something about actor-critic, eg section 3.2.1 of this pdf. Because we're starting from the prior that the brain is well enough designed to get pretty damn close to working, seeing that policy rewards move earlier is not evidence that should update us away from models where the brain is doing correct temporal difference learning (section 2.3.3 in that pdf).

The social thing I'm suggesting is that the expected value that the value function is predicting on seeing "oh, I gained weight" is a correct representation of future reward, even though it's a very simple approximation. I don't mean to say that I think a complicated, multi-step model is being run, just that the usual approximation is approximating a reasoning process that if done in full using the verbal loop, would look something like:

  1. I have higher weight

  2. I now know that I have higher weight

  3. I now have less justified ability to claim high status

  4. When I next interact with someone, I will have less claim to be valuable in their eyes

  5. I will therefore expect them to express slightly less approval toward me, because I won't be able to hide that I know I feel I have less justified ability to claim status

I am saying that I don't think implementation of TD-learning is the problem here.

Got it. That makes sense. I think I still disagree, but if I've understood you right I can agree that that hypothesis also clearly deserves to be in the mix.

This seems largely correct to me, although I think hyperbolic discounting of rewards/punishments over time may be less pronounced in human conditioning as compared to animals being conditioned by humans. Humans can think "I'm now rewarding myself for Action A I took earlier" or "I'm being punished for Action B" which can seems, at least in my experience, to decrease the effect of the temporal distance whereas animals seem less able to conceptualize the connection over time. Because of this difference, I think the temporal difference of reward/punishment is less important in people for conditioning as long as the individual is mentally associating the stimulus with the action, although it is still significant.

Also what's the name of the paper for the monkeys and juice study? I'd like to look at it because the result did surprise me.

Yeah, it makes a lot of sense to me that explicit cognition can interfere with the underlying, more "automatic" conditioning. Narrative framing and preforming intentions and focusing attention on the link between X and Y seem to have a strong influence on how conditioning does or doesn't work, and I don't know what the mechanisms are.

That being said, I think we agree that, in situations where there's not a lot of conscious attention on what's happening, the conditioning proceeds something like "normally," where "normal" is "comparable to what happens in less sapient animals"?

I couldn't dig up the original study from my phone but I found this, which references it:

For the specific case of weighing yourself, could you create a scale that only gives the positive reward, not the negative one? Like, it only tells you your weight if it's lower than yesterday, or better yet if the trend in your weight is downward over the past week? Maybe it displays a cheerful message and plays a soothing sound when you weigh yourself, and it emails you later at random if you've been losing weight.

Yeah, those seem like ameliorative measures that are likely to help the brain adopt better goals beneath the hood.

"So my brain is sitting there with mirror-twin goals of maximize exposure to low scale numbers and minimize exposure to high scale numbers, "

I think my answer would be don't give it those goals. Give it rule following goals and model updating goals. Create rules from the models. If following the rules don't actually get you slimmer, update the model and update the rules.


I think you missed the central claim? I'm not saying those rules are good, nor that they are consciously installed and reflectively endorsed. I'm saying that your subconscious has goals like that that you, whpearson's conscious verbal loop, aren't fully aware of and don't notice and are manipulated (or at least influenced) by. This isn't a thing that's fixed by simply deciding on different rules—S1 doesn't communicate verbally, except indirectly by responding to stories and narratives.

Also, the idea that the solution to Goodhart's Law is "create rules" makes me feel like I failed to communicate Claim 1.

I was not say that they were consciously endorsed, just that they were a product of taking a particular mindset which was consciously endorsed. E.g. my goal is to lose weight.

What I am suggesting, which is not a panacea but might not suck too much, is taking a different mindset. So "My goal is to understand the relationship between my activities and weight". So the scientific point of view. Once you have gotten a good understanding that you can actually try and optimise your weight. More details on what I am trying to get across can be found in this post which introduces it and post which gives a hypothetical example of its application to a field and has a more formal description of what I mean in one of the comments.

Yeah, but I guess I'm still not communicating the part where a very large and important section of your brain doesn't just adopt the goals you consciously give it. The stuff you said above is true, but irrelevant in this context/overwhelmed or undermined by the effect the post is pointing at, which makes me continue to feel like you're not receiving the point I'm trying to convey.

We may be agreeing, but not being clear!

I hundred percent agree that most of the brain is below conscious control and doesn't adopt your goals. What I think Goodhearts law should be guiding us towards is how we set the bits of the brain that are.

For example of losing weight by measuring weight and using that as a metric is *literally* setting the metric and measure the same. I was trying to point out a way that a measure could be used in pursuit of a goal, but that it could not be a treated like a metric.

I didn't get that impression from your post that you thought that there was any sort of conscious thing you could do before hand to try and head off the worst bits of Goodhearts. It was only post-hoc noticing things are going wrong. Like noticing "hey I am about to start on a poorly understood problem, that my subconscious brain will optimise wrong if I just set solving it as my goal. Maybe instead I should try to understand it first, using the measure I was about to use as a metric."