Mini sequence: ~30min of rationality writing per day for 30 days (2/30)
Claim 1: Goodhart's Law is true.
Goodhart's Law (which is incredibly appropriately named) reads "any measure which becomes a metric ceases to be a good measure." Another way to say this is "proxies are leaky," i.e. the proxy never quite gets you the thing it was intended to get you. If you want to be able to differentiate between promising math students and less-promising ones, you can try out a range of questions and challenges until you cobble together a test that the 100 best students do well on and the following 900 do worse on. But as soon as you make that test the test, it's going to start leaking. In the tenth batch of a thousand students, the 100 best ones will still do quite well, but you'll also get a bunch of people who don't have the generalized math skill, but who did get good at answering the specific, known questions. Your top 100 will no longer be composed only of the 100 actual-best math students.
This is analogous to what's happened with Western diets and sugar. Prehistoric primates who happened to have a preference for sweet things (fruit) also happened to get a lot more vitamins and minerals, and therefore they survived and thrived at higher rates than those sugar-ambivalent primates who failed to become our ancestors and died out. The process of natural selection turned a measure for nutrition (sweetness) into a metric (having a sweet tooth/implicit hardwired assumption that more sugar → more utility), which was fine until we learned to separate the sugar from the nutrients (teaching to the test) and discovered that our preferences were hardwired to the proxy rather than to the Actual Good Thing.
Claim 2: When attempting to do operant conditioning with a given reward or punishment, for any desired strength-of-conditioning-effect, ∃ ("there exists") a sufficiently small delay between behavior and consequence to produce that effect.
This one is not literally true. In order for it to be true, the hyperbolic nature of discounting (such that closer rewards are disproportionately more effective in creating reinforcement) would have to extend off to absurdity such that an infinitesimally small reward or response could produce an arbitrarily large conditioning effect if it was immediately proximal to the relevant behavior, and if that were true then clicker training (in which you use a click sound that's been associated with treats and compliments and other rewards to signal to a dog that you like what it just did) wouldn't reinforce the distant behavior of rolling over but would instead reinforce something like the last blink of the dog's eye before the soundwave of the click reached the dog's ear.
However, I claim it is effectively true, for rewards as small as fleeting thoughts or shifts in emotion, and for time scales as small as hundredths of a second. If I want an anti-Oreo conditioning effect that is as strong as the pleasure-burst I receive from eating an Oreo, I can get it, even with a stimulus as small as a thought—provided that thought pops up fast enough.
(This is actually why clicker training is a thing—because you literally cannot deliver a treat fast enough to produce effects of the size you can get through the much-tighter feedback loop provided by the audio channel. If you can make a click into a positive reward for a dog, then you're better off clicking than tossing cheese cubes.)
(For a model of why hyperbolic discounting, consider the bits of data required to locate and confirm a causal link between Behavior #736 and a reward that doesn't appear until after Behavior #755, compared to the bits required to be confident of the link when the reward appears only one or two Behaviors later.)
Claim 3: Our S1s aggregate and analyze a tremendous amount of sensory data into implicit causal models, and those causal models produce binary approach-avoid signals when we encounter new stimuli, based on whether or not (according to those models) those stimuli will be helpful or hurtful re: progress toward our goals.
I think this is what Anna Salamon is after when she talks about "taste." Imagine a veteran doctor who has, in their long career, chased down the explanations for hundreds of confusing, confounded, or hitherto-unknown ailments. In investigating a thousand hypotheses, maybe 100 of them panned out, and 800 of them led to brick walls, and 100 of them remain inconclusive. The part of their brain that builds and maintains a rich, inner model of the universe is (quietly, under the hood) drawing connections between those investigations, noting the elements that the successful ones had in common versus the elements that the unsuccessful ones had in common. When our doctor encounters a new patient and starts investigating, some part of their system makes a lightning-fast comparison—does this new line of research "feel like" or "resemble" those ones which previously paid off, or is it more reminiscent of those ones that ended in frustration?
That information gets compressed into a quick yes-or-no, good-or-bad, approach-or-avoid signal—a gut sense of doom or optimism, interest or disinterest. To the extent that there's been lots of relevant experience and the new situation is in the same class as the old ones, this sense can be extremely accurate and valuable—what we call taste or intuition or second nature—and even when there's been very little training data, this sense can still provide useful insight.
Claim 4: Our brains condition us, often without us noticing.
In brief: there were studies with monkeys whose brains were hooked up to detectors and who had straws positioned to squirt juice into their mouths. When those monkeys exhibited desired behaviors, the scientists would give them a shot of juice, and the detectors would register a dopamine spike.
After a while, though, the dopamine spike migrated. It became associated with a "victory!" screen that the scientists would flash whenever the monkey performed a desired behavior, just like a dog begins to associate clicks with treats and other rewards.
Pause to let yourself be confused for a second. Don't gloss over this.
What. The. Heck.
The dopamine spike moves? How? Why?
I claim that what's going on is that the monkey's brain, separate from the monkey/the monkey's S2/any sapient or strategic awareness that the monkey has, is conditioning the monkey. Remember, a system that is capable of learning from its environment and meaningfully updating on that learning is more likely to survive and thrive than one that does not, so it makes sense that the monkey has some functional, adaptive processes in place to shape its own behavior. Basically, the monkey's brain has access to a) a ton of data, and b) carrots-and-sticks, in the form of pleasure and pain responses. The brain is sitting there wondering how the heck it can get this monkey to perform adaptive behavior, just like a human is sitting there wondering how the heck it can get the dog to roll over. The brain has a model of what sorts of behaviors will lead to success and thriving, just as the human has a model of what cute doggy behavior looks like.
And the brain knows that, with a shot of pleasure, the monkey is vastly more likely to repeat the action it just tried. Things that lead to juice are hard-wired to produce a spike of pleasure, so that juice-seeking behavior will be reinforced. But then the brain slowly starts to notice that there's no decision-tree node between a victory screen and juice—once the screen flashes, juice is inevitable.
So the relevant behavior must be further back. The brain starts reinforcing victory screens as a proxy for juice (which itself is a primordial proxy for calories and micronutrients). Whenever the victory screen appears, the monkey is rewarded by its own brain, such that it becomes more likely to do whatever it was doing just before the screen appeared. And all of this is happening below the level of conscious attention for the monkey—all it knows is it likes juice and it likes being happy and it does things that previously led to juice and happiness. Eventually, the monkey's brain starts rewarding behavior even further back (though probably with a lighter wash of anticipatory exhilaration rather than a sharp spike of pleasure): game actions that lead to victory screens that lead to juice that lead to happiness.
Conclusion: Your brain is conditioning you, all the time, often beneath your notice, toward proxies that, based on past experience, are likely to take you closer to your goals rather than farther away from them. Furthermore, by the combination of Claims 2 and 3, this conditioning is effective—it actually influences behavior to a meaningful degree.
Shitty corollary: Because proxies are always leaky, your brain is conditioning you wrong.
Case in point: Hypothetical Me is trying to lose weight (which is just another proxy), and I've decided to weigh myself every day because what gets measured gets managed (ha). My brain isn't explicitly smart, just implicitly clever, and it's on my side. It slowly starts to figure out that high scale numbers = bad, and low scale numbers = good, and it decides to do whatever it can with that information and its ability to send me visceral signals.
But I've had a few high scale number days, and because humans are risk-averse and loss-averse, those high scale number days hurt pretty badly and they get bumped up in the priority list. So my brain is sitting there with mirror-twin goals of maximize exposure to low scale numbers and minimize exposure to high scale numbers, and it doesn't really know how to do the former, but it sure as heck can do something about the latter, which is the one that seems more urgent anyway.
So I glance toward my bathroom scale, and—often at a level too low to grab my conscious attention—my brain deals me a helpful "owch" that disincentivizes the glance I just made. And because the owch was near-instantaneous, it works (see Claim 1). After a few iterations of this, I'm successfully conditioned into developing a big ol' blind spot where my bathroom scale is, such that I never even notice it anymore (and often such that I don't even notice that I'm not noticing).
If I'm lucky, eventually my train of thought wanders, and my real goal floats back up to the front of my mind, and I realize what's going on, and I say "thanks for trying, brain," (because it really is doing heroic work; don't beat it up for getting it just a tiny bit wrong because guess what, the beating-up is far closer to the noticing than it is to the mistake-making that you're actually trying to disincentivize, think about the implications aaaaaaahhhhhhhhh) and then I do a quick meditation on what the incentives ought to be and try to produce an S1 shift in the right direction.
But if I'm not lucky, this just becomes a part of my blind spot forever.
(Caveat: epistemic status of all of this is somewhat tentative, but even if you assign e.g. only 70% confidence in each claim (which seems reasonable) and you assign a 50% hit to the reasoning from sheer skepticism, naively multiplying it out as if all of the claims were independent still leaves you with a 12% chance that your brain is doing this to you, which seems at least worth a few cycles of trying to think about it and ameliorate the situation.)