Statistical Prediction Rules Out-Perform Expert Human Judgments

A parole board considers the release of a prisoner: Will he be violent again? A hiring officer considers a job candidate: Will she be a valuable asset to the company? A young couple considers marriage: Will they have a happy marriage?

The cached wisdom for making such high-stakes predictions is to have experts gather as much evidence as possible, weigh this evidence, and make a judgment. But 60 years of research has shown that in hundreds of cases, a simple formula called a statistical prediction rule (SPR) makes better predictions than leading experts do. Or, more exactly:

When based on the same evidence, the predictions of SPRs are at least as reliable as, and are typically more reliable than, the predictions of human experts for problems of social prediction.1

For example, one SPR developed in 1995 predicts the price of mature Bordeaux red wines at auction better than expert wine tasters do. Reaction from the wine-tasting industry to such wine-predicting SPRs has been "somewhere between violent and hysterical."

How does the SPR work? This particular SPR is called a proper linear model, which has the form:

P = w1(c1) + w2(c2) + w3(c3) + ...wn(cn)

The model calculates the summed result P, which aims to predict a target property such as wine price, on the basis of a series of cues. Above, cn is the value of the nth cue, and wn is the weight assigned to the nth cue.2

In the wine-predicting SPR, c1 reflects the age of the vintage, and other cues reflect relevant climatic features where the grapes were grown. The weights for the cues were assigned on the basis of a comparison of these cues to a large set of data on past market prices for mature Bordeaux wines.3

There are other ways to construct SPRs, but rather than survey these details, I will instead survey the incredible success of SPRs.

  • Howard and Dawes (1976) found they can reliably predict marital happiness with one of the simplest SPRs ever conceived, using only two cues: P = [rate of lovemaking] - [rate of fighting]. The reliability of this SPR was confirmed by Edwards & Edwards (1977) and by Thornton (1979).
  • Unstructured interviews reliably degrade the decisions of gatekeepers (e.g. hiring and admissions officers, parole boards, etc.). Gatekeepers (and SPRs) make better decisions on the basis of dossiers alone than on the basis of dossiers and unstructured interviews. (Bloom and Brundage 1947, DeVaul et. al. 1957, Oskamp 1965, Milstein et. al. 1981; Hunter & Hunter 1984; Wiesner & Cronshaw 1988). If you're hiring, you're probably better off not doing interviews.
  • Wittman (1941) constructed an SPR that predicted the success of electroshock therapy for patients more reliably than the medical or psychological staff.
  • Carroll et. al. (1988) found an SPR that predicts criminal recidivism better than expert criminologists.
  • An SPR constructed by Goldberg (1968) did a better job of diagnosing patients as neurotic or psychotic than did trained clinical psychologists.
  • SPRs regularly predict academic performance better than admissions officers, whether for medical schools (DeVaul et. al. 1957), law schools (Swets, Dawes and Monahan 2000), or graduate school in psychology (Dawes 1971).
  • SPRs predict loan and credit risk better than bank officers (Stillwell et. al. 1983).
  • SPRs predict newborns at risk for Sudden Infant Death Syndrome better than human experts do (Lowry 1975; Carpenter et. al. 1977; Golding et. al. 1985).
  • SPRs are better at predicting who is prone to violence than are forensic psychologists (Faust & Ziskin 1988).
  • Libby (1976) found a simple SPR that predicted firm bankruptcy better than experienced loan officers.

And that is barely scratching the surface.

If this is not amazing enough, consider the fact that even when experts are given the results of SPRs, they still can't outperform those SPRs (Leli & Filskov 1985; Goldberg 1968).

So why aren't SPRs in use everywhere? Probably, suggest Bishop & Trout, we deny or ignore the success of SPRs because of deep-seated cognitive biases, such as overconfidence in our own judgments. But if these SPRs work as well as or better than human judgments, shouldn't we use them?

Robyn Dawes (2002) drew out the normative implications of such studies:

If a well-validated SPR that is superior to professional judgment exists in a relevant decision making context, professionals should use it, totally absenting themselves from the prediction.

Sometimes, being rational is easy. When there exists a reliable statistical prediction rule for the problem you're considering, you need not waste your brain power trying to make a careful judgment. Just take an outside view and use the damn SPR.4

 

 

Recommended Reading

 

Notes

1 Bishop & Trout, Epistemology and the Psychology of Human Judgment, p. 27. The definitive case for this claim is made in a 1996 study by Grove & Meehl that surveyed 136 studies yielding 617 comparisons between the judgments of human experts and SPRs (in which humans and SPRs made predictions about the same cases and the SPRs never had more information than the humans). Grove & Meehl found that of the 136 studies, 64 favored the SPR, 64 showed roughly equal accuracy, and 8 favored human judgment. Since these last 8 studies "do not form a pocket of predictive excellent in which [experts] could profitably specialize," Grove and Meehl speculated that these 8 outliers may be due to random sampling error.

2 Readers of Less Wrong may recognize SPRs as a relatively simple type of expert system.

3 But, see Anatoly_Vorobey's fine objections.

4 There are occasional exceptions, usually referred to as "broken leg" cases. Suppose an SPR reliably predicts an individual's movie attendance, but then you learn he has a broken leg. In this case it may be wise to abandon the SPR. The problem is that there is no general rule for when experts should abandon the SPR. When they are allowed to do so, they abandon the SPR far too frequently, and thus would have been better off sticking strictly to the SPR, even for legitimate "broken leg" instances (Goldberg 1968; Sawyer 1966; Leli and Filskov 1984).

 

References

Bloom & Brundage (1947). "Predictions of Success in Elementary School for Enlisted Personnel", Personnel Research and Test Development in the Natural Bureau of Personnel, ed. D.B. Stuit, 233-61. Princeton: Princeton University Press.

Carpenter, Gardner, McWeeny, & Emery (1977). "Multistage scory systemfor identifying infants at risk of unexpected death", Arch. Dis. Childh., 53: 606−612.

Carroll, Winer, Coates, Galegher, & Alibrio (1988). "Evaluation, Diagnosis, and Prediction in Parole Decision-Making", Law and Society Review, 17: 199-228.

Dawes (1971). "A Case Study of Graduate Admissions: Applications of Three Principles of Human Decision-Making", American Psychologist, 26: 180-88.

Dawes (2002). "The Ethics of Using or Not Using Statistical Prediction Rules in Psychological Practice and Related Consulting Activities", Philosophy of Science, 69: S178-S184.

DeVaul, Jervey, Chappell, Carver, Short, & O'Keefe (1957). "Medical School Performance of Initially Rejected Students", Journal of the American Medical Association, 257: 47-51.

Faust & Ziskin (1988). "The expert witness in psychology and psychiatry", Science, 241: 1143−1144.

Goldberg (1968). "Simple Models of Simple Process? Some Research on Clinical Judgments", American Psychologist, 23: 483-96.

Golding, Limerick, & MacFarlane (1985). Sudden Infant Death. Somerset: Open Books.

Edwards & Edwards (1977). "Marriage: Direct and Continuous Measurement", Bulletin of the Psychonomic Society, 10: 187-88.

Howard & Dawes (1976). "Linear Prediction of Marital Happiness", Personality and Social Psychology Bulletin, 2: 478-80.

Hunter & Hunter (1984). "Validity and utility of alternate predictors of job performance", Psychological Bulletin, 96: 72-98

Leli & Filskov (1984). "Clinical Detection of Intellectual Deterioration Associated with Brain Damage", Journal of Clinical Psychology, 40: 1435–1441.

Libby (1976). "Man versus model of man: Some conflicting evidence", Organizational Behavior and Human Performance, 16: 1-12.

Lowry (1975). "The identification of infants at high risk of early death", Med. Stats. Report, London School of Hygiene and Tropical Medicine.

Milstein, Wildkinson, Burrow, & Kessen (1981). "Admission Decisions and Performance during Medical School", Journal of Medical Education, 56: 77-82.

Oskamp (1965). "Overconfidence in Case Study Judgments", Journal of Consulting Psychology, 63: 81-97.

Sawyer (1966). "Measurement and Prediction, Clinical and Statistical", Psychological Bulletin, 66: 178-200.

Stillwell, Barron, & Edwards (1983). "Evaluating Credit Applications: A Validation of Multiattribute Utility Weight Elicitation Techniques", Organizational Behavior and Human Performance, 32: 87-108.

Swets, Dawes, & Monahan (2000). "Psychological Science Can Improve Diagnostic Decisions", Psychological Science in the Public Interest, 1: 1–26.

Thornton (1977). "Linear Prediction of Marital Happiness: A Replication", Personality and Social Psychology Bulletin, 3: 674-76.

Wiesner & Cronshaw (1988). "A meta-analytic investigation of the impact of interview format and degree of structure on the validity of the employment interview", Journal of Applied Psychology, 61: 275-290.

Wittman (1941). "A Scale for Measuring Prognosis in Schizophrenic Patients", Elgin Papers 4: 20-33.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 8:46 PM
Select new highlight date
All comments loaded

I'm skeptical, and will now proceed to question some of the assertions made/references cited. Note that I'm not trained in statistics.

Unfortunately, most of the articles cited are not easily available. I would have liked to check the methodology of a few more of them.

For example, one SPR developed in 1995 predicts the price of mature Bordeaux red wines at auction better than expert wine tasters do.

The paper doesn't actually establish what you say it does. There is no statistical analysis of expert wine tasters, only one or two anecdotal statements of their fury at the whole idea. Instead, the SPR is compared to actual market prices - not to experts' predictions. I think it's fair to say that the claim I quoted is overreached.

Now, about the model and its fit to data. Note that the SPR is older than 1995, when the paper was published. The NYTimes article about it which you reference is from 1990 (the paper bizarrely dates it to 1995; I'm not sure what's going on there).

The fact that there's a linear model - not specified precisely anywhere in the article - which is a good fit to wine prices for vintages of 1961-1972 (Table 3 in the paper) is not, I think, very significant on its own. To judge the model, we should look at what it predicts for upcoming years. Both the paper and the NYTimes article make two specific predictions. First, the 1986 vintage, claimed to be extolled by experts early on, will prove mediocre because of the weather conditions that year (see Figure 3 in the paper - 1986 is clearly the worst of the 80ies). NYTimes says "When the dust settles, he predicts, it will be judged the worst vintage of the 1980's, and no better than the unmemorable 1974's or 1969's". The 1995 paper says, more modestly, "We should expect that, in due course, the prices of these wines will decline relative to the prices of most of the other vintages of the 1980s". Second, the 1989-1990 is predicted to be "outstanding" (paper), "stunningly good" (NYTimes), "adjusted for age, will outsell at a significant premium the great 1961 vintage (NYTimes)."

It's now 16 years later. How do we test these predictions?

First, I've stumbled on a different paper from the primary author, Prof. Ashenfelter, from 2007. Published 12 years later than the one you reference, this paper has substantially the same contents, with whole pages copied verbatim from the earlier one. That, by itself, worries me. Even more worrying is the fact that the 1986 prediction, prominent in the 1990 article and the 1995 paper, is completely missing from the 2007 paper (the data below might indicate why). And most worrying of all is the change of language regarding the 1989/1990 prediction. The 1995 paper says about its prediction that the 1989/1990 will turn out to be outstanding, "Many wine writers have made the same predictions in the trade magazines". The 2007 paper says "Ironically, many professional wine writers did not concur with this prediction at the time. In the years that have followed minds have been changed; and there is now virtually unanimous agreement that 1989 and 1990 are two of the outstanding vintages of the last 50 years."

Uhm. Right. Well, because the claims aren't strong enough, they do not exactly contradict each other, but this change leaves a bad taste. I don't think I should give much trust to these papers' claims.

The data I could find quickly to test the predictions is here. The prices are broken down by the chateaux, by the vintage year, the packaging (I've always chosen BT - bottle), and the auction year (I've always chosen the last year available, typically 2004). Unfortunately, Ashenfelter underspecifies how he came up with the aggregate prices for a given year - he says he chose a package of the best 15 wineries, but doesn't say which ones or how the prices are combined. I used 5 wineries that are specified as the best in the 2007 paper, and looked up the prices for years 1981-1990. The data is in this spreadsheet. I haven't tried to statistically analyze it, but even from a quick glance, I think the following is clear. 1986 did not stabilize as the worst year of the 1980s. It is frequently second- or third-best of the decade. It is always much better than either 1984 or 1987, which are supposed to be vastly better according to the 1995 paper's weather data (see Figure 3). 1989/1990 did turn out well, especially 1990. Still, they're both nearly always less expensive than 1982, which is again vastly inferior in the weather data (it isn't even in the best quarter). Overall, I fail to see much correlation between the weather data in the paper for the 1980s, the specific claims about 1986 and 1989/1990, and the market prices as of 2004. I wouldn't recommend using this SPR to predict market prices.

Now, this was the first example in your post, and I found what I believe to be substantial problems with its methodology and the quality of its SPR. If I were to proceed and examine every example you cite in the same detail, would I encounter many such problems? It's difficult to tell, but my prediction is "yes". I anticipate overfitting and shoddy methodology. I anticipate huge influence of the selection bias - the authors that publish these kinds of papers will not publish a paper that says "The experts were better than our SPR". And finally, I anticipate overreaching claims of wide-reaching applicability of the models, based on papers that actually indicate modest effect in a very specific situation with a small sample size.

I've looked at your second example:

Howard and Dawes (1976) found they can reliably predict marital happiness with one of the simplest SPRs ever conceived, using only two cues: P = [rate of lovemaking] - [rate of fighting].

I couldn't find the original paper, but the results are summarised in Dawes (1979). Looking at it, it turns out that when you say "predict marital happiness", it really means "predicts one of the partners' subjective opinion of their marital happiness" - as opposed to e.g. stability of the marriage over time. There's no indication as to how the partner to question was chosen from each pair (e.g. whether the experimenter knew the rate when they chose). There was very good correlation with binary outcome (happy/unhappy), but when a finer scale of 7 degrees of happiness was used, the correlation was weak - rate of 0.4. In a follow-up experiment, correlation rate went up to 0.8, but there the subject looked at the lovemaking/fighting statistics before opining on the degree of happiness, thus contaminating their decision. And even in the earlier experiment, the subject had been recording those lovemaking/fighting statistics in the first place, so it would make sense for them to recall those events when they're asked to assess whether their marriage is a happy one. Overall, the model is witty and naively appears to be useful, but the suspect methodology and the relatively weak correlation encourages me to discount the analysis.

Finally, the following claim is the single most objectionable one in your post, to my taste:

If you're hiring, you're probably better off not doing interviews.

My own experience strongly suggests to me that this claim is inane - and is highly dangerous advice. I'm not able to view the papers you base it on, but if they're anything like the first and second example, they're far, far away from convincing me of the truth of this claim, which I in any case strongly suspect to overreach gigantically over what the papers are proving. It may be true, for example, that a very large body of hiring decision-makers in a huge organisation or a state on average make poorer decisions based on their professional judgement during interviews than they would have made based purely on the resume. I can see how this claim might be true, because any such very large body must be largely incompetent. But it doesn't follow that it's good advice for you to abstrain from interviewing - it would only follow if you believe yourself to be no more competent than the average hiring manager in such a body, or in the papers you reference. My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial (though I will freely grant that different kinds of interviews vary wildly in their effectiveness).

If you're hiring, you're probably better off not doing interviews.

My own experience strongly suggests to me that this claim is inane - and is highly dangerous advice... My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial (though I will freely grant that different kinds of interviews vary wildly in their effectiveness).

The whole point of this article is that experts often think themselves better than SPR's when actually they perform no better than SPRs on average. Here we have an expert telling us that he thinks he would perform better than an SPR. Why should we be interested?

Because I didn't just state a blanket opinion. I dug into the studies, looked for data to test one of them in depth, and found it to be highly flawed. I called into question the methodology employed by the studies, as well as overgeneralizing and overreaching conclusions they're drummed up to support. The evidence that at least some studies are flawed and the methodology is shoddy should make you question the universal claim "... actually they perform no better than SPRs on average". That's why you should be interested.

My personal experience with interviewing is certainly not as important piece of evidence against the article as the specific criticisms of the studies. It's just another anecdotal data point. That's why I didn't expand on it as much as I did on the wine study, although I do believe it can be made more convincing through further elucidation.

Cool, I'll look into these points.

I made one small change so far. The above article now read: "Reaction from the wine-tasting industry to such wine-predicting SPRs has been 'somewhere between violent and hysterical.'"

Also, I'll post links to the specific papers when I have time to visit UCLA and grab them.

Psychology is not my field, but my understanding is that the 'interview effect' for unstructured interviews is a very robust finding across many decades. For more, you can listen to my interview with Michael Bishop. But hey, maybe he's wrong!

Update 1: If I read the 1995 study correctly, they judged the accuracy of wine tasters by comparing the price of immature wines to those of mature wines, but I'm not sure. The way I phrased that is from Bishop & Trout, and that is how Bishop recalls it, though it's been several years now since he co-wrote Epistemology and the Psychology of Human Judgment.

My own experience strongly suggests to me that this claim is inane ... it would only follow if you believe yourself to be no more competent than the average hiring manager in such a body, or in the papers you reference.

What evidence do you have that you are better than average?

My personal experience from interviewing many, many candidates for a large company suggests that interviewing is crucial

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it!"

I have heard of one job interview that I felt constituted a useful tool that could not effectively be replaced by resume examination and statistical analysis. A friend of mine got a job working for a company that provides mathematical modeling services for other companies, and his "interview" was a several hour test to create a number of mathematical models, and then explaining to the examiner in layman's terms how and why the models worked.

Most job interviews are really not a demonstration of job skills and aptitude, and it's possible to simply bullshit your way through them. On the other hand, if you have a simple and direct way to test the competence of your applicants, then by all means use it.

That isn't an interview, it's a test. Tests are extremely useful. IQ tests are an excellent predictor of job performance, maybe the best one available. Regardless, IQ tests are usually de facto illegal in the US due to disparate impact.

I'm most familiar with interviews for programming jobs, where an interview that doesn't ask the candidate to demonstrate job-specific skills, knowledge and aptitude is nearly worthless. These jobs are also startlingly prone to resume distortion that can make vastly different candidates look similar, especially recent graduates.

Asking for coding samples and calling previous employers, especially if coupled with a request for code solving a new (requested) problem, could potentially replace interviews. However, judging the quality of code still requires a person, so that doesn't seem to really change things to me.

So why aren't SPRs in use everywhere? Probably, we deny or ignore the success of SPRs because of deep-seated cognitive biases, such as overconfidence in our own judgments. But if these SPRs work as well as or better than human judgments, shouldn't we use them?

Without even getting into the concrete details of these models, I'm surprised that nobody so far has pointed out the elephant in the room: in contemporary society, statistical inference about human behavior and characteristics is a topic bearing tremendous political, ideological, and legal weight. [*] Nowadays there exists a firm mainstream consensus that the use of certain sorts of conditional probabilities to make statistical predictions about people is discriminatory and therefore evil, and doing so may result not only in loss of reputation, but also in serious legal consequences. (Note that even if none of the forbidden criteria are built into your decision-making explicitly, that still doesn't leave you off the hook -- just search for "disparate impact" if you don't know what I'm talking about.)

Now of course, making any prediction about people at all necessarily involves one sort of statistical discrimination or another. The boundaries between the types of statistical discrimination that are considered OK and those that are considered evil and risk legal liability are an arbitrary result of cultural, political, and ideological factors. (They would certainly look strange and arbitrary to someone who isn't immersed in the culture that generated them to the point where they appear common-sensical or at least explicable.) Therefore, while your model may well be accurate in estimating the probability of recidivism, job performance, etc., it's unlikely that it will be able to navigate the social conventions that determine these forbidden lines. A lot of the seemingly absurd and ineffective rituals and regulations in modern business, government, academia, etc. exist exactly for the purpose of satisfying these complex constraints, even if they're not commonly thought of as such.

--

[*] Edit: I missed the comment below in which the commenter Student_UK already raised a similar point.

If the best way to choose who to hire is with a statistical analysis of legally forbidden criteria, then keep your reasons secret and shred your work. Is that so hard?

That doesn't close the loophole, it adds a constraint. And it's only significant for those who both hire enough people to be vulnerable to statistical analysis of their hiring practices, and receive too many bad applicants from protected classes. If it is a significant constraint, you want to find that out from the data, not from guesswork, and apply the minimum legally acceptable correction factor.

Besides, it's not like muggles are a protected class. And if they were? Just keep them from applying in the first place, by building your office somewhere they can't get to. There aren't any legal restrictions on that.

Besides, it's not like muggles are a protected class. And if they were? Just keep them from applying in the first place, by building your office somewhere they can't get to. There aren't any legal restrictions on that.

You joke, but the world [1] really is choking with inefficient, kludgey workarounds for the legal prohibition of effective employment screening. For example, the entire higher education market has become, basically, a case of employers passing off tests to universities that they can't legally administer themselves. You're a terrorist if you give an IQ test to applicants, but not if you require a completely irrelevant college degree that requires taking the SAT (or the military's ASVAB or whatever the call it now).

It feels so good to ban discrimination, as long as you don't have to directly face the tradeoff you're making.

[1] Per MattherW's correction, this should read "Western developed economies" instead of "the world" -- though I'm sure the phenomenon I've described is more general the form it takes in the West.

That doesn't close the loophole, it adds a constraint.

Yes, it does close the loophole. You say conceal the cause (intent to discriminate) and you can get away with as much effect (disproportionate exclusion) as you want. Except the law already specifies that the effect is punishable as well as the cause.

So now the best you can do, assuming the populations are equally competent and suited for the job, is 20% discrimination.

And of course, in the real world, populations usually differ in their suitability for the job. Blacks tend not to have as many CS degrees as whites, for example. So if you are an employer of CS degrees, you may not be able to get away with any discrimination before you have breached the 20% limit, and may need to discriminate against the non-blacks in order to be compliant.

Besides, it's not like muggles are a protected class.

I would suspect that if the US Muggle legal system had anything to say about it, they would be. If magical-ness is conferred by genes, then it's violating either the general racial guideline or it's violating recent laws (signed by GWB, IIRC) forbidding employer discrimination based on genetics (in the context of genome sequencing, true, but probably general). If it's not conferred by genes, then there may be a general cultural basis on which to sue (Muggles as disabled because they lack an ability necessary for basic functioning in Wizarding society, perhaps).

Also, if I may be permitted to make a more general criticism in response to this post, I would say that while the article appears to be well-researched, it has demonstrated some of the worst problems I commonly notice on this forum. The same goes for the majority of the comments, even though many are knowledgeable and informative. What I have in mind is the fixation on concocting theories about human behavior and society based on various idées fixes and leitmotifs that are parts of the intellectual folklore here, while failing to notice issues suggested by basic common sense that are likely to be far more important.

Thus the poster notices that these models are not used in practice despite considerable evidence in their favor, and rushes to propose cognitive biases à la Kahneman & Tversky as the likely explanation. This without even stopping to think of two questions that just scream for attention. First, what is the importance of the fact that just about any issue of sorting out people is nowadays likely to be ideologically charged and legally dangerous? Second, what about the fact that these models are supposed to throw some high-status people out of work, and in a way that makes them look like they've been incompetent all along?

Regardless of whether various hypotheses based on these questions have any merit, the fact that someone could write a post without even giving them the slightest passing attention, offering instead a blinkered explanation involving the standard old LW/OB folklore, and still get upvoted to +40 is, in my opinion, indicative of some severe and widespread biases.

While this post has +40 upvotes, the majority of the top-voted comments are skeptical of it. I think this represents confusion as to how to upvote, although this is merely a hypothesis. The article surveys a very interesting topic that is right in the sweet spot of interest for this community, it also appears scholarly, however the conclusions synthesized by the author strike me as naive and I suspect that's also the conclusion of the majority. Whether it deserves an upvote is debateable. I downvoted.

My intent was to summarize the literature on SPRs, not provide an account for why they are not used more widely. I almost didn't include that sentence at all. Surely, more analysis would be important to have in a post intending to discuss the psychological issues involved in our reaction to SPRs, but that was not my subject.

In pointing to cognitive biases as an explanation, I was merely repeating what Bishop & Trout & Dawes have suggested on the matter, not making up my own explanations in light of LW lore.

In fact, the arrows point the other way. Many of the authors cited in my article worked closely with people like Kahneman who are the original academic sources of much of LW lore.

Edit: I've added a clause about the source of the "cognitive biases" suggestion, in case others are tempted to make the same mistaken assumption as you made.

An interesting story that I think I remember reading:

One study found that relatively inexperienced psychiatrists were more accurate at diagnosing mental illness than experienced ones. This is because inexperienced psychiatrists stuck closely to checklists rather than rely on their own judgment, and whether or not a diagnosis was considered "accurate" was based on how closely the reported symptoms matched the checklist. ;)

If this is not amazing enough, consider the fact that even when experts are given the results of SPRs, they still can't outperform those SPRs (Leli & Filskov 1985; Goldberg 1968).

Now THAT part is just plain embarrassing. I mean, it's truly a mark of shame upon us if we have a tool that we know works, we are given access to the tool, and we still can't do better than the tool itself, unaided. (EDIT: By "we", I mean "the experts in the relevant fields"... which I guess isn't really a "we" as such, but you know what I mean)

Anyways, are there any nice online indexes or whatever of SPRs that make it easy to put in class of problem and have it find a SPR that's been verified to work for that sort of problem?

If anybody would like to try some statistical machine learning at home, it's actually not that hard. The tough part is getting a data set. Once that's done, most of the examples in this article are things you could just feed to some software like Weka, press a few buttons, and get a statistical model. BAM!

Let's try an example. Here is some breast cancer diagnostic data, showing a bunch of observations of people with breast cancer (age, size of tumors, etc.) and whether or not the cancer reoccurred after treatment. Can we predict cancer recurrence?

If you look at it with a decision tree, it turns out that you can get about 70% accuracy by observing two of the several factors that were observed, in a very simple decision procedure. You can do a little better by using something more sophisticated, like a naive Bayes classifier. These show us what factors are the most important, and how.

If you're interested, go ahead and play around. It's pretty easy to get started. Obviously, take everything with a grain of salt, but still, basic machine learning is surprisingly easy.

I second the advice.

Let me brag a bit. Once in a friendly discussion the following question came up: How to predict for an unknown first name whether it is a male or female name. This was in a context of Hungarian names, as all of us were Hungarians. I had a list of Hungarian first names in digital format. The discussion turned into a bet: I said I can write a program in half an hour that tells with at least 70% precision the sex of a first name it never saw before. I am quite fast with writing small scripts. It wasn't even close: It took me 9 minutes to

  • split my sets of 1000 male and 1000 female names into a random 1000-1000 train-test split,
  • split each name into character 1,2- and 3-grams. E.g.: Luca was turned into ^L u c a$ ^Lu uc ca$ ^Luc uca$.
  • feed the training data into a command line tool to train a maxent model,
  • test the accuracy of the model on the unseen test data.

The model reached an accuracy of 90%. In retrospect, this is not surprising at all. Looking into the linear model, the most important feature it identified was whether the name ends with an 'a'. This trivial model alone reaches some 80% precision for Hungarian names, so if I knew this in advance, I could have won the bet in 30 seconds instead of 9 minutes, with the sed command s/a$/a FEMALE/.

This is a great article, but it only lists studies where SPRs have succeeded. In fairness, it would be good to know if there were any studies that showed SPRs failing (and also consider publication bias, etc.).

I have two concerns about the practical implementation of this sort of thing:

  1. It seems like there are cases where if a rule is being used then people could abuse it. For example, in job applications or admissions to medical schools. A better understanding of how the rule relates to what it predicts would be needed.

If X+Y predicts Z does that mean enhancing X and Y will up the probability of Z? Not necessarily, consider the example of happy marriages. Will having more sex make your relationship happier? Or does the rule work because happy couples tend to have more sex?

  1. It is not true in every case that we equally value all true beliefs, and equally value all false beliefs. Certain rules might work better if we take into consideration a person's race, sex, religion and nationality. But most people find this sort of thing unpalatable because it can lead to the systematic persecution of sub groups, even if it results in more true, and fewer false, beliefs overall. It also might be the case that some of these rules discriminate against groups of people in more subtle ways that won't be immediately obvious.

Of course neither of these problems mean that there won't be perfectly good cases where these rules would improve decision making a lot.

Yes, several of these models look like they're likely to run into trouble of the Goodhart's law type ("Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes").

Will having more sex make your relationship happier?

Obviously, yes.

It probably depends somewhat on with whom you are having it.

My gut reaction is that this doesn't demonstrate that SPRs are good, just that humans are bad. There are tons of statistical modeling algorithms that are more sophisticated than SPRs.

Unless, of course, SPR is another word for "any statistical modeling algorithm", in which case this is just the claim that statistical machine learning is a good approach, which anyone as Bayesian as the average LessWronger probably agrees with.

Well, SPRs can plausibly outperform average expertise. That's because most of the expertise is utter and complete sham.

The recidivism in example...

The judges, or psychologists, or the like, what in the world makes them experts on predicting the criminals? Did they read an unbiased sample of recidivism? Did they do any practice, earning marks for predicting criminals? Anything?

Resounding no. They never in their lives did anything that should have earned them the expert status on this task. They did other stuff that puts them first on the list when you're looking for 'experts' on a topic for which there is no experts.

They are about as much experts on this task as a court janitor is an expert on law. He too did not do anything related to law, he did clean the courtroom.

Besides the legal issues with discrimination and disparate impact, another important issue here is that jobs that involve making decisions about people tend to be high-status. As a very general tendency, the higher-status a profession is, the more its practitioners are likely to organize in a guild-like way and resist intrusive innovations by outsiders -- especially innovations involving performance metrics that show the current standards of the profession in a bad light, or even worse, those that threaten a change in the way their work is done that might lower its status.

Discussions of such cases in medicine are a regular feature on Overcoming Bias, but it exists in a more or less pronounced form in any other high-status profession too. How much it accounts for the specific cases discussed in the above article is a complex question, but this phenomenon should certainly be considered as a plausible part of the explanation.

Sometimes, being rational is easy. When there exists a reliable statistical prediction rule for the problem you're considering, you need not waste your brain power trying to make a careful judgment.

Unfortunately linear models for a lot of situations are simply not available. The dozen or so ones in the literature are the exception, not the rule.