Really interesting analysis of social science papers and replication markets. Some excerpts: 

Over the past year, I have skimmed through 2578 social science papers, spending about 2.5 minutes on each one. This was due to my participation in Replication Markets, a part of DARPA's SCORE program, whose goal is to evaluate the reliability of social science research. 3000 studies were split up into 10 rounds of ~300 studies each. Starting in August 2019, each round consisted of one week of surveys followed by two weeks of market trading. I finished in first place in 3 out 10 survey rounds and 6 out of 10 market rounds. In total, about $200,000 in prize money will be awarded.

The studies were sourced from all social sciences disciplines (economics, psychology, sociology, management, etc.) and were published between 2009 and 2018 (in other words, most of the sample came from the post-replication crisis era).

The average replication probability in the market was 54%; while the replication results are not out yet (175 of the 3000 papers will be replicated), previous experiments have shown that prediction markets work well.1

This is what the distribution of my own predictions looks like:2

[...]


Check out this crazy chart from Yang et al. (2020):

Yes, you're reading that right: studies that replicate are cited at the same rate as studies that do not. Publishing your own weak papers is one thing, but citing other people's weak papers? This seemed implausible, so I decided to do my own analysis with a sample of 250 articles from the Replication Markets project. The correlation between citations per year and (market-estimated) probability of replication was -0.05!

You might hypothesize that the citations of non-replicating papers are negative, but negative citations are extremely rare.5 One study puts the rate at 2.4%. Astonishingly, even after retraction the vast majority of citations are positive, and those positive citations continue for decades after retraction.6

As in all affairs of man, it once again comes down to Hanlon's Razor. Either:

  1. Malice: they know which results are likely false but cite them anyway.
  2. or, Stupidity: they can't tell which papers will replicate even though it's quite easy.

Accepting the first option would require a level of cynicism that even I struggle to muster. But the alternative doesn't seem much better: how can they not know? I, an idiot with no relevant credentials or knowledge, can fairly accurately determine good research from bad, but all the tenured experts can not? How can they not tell which papers are retracted?

I think the most plausible explanation is that scientists don't read the papers they cite, which I suppose involves both malice and stupidity.7 Gwern has an interesting write-up on this question, citing some ingenious bibliographic analyses: "Simkin & Roychowdhury venture a guess that as many as 80% of authors citing a paper have not actually read the original". Once a paper is out there nobody bothers to check it, even though they know there's a 50-50 chance it's false!

New Comment
22 comments, sorted by Click to highlight new comments since:

Having read the original article, I was surprised at how long it was (compared to the brief excerpts), and how scathing it was, and how funny it was <3

Criticizing bad science from an abstract, 10000-foot view is pleasant: you hear about some stuff that doesn't replicate, some methodologies that seem a bit silly. "They should improve their methods", "p-hacking is bad", "we must change the incentives", you declare Zeuslike from your throne in the clouds, and then go on with your day.
But actually diving into the sea of trash that is social science gives you a more tangible perspective, a more visceral revulsion, and perhaps even a sense of Lovecraftian awe at the sheer magnitude of it all: a vast landfill—a great agglomeration of garbage extending as far as the eye can see, effluvious waves crashing and throwing up a foul foam of p=0.049 papers. As you walk up to the diving platform, the deformed attendant hands you a pair of flippers. Noticing your reticence, he gives a subtle nod as if to say: "come on then, jump in".

A typical paper doesn't just contain factual claims about standard questions, but also theoretical discussion and a point of view on the ideas that form the fabric of a field. Papers are often referenced to clarify the meaning of a theoretical discussion, or to give credit for inspiring the direction in which the discussion moves. This aspect doesn't significantly depend on truth of findings of particular studies, because an interesting concept motivates many studies that both experimentally investigate and theoretically discuss it. Some of the studies will be factually bogus, but the theoretical discussion in them might still be relevant to the concept, and useful for subsequent good studies.

So a classification of citations into positive and negative ignores this important third category, something like conceptual reference citation.

Maybe we need a yearly award for the scientist who cites the most redacted papers?

I appreciated the analysis of what does and doesn't replicate, but the author has clearly never been in academia and many of their recommendations are off base. Put another way, the "what's wrong with social science" part is great, and the "how to fix it" is not.

Which specific parts did you have in mind?

My claims are really just for CS, idk how much they apply to the social sciences, but the post gives me no reason to think they aren't true for the social sciences as well.

  • Just stop citing bad research, I shouldn't need to tell you this, jesus christ what the fuck is wrong with you people.

This doesn't work unless it's common knowledge that the research is bad, since reviewers are looking for reasons to reject and "you didn't cite this related work" is a classic one (and your paper might be reviewed by the author of the bad work). When I was early in my PhD, I had a paper rejected where it sounded like a major contributing factor was not citing a paper that I specifically thought was not related but the reviewer thought was.

  • Read the papers you cite. Or at least make your grad students to do it for you. It doesn't need to be exhaustive: the abstract, a quick look at the descriptive stats, a good look at the table with the main regression results, and then a skim of the conclusions. Maybe a glance at the methodology if they're doing something unusual. It won't take more than a couple of minutes. And you owe it not only to SCIENCE!, but also to yourself: the ability to discriminate between what is real and what is not is rather useful if you want to produce good research.23

I think the point of this recommendation is to get people to stop citing bad research. I doubt it will make a difference since as argued above the cause isn't "we can't tell which research is bad" but "despite knowing what's bad we have to cite it anyway".

  • When doing peer review, reject claims that are likely to be false. The base replication rate for studies with p>.001 is below 50%. When reviewing a paper whose central claim has a p-value above that, you should recommend against publication unless the paper is exceptional (good methodology, high prior likelihood, etc.)24 If we're going to have publication bias, at least let that be a bias for true positives. Remember to subtract another 10 percentage points for interaction effects. You don't need to be complicit in the publication of false claims.

I have issues with this, but they aren't related to me knowing more about academia than the author, so I'll skip it. (And it's more like, I'm uncertain about how good an idea this would be.)

  • Stop assuming good faith. I'm not saying every academic interaction should be hostile and adversarial, but the good guys are behaving like dodos right now and the predators are running wild.

The evidence in the post suggesting that people aren't acting in good faith is roughly "if you know statistics then it's obvious that the papers you're writing won't replicate". My guess is that many social scientists don't know statistics and/or don't apply it intuitively, so I don't see a reason to reject the (a priori more plausible to me) hypothesis that most people are acting in okay-to-good faith.

I don't really understand the author's model here, but my guess is that they are assuming that academics primarily think about "here's the dataset and here are the analysis results and here are the conclusions". I can't speak to social science, but when I'm trying to figure out some complicated thing (e.g "why does my algorithm work in setting X but not setting Y") I spend most of my time staring at data, generating hypotheses, making predictions with them, etc. which is very very conducive to the garden of forking paths that the author dismisses out of hand.

EDIT: Added some discussion of the other recommendations below, though I know much less about them, and here I'm just relying more on my own intuition rather than my knowledge about academia:

Earmark 60% of funding for registered reports (ie accepted for publication based on the preregistered design only, not results). For some types of work this isn't feasible, but for ¾ of the papers I skimmed it's possible. In one fell swoop, p-hacking and publication bias would be virtually eliminated.

I'd be shocked if 3/4 of social science papers could have been preregistered. My guess is that what happens is that researchers collect data, do a bunch of analyses, figure out some hypotheses, and only then write the paper.

Possibly the suggestion here is that all this exploratory work should be done first, then a study should be preregistered, and then the results are reported. My weak guess is that this wouldn't actually help replicability very much -- my understanding is that researchers are often able to replicate their own results, even when others can't. (Which makes sense! If I try to describe to a CHAI intern an algorithm they should try running, I often have the experience that they do something differently than I was expecting. Ideally in social science results would be robust to small variations, but in practice they aren't, and I wouldn't strongly expect preregistration to help, though plausibly it would.)

An NSF/NIH inquisition that makes sure the published studies match the pre-registration (there's so much """"""""""QRP"""""""""" in this area you wouldn't believe). The SEC has the power to ban people from the financial industry—let's extend that model to academia.

My general qualms about preregistration apply here too, but if we assume that we're going to have a preregistration model, then this seems good to me.

Earmark 10% of funding for replications. When the majority of publications are registered reports, replications will be far less valuable than they are today. However, intelligently targeted replications still need to happen.

This seems good to me (though idk if 10% is the right number, I could see both higher and lower).

Increase sample sizes and lower the significance threshold to .005. This one needs to be targeted: studies of small effects probably need to quadruple their sample sizes in order to get their power to reasonable levels. The median study would only need 2x or so. Lowering alpha is generally preferable to increasing power. "But Alvaro, doesn't that mean that fewer grants would be funded?" Yes.

Personally, I don't like the idea of significance thresholds and required sample sizes. I like having quantitative data because it informs my intuitions; I can't just specify a hard decision rule based on how some quantitative data will play out.

Even if this were implemented, I wouldn't predict much effect on reproducibility: I expect that what happens is the papers we get have even more contingent effects that only the original researchers can reproduce, which happens via them traversing the garden of forking paths even more. Here's an example with p-values of .002 and .006.

Andrew Gelman makes a similar case.

Ignore citation counts. Given that citations are unrelated to (easily-predictable) replicability, let alone any subtler quality aspects, their use as an evaluative tool should stop immediately.

I am very on board with citation counts being terrible, but what should be used instead? If you evaluate based on predicted replicability, you incentivize research that says obvious things, e.g. "rain is correlated with wet sidewalks".

I suspect that you probably could build a better and still cost-efficient evaluation tool, but it's not obvious how.

Open data, enforced by the NSF/NIH. There are problems with privacy but I would be tempted to go as far as possible with this. Open data helps detect fraud. And let's have everyone share their code, too—anything that makes replication/reproduction easier is a step in the right direction.

Seems good, though I'd want to first understand what purpose IRBs serve (you'd have to severely roll back IRBs for open data to become a norm).

Financial incentives for universities and journals to police fraud. It's not easy to structure this well because on the one hand you want to incentivize them to minimize the frauds published, but on the other hand you want to maximize the frauds being caught. Beware Goodhart's law!

I approve of the goal "minimize fraud". This recommendation is too vague for me to comment on the strategy.

Why not do away with the journal system altogether? The NSF could run its own centralized, open website; grants would require publication there. Journals are objectively not doing their job as gatekeepers of quality or truth, so what even is a journal? A combination of taxonomy and reputation. The former is better solved by a simple tag system, and the latter is actually misleading. Peer review is unpaid work anyway, it could continue as is. Attach a replication prediction market (with the estimated probability displayed in gargantuan neon-red font right next to the paper title) and you're golden. Without the crutch of "high ranked journals" maybe we could move to better ways of evaluating scientific output. No more editors refusing to publish replications. You can't shift the incentives: academics want to publish in "high-impact" journals, and journals want to selectively publish "high-impact" research. So just make it impossible. Plus as a bonus side-effect this would finally sink Elsevier.

This seems to assume that the NSF would be more competent than journals for some reason. I don't think the problem is with journals per se, I think the problem is with peer review, so if the NSF continues to use peer review as the author suggests, I don't expect this to fix anything.

The author also suggests using a replication prediction market; as I mentioned above you don't want to optimize just for replicability. Possibly you could have replication + some method of incentivizing novelty / importance. The author does note this issue elsewhere but just says "it's a solvable problem". I am not so optimistic. I feel like similar a priori reasoning could have led to the author saying "reproducibility will be a solvable problem".

some method of incentivizing novelty / importance

Citation count clearly isn't a good measure of accuracy, but it's likely a good measure of importance in a field. So we could run some kind of expected value calculation where the usefulness of a paper is measured by P(result is true) * (# of citations) - P(result is false) * (# of citations) = (# of citations) * [P(result is true) - P(result is false)].

Edit: where the probabilities are approximated by replication markets. I think this function gives us what we actually want, so optimizing institutions to maximize it seems like a good idea.

Edit: This doesn't actually represent what we want, since journals can just force everyone to cite the same well replicated study to maximize citation count on that, but it's a good approximation. Not a great goal, but a good measurement of what we want, but we shouldn't optimize institutions to maximize it.

some method of incentivizing novelty / importance

Lesswrong upvote count.

Slightly more seriously: Propagation through the academic segments of centerless curation networks. The author might be anticipating continued advances in social media technology, conventions of use, and uptake. Uptake and improvements in conventions of use, at least, seem to be visibly occuring. Advances in technology seem less assured, but I will do what I can.

This problem seems to me to have the flavor of Moloch and/or inadequate equilibria. Your criticisms have two parts, the pre-edit part based on your personal experience, in which you state why the personal actions they recommend are actually not possible because of the inadequate equilibria (i.e. because of academic incentives), and the criticism of the author's proposed non-personal actions, which you say is just based on intuition.

I think the author would be unsurprised that the personal actions are not reasonable. They have already said this problem requires government intervention, basically to resolve the incentive problem. But maybe at the margin you can take some of the actions that the author refers to in the personal actions. If a paper is on the cusp of "needing to be cited" but you think it won't replicate, take that into account! Or if reviewing a paper, at least take into account the probability of replication in your decision.

I think you are maybe reading the author's claim to "stop assuming good faith" too literally. In the subsequent sentence they are basically refining that to the idea that most people are acting in good faith, but are not competent enough for good faith to be a useful assumption, which seems reasonable to me.

If a paper is on the cusp of "needing to be cited" but you think it won't replicate, take that into account! Or if reviewing a paper, at least take into account the probability of replication in your decision.

Why do you think people don't already do this?

In general, if you want to make a recommendation on the margin, you have to talk about what the current margin is.

I think you are maybe reading the author's claim to "stop assuming good faith" too literally. In the subsequent sentence they are basically refining that to the idea that most people are acting in good faith, but are not competent enough for good faith to be a useful assumption

Huh? The sentence I see is

I'm not saying every academic interaction should be hostile and adversarial, but the good guys are behaving like dodos right now and the predators are running wild.

"the predators are running wild" does not mean "most people are acting in good faith, but are not competent enough for good faith to be a useful assumption".

Why do you think people don't already do this?

They have to do it to some extent, otherwise replicability would be literally uncorrelated with publishability, which probably isn't the case. But because of the outcomes, we can see that people aren't doing it enough at the margin, so encouraging people to move as far in that direction as they can seems like a useful reminder.

There are two models here, one is that everyone is a homo economicus when citing papers, so no amount of persuasion is going to adjust people's citations. They are already making the optimal tradeoff based on their utility function of their personal interests vs. society's interests. The other is that people are subject to biases and blind spots, or just haven't even really considered whether they have the OPTION of not citing something that is questionable, in which case reminding them is a useful affordance.

I'm trying to be charitable to the author here, to recover useful advice. They didn't say things in the way I'm saying them. But they may have been pointing in a useful direction, and I'm trying to steelman that.

"the predators are running wild" does not mean "most people are acting in good faith, but are not competent enough for good faith to be a useful assumption".

Even upon careful rereading of that sentence, I disagree. But to parse this out based on this little sentence is too pointless for me. Like I said, I'm trying to focus on finding useful substance, not nitpicking the author, or you!

What are some of the recommendations that seem most off base to you?

Replied to John below

Even the crème de la crème of economics journals barely manage a ⅔ expected replication rate.

Is a two-thirds replication rate necessarily bad? This is an honest question, since I don't know what the optimal replication rate would be. Seems worth noting that a) a 100% replication rate seems too high, since it would indicate that people were only doing boring experiments that were certain to replicate b) "replication rate" seems to mean "does the first replication attempt succeed", and some fraction of replication attempts will fail due to random chance even if the effect is genuine.

I think there's an idea that a paper with a p=0.05 finding should replicate 95% of the time. If it doesn't then the p-value was wrong. 

That's not really what a p-value means though, right? The actual replication rate should depend on the prior and the power of the studies.

I don't think a high replication rate necessarily implies the experiments were boring. Suppose you do 10 experiments, but they're all speculative and unlikely to be true: let's say only one of them is looking at a true effect, BUT your sample sizes are enormous and you have a low significance cutoff. So you detect the one effect and get 9 nulls on the others. When people try to replicate them, they have a 100% success rate on both the positive and the negative results.

The fraction of attempts that will fail due to random chance depends on the power, and replicators tend to go for very high levels of power, so typically you'd have about 5% false negatives or so in the replications.

I think the most plausible explanation is that scientists don't read the papers they cite

Indeed. Reading an abstract and skimming intro/discussion is as far as it goes in most cases. Sometimes it's just the title that is enough to trigger a citation. Often it's "reciting", copying the references from someone else's paper on the topic. My guess is that maybe 5% of references in a given paper have actually been read by the authors.

Andrew Gelman's take here.

I think there is an important (and obvious) third alternative to the two options presented at the end (of the snippet, rather early in the full piece), namely that many scientists are not very interested in the truth value of the papers they cite. This is neither malice nor stupidity. There is simply no mechanism to punish scientists who cite bad science (and it is not clear there should be, in my opinion). If a paper passes the initial hurdle of peer review it is officially Good Enough to be cited as well, even if it is later retracted (or, put differently, "I'm not responsible for the mistakes the people I cited make, the review committee should have caught it!").

If you're a scientist, your job is ostensibly to uncover the truth about your field of study, so I think being uninterested in the truth of the papers you cite is at least a little bit malicious.

Certainly, but it's not malicious in the sense of deliberately citing bad science. More like negligence.