Hearsay, Double Hearsay, and Bayesian Updates

Application of: How Much Evidence Does It Take?

(trigger warning: some description of domestic violence)

Summary: I discuss the strengths and weaknesses of one way that the American legal system tries to assess and cope with the unreliability of certain kinds of evidence. After explaining the relevant rules with references to a few recent famous cases and a non-notable case that I'm working on now, I briefly consider whether this part of the evidence code is above or below the sanity waterline, and suggest an incremental improvement.


Recently, I got to the point in my legal career where people are trusting me to write evidentiary briefs, i.e., to argue in front of a judge about what kinds of evidence are reliable enough to be safely presented to a jury. There is an odd division of epistemological labor in the American court system: judges are thought [page 90] to be better than juries at resisting passionate or manipulative oratory, and juries are thought to be better than judges at resisting bribery and (pre-existing) personal hatred. As a result, potentially inflammatory or unreliable evidence is presented first to a judge, who (much like one of Eliezer's Confessors) is supposed to sift the exhibit to see if normal people can handle it without losing their tenuous grip on sanity. If and only if the evidence seems safe for ordinary human consumption, the judge will allow the lawyers to argue about that evidence in front of the jury. Otherwise, the evidence sits in a cardboard box in an unheated warehouse, safely away from the eyes of the jury, until it's time for an appeal.

The Hearsay Rule

By way of a concrete example, one famous recent case featured a recorded 911 call made by a domestic violence victim to the emergency phone operator. The operator asked questions about the location and identity of the person who was accused of beating the caller. The caller answered the questions on tape, explicitly identifying her abuser as Mr. Adrian Martell Davis, and the answers were used first to find and arrest the suspect, and ultimately to convict him. The victim was apparently too intimidated to testify in open court, and so her recorded statement as to the name of her abuser was absolutely necessary to support a conviction -- no recording, no conviction. Under the 400-year-old hearsay rule, recorded testimony typically is not allowed to be presented to a jury -- courts are concerned that the person giving the recorded statement might be pressured by the police in ways that wouldn't show up on tape, and that allowing a witness to testify without showing up in court unfairly deprives the defendant of a chance to (a) cross-examine the witness, and (b) have the jury see any facial tics, body language, etc. that undercut the witness's credibility. In the 911 case, though, the Court faced a straight choice between finding an exception to the hearsay rule and letting an apparent abuser go free.

In making this choice, the US Supreme Court managed to ignore a variety of emotionally salient but epistemologically irrelevant distractions, such as the seriousness of the crime, the relative helplessness of the victim, and the respectability of the 911 operator. Instead, the Court focused on the purpose for which the 911 statements were obtained. If the statements were obtained to help gather information needed to safely resolve an ongoing emergency, they could be used at trial. If the statements, however, were obtained to gather information about a past event, they could *not* be used at trial.

The theory supporting this distinction seems to have been that the right to cross-examine and the right to have the jury see body language are fungible elements of a more general reliability test. A stranger's assertion, without more, could be true or could be false. It doesn't count as very much evidence. To turn an assertion into enough evidence to convict someone beyond a reasonable doubt, you need to show that the assertion comes with "indicia of reliability." Two of these indicia are cross-examination and body language -- if a story checks out despite a vigorous unfriendly interview and the peer pressure of having to tell the story while physically in the room with other people from your community, then that's pretty good evidence. But you might have reasons to believe a story even if you don't get cross-examination or body language. In the case of the 911 call, one might think that the caller had a strong motive to tell the truth, because if she didn't, then the police would go looking for the wrong guy, and her abuser would come find her and continue hurting her. Similarly, one might think that the operators had a strong motive to ask fair, non-leading questions, because of they didn't get the right answer, then the police might show up in the wrong neighborhood or with the wrong expectations, and there could be an unnecessary firefight. Finally, one could argue that a recorded statement made as events were unfolding is inherently more reliable (in some ways) than a narrative given months or years after the event; human memory gets corrupted faster than 8-track tapes.

Some combination of these factors convinced the Court to admit the evidence. Other, very similar cases have been decided differently. Whether they got that particular decision right or wrong, though, the framework of "indicia of reliability" is hard-coded into American evidence law, especially for civil cases. If you want to present evidence to a jury based on a statement that was made outside of court, you have to give at least one reason why the statement is nevertheless reliable.

Double and Triple Hearsay

Here's where things really get interesting: if your out-of-court statement quotes another out-of-court statement, the evidence is called "double hearsay," and you need to independently verify each statement. If any link in the chain breaks, the whole document gets excluded. For example, in the case I'm working on now, the defendants want to show the jury a report filled out by California's Occupational Health and Safety Administration ("OSHA"). The OSHA report is based almost entirely on an accident report form filled out by a private corporation. That report form, in turn, is based almost entirely on an informal interview of the only eyewitness to an accident. So the defendants can use the OSHA report if and only if the OSHA report, the accident report, and the informal interview are all reliable. Use  A ↔ (A ∧ B ∧ C) are reliable.

To try to qualify the OSHA report, the defendants are arguing that the OSHA report is reliable under the public record exception to the hearsay rule, meaning that the public officials who prepared it had a stronger interest in accurately reporting public information than they did in the outcome of the accident victim's private case. To get the accident report form in, the defendants are arguing that it is reliable under the business record exception to the hearsay rule, meaning that the corporate officials who prepared it had a stronger interest in making sure their company had access to accurate information about safety risks than they did in the outcome of any one customer's lawsuit. As for the informal interview...well, I honestly have no idea how they plan to justify its reliability. But, then again, I'm biased. My professional interest lies in making sure that the whole string of unhelpful quotations stays in a cardboard box in a dank garage, far away from any juries.

Do the Rules Work?

So far, I've been pleasantly surprised at how well the American legal system handles some of these challenges. The fact that we have a two-tiered system of evaluating evidence at all is a cut above average -- imagine, e.g., the doctor who examines you taking notes on your condition, filtering out any subjective comments you make about how you're sure it's just a cold, and reporting only your objective symptoms to a second doctor, who then renders a diagnosis. Or imagine a team of business consultants who interview a Fortune 500 company's leadership team, and then pass their written notes back to a team at HQ (who has never met the executives) so that HQ can catch any obvious mistakes in reasoning before sending out recommendations. We know, intellectually, that meeting people tends to make us friendlier toward them and more likely to adopt their point of view even if we encounter no Bayesian evidence that increases the plausibility of their opinions, but our institutions rarely take steps to guard against that bias.

I think my biggest criticism of the American evidence code is that it doesn't account for uncertainty in the model. For instance, if I read the headline on a piece of science journalism saying that (e.g.) coffee consumption reduces the risk of prostate cancer, or that receiving spankings in childhood is negatively correlated with conscientiousness as an adult, there are least six layers of 'hearsay' -- I might have misunderstood the headline, the headline might have mis-summarized the article, the article might have misquoted the scientist, the scientist might have misinterpreted the recorded data, the recorded data might not faithfully reflect what actually happened during the experiment, and the experiment might not faithfully replicate the real-world conditions that interest us.

Even if I can articulate plausible reasons why each step in the transmission of information was "reliable," I should be very skeptical that my *model* of the transmission is accurate. I only have to be wrong about one of the six steps for my estimate of the information's plausibility to be untrustworthy. If the information would only provide a few decibels of evidence even if it were perfectly reliable, then trying to calculate how many points a semi-reliable piece of evidence is worth can fail because of a low signal-to-noise ratio. E.g., suppose I learn that neither the suspect nor the actual criminal were redheads - I might be absolutely certain of this new piece of information, but that's still nowhere near enough evidence to support a conviction. If instead I learn that there is probably something like a 60% chance that neither the suspect nor the criminal had red hair, that datum really doesn't tell me anything at all -- the info shouldn't shift my prior enough for my prior to be noticeably different.

Although courts are allowed to consider the extent to which an unduly long chain of inferences makes evidence less "trustworthy," I think that on balance decisions would be more accurate if there were a firm limit -- say, three layers -- beyond which evidence was simply inadmissible as a matter of law. If A says that B says that C says that D shot someone, then no matter how reliable we think A, B, and C are, we should probably keep that evidence away from the jury unless we can haul at least one of B, C, or D into court to answer cross-examination.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 7:37 AM
Select new highlight date
All comments loaded

I am sorry I did not manage to comment on this earlier; I did not suspect it would get promoted.

In short, your treatment of hearsay, and how the legal system addresses it, is simply wrong. Most of what you talk about is actually about the Confrontation Clause. I don't know if this is due to an intentional simplification of your examples, but the cases you use just don't work that way.

The main case you talk about, Davis v. Washington, is not a case about hearsay; just look at the wikipedia summary. It is a case about the confrontation clause. This is a clause that says that those accused of crimes have the right to confront the witnesses against them; if someone talks to the police under certain circumstances, that testimony may not be entered. It does not matter how reliable it is. See Crawford v. Washington. The "indicia of reliability test" was abandoned in Crawford, because it was completely circular - it was compared to doing away with a jury trial because the defendant was obviously guilty.

More generally, there is almost never a balancing test in hearsay. Hearsay is a series of rules that are applied systematically. Out of court statements are considered unreliable principally because the declarant is not under oath; there is no particular reason to believe they were being truthful. There is a series of rules that allow certain statements in for this purpose. The idea behind these rules is that they indicate the evidence is reliable. However, they operate purely formalistically: if something someone said was a statement for the purpose of medical diagnosis, it is admissible hearsay, even if the circumstances strongly demonstrate they were lying. The jury is permitted to figure that out.

The basic idea behind hearsay, and indeed behind evidence law generally, is that certain statement are more likely to mislead the jury than to aid in finding the truth. However, your whole discussion of "indicia of reliability" seems to me to address an obsolete doctrine on the Confrontation Clause. Hearsay, in the vast majority of circumstances, does not involve any kind of balancing test or similar determination. It either meets a rule, or it doesn't (though there is catch-all rule that gives the court some discretion - it can actually be somewhat problematic, because courts often get things wrong).

As to the issue of double hearsay - which I am used to hearing referred to as "hearsay within hearsay," a per se rule against a certain number of levels doesn't make a lot of sense. In the example you use, the bottom level of hearsay is very likely inadmissible; that's enough to keep it out. But the circumstances under which one could admit multi-layer hearsay are pretty limited; it would have to have an applicable exception for every level. You don't discuss any inadequacies with the exceptions, so I just don't see why it follows that their repeat application should be unreliable.

I don't know if this is due to an intentional simplification of your examples

Yes, it is. Lawyers and judges have a tendency to invent dozens of fuzzily overlapping concepts without even considering whether one or two concepts could do just as much useful intellectual work. I could tease out the difference between testimonial and nontestimonial evidence, assertions and non-assertions, matters offered for the truth of the matter asserted, matters offered for other purposes, matters pretextually offered for other purposes, matters honestly offered for other purposes but with an unacceptable tendency to prejudice the jury...but I'm not writing a law review article; I'm writing a Less Wrong post. I tried to focus on what I thought the audience would find relevant.

What interests me here is the distinction between the truth of evidence (does the content of this document describe reality?) and the reliability of evidence (would we ordinarily expect documents like this one to describe reality?). Anything further would be an explanation of the law for its own sake.

Davis v. Washington, is not a case about hearsay; just look at the wikipedia summary. The "indicia of reliability test" was abandoned in Crawford.

Give me a little credit, here; don't you think I looked at the Wikipedia summary before publishing the post? I also linked to Michigan v. Bryant, a newer Supreme Court case which extensively discusses Crawford. I think the cases I linked to provide a discussion of evidentiary reliability that illustrates some important Bayesian concerns. Whether every doctrine in every case I cite is still good law is not really the point.

Hearsay, in the vast majority of circumstances, does not involve any kind of balancing test or similar determination.

I may not have been clear on this point -- I'm not claiming that judges weigh evidence to see if it should be considered hearsay. Rather, the very process of determining whether evidence is hearsay appears to be designed so as to indirectly prompt judges to weigh whether evidence is reliable. By systematically applying the rules about what counts as hearsay, judges consciously or unconsciously wind up admitting only evidence that the system views as reliable. If you like, we could say that the people who write the laws of evidence in the first place are the ones who perform the actual balancing test.

The issue is that Confrontation clause != hearsay. Confrontation rights belong to criminal defendants only, while hearsay is an issue in any trial. As you note, hearsay is conceptually a reliability indicator, while Confrontation clause analysis is trying to determine when the government must go through the time and effort to produce a witness at the actual trial.

In general, criminal defendant rights are not well correlated with reliability. For example, suppression of illegally obtained evidence is anti-correlated with accuracy. This piece makes a good point about chaining evidence. As a lawyer, I thought the piece did a great job of highlighting when the legal system does a better job of truth discovery than society as a whole, and the more frequent occurrence when the legal system is just as misguided as ordinary Joe Citizen.

In short, please accept the word of an expert that the discussion under the heading The Hearsay Rule is not about the hearsay rule and is unrelated to the remainder of the excellent piece.

How does the legal system normally deal with cases where someone has a chain of logic where each link seems strong but there are a dangerously large number of links? This seems like a special case of a more general issue that the court must face regularly.

Would an argument to the judge like "even if each of these reports comes from a person trying to do a good job in passing along the truth, there are too many places where any of these people could have made a simple error" stand a chance?

Is there any evidence than American or any other legal system is significantly better than chance at what it does? Or even not significantly worse than chance? (by being biased instead of just random)

That's the first question we should be asking, before concerning ourselves with minor issues about admissibility of evidence.

That's an excellent question. The answer depends on exactly what you mean by "better than chance." If you mean "more than half of those convicted of a crime are guilty of that crime," then I'd say yes, there is excellent reason to think that they are. Prosecutors usually have access to several times more reports of crime than they can afford to go out and prosecute. Prosecutors are often explicitly or implicitly evaluated on their win ratio -- they have strong incentives to pick the 'easy' cases where there is abundant evidence that the suspect is guilty. Most defense lawyers will cheerfully concede that the vast majority of their clients are guilty -- either the clients admit as much to their lawyers, or the clients insist on implausible stories that don't pass muster, which the lawyers have to disguise in order to get their clients to go free. Although as a matter of law and rhetoric people are presumed innocent until proven guilty, as a matter of cold statistics, someone who has been lawfully indicted in America is probably more likely to be guilty than innocent. In fact, there are probably so many guilty suspects in Court that the legal system does strictly worse than what social scientists call a "naive predictor" -- i.e., just assuming that everyone is guilty. Of course, that wouldn't be a sustainable policy -- prosecutors choose easy cases because they know that they'll be required to win those cases in a relatively challenging environment. If the rule were that everyone is guilty, prosecutors would start choosing cases based on other criteria, and the percentage of indicted suspects who were actually guilty would go down.

Suppose you survey defense attorneys, and conclude that, say, roughly 80% of indicted suspects are guilty. Could you somehow measure whether the legal system does better than a "mixed strategy predictor" that guessed that a suspect was guilty with probability 0.8 and guessed that a suspect was innocent with probability 0.2? The mixed-strategy predictor would get an accurate result in (0.8) ^ 2 + (0.2) ^ 2 = 68% of the time. To assess whether the legal system is better than a mixed-strategy predictor, you would need to have a way of validating at least a sample of actual cases. I really have no idea how you would start to do that. It's not clear that self-reported guilt or defense-attorney-assessed guilt will correlate strongly enough with actual guilt that we can figure out which individual cases the legal system gets right and which ones it gets wrong. But if you can't measure accuracy in individual cases, how do you figure out the system's overall accuracy rate? It's not clear that looking at appellate results or DNA exonerations, etc. would help either. A reversal on appeal is no guarantee of innocence, because a sentence can be reversed (a) if the evidence is still strong but not strong enough to remove all reasonable doubt as well as (b) when the prosecution or police have used inappropriate but reliable tactics (such as using high-tech cameras to take pictures of the inside of your home without a warrant).

Finally, there is "better than chance" in the sense of specific forensic techniques being verifiably better than, say, a Ouija board. There are several pretty good techniques, such as document analysis, DNA analysis, electronic tracing, and perhaps even paired-question polygraph testing. Whether or not the system interprets the evidence correctly, a typical trial at least contains sufficient evidence for a rational evaluator to beat chance.

[T]here are probably so many guilty suspects (...) that the legal system does strictly worse than (...) just assuming that everyone is guilty.

Careful, there: the economic damage of not locking up a thief is much lower than the economic damage of incorrectly locking up a non-thief. "It's better that X guilty people go free than that one innocent person goes to prison" is a good principle.

(Note that X is likely to have different values for weed users, thieves and serial killers.)

If a random 80% of suspects are guilty, the appropriate naive predictor is one that always votes "guilty", not one that tries to match probabilities by choosing a random 80% of suspects to call guilty. Then you get an accurate result 80% of the time, which is a lot better than 68%. That seems to me a more appropriate benchmark.

(Alternatively, you might consider a predictor that matches its probabilities not to the proportion of defendants who are guilty but to the proportion who are convicted. There might be something to be said for that.)

I think the intended question is whether the legal system adds anything beyond a pure chance element. Somehow we'd need a gold standard of actually guilty and innocent suspects, then we'd need to measure whether p(guilty|convicted) > 80%. You could also ask if p(innocent|acquitted) > 20%, but that's the same question.

Thank you! Intended or not, it's a fantastic question, and I don't know where to look up the answer. I'm not even sure that anyone has seriously tried to answer that question. If they haven't, then I want to. I'll look into it.

By "better than chance" do you mean whether when investigating e.g. a murder, the American police and legal system have more than P(1/population of America) of locating and punishing the actual guilty party?

I like the idea of capping the length of an admissible chain of hearsay, but whenever I hear about a rule like that, I always think of the risk that you'll miss an obviously true conclusion just because the evidence wasn't admissible. Of course, that's a silly argument, since we have lots of such limits and they're not something I disagree with.

The obvious solution to this entire debate is to teach people a basic understanding of practical probability, but I guess you work with what you've got...

Incidentally, is the title a deliberate play on "Lies, damn lies, and statistics"? I couldn't work it out.

A few comments:

  1. It is somewhat confusing (at least to legal readers) that you use legal terms in non-standard ways. Conflating confrontation with hearsay issues is confusing because making people available for cross-examination solves the confrontation problem but not always the hearsay one.

  2. I like your emphasis on the filtering function of evidentiary rules. Keep in mind, however, that these rules have little effect in bench trials (which are more common than jury trials in state courts of general jurisdiction). And relatively few cases reach trial at all; more are disposed of by pretrial motions or by settlements. (For some data, you could check out this paper by Marc Galanter.) So this filtering process is only rarely applied in real-world cases!

  3. Before suggesting that we should exclude evidence of low reliability, you should probably take more time to think about substitution effects. If lawyers cannot use multiply embedded hearsay, what will juries hear instead? Also, you would want to establish that juries would systematically err in their use of such evidence. It is not a problem to have unreliable evidence come in if juries in fact recognize its unreliability.

  4. I've recently spent some time thinking about how we might apply the scientific method towards designing better rules of legal procedure and evidence. It turns out to be trickier than you might think, largely because it is hard to measure the impact of legal rules on the accuracy of case resolutions. If you are curious about such things (and with apologies for blatant self promotion), you might want to read some of what I wrote here, particularly parts 2-4.

The "legal system" is concerned, above all else, that citizens regard its workings as legitimate, The appearance of inevitability promotes the sense of legitimacy, and any procedures that appear arbitrary interfere with it. Thus, the law would exclude all "hearsay within hearsay" before it would impose a three-level limit. Statistical evidence might show that three levels is optimal (or that some other cutoff is), but the provision's artificiality is patent. "I was treated unjustly because my evidence consisted of four levels of hearsay" sounds unjust because "arbitrary" limitations denude the law of evidence of the sense that it's natural.