Diseased disciplines: the strange case of the inverted chart

Imagine the following situation: you have come across numerous references to a paper purporting to show that the chances of successfully treating a disease contracted at age 10 are substantially lower if the disease is detected later: somewhat lower at age 20 to very poor at age 50. Every author draws more or less the same bar chart to depict this situation: the picture below, showing rising mortality from left to right.

Rising mortality, left to right

You search for the original paper, which proves a long quest: the conference publisher have lost some of their archives in several moves, several people citing the paper turn out to no longer have a copy, etc. You finally locate a copy of the paper (let's call it G99) thanks to a helpful friend with great scholarly connections.

And you find out some interesting things.

The most striking is what the author's original chart depicts: the chances of successfully treating the disease detected at age 50 become substantially lower as a function of age when it was contracted; mortality is highest if the disease was contracted at age 10 and lowest if contracted at age 40. The chart showing this is the picture below, showing decreasing mortality from top to bottom, for the same ages on the vertical axis.

Decreasing mortality, top to bottom

Not only is the representation topsy-turvy; the two diagrams can't be about the same thing, since what is constant in the first (age disease detected) is variable in the other, and what is variable in the first (age disease contracted) is constant in the other.

Now, as you research the issue a little more, you find out that authors prior to G99 have often used the first diagram to report their findings; reportedly, several different studies on different populations (dating back to the eighties) have yielded similar results.

But when citing G99, nobody reproduces the actual diagram in G99, they all reproduce the older diagram (or some variant of it).

You are tempted to conclude that the authors citing G99 are citing "from memory"; they are aware of the earlier research, they have a vague recollection that G99 contains results that are not totally at odds with the earlier research. Same difference, they reason, G99 is one more confirmation of the earlier research, which is adequately summarized by the standard diagram.

And then you come across a paper by the same author, but from 10 years earlier. Let's call it G89. There is a strong presumption that the study in G99 is the same that is described in G89, for the following reasons: a) the researcher who wrote G99 was by then already retired from the institution where they obtained their results; b) the G99 "paper" isn't in fact a paper, it's a PowerPoint summarizing previous results obtained by the author.

And in G89, you read the following: "This study didn't accurately record the mortality rates at various ages after contracting the disease, so we will use average rates summarized from several other studies."

So basically everyone who has been citing G99 has been building castles on sand.

Suppose that, far from some exotic disease affecting a few individuals each year, the disease in question was one of the world's major killers (say, tuberculosis, the world's leader in infectious disease mortality), and the reason why everyone is citing either G99 or some of the earlier research is to lend support to the standard strategies for fighting the disease.

When you look at the earlier research, you find nothing to allay your worries: the earlier studies are described only summarily, in broad overview papers or secondary sources; the numbers don't seem to match up, and so on. In effect you are discovering, about thirty years later, that what was taken for granted as a major finding on one of the principal topics of the discipline in fact has "sloppy academic practice" written all over it.

If this story was true, and this was medicine we were talking about, what would you expect (or at least hope for, if you haven't become too cynical), should this story come to light? In a well-functioning discipline, a wave of retractations, public apologies, general embarrassment and a major re-evaluation of public health policies concerning this disease would follow.

 

The story is substantially true, but the field isn't medicine: it is software engineering.

I have transposed the story to medicine, temporarily, as an act of benign deception, to which I now confess. My intention was to bring out the structure of this story, and if, while thinking it was about health, you felt outraged at this miscarriage of academic process, you should still feel outraged upon learning that it is in fact about software.

The "disease" isn't some exotic oddity, but the software equivalent of tuberculosis - the cost of fixing defects (a.k.a. bugs).

The original claim was that "defects introduced in early phases cost more to fix the later they are detected". The misquoted chart says this instead: "defects detected in the operations phase (once software is in the field) cost more to fix the earlier they were introduced".

Any result concerning the "disease" of software bugs counts as a major result, because it affects very large fractions of the population, and accounts for a major fraction of the total "morbidity" (i.e. lack of quality, project failure) in the population (of software programs).

The earlier article by the same author contained the following confession: "This study didn't accurately record the engineering times to fix the defects, so we will use average times summarized from several other studies to weight the defect origins".

Not only is this one major result suspect, but the same pattern of "citogenesis" turns up investigating several other important claims.

 

Software engineering is a diseased discipline.

 

 


The publication I've labeled "G99" is generally cited as: Robert B. Grady, An Economic Release Decision Model: Insights into Software Project Management, in proceedings of Applications of Software Measurement (1999). The second diagram is from a photograph of a hard copy of the proceedings.

Here is one typical publication citing Grady 1999, from which the first diagram is extracted. You can find many more via a Google search. The "this study didn't accurately record" quote is discussed here, and can be found in "Dissecting Software Failures" by Grady, in the April 1989 issue of the "Hewlett Packard Journal"; you can still find one copy of the original source on the Web, as of early 2013, but link rot is threatening it with extinction.

A more extensive analysis of the "defect cost increase" claim is available in my book-in-progress, "The Leprechauns of Software Engineering".

Here is how the axes were originally labeled; first diagram:

  • vertical: "Relative Cost to Correct a Defect"
  • horizontal: "Development Phase" (values "Requirements", "Design", "Code", "Test", "Operation" from left to right)
  • figure label: "Relative cost to correct a requirement defect depending on when it is discovered"

Second diagram:

  • vertical: "Activity When Defect was Created" (values "Specifications", "Design", "Code", "Test" from top to bottom)
  • horizontal: "Relative cost to fix a defect after release to customers compared to the cost of fixing it shortly after it was created"
  • figure label: "Relative Costs to Fix Defects"

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 8:51 PM
Select new highlight date
All comments loaded

Reminds me of the bit in "Cargo Cult Science" by Richard Feynman:

Other kinds of errors are more characteristic of poor science. When I was at Cornell, I often talked to the people in the psychology department. One of the students told me she wanted to do an experiment that went something like this--it had been found by others that under certain circumstances, X, rats did something, A. She was curious as to whether, if she changed the circumstances to Y, they would still do A. So her proposal was to do the experiment under circumstances Y and see if they still did A.

I explained to her that it was necessary first to repeat in her laboratory the experiment of the other person--to do it under condition X to see if she could also get result A, and then change to Y and see if A changed. Then she would know that the real difference was the thing she thought she had under control.

She was very delighted with this new idea, and went to her professor. And his reply was, no, you cannot do that, because the experiment has already been done and you would be wasting time. This was in about 1947 or so, and it seems to have been the general policy then to not try to repeat psychological experiments, but only to change the conditions and see what happens.

This is more common than most people would like to think, I think. I experienced this tracking down sunk cost studies recently - everyone kept citing the same studies or reviews showing sunk cost in real world situations, but when you actually tracked back to the experiments, you saw they weren't all that great even though they had been cited so much.

Did you find more recent replications as well? Or was the scientific community in question content that the old, not-so-good studies had found something real?

I didn't find very many recent replications that supported the standard wisdom, but that may just be because it's easier to find old much-cited material than new contrarian material, by following the citations in the prominent standard papers that I started with.

My father, a respected general surgeon and an acute reader of the medical literature, claimed that almost all the studies on early detection of cancer confuse degree of disease at time of detection with "early detection". That is, a typical study assumes that a small cancer must have been caught early, and thus count it as a win for early detection.

An obvious alternate explanation is that fast-growing malignant cancers are likely to kill you even in the unlikely case that you are able detect them before they are large, whereas slow-growing benign cancers are likely to sit there until you get around to detecting them but are not particularly dangerous in any case. My father's claim was that this explanation accounts for most studies' findings, and makes something of a nonsense of the huge push for early detection.

Interesting. Is he (or anyone else) looking at getting something published that either picks apart the matter or encourages others to?

In trying to think of an easy experiment that might help distinguish, all that's coming to mind is adding a delay to one group and not to another. It seems unlikely this could be done ethically with humans, but animal studies may help shed some light.

Well it seems to me that disciplines become diseased when people who demand the answers can't check answers or even make use of the answers.

In case of your software engineering example:

The software development work flow is such that some of the work affects the later work in very clear ways, and the flaws in this type of work end up affecting a growing number of code lines. Obviously, the less work you do that you need to re-do to fix the flaw, the cheaper it is.

That is extremely solid reasoning right here. Very high confidence result, high fidelity too - you know what kind of error will cost more and more to fix later, not just 'earlier error'.

The use of statistics from a huge number of projects to conclude this about a specific project is, however, a case of extremely low grade reasoning, giving a very low fidelity result as well.

In presence of extremely high grade reasoning, why do we need the low fidelity result from extremely low grade reasoning? We don't.

To make an analogy:

The software engineer has no more need for this statistical study than barber has need for the data on correlation of head size and distance between the eyes, and correlation of distance between eyes and height. The barber works on specific client's head that has specific shape, instead of inferring it from client's height based on both correlations and then either snapping the scissors uselessly in the air or embedding scissors inside the client's skull.

It's only if barber was an entirely blind rigid high powered automation that only knows client's height, that the barber would benefit from such correlation 'science' - but such automation would still end up killing the client almost half of the time.

However, the management demands science, and in the eyes of management the science is what they have been told in highschool - making hypothesis, testing hypothesis, doing some statistics. They are never taught value of reasoning. They are the barbershop's idiot manager who demands that the barber uses some science in form of head size to height correlation, to be sure that the barber is not embedding scissors in the client's head, or snapping them in the air.

Nicely put, overall. Your points on the demands of "management" echo this other Dijkstra screed.

However I have to disagree with your claims of "high grade reasoning" in this particular instance. You say "some of the work affects the later work in very clear ways": what ways?

The ways most people in the profession know are less than clear, otherwise we wouldn't see all the arguing back and forth about which language, which OS, which framework best serves the interests of "later work"; or about which technique is best suited to prevent defects, or any of a number of debates that have been ongoing for over four decades.

I've been doing this work for over twenty years now, and I still find the determinants of software quality elusive on the scale of whole programs. Sure, for small or "toy" examples you can always show the advantages of this or that technique: unit testing, UML models, static typing, immutability, lazy evaluation, propagation, whatever. (I've perhaps forgotten more of these tricks than some "hotshot" young programmers have ever learned.) But at larger scales of programming efforts these advantages start being swamped by the effects of other forces, that mostly have to do with the social nature of large scale software development.

Is it really valid to conclude that software engineering is diseased based on one propagating mistake? Could you provide other examples of flawed scholarship in the field? (I'm not saying I disagree, but I don't think your argument is particularly convincing.)

Can you comment on Making Software by Andy Oram and Greg Wilson (Eds.)? What do you think of Jorge Aranda and Greg Wilson's blog, It Will Never Work in Theory?

To anyone interested in the subject, I recommend Greg Wilson's talk on the subject, which you can view here.

I'm a regular reader of Jorge and Greg's blog, and even had a very modest contribution there. It's a wonderful effort.

"Making Software" is well worth reading overall, and I applaud the intention, but it's not the Bible. When you read it with a critical mind, you notice that parts of it are horrible, for instance the chapter on "10x software developers".

Reading that chapter was in fact largely responsible for my starting (about a year ago) to really dig into some of the most-cited studies in our field and gradually realizing that it's permeated with poorly supported folklore.

In 2009, Greg Wilson wrote that nearly all of the 10x "studies" would most likely be rejected if submitted to an academic publication. The 10x notion is another example of propagation of extremely unconvincing claims, that nevertheless have had a large influence in shaping the discipline's underlying assumptions.

But Greg had no problem including the 10x chapter which rests mostly on these studies, when he became the editor of "Making Software". As you can see from Greg's frosty tone in that link, we don't see eye to eye on this issue. I'm partially to blame for that, insofar as one of my early articles on the topic carelessly implied that the author of the chapter in question had "cheated" by using these citations. (I do think the citations are bogus, but I now believe that they were used in good faith - which worries me even more than cheating would.)

Another interesting example is the "Making Software" chapter on the NASA Software Engineering Laboratory (SEL), which trumpets the lab as "a vibrant testbed for empirical research". Vibrant? In fact by the author's own admission the SEL was all but shut down by NASA eight years earlier, but this isn't mentioned in the chapter at all.

It's well known on Less Wrong that I'm not a fan of "status" and "signaling" explanations for human behaviour. But this is one case where it's tempting... The book is one instance of a recent trend in the discipline, where people want to be seen to call for better empirical support for claimed findings, and at least pay overt homage to "evidence based software engineering". The problem is that actual behaviours belie this intention, as in the case of not providing information about the NASA SEL that no reader would fail to see as significant - the administrative failure of the SEL is definitely evidence of some kind, and Basili thought it significant enough to write an article about its "rise and fall".

Other examples of propagation of poorly supported claims, that I discuss in my ebook, are the "Cone of Uncertainty" and the "Software Crisis". I regularly stumble across more - it's sometimes a strange feeling to discover that so much that I thought was solid history or solid fact is in fact so much thin air. I sincerely hope that I'll eventually find some part of the discipline's foundations that doesn't feel like quicksand; a place to stand.

What would you suggest if asked for a major, well-supported result in software engineering?

Thanks for taking the time to reply thoughtfully. That was some good reading, especially for a non-expert like me. Here are my thoughts after taking the time to read through all of those comment threads and your original blog post. I'll admit that I haven't read the original McConnell chapter yet, so keep that in mind. Also, keep in mind that I'm trying to improve the quality of this discussion, not spark an argument we'll regret. This is a topic dear to my heart and I'm really glad it ended up on LW.

What would you suggest if asked for a major, well-supported result in software engineering?

Based on Steve McConnell's blog post (and the ensuing comment thread), I think the order-of-magnitude claim is reasonably well-supported -- there are a handful of mediocre studies that triangulate to a reasonable belief in the order-of-magnitude claim. In none of those comment threads are you presenting a solid argument for the claim being not well-supported. Instead, you are mostly putting forth the claim that the citations were sloppy and "unfair." You seem to be somewhat correct -- which Steve acknowledged -- but I think you're overreaching with your conclusions.

The book is one instance of a recent trend in the discipline, where people want to be seen to call for better empirical support for claimed findings, and at least pay overt homage to "evidence based software engineering".

We could look at your own arguments in the same light. In all those long comment threads, you failed to engage in a useful discussion, relying on argumentative "cover fire" that distracted from the main point of discussion (i.e. instead of burying your claims in citations, you're burying your claims in unrelated claims.) You claim that "The software profession has a problem, widely recognized but which nobody seems willing to do anything about," despite that you acknowledge that Wilson et al are indeed doing quite a bit about it. This looks a lot like narrative/confirmation bias, where you're a detective unearthing a juicy conspiracy. Many of your points are valid, and I'm really, really glad for your more productive contributions to the discussion, but you must admit that you are being stubborn about the McConnell piece, no?

Regarding Greg Wilson's frosty tone, I don't think that has anything to do with the fact that you disagree about what constitutes good evidence. He's very clearly annoyed that your article is accusing Steve McConnell of "pulling a fast one." But really, the disagreement is about your rather extreme take on what evidence we can consider.

Considering how consistently you complained about the academic paywalls, it's strange that you're putting the substance of your piece behind your own paywall. This post is a good example of a LessWrong post that isn't a thinly veiled advert for a book.

I'm not disagreeing altogether or trying to attack you, but I do think you have pretty extreme conclusions. Your views of the "10x" chapter and the "SEL" chapter are not enough to conclude that the broad discipline is "diseased." I think your suggestion that Making Software is only paying "overt homage" to real scholarly discipline is somewhat silly and the two reasons you cite aren't enough to damn it so thoroughly. Moreover, your criticism should (and does, unintentionally) augment and refine Making Software, instead of throwing it away completely because of a few tedious criticisms.

Is it really valid to conclude that software engineering is diseased based on one propagating mistake?

How about based on the fact that the discipline relies on propagating result rather than reproducing them.

"Should Computer Scientists Experiment More? 16 excuses to avoid experimentation":

In [15], 400 papers were classified. Only those papers were considered further whose claims required empirical evaluation. For example, papers that proved theorems were excluded, because mathematical theory needs no experiment. In a random sample of all papers ACM published in 1993, the study found that of the papers with claims that would need empirical backup, 40% had none at all. In journals related to software, this fraction was 50%. The same study also analyzed a non-CS journal, Optical Engineering, and found that in this journal, the fraction of papers lacking quantitative evaluation was merely 15%.

The study by Zelkowitz and Wallace [17] found similar results. When applying consistent classification schemes, both studies report between 40% and 50% unvalidated papers in software engineering. Zelkowitz and Wallace also surveyed journals in physics, psychology, and anthropology and again found much smaller percentages of unvalidated papers there than in computer science.

...Here are some examples. For about twenty years, it was thought that meetings were essential for software reviews. However, recently Porter and Johnson found that reviews without meetings are neither substantially more nor less effective than those with meetings [11]. Meeting-less reviews also cost less and cause fewer delays, which can lead to a more effective inspection process overall. Another example where observation contradicts conventional wisdom is that small software components are proportionally less reliable than larger ones. This observation was first reported by Basili [1] and has been confirmed by a number of disparate sources; see Hatton [6] for summaries and an explanatory theory. As mentioned, the failure probabilities of multi-version programs were incorrectly believed to be the product of the failure probabilities of the component versions. Another example is type checking in programming languages. Type checking is thought to reveal programming errors, but there are contexts when it does not help [12]. Pfleeger et al. [10] provides further discussion of the pitfalls of intuition.

This strikes me as particularly galling because I have in fact repeated this claim to someone new to the field. I think I prefaced it with "studies have conclusively shown...". Of course, it was unreasonable of me to think that what is being touted by so many as well-researched was not, in fact, so.

Mind, it seems to me that defects do follow both patterns: Introducing defects earlier and/or fixing them later should come at a higher dollar cost, that just makes sense. However, it could be the same type of "makes sense" that made Aristotle conclude that heavy objects fall faster than light objects - getting actual data would be much better than reasoning alone, especially is it would tell us just how much costlier, if at all, these differences are - it would be an actual precise tool rather than a crude (and uncertain) rule of thumb.

I do have one nagging worry about this example: These days a lot of projects collect a lot of metrics. It seems dubious to me that no one has tried to replicate these results.

These days a lot of projects collect a lot of metrics.

Mostly the ones that are easy to collect: a classic case of "looking under the lamppost where there is light rather than where you actually lost your keys".

it could be the same type of "makes sense"

Now we're starting to think. Could we (I don't have a prefabricated answer to this one) think of a cheap and easy to run experiment that would help us see more clearly what's going on?

You're ascribing diseases to an entity that does not exist.

Software engineering is not a discipline, at least not like physics or even computer science. Most software engineers, out there writing software, do not attend conferences. They do not write papers. They do not read journal articles. Their information comes from management practices, consulting, and the occasional architect, and a whole heapin' helpin' of tribal wisdom - like the statistic you show. At NASA on the Shuttle software, we were told this statistic regularly, to justify all the process and rigor and requirements reviews and design reviews and code reviews and review reviews that we underwent.

Software engineering is to computer science what mechanical engineering is to physics.

Can you suggest examples, in mechanical engineering, of "tribal wisdom" claims that persist as a result of poor academic practice? (Or perhaps I am misunderstanding your analogy.)

Discussed in this oft-quoted (here, anyway) talk.

A number of these phenomena have been bundled under the name "Software Engineering". As economics is known as "The Miserable Science", software engineering should be known as "The Doomed Discipline", doomed because it cannot even approach its goal since its goal is self-contradictory. Software engineering, of course, presents itself as another worthy cause, but that is eyewash: if you carefully read its literature and analyse what its devotees actually do, you will discover that software engineering has accepted as its charter "How to program if you cannot.".

If you haven't already, you should probably read that entire essay.

There is a similar story in The trouble with Physics. I think it was about whether there had been proven to be no infinities in a class of string theory, where the articles cited only proved there weren't any in the low order terms.

As an aside it is an interesting book on how a subject can be dominated by a beguiling theory that isn't easily testable, due to funding and other social issues. Worth reading if we want to avoid the same issues in existential risk reduction research.

While the example given is not the main point of the article, I'd still like to share a bit of actual data. Especially since I'm kind of annoyed at having spouted this rule as gospel without having a source, before.

A study done at IBM shows a defect fixed during the coding stage costs about 25$ to fix (basically in engineer hours used to find and fix it).

This cost quadruples to 100$ during the build phase; presumably because this can bottleneck a lot of other people trying to submit their code, if you happen to break the build.

The cost quadruples again for bugs found during QA/Testing phase, to 450$. I'm guessing this includes tester time, developer time, additional tools used to facilitate bug tracking... Investments the company might have made anyway, but not if testing did not catch bugs that would otherwise go out to market.

Bugs discovered once released as a product is the next milestone, and here the jump is huge: Each bug cost 16k$, about 35 times the cost of a tester-found bug. I'm not sure if this includes revenue lost due to bad publicity, but I'm guessing probably no. I think only tangible investments were tracked.

Critical bugs discovered by customers that do not result in a general recall cost x10 that much (this is the only step that actually seems to have this number), at 158k$ per defect. This increases to 241k$ for recalled products.

My own company also noticed that external bugs typically take twice as long to fix as internally found bugs (~59h to ~30h) in a certain division.

So this "rule of thumb" seems real enough... The x10 rule is not quite right, it's more like a x4 rule with a huge jump once your product goes to market. But the general gist seems to be correct.

Note this is all more in line with the quoted graph than its extrapolation: Bugs detected late cost more to fix. It tells us nothing about the stage they were introduced in.

Go data-driven conclusions! :)

Your core claim is very nearly conventional wisdom in some quarters. You might want to articulate some remedies.

A few thoughts --

One metric for disease you didn't mention is the gap between research and practice. My impression is that in graphics, systems, networking and some other healthy parts of the CS academy, there's an intensive flow of ideas back and forth between researchers and practitioners. That's much rarer in software engineering. There are fewer start-ups by SE researchers. There are few academic ideas and artifacts that have become widely adopted. (I don't have numerical metrics for this claim, unfortunately.) And this is a sign that either the researchers are irrelevant, or the practitioners refuse to learn, or both.

I can vouch from personal observation that software engineering is often a marginal part of academic computer science. It's not well respected as a sub-field. Software engineering that studies developer behavior is especially marginal -- as a result, the leading conferences tend to be dominated by applied program analysis papers. Which are nice, but typically very low impact.

I see that you have a book about this, but if this error is egregious enough, why not submit papers to that effect? Surely one can only demonstrate that Software Engineering is diseased is if, once the community have read your claims, they refuse to react?

You don't call a person diseased because they fail to respond to a cure: you call them diseased because they show certain symptoms.

This disease is widespread in the community, and has even been shown to cross the layman-scientist barrier.

Very cool analysis, and I think especially relevant to LW.

I'd love to see more articles like this, forming a series. Programming and software design is a good testing ground for rationality. It's about mathematically simple enough to be subject to precise analysis, but its rough human side makes it really tricky to determine exactly what sort of analyses to do, and what changes in behavior the results should inspire.

Thanks! This work owes a substantial debt to LW, and this post (and possible subsequent ones) are my small way of trying to repay that debt.

So far, though, I've mostly been blogging about this on G+, because I'm not quite sure how to address the particular audience that LW consists of - but then I also don't know exactly how to present LW-type ideas to the community I'm most strongly identified with. I'd appreciate any advice on bridging that gap.

I see that the there is a problem, but it seems that both charts support the same conclusion: the longer problem goes undetected, the more problems it brings.

Are there any methodological recommendations which are supported by one chart, but not the other?

As Software Engineering is too far from being a science anyway, correct sign of correlation seems to be everything that matters, because exact numbers can always be fine-tuned given the lack of controlled experiments.

seems that both charts support the same conclusion: the longer problem goes undetected, the more problems it brings

Yes, there's a sort of vague consistency between the charts. That's precisely where the problem is: people already believe that "the longer a bug is undetected, the harder it is to fix" and therefore they do not look closely at the studies anymore.

In this situation, the studies themselves become suspect: this much carelessness should lead you (or at least, it leads me) to doubt the original data; and in fact the original data appears to be suspect.

As Software Engineering is too far from being a science anyway

Yes, that's where I end up. :)