Is statistics beyond introductory statistics important for general reasoning?

Ideas such as regression to the mean, that correlation does not imply causation and base rate fallacy are very important for reasoning about the world in general. One gets these from a deep understanding of statistics 101, and the basics of the Bayesian statistical paradigm. Up until one year ago, I was under the impression that more advanced statistics is technical elaboration that doesn't offer major additional insights  into thinking about the world in general.

Nothing could be further from the truth: ideas from advanced statistics are essential for reasoning about the world, even on a day-to-day level. In hindsight my prior belief seems very naive – as far as I can tell, my only reason for holding it is that I hadn't heard anyone say otherwise. But I hadn't actually looked advanced statistics to see whether or not my impression was justified :D.

Since then, I've learned some advanced statistics and machine learning, and the ideas that I've learned have radically altered my worldview. The "official" prerequisites for this material are calculus, differential multivariable calculus, and linear algebra. But one doesn't actually need to have detailed knowledge of these to understand ideas from advanced statistics well enough to benefit from them. The problem is pedagogical: I need to figure out how how to communicate them in an accessible way.

Advanced statistics enables one to reach nonobvious conclusions

To give a bird's eye view of the perspective that I've arrived at, in practice, the ideas from "basic" statistics are generally useful primarily for disproving hypotheses. This pushes in the direction of a state of radical agnosticism: the idea that one can't really know anything for sure about lots of important questions. More advanced statistics enables one to become justifiably confident in nonobvious conclusions, often even in the absence of formal evidence coming from the standard scientific practice.

IQ research and PCA as a case study

In the early 20th century, the psychologist and statistician Charles Spearman discovered the the g-factor, which is what IQ tests are designed to measure. The g-factor is one of the most powerful constructs that's come out of psychology research. There are many factors that played a role in enabling Bill Gates ability to save perhaps millions of lives, but one of the most salient factors is his IQ being in the top ~1% of his class at Harvard. IQ research helped the Gates Foundation to recognize iodine supplementation as a nutritional intervention that would improve socioeconomic prospects for children in the developing world.

The work of Spearman and his successors on IQ constitute one of the pinnacles of achievement in the social sciences. But while Spearman's discovery of IQ was a great discovery, it wasn't his greatest discovery. His greatest discovery was a discovery about how to do social science research. He pioneered the use of factor analysis, a close relative of principal component analysis (PCA).

The philosophy of dimensionality reduction

PCA is a dimensionality reduction method. Real world data often has the surprising property of "dimensionality reduction":  a small number of latent variables explain a large fraction of the variance in data.

This is related to the effectiveness of Occam's razor: it turns out to be possible to describe a surprisingly large amount of what we see around us in terms of a small number of variables. Only, the variables that explain a lot usually aren't the variables that are immediately visibleinstead they're hidden from us, and in order to model reality, we need to discover them, which is the function that PCA serves. The small number of variables that drive a large fraction of variance in data can be thought of as a sort of "backbone" of the data. That enables one to understand the data at a "macro /  big picture / structural" level.

This is a very long story that will take a long time to flesh out, and doing so is one of my main goals. 

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 8:40 AM
Select new highlight date
Rendering 50/132 comments  show more

"impression that more advanced statistics is technical elaboration that doesn't offer major additional insights"

Why did you have this impression?

Sorry for the off-topic, but I see this a lot in LessWrong (as a casual reader). People seem to focus on textual, deep-sounding, wow-inducing expositions, but often dislike the technicalities, getting hands dirty with actually understanding calculations, equations, formulas, details of algorithms etc (calculations that don't tickle those wow-receptors that we all have). As if these were merely some minor additions over the really important big picture view. As I see it this movement seems to try to build up a new backbone of knowledge from scratch. But doing this they repeat the mistakes of the past philosophers. For example going for the "deep", outlook-transforming texts that often give a delusional feeling of "oh now I understand the whole world". It's easy to have wow-moments without actually having understood something new.

So yes, PCA is useful and most statistics and maths and computer science is useful for understanding stuff. But then you swing to the other extreme and say "ideas from advanced statistics are essential for reasoning about the world, even on a day-to-day level". Tell me how exactly you're planning to use PCA day-to-day? I think you may mean you want to use some "insight" that you gained from it. But I'm not sure what that would be. It seems to be a cartoonish distortion that makes it fit into an ideology.

Anyway, mainstream machine learning is very useful. And it's usually much more intricate and complicated than to be able to produce a deep everyday insight out of it. I think the sooner you lose the need for everything to resonate deeply or have a concise insightful summary, the better.

Why did you have this impression?

Probably because of the human tendency to overestimate the importance of any knowledge one happens to have and underestimate the importance of any knowledge one doesn't. (Is there a name for this bias?)

Why did you have this impression?

Groupthink I guess: other people who I knew didn't think that it's so important (despite being people who are very well educated by conventional standards, top ~1% of elite colleges).

Tell me how exactly you're planning to use PCA day-to-day?

Disclaimer: I know that I'm not giving enough evidence to convince you: I've thought about this for thousands of hours (including working through many quantitative examples) and it's taking me a long time to figure out how to organize what I've learned.

I already have been using dimensionality reduction (qualitatively) in my day to day life, and I've found that it's greatly improved my interpersonal relationships because it's made it much easier to guess where people are coming from (before people's social behavior had seemed like a complicated blur because I saw so many variables without having started to correctly identify the latent ones).

i think the sooner you lose the need for everything to resonate deeply or have a concise insightful summary, the better.

You seem to be making overly strong assumptions with insufficient evidence: how would you know whether this was the case, never having met me? ;-)

Qualitative day-to-day dimensionality reduction sounds like woo to me. Not a bit more convincing than quantum woo (Deepak Chopra et al.). Whatever you're doing, it's surely not like doing SVD on a data matrix or eigen-decomposition on the covariance matrix of your observations.

Of course, you can often identify motivations behind people's actions. A lot of psychology is basically trying to uncover these motivations. Basically an intentional interpretation and a theory of mind are examples of dimensionality reduction in some sense. Instead of explaining behavior by reasoning about receptors and neurons, you imagine a conscious agent with beliefs, desires and intentions. You could also link it to data compression (dimensionality reduction is a sort of lossy data compression). But I wouldn't say I'm using advanced data compression algorithms when playing with my dog. It just sounds pretentious and shows a desperate need to signal smartness.

So, what is the evidence that you are consciously doing something similar to PCA in social life? Do you write down variables and numbers, or how can I imagine qualitative dimensionality reduction. How is it different from somebody just getting an opinion intuitively and then justifying it with afterwards?

I think having the concept of PCAs prevents some mistakes in reasoning on an intuitive day to day level of reasoning. It nudges me towards fox thinking instead of hedgehog thinking. Normal folk intuition grasps at the most cognitively available and obvious variable to explain causes, and then our System 1 acts as if that variable explains most if not all the variance. Looking at PCAs many times (and being surprised by them) makes me less likely to jump to conclusions about the causal structure of clusters of related events. So maybe I could characterize it as giving a System 1 intuition for not making the post hoc ergo propter hoc fallacy.

Maybe part of the problem Jonah is running in to explaining it is that having done many many example problems with System 2 loaded it into his System 1, and the System 1 knowledge is what he really wants to communicate?

What do you mean by getting surprised by PCAs? Say you have some data, you compute the principal components (eigenvectors of the covariance matrix) and the corresponding eigenvalues. Were you surprised that a few principal components were enough to explain a large percentage of the variance of the data? Or were you surprised about what those vectors were?

I think this is not really PCA or even dimensionality reduction specific. It's simply the idea of latent variables. You could gain the same intuition from studying probabilistic graphical models, for example generative models.

I don't believe you can obtain an understanding of the idea that "correlation does not imply causation" from even a very deep appreciation of the material in Statistics 101. These courses usually make no attempt to define confounding, comparability etc. If they try to define confounding, they tend to use incoherent criteria based on changes in the estimate. Any understanding is almost certainly going to have to originate from outside of Statistics 101; unless you take a course on causal inference based on directed acyclic graphs it will be very challenging to get beyond memorizing the teacher's password

Agree completely, and I'll also point out that at least for me, a very shallow understanding of the ideas in Causality did much more to help me understand correlation vs. causation, confounding etc. than any amount of work with Statistics 101. And this was enormously practical–I was able to make significantly better financial decisions at Fundation due to understanding concepts like Simpson's Paradox on a system 1 level.

To chime in as well: my own understanding of 'correlation does not imply causation' does not come from the basic statistics courses and articles and tutorials I read. While I knew the saying and the concepts and a little bit about causal graphs, it took years of failed self-experiments and the intensely frustrating experience of seeing correlate after correlate fail randomized experiments before I truly accepted it.

I don't know how helpful, exactly, this has been on a practical level, but at least it's good for me on an epistemic level in that I have since accepted many fewer new beliefs than I would otherwise have.

Me four.


Although you know, there is no reason in principle you couldn't get all that stuff Anders_H is talking about from intro stats, it's just that stats isn't taught as well as it can be.

PCA and other dimensionality reduction techniques are great, but there's another very useful technique that most people (even statisticians) are unaware of: dimensional analysis, and in particular, the Buckingham pi theorem. For some reason, this technique is used primarily by engineers in fluid dynamics and heat transfer despite its broad applicability. This is the technique that allows scale models like wind tunnels to work, but it's more useful than just allowing for scaling. I find it very useful to reduce the number of variables when developing models and conducting experiments.

Dimensional analysis recognizes a few basic axioms about models with dimensions and sees what they imply. You can use these to construct new variables from the old variables. The model is usually complete in a smaller number of these new variables. The technique does not tell you which variables are "correct", just how many independent ones are needed. Identifying "correct" variables requires data, domain knowledge, or both. (And sometimes, there's no clear "best" variable; multiple work equivalently well.)

Dimensional analysis does not help with categorical variables, or numbers which are already dimensionless (though by luck, sometimes combinations of dimensionless variables are actually what's "correct"). This is the main restriction that applies. And you can expect at best a reduction in the number of variables of about 3. Dimensional analysis is most useful for physical problems with maybe 3 to 10 variables.

The basic idea is this: Dimensions are some sort of metadata which can tell you something about the structure of the problem. You can always rewrite a dimensional equation, for example, to be dimensionless on both sides. You should notice that some terms become constants when this is done, and that simplifies the equation.

Here's a physical example: Let's say you want to measure the drag on a sphere (units: N). You know this depends on the air speed (units: m/s), viscosity (units: m^2/s), air density (units: kg/m^3), and the diameter of the sphere (units: m). So, you have 5 variables in total. Let's say you want to do a factorial design with 4 levels in each variable, with no replications. You'd have to do 4^4 = 256 experiments. This is clearly too complicated.

What fluid dynamicists have recognized is that you can rewrite the relationship in terms of different variables, and nothing is missing. The Buckingham pi theorem mentioned previously says that we only need 2 dimensionless variables given our 5 dimensional variables. So, instead of the drag force, you use the drag coefficient, and instead of the speed, viscosity, etc., you use the Reynolds number. Now, you only need to do 4 experiments to get the same level of representation.

As it turns out, you can use techniques like PCA on top of dimensional analysis to determine that certain dimensionless parameters are unimportant (there are other ways too). This further simplifies models.

There's a lot more on this topic than what I have covered and mentioned here. I would recommend reading the book Dimensional analysis and the theory of models for more details and the proof of the pi theorem.

(Another advantage of dimensional analysis: If you discover a useful dimensionless variable, you can get it named after yourself.)

In general, if your problem displays any kind of symmetry* you can exploit that to simplify things. I think most people are capable of doing this intuitively when the symmetry is obvious. The Buckingham pi theorem is a great example of a systematic way to find and exploit a symmetry that isn't so obvious.

* By "symmetry" I really mean "invariance under a group of transformations".

I've always been amazed at the power of dimensional analysis. To me the best example is the problem of calculating the period of an oscillating mass on a spring. The relevant values are the spring constant K (kg/s^2) and the mass M (kg), and the period T is in (s). The only way to combine K and M to obtain a value with dimensions of (s) is sqrt(M/K), and that's the correct form of the actual answer - no calculus required!

Actually, there's another parameter, the displacement. It turns out that the spring period does not depend on the displacement, but that's a miracle that is special to springs. Instead, look at the pendulum. The same dimensional analysis gives the square root of the length divided by gravitational acceleration. That's off by a dimensionless constant, 2π. Moreover, even that is only approximately correct. The real answer depends on the displacement in a complicated way.

That's because outside of physics (and possibly chemistry) there are enough constants running around that all quantities are effectively dimensionless. I'm having a hard time seeing a situation in say biology where I could propose dimensional analysis with a straight face, to say nothing of softer sciences.

What resources would you recommend for learning advanced statistics?