Collider Bias in Statistics

Scott Lyden
6 min readSep 1, 2022


Photo by Uriel Soberanes on Unsplash

In bio-statistics there is a consistent pattern in perinatal data: Among babies born with low birth-weight, infants born to smokers have lower first-year mortality than those born to non-smokers. At first (or even second) blush, it might appear that smoking is beneficial to underweight infants. Don’t laugh; this idea was actually seriously proposed in the medical literature about fifty years ago!

We’ve all heard that correlation does not imply causation. The correlation in this case is clear, but is there a causal connection? Should obstetricians prescribe cigarettes to patients who appear headed for a low birth-weight delivery? We all know that would be nuts, but why is it nuts? (Obviously, it’s a bad idea from the point of view of the mother’s health; we’re just focusing on the baby’s health with this question.)

Provocative questions like this set the correlation-causation conundrum in high relief and nicely highlight the complexity of isolating causal relationships. In this case, an inference of a causal relationship would be invalid due to a phenomenon statisticians call collider bias.

Collider bias is the subtle, insidious cousin of confounder bias. I think confounder bias is simpler to understand (and fix!) than collider bias. Also, collider bias is often the product of misguided attempts to address confounder bias, so let’s lay the foundation for an exploration of collider bias by first reviewing confounder bias.

Confounding bias occurs when another variable, a confounder, influences both the cause and the effect in a causal relationship. To continue with the example of the connection between smoking and infant mortality, let’s think about a vastly oversimplified model where smoking and infant mortality might both be driven by a lone third variable. One can imagine a lot of hard-to-measure candidates for such a variable, but education is relatively easy to measure and is routinely included as a possible confounder in models looking at the connection between smoking and infant mortality.

It can be useful to illustrate even greatly over-simplified causal relationships like this in a directed acyclic graph. The graph is directed because the arrows illustrate the direction of postulated causes and effects. It’s acyclic because a world where effects can cycle back to cause their causes would just be weird. Here, two arrows are both running away from education to influence both the exposure (smoking) and the outcome (infant mortality).

Any study attempting to measure the effect of smoking on infant mortality without statistically controlling for education (or, better yet, all the variables we’d really like to measure), would be assigning all of the mortality effects of the education variable (smoking, poor nutrition, drug use, etc.) to smoking alone, almost surely producing an exaggerated (upwardly biased) estimate of the pure, isolated effects of smoking by itself.

I think confounder bias is easy enough to understand in this case, but now let’s slowly complicate things.

Let’s start by thinking about two variables, A and B, that independently act to cause a third variable, C.

The arrows from A and B both collide at (both point to) the variable C. For simplicity, let’s think of all three variables as binary (on or off) variables, where we observe C if either A or B is true. To make the example even sharper, imagine that we only observe C if A is true but B is not (and vice versa). Then, if we observe C, we know that either A is true and B is not or that B is true and A is not. Put another way, even though A and B may be statistically independent unconditionally, they are perfectly negatively correlated conditional on C.

Let’s apply all this to the example we began with, the birth-weight paradox. Infants born to smoking mothers have elevated risk of low birth weight and death within the first year of life. Yet, low birth-weight infants have better survival chances if their mothers smoked than if they didn’t.

By now, I’ll bet you’re starting to fill in the explanation yourself, but let’s spell it out. (Hint: In the following, think of low birth weight in the role of C above.)

Let’s imagine that low birth weight is only caused by two factors: smoking and a catch-all factor we’ll call “birth defects.” In this model, low birth-weight infants whose mothers didn’t smoke have a higher likelihood of birth defects, given that they are under-weight and their low birth weight can’t be attributed to smoking. If birth defects have a strong effect on mortality, failing to smoke😃could be associated with elevated mortality risk for the sub-population of low birth-weight babies. Note that this is the case whether or not there is a causal connection between smoking and infant mortality at all. That is, we could delete the line on the bottom of the DAG and a spurious connection between smoking and mortality risk could still exist if we control for/select by birth weight. (Make sure you understand this point!)

In contrast to confounding bias, the problem here is that the researcher has controlled for (in this case, selected case samples based on) low birth weight, a variable that is caused by one or more study variables that are also linked to the outcome. Here, controlling for the problematic variable has caused the bias, whereas controlling for a confounding variable alleviates study bias, so the distinction between colliders and confounders has crucial implications for researchers.

To review, confounding occurs when a common cause affects treatment and outcome variables jointly. Here, you address the problem by controlling for the variable in question. Collider bias, on the other hand, occurs when one or more of the treatment variables causes variation in another variable that is (erroneously) used in sample selection or another form a statistical control. That is, controlling for such a variable is exactly what you don’t want to do.

Wrapping up, I just want to say a brief word about how all this relates to selection bias. Many readers will recognize that the particular form of collider bias I’ve used for illustration has been a form of selection bias: Only low birth-weight infants were selected for analysis. Collider bias is more general than selection bias, though. The same basic class of problems would arise if data were stratified by birth weight (low, medium, high, e.g.) or if birth weight were added as a continuous-valued control in a regression study.

Selection bias can also be driven by confounders. Generalizing from the selection bias examples that I can think of, it seems like selection bias is more likely to be driven by confounders when the study subjects select themselves, and colliders are likely to be involved when study subjects are selected by researchers, but please leave a comment if you can think of a good counter example!

Finally, there’s another classic conundrum from the bio-statistical literature that is often viewed as an instance of collider bias; it’s called the “obesity paradox.” In the obesity paradox, patients with chronic conditions tend to have better outcomes when they are obese. (Take two cheeseburgers and call me in the morning?) The literature on both of these statistical paradoxes is vast, but the two papers cited and linked below are great references to start exploring these topics in greater depth.

[1] Hernández-Díaz, Sonia, Enrique F. Schisterman, and Miguel A. Hernán. “The Birth Weight ‘Paradox’ Uncovered?” American Journal of Epidemiology 164, no. 11 (2006): 1115–1120.

[2] Banack, HR and A Stokes. “The ‘obesity paradox’ may not be a paradox at all,” International Journal of Obesity (2017) 41, 1162–1163.



Recommended from Medium