Lung cancer occurs as a result of smoking is an example of causal reasoning

Smoking cigarettes, as we know today, causes lung cancer. However, that fact was not entirely clear in the 1950s, when the first studies showing a correlation between smoking and lung cancer were published. One of the skeptics was statistician R.A. Fisher, who reasoned that the causality could be the other way around:

“Is it possible then, that lung cancer — that is to say, the pre-cancerous condition which must exist and is known to exist for years in those who are going to show over lung cancer — is one of the causes of smoking cigarettes? I don’t think it can be excluded.”

To be clear, Fisher was not only a statistician but also a heavy smoker, so his view was probably biased. Nevertheless he had a point: correlations alone are not enough to establish causation. What else can explain correlation?

Cigarettes cause lung cancer

Generally, if two variables A and B are correlated, there are at least four possible explanations:

  1. A causes B
  2. B causes A
  3. A and B are both caused by a third variable, C.
  4. Chance (the correlation is spurious).

So how was the causal link between cigarettes and lung cancer established? In the 50s and 60s a large number of studies came out that confirmed the correlation. Furthermore, studies also showed that heavier smokers suffered more cancer than lighter smokers, and that pipe smokers developed more lip cancer while cigarette smokers developed more lung cancer. All that evidence taken together made the case clear. In 1964, the U.S. surgeon general, Luther Terry, made the causal connection official:

“In view of the continuing and mounting evidence from many sources, it is the judgment of the Committee that cigarette smoking contributes substantially to mortality from certain specific diseases and to the overall death rate.”

Smoking, Terry concluded, is a health hazard.

source: xkcd

Correlation caused by a third variable

Sometimes correlations appear between two variables simply because both of them are caused by a third, unobserved variable. One of the textbook examples is the correlation between the number of ice-cream sales and murder rates in New York City. Obviously, this correlation is caused by a third variable: season. Summertime is prime time for both ice cream as well as crime.

Other times, a correlation caused by a third variable can be less obvious. Consider the link between estrogen level and heart disease: in the 90s, studies were conducted showing that women’s estrogen levels are negatively correlated with risk of heart disease. This is an important issue, since heart disease is the leading cause of death for women above 65. So why not, by default, recommend hormone replacement therapy for post-menopausal, low-estrogen women? In fact, this was the common wisdom before the turn of the Millennium.

Then, the Women’s Health Initiative reported the results from a long-term controlled study, involving more than 160,000 women, that rebutted the common wisdom: hormone replacement therapy did not decrease the risk of heart disease, and in some cases it would even do the opposite. In this case a third variable, menopause, affected both the rate of heart disease as well as estrogen level, causing the observed correlation that was mistaken for a causation.

Another notable example is the apparent link between vaccines and autism that sparked the anti-vaccine movement which continues to this day. In 1998, the medical journal The Lancet published Dr. Andrew Wakefield’s research claiming to have found a link between autism and the MMR (measles, mumps, rubella) vaccine. There were a number of problems with that paper, most importantly the small — apparently hand-selected — sample size of a mere 12 kids. Correlation with a third variable can also be a problem here: the development of autistic symptoms and the first vaccinations fall both into early childhood, and therefore a temporal correlation is expected.

Wakefield’s paper was quickly retracted by the journal as other researchers pointed out the flaws in the study. Wakefield was later accused not only of bad science but also of deliberate fraud.

Spurious correlations

Did you know that the yearly number of people that drowned by falling into a pool and the number of movies with Nicholas Cage between 1999–2009 are correlated?

Are his movies so bad that it makes viewers want to commit suicide in their own pools? No, this is an example of a spurious correlation, a lucky coincidence. Tyler Vigen has a number of such examples on his website.

Spurious correlations are a serious problem when researchers test for a large number of possible connections. In a The statistics of the improbable, I mentioned the Swedish study from 1992 that linked living near power lines to childhood leukemia: the researchers surveyed everyone living within 300 meters of high-voltage power lines over a 25-year period and looked for statistically significant increases in rates of over 800 different ailments. Of course, by looking at so many different possibilities at once, it is pretty much guaranteed to find at least once statistically significant correlation simply by chance. This is the so-called look-elsewhere effect.

Conclusion: be careful with correlations

Correlation alone does not imply causation. Sometimes the two correlated variables are the result of a third, unobserved variable, such as the link between estrogen levels and risk of heart disease. Sometimes the correlation can be spurious, such as the link between power lines and childhood leukemia.

Establishing a causation requires a lot more work than finding a correlation because it is a much stronger statement. This is the problem of inference: Causation can only be inferred, never exactly known.

Is smoking and cancer correlation or causation?

Cigarette smoking can cause cancer almost anywhere in the body. Cigarette smoking causes cancer of the mouth and throat, esophagus, stomach, colon, rectum, liver, pancreas, voicebox (larynx), trachea, bronchus, kidney and renal pelvis, urinary bladder, and cervix, and causes acute myeloid leukemia.

Is smoking a causation or correlation?

A correlation may also be observed when there is causality behind it—for example, it is well established that cigarette smoking not only correlates with lung cancer but actually causes it.

How is lung cancer caused by smoking?

Doctors believe smoking causes lung cancer by damaging the cells that line the lungs. When you inhale cigarette smoke, which is full of cancer-causing substances (carcinogens), changes in the lung tissue begin almost immediately. At first your body may be able to repair this damage.

What causal mechanism can cause lung disease and lung cancer?

Smoking tobacco is by far the leading cause of lung cancer. About 80% of lung cancer deaths are caused by smoking, and many others are caused by exposure to secondhand smoke. Smoking is clearly the strongest risk factor for lung cancer, but it often interacts with other factors.