##
Archive for **December 17th, 2022**

### Consider a child that prefers ice-cream ‘a’ to ice-cream ‘b’. The same child prefers ice-cream ‘b’ to ice-cream ‘c’. Does this mean that the child prefers ice-cream ‘a’ to ice-cream ‘c’?

Posted December 17, 2022

on:**Do you tend to apply the “Transitivity” relationship to living species?**

How to decide whether a relationship is Random and Not connected?

Should the next phase be the design of a **controlled multivariable experiment** in order to verify any Causative effects when correlations are observed?

**Controlled experiments are an Art of its own and their comprehension do Not come easily.**

By Hemanth September 5, 2022

# Correlation: One Of The Most Misunderstood Concepts In Science

## The counter-intuitive math behind regression and correlation!

Be it the medical sciences or the social sciences, correlation is at the heart of scientific discovery. (**Meaning: Giving the inspiration to research further the observation)**

Say that you wish to invent a new drug to cure a disease. You could just gather a bunch of **bio markers** that have a **positive correlation** with curing the said disease.

Then, you just test plausible chemical combinations that aid this said “positive correlation”. (**Trial and error can be your friend** if you neglect other more **effective alternatives**).

At some point, you should have a new drug that cures the disease. (**Assuming the governing health institution is failing in their job of safeguarding the public)**

Now, say that you wish to analyse if the *price* of your new drug would have any effect on your company’s *stock market price or vice-versa*. You perform a “linear-regression” analysis comprising the two variables (based on some past data).

This is standard practice in the biz, and your analysis shows that the price of your drugs and your company’s stock market price are **uncorrelated**. So, you could go ahead and price your latest drug without consideration to your company’s stock market price. Pretty cool, right?

Well, not quite.

The results of your drug development method as well as your linear regression analysis might be misleading you . And your misunderstanding of the word “correlation” is to blame.

This essay aims to correct that. We will be starting with the history of how correlation was discovered. Following this, we will swing back and forth between **regression **and *correlation* with a slight touch of **analytic geometry**.

It would help you if you have already read **my essay on regression**. At the end of this essay, you should have a better understanding of both correlation and regression. This, in turn, would help you avoid the common pitfalls. Without any further ado, let us begin!

## Regression Vs. Heredity — Galton’s Genius

Our story begins with the renowned polymath, ** Francis Galton**. Among his many adventures, he was studying the nature of heredity. He had already figured out that children of tall parents tended to be taller than average, but still shorter than their parents.

On the other hand, children of short parents tended to be shorter than average, but still, taller than their parents. Galton termed this phenomenon ‘regression’. In essence, he had discovered *regression to the mean*.

But he wasn’t satisfied with just that. He wanted to quantify the effect of regression. Clearly, it wasn’t just regression which decided a child’s height. Heredity also played a role. But how much did each affect the final height? That’s pretty much the question he was trying to answer.

## Correlation — Galton’s Ellipse

Galton collected parent-child height data and created a table which featured a map between parents’ heights and adult children’s heights. I’m skipping the fine details here on how exactly he computed these, but this suffices for our requirements in this essay.

He noticed something peculiar. He began seeing an elliptical shape that did not appear random. Here is a diagram he published based on his data in his 1886 paper “**Regression Towards Mediocrity in Hereditary Stature**”:

This diagram reveals the interplay between heredity and chance. In the forthcoming section, we will briefly see how variations in both of these variables could affect this interplay.

Galton then decided to plot each parent-child height pair as a point in a two-dimensional plane; he considered parents’ heights on a horizontal axis and children’s heights on a vertical axis. Mind you, back then Cartesian graphs were not remotely close to the norm.

By doing this, Galton had in fact invented what we know today as **scatter plots**. While we are on the topic of scatter plots, why don’t we see how heredity and chance affect the outcome of the graphical results?

## Introduction to Correlation via Scatter Plots

Let us now say that chance has no role to play in Galton’s parent-child dataset. In such a case, heredity governs the adult child’s height, and every child would be exactly as tall as the parent. In such a case, the scatter plot would look like this:

It is no wonder that we see just points along a diagonal line here, because we have a situation where (x = y). Having said that, note that the points are more scattered toward the peripheries than in the middle.

Now, let us say that heredity has no role to play in Galton’s parent-child dataset. In this realm, chance governs the outcome 100%. The corresponding scatter plot would look like this:

This scatter plot shows no affinity to the diagonal. In other words, the child’s height and parent’s height are independent of each other. So, regardless of the parent’s genes, the child’s characteristics are 100% luck of the draw (chance).

As you can imagine, both of these are extreme cases. The reality slots somewhere in between. Galton’s dataset led to a scatter plot that looked like this:

Here, we can clearly see the ellipse from the left-hand bottom corner to the right-hand top corner. Galton was so methodical with confirming this result that he went to the trouble of concealing the data’s background (to remove prejudices) and consulting a mathematician to confirm his observation; old-school peer review, if you will.

## Drawing Conclusions from the Scatter Plots

Comparing all three scatter plots, we could arrive at the following three empirical conclusions:

1. When the outcome is completely deterministic (that is, controlled 100% by heredity), the ellipse collapses into a straight (diagonal) line.

2. When the outcome is completely governed by chance, the ellipse expands to become a circle (roughly speaking).

3. When the outcome is governed by a mix of both, an ellipse of ‘some’ level of eccentricity results.

Galton termed the measure of this eccentricity (of the ellipse) correlation. Over time, the measure of correlation has been advanced by impressive contributions from people such as ** Karl Pearson**. Today, we apply the concept of correlation to data-sets that span multiple dimensions (a topic for another day).

Back to our main challenge: Why are the methods you followed for your drug company misleading you? Let’s jump right into the answers.

## Correlation: One of the Most Misunderstood Concepts in Science

You might have heard of the phrase “Correlation does not mean causation”. This phrase has almost become mainstream. What it means is that two correlated variables need not necessarily have causal links.

For instance, it could be that the biological markers you have gathered are positively correlated with curing the said disease. However, it is NOT a given, that they are causally linked. In other words, having the required bio markers does not necessarily guarantee cure from the said disease.

Since this issue has become more or less mainstream, many researchers are able to wrap their heads around reading correlation information without imposing causal biases. However, the **in-transitivity of correlation** is something that still catches many researchers off guard.

To understand transitivity, let us consider the following relationship: (a > b > c). From this, we see that ‘a’ is greater than ‘b’ and ‘b’ is greater than ‘c’. Based on this, we could say for sure that ‘a’ is greater than ‘c’.

This property of extending a relationship from one variable to another from the given relationship information is known as transitivity. “Greater than” relationships are transitive. However, **correlation is NOT.**

Here’s your situation: your new drug boosts the bio marker readouts. These bio markers are in turn positively correlated with the disease’s cure. Based on this, the intuitive conclusion many would make is that the new drug helps cure the disease. But you see, this is **NOT **a given.

To drive home this point, consider a child that prefers ice-cream ‘a’ to ice-cream ‘b’. The same child prefers ice-cream ‘b’ to ice-cream ‘c’. Does this mean that the child prefers ice-cream ‘a’ to ice-cream ‘c’?

Anyone who has interacted with an ice-cream-loving picky child would not make that “**assumption**”. But wait, there’s more!

## Uncorrelated variables are NOT Always Unrelated

Inthe introduction, we saw that the price of your drugs and your company’s stock market price were *uncorrelated. *This might well be the case, but that does NOT mean that they are unrelated.

The dirty little secret of quite a few of the regression/correlation analyses conducted in scientific research is that they look for **linear relationships**. This is a conceptual simplification. Not all relationships are linear, and not all uncorrelated variables are unrelated.

The non-linear relationship between your drug prices and your company’s stock market price could take you for a ride if your decisions do not take into account the possibility of these two variables “becoming” correlated in the future or beyond the realm of your data set.

**A linear analysis that says two variables are uncorrelated simply says that they do not have a linear relationship.**

Many researchers still fall prey to concluding that the two variables in question are independent of each other.