Epi Explained: Understanding Correlation

A portrait of Diogenes holding an oil lamp above some owls

Correlation is a fundamental statistical concept that measures the strength and direction of a relationship between two quantitative variables. This article will demystify correlation for beginners, providing a comprehensive understanding of its mathematical and scientific principles, historical context, and some practice problems. By the end of this Epi Explained, you’ll not only grasp what correlation is but also how to interpret its values and apply this understanding in various contexts.

Introduction to Correlation

At its core, correlation quantifies the degree to which two variables move in relation to each other. When we talk about variables, we’re referring to anything that can take on different quantities or values, like height, weight, temperature, or even the number of hours studied. Correlation is pivotal in data analysis, helping public health workers uncover relationships among variables, predict outcomes, and make informed decisions.

The Concept of Correlation

Correlation is measured on a scale from -1 to 1, known as the correlation coefficient. This scale tells us the strength and direction of the relationship:

Positive Correlation (+1): As one variable increases, the other variable also increases. For example, the more hours spent studying, the higher the grades (typically).
Negative Correlation (-1): As one variable increases, the other decreases. An example could be the relationship between the number of hours spent on leisure activities and academic performance.
No Correlation (0): There is no discernible relationship between the movements of the two variables. For instance, the number of hours slept and the grades might not show any consistent pattern across different students.

Pearson Correlation Coefficient

The most commonly used measure of correlation is the Pearson correlation coefficient, denoted as [math] r [/math]. It quantifies the degree of linear relationship between two continuous variables. The formula for Pearson’s [math] r [/math] is:

r = \frac{\sum (X_{i} - \overline{X}) (Y_{i} - \overline{Y})}{\sqrt{\sum (X_{i} - \overline{X})^{2} \sum (Y_{i} - \overline{Y})^{2}}}

Where:

$[math] X_i [/math]$ $and [math] Y_i [/math]$ are the individual sample points for the two variables.
$[math] \bar{X} [/math] and [math] \bar{Y} [/math]$ are the means of the $[math] X [/math]$ $and [math] Y [/math]$ variables, respectively.

So in simple terms, we are looking at the sum of the differences between each $[math] X [/math] observation against its average and multiplying them by the differences between [math] Y [/math] and its average, then dividing that by the square root of the sum square differences of both [math] X [/math] and [math] Y [/math].$

Spearman’s Rank Correlation

In situations where the relationship between variables is not linear or when dealing with ordinal data (data that is ranked or ordered), Spearman’s rank correlation coefficient is used. It assesses how well the relationship between two variables can be described using the following formula:

[math]r_s = 1 – \frac{6 \sum d^2}{n(n^2 – 1)}[/math]

where:

[math] r_s [/math] is the correlation measure.
[math] n [/math] is the number of observations.
[math]\sum{d^2}[/math] represents the sum of the squares of the differences in ranks between the two variables for each observation. Here’s how you calculate it step-by-step:

Rank the Observations: For both variables you’re comparing, assign ranks to each observation. If you’re looking at, say, height and weight, rank all individuals from the shortest to tallest (for height) and from the lightest to heaviest (for weight). In case of ties (equal values), assign the average rank that these values would have received had they been ordered.
Calculate the Differences in Ranks ([math] d [/math]): For each pair of observations, calculate the difference between the ranks. This is done by subtracting the rank of one variable from the rank of the other variable for each observation. Mathematically, if [math]R_1 [/math]is the rank of the observation in the first variable and [math] R_2 [/math] is the rank in the second variable, then the difference in ranks [math] d [/math] for each observation is [math] d = R_1 – R_2 [/math].
Square the Differences ([math]d^2[/math]): Square the difference in ranks for each observation. This step is crucial because it treats positive and negative differences equally, focusing purely on the magnitude of the difference, not its direction. Squaring also amplifies larger discrepancies.
Sum Up the Squared Differences ([math] \sum{d^2} [/math]): Finally, add up all the squared differences.

Historical Context

The concept of correlation dates back to the late 19th century, with Francis Galton, a cousin of Charles Darwin, being one of the pioneers in studying correlation and regression. Galton’s work laid the foundation for Pearson, who later formalized the calculation of the correlation coefficient that we use today. This development marked a significant advancement in statistical methods, allowing for more precise analysis of relationships between variables.

Interpreting Correlation Coefficients

While correlation coefficients provide valuable insights, it’s crucial to remember that correlation does not imply causation. A high correlation between two variables does not mean that one variable causes the other to change. There could be other underlying factors or a third variable influencing both.

Limitations and Considerations

Correlation analysis also has its limitations. It assumes a linear or monotonic relationship and does not capture complex relationships well. Moreover, outliers can significantly impact the correlation coefficient, potentially leading to misleading interpretations.

Practice Problem

Consider the following data points for variables X and Y:

X: 1, 2, 3, 4, 5
Y: 2, 3, 5, 7, 8

For on the given data, what is the Pearson correlation coefficient (r) between X and Y?

Options:

A) 0.95

B) 0.85

C) 0.99

D) 0.90

Answer Key, click to reveal

D) 0.90

Conclusion

Correlation is a powerful statistical tool that offers a window into the relationships between variables. By understanding and correctly interpreting correlation coefficients, we can uncover insights from data that inform decision-making and hypothesis testing. However, it’s important to approach correlation analysis with a critical eye, recognizing its limitations and the potential for confounding variables.

Humanities Moment

This article’s featured image was Diogenes by Johann Carl Loth (German, 1632 – 1698). Johann Carl Loth was a German Baroque painter renowned for his history paintings, often featuring crowded scenes from classical mythology or the Old Testament, and spent the majority of his career in Venice where he influenced and was associated with numerous artists. Born in Munich and trained by his father, Loth’s notable commissions included work for Emperor Leopold I in Vienna, and his legacy is marked by a significant number of pupils and collaborations within the Venetian art scene until his death in 1698.

Epidemiology, Broadly Speaking