PyFriday: How to Calculate Correlation in Python

Correlation is a statistical measure that describes the extent to which two variables change together. In data analysis tasks, understanding correlation can often help in identifying relationships between variables. This tutorial will guide you through calculating and visualizing correlation using Python.

Understanding Correlation in Epidemiology: Context and Limitations

Before we cover correlation analysis in Python, we should first establish what correlation is, what types of correlation exist, potential uses and the limitations of the measure. This understanding is crucial for applying correlation analysis appropriately and interpreting its results accurately in any context, but especially when it comes to research that might directly impact the lives of entire populations.

What is Correlation?

Correlation is a measure that illustrates the degree to which two variables are associated with each other. For example, one might look into an association between environmental factors (like air pollution levels) and health outcomes (such as asthma rates). Understanding these correlations is fundamental in identifying potential risk factors and health trends.

What are the Types of Correlation?

Positive Correlation: If an increase in one variable (e.g., tobacco use) tends to be associated with an increase in another (like lung cancer incidence), this is positive correlation.
Negative Correlation: Conversely, a negative correlation occurs when an increase in one variable (e.g., vaccination rates) is associated with a decrease in another (like the incidence of the targeted infectious disease).
No Correlation: There are instances where no discernible pattern exists between two variables, suggesting no apparent relationship.

Generally, correlation is denoted as ‘r’, with a value from -1 (total negative correlation) to +1 (total positive correlation), with values closer to 0 being seen as having little or no correlation. It should be noted though that what might constitutes a weak correlation or no correlation will likely depend on multiple factors.

Relevance in Epidemiological Studies

Correlation is a vital statistical tool in epidemiology for:

Identifying Risk Factors: It helps in detecting potential risk factors associated with diseases.
Public Health Planning: Correlation studies inform preventive strategies and healthcare interventions.
Hypothesis Generation: Observing correlations can lead to generating hypotheses for causal research or avenues for intervention.

Key Limitations in Correlation Analysis

As with any statistical measure, the value of correlation can vary and comes with significant caveats:

Correlation Does Not Imply Causation: A fundamental limitation is that a correlation between two variables does not necessarily mean one causes the other. Confounding factors often play a role.
Susceptibility to Outliers: Correlation can be skewed by outliers, leading to misleading conclusions.
Simplification of Complex Relationships: Many health outcomes are influenced by multiple factors, and simple correlation may oversimplify these relationships.

For more information on Correlation, please check out our Epi Explained article on Correlation.

How to Perform a Correlation in Python

With the key background out of the way, let’s dig into how to do a correlation in Python with some data that is similar to what you may encounter in the field. If you’d like to follow along with the code, please check out Cody’s Github and download the folder for the Correlation Tutorial.

Our task during this tutorial is fairly simple. You have been assigned a quick analysis to see if there is a correlation between people getting sick with Cholera or Influenza and their proximity to a suspected source of illness, Well A. We can look at the raw correlation figure, as well as do some graphing that can illustrate any potential relationships.

Step 1: Read in the Libraries and Data.

Starting out, let’s first install some key packages and then read them into the environment. Please note this tutorial assumes you already have Python and some IDE (Integrated Development Environment) installed, and if you’ve yet to do so, please do so now.

Packages (or Libraries, both are used interchangeably in most cases) are basically bundles of code, functions and data that can be imported into your project that don’t require you to do everything by hand. Here, we’re importing pandas and numpy to do some data formatting and numbers work, while matplotlib , seaborn and plotly all have different graphing capabilities we’ll use together. It should be noted that the part of the code that goes as pd ,etc. is a way to shorten your call to specific libraries and is a fairly standard practice in Python.

Next, we need to read in our data. this is fairly simple as we call pd.read_csv from pandas , and tell it to go into the Data folder, and in that folder find our disease_dataset.

We can take a look at what we’re dealing with now.

Step 2: Data Cleaning

As we can see, we have two columns, one with the name of the disease and the other with the distance of the well marked. Further looking into it, there appears to be 3 noted diseases, both influenza and cholera which we expected, as well as a group called “No Enteric Diseases”. These entries aren’t particularly of use to us, so we can ignore them. What we can’t really ignore is that the data isn’t formulated in a way that makes life easy for us. To do a proper correlation, we will need to do the following steps:

Separate out the diseases into unique groups.
Take these groups and see if they can be further grouped together by distance to Well A.
Summarize the number of visits by distance.

To do this, we can write what’s known as a function. Functions are pieces of code that can be used for the same purpose more than once, and are a great way to shorten up analysis files.

The group_by_distance function first filters the dataset for the specified disease, then groups the filtered data by distance from Well A, and counts the occurrences of the disease at each distance. The result is a new dataframe, which is grouped by the distance from Well A, by number of occurrences, and creates a new column called count_A which shows the number of cases for each distance.
We then call our function on our dataframe, to create both Cholera_Distance and Flu_Distance, which have the distance as the first column, and then the count of occurrences of the disease as the next column.

Correlation and Graphing

Once our data is all set up, correlation is a remarkably easy task. We can just call the correcoef function from numpy . What we’re doing is feeding in the distance variable and the count variable we created, and then grabbing a part of the resulting array where there is an examination of the relationship between the two variables. It should be noted that by default, this function gives an array of the correlation of all variables and how they relate to each other (including themselves) so checking on output order of the results is fairly important. For the last bit of the script, we’re just printing out the correlation values for both Influenza and Cholera.

We can see that the correlation between flu and distance from the well is very weak at .09, indicating no real correlation is taking place. That said, Cholera has a decent negative correlation of -0.47, which indicates that the farther someone is away from the well, they are also less likely to get sick with cholera.

Say though that simple figures aren’t quite enough and we need some graphs to put into a report, and an additional graph another Epidemiologist can interact with for more detailed information. We can finally play with graphing capabilities. First, let’s knock out the static graphs.

There’s a lot going on with the code segment above, so let’s break it down. For both graphs, we’re first setting how big our graphs will be (10 inches by 6 inches) by using plt.figure . Then, we’re using Seaborn to make a a scatter plot of our two variables of interest, as well as creating a regression line over it. Next, we’re formatting the title and labels before finally printing our finished plot using plt.show()

Lastly, we can do a similar operation using plotly to create some interactive graphs. This is quite a bit more condensed but follows the same general idea of defining the x and y axis, setting a trend line, and then making a title.

This can then be printed, or used in a dashboard, or embedded like it is here.

Conclusion

In this tutorial, we’ve covered:

Correlation as a concept
How to read in real-world data
How to clean and prepare that data for analysis
How to calculate Correlation in Python
How to Graph Correlation in Python

If you’ve found this article helpful, please feel free to drop by again next and every friday for another PyFriday Tutorial. Or, check in every day for new research, core concepts, and programming content.

Humanities Moment

The featured image of this article is Odysseus And Polyphemus (1896) by Arnold Böcklin (Swiss, 1827-1901). Böcklin, a Swiss symbolist painter known for his imaginative interpretations of classical and mythological subjects, is most celebrated for his “Isle of the Dead” series, evoking themes of death and mortality within fantastical settings. His work, embodying the romantic and symbolist aesthetics of the late 19th century, has been both admired for its technical mastery and critiqued for its anachronistic style.

Epidemiology, Broadly Speaking