Reading (ca. 1860) by Honoré Daumier (French, 1808-1879) is a pencil drawing of two men seating with one reading.

Introduction to Relative Risk

Relative Risk (RR) is one of the most fundamental measures in  public health, offering insights into the strength of association between an exposure (like smoking) and an outcome (such as lung cancer). Here, we’ll quickly cover what Relative Risk is, and how to calculate it in Python.

Understanding the Mathematical Concept of Relative Risk

Before we delve into the Python programming, it’s essential to grasp the mathematical foundation of Relative Risk. RR is a ratio that compares the probability of an event occurring in two different groups: those exposed to a certain factor versus those not exposed.

The Basic Formula

The formula for calculating Relative Risk is relatively straightforward:


[math] \text{RR} = \frac{\text{Re}}{\text{Ru}} [/math]



  • Risk in Exposed Group (Re) is the probability of an event occurring in the group exposed to a certain factor.
  • Risk in Unexposed Group (Ru) is the probability of the same event in a group not exposed to that factor.

Now we have to ask ourselves how to calculate Re and Ru. Simply put, risk is just the number of individuals who have the condition of interest divided by all individuals who are observed.

[math] \text{Re} = \frac{\text{Number of Cases in Exposed Group}}{\text{Total Exposed Group}} [/math]

[math] \text{Ru} = \frac{\text{Number of Cases in Unexposed Group}}{\text{Total Unexposed Group}} [/math]


Interpreting Relative Risk

  • RR = 1: Indicates no association between exposure and outcome.
  • RR > 1: Suggests a higher risk of the event with exposure.
  • RR < 1: Implies a lower risk with exposure, potentially indicating a protective factor.

Relative Risk provides a quantifiable measure to understand the strength and direction of the association, making it an invaluable tool alongside techniques like Odds Ratios.


Project Introduction

For this scenario, imagine a senior epi has called you in to figure out if there’s a significant risk of developing two kinds of cancer due to smoking. They hand you a csv with 3500 entries, including smoking status and a field for diagnosis codes that looks a bit tricky. There was some talk about further data cleaning and organization but that would take days to sift through manually. No worries though, we’ll get it all sorted out and the question of if there’s significant risk in a matter of minutes. As usual, if you’d like to follow along, go ahead and download the Relative Risk folder from Cody’s Github.


Preparing the Environment: Installing and Loading Necessary Python Packages and Data

To begin, we need to set up our Python environment by installing and loading the required packages. We’ll primarily be using the most popular data science package for Python, pandas. We’ll also shorten up its title to make it easier to call:

We’ll also want to read in our dataset for this project, which is csv called smoking_survey.csv

Applying Functions to Data: Using .apply Function to Process Data

Now that we have our data in, we can write a quick function that makes any further data cleaning unneeded. For our purposes, the only ICD-10 codes that really matter are C34.90 or C96.29. We don’t need to worry about the presence or absence of any other codes, and 1 or both of these codes would count as a case. As such, we can use fairly simple logic in Python.

Here, we first define a function called has_lung_cancer which takes in a variable called codes . This is then compared against a list we define called lung_cancer_codes from there, we have a logic statement that if any of our codes match with our list, then we can return a value of “yes”, otherwise a value of “no” is returned. We then take this function and .applyit across the dataframe we have (specifically to diagnosis_codes) to then define a new column in our dataframe called lung_cancer.

Formatting: Selecting the Data and Data Types, Creating Contingency Tables

Now that we have our list of cases and exposures, we can get the data in a format that makes it easy to calculate our relative risk. To do this we need to take three steps. First, we need to reach into our dataframe and only pick out the columns that matter to us. Then, we need to make those columns categorical in nature, and order those categories so they are always present in a consistent manner. Lastly, we need to use the crosstab function from pandas to create a proper contingency table of our organized variables to accurate count up our exposed and positive, exposed and negative, unexposed and positive, and our unexposed and negative groups. We can do that using the following code:


Calculate Relative Risk in Python

We’ve finally hit the point we can properly calculate Relative Risk. For this, we turn our contingency table into a 2×2 table, then apply the .riskratio function (another name for Relative Risk) onto that preformatted table, and lastly we calculate and save the lower and upper 95% confidence intervals using this code snippet: rr_ci_lower, rr_ci_upper = rr_table.riskratio_confint(). Let’s see the rest of this code and then dig more into that line in particular as it might look a bit weird to new Python users.

So, for that tricky bit of code, here’s what’s going in detail, with the right side of the equation going first. What we are doing is calling a function, .riskratio_confint() on our table that returns the confidence intervals, which are formatted in a datatype called a Tuple. Tuples are a data format where multiple pieces of information are stored in sequence and can be unpacked by essentially telling Python “put the first value in this variable, then the second here” and so on. Since we’re only dealing with two values, the upper and lower limits, we need only worry about those two. For the function, the lower estimate is presented first, then the upper. So, we define our variables in that order as rr_ci_lower, rr_ci_upper.


To check our values, we can just print the variable names and see that the relative risk in this sample is roughly 3.85, with lower and upper bounds of 3.31 and 4.47. Since this is far and away greater than 1, as is the lower bound, we can say that there’s significant risk associated with smoking when it comes to our cancers of interest.
Here, we learned what Relative Risk is, how to calculate it, and then how to prepare data and process it for Relative Risk in Python.


Humanities Moment

For this PyFriday tutorial, the featured image was Reading (ca. 1860) by Honoré Daumier (French, 1808-1879). Daumier was a prominent French artist, known for his paintings, sculptures, and printmaking. He became renowned for his critical and satirical depictions of political figures and social commentary, reflecting on France’s turbulent period from the 1830 Revolution to the fall of the second Napoleonic Empire in 1870. Daumier’s work, especially his caricatures published in newspapers, highlighted his strong republican democratic beliefs, often targeting the bourgeoisie, the church, and the monarchy. Despite facing imprisonment for his controversial work, including a notable piece caricaturing King Louis-Philippe, Daumier’s influence extended beyond satire, as he was also a serious painter associated with realism. His artistry, which included over 100 sculptures, 500 paintings, and thousands of prints, was initially underappreciated by the public and critics, though later generations recognized him as a significant figure in 19th-century French art. Daumier’s personal life was marked by financial struggles, but his legacy was cemented by his impactful contributions to political satire and art, influencing subsequent generations of artists.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>