Introduction to Relative Risk

Relative Risk (RR) is one of the most fundamental measures in  public health, offering insights into the strength of association between an exposure (like smoking) and an outcome (such as lung cancer). Here, we’ll quickly cover what Relative Risk is, and how to calculate it in R.

Understanding the Mathematical Concept of Relative Risk

Before we delve into the R programming aspects, it’s essential to grasp the mathematical foundation of Relative Risk. RR is a ratio that compares the probability of an event occurring in two different groups: those exposed to a certain factor versus those not exposed.

The Basic Formula

The formula for calculating Relative Risk is relatively straightforward:


[math] \text{RR} = \frac{\text{Re}}{\text{Ru}} [/math]



  • Risk in Exposed Group (Re) is the probability of an event occurring in the group exposed to a certain factor.
  • Risk in Unexposed Group (Ru) is the probability of the same event in a group not exposed to that factor.

Now we have to ask ourselves how to calculate Re and Ru. Simply put, risk is just the number of individuals who have the condition of interest divided by all individuals who are observed.

[math] \text{Re} = \frac{\text{Number of Cases in Exposed Group}}{\text{Total Exposed Group}} [/math]

[math] \text{Ru} = \frac{\text{Number of Cases in Unexposed Group}}{\text{Total Unexposed Group}} [/math]


Interpreting Relative Risk

  • RR = 1: Indicates no association between exposure and outcome.
  • RR > 1: Suggests a higher risk of the event with exposure.
  • RR < 1: Implies a lower risk with exposure, potentially indicating a protective factor.

Relative Risk provides a quantifiable measure to understand the strength and direction of the association, making it an invaluable tool alongside techniques like Odds Ratios.


Project Introduction

For this scenario, imagine a senior epi has called you in to figure out if there’s a significant risk of developing two kinds of cancer due to smoking. They hand you a csv with 3500 entries, including smoking status and a field for diagnosis codes that looks a bit tricky. There was some talk about further data cleaning and organization but that would take days to sift through manually. No worries though, we’ll get it all sorted out and the question of if there’s significant risk in a matter of minutes. If you want to follow along, feel free to download the relative risk folder from Cody’s Github.

Preparing the Environment: Installing and Loading Necessary R Packages

To begin, we need to set up our R environment by installing and loading the required packages. We’ll use dplyr for data manipulation and epiR for analysis. Installing these packages is straightforward:


Data Preparation: Importing Data into R

Our analysis begins by importing the dataset into R. We’ll use a practice file  smoking_survey.csv file, which contains data on smoking status and various diagnosis codes for each person, age of the person and the zip-code in which they reside.



For our work, we really only need to know if the person was a smoker, and if they have a diagnosis code of interest.

Identifying the Variables: Selecting Relevant Variables for Analysis

We focus on two key variables: smoking_status and diagnosis_codes. To simplify our dataset, we’ll select only these columns:

Applying Functions to Data: Using apply Function to Process Data

Next, we classify each individual’s lung cancer status based on diagnosis codes. We define a function, has_lung_cancer, which returns “yes” if the codes match either of the specific lung cancer diagnoses and “no” otherwise. Then, we apply this function to our data:

From this point we can do another quick column selection to only keep what we really need. Now we can go about calculating relative risk.

Calculating Relative Risk: Manual Method

To calculate relative risk, we first create a contingency table using our selected variables:

This is done by using the table() function, where smoking_status are our Y values, and lung_cancer status are our X values. We turned them into factors earlier on in this script to ensure the first entry is “yes” for lung_cancer and “smoker” as the first value for smoking_status. Factors are basically organized categorical values, so R doesn’t try to guess at organization and we have a messed up statistic as a result.

Then, we calculate the risk for both exposed (smokers) and unexposed (non-smokers) groups:

Let’s break down the syntax of what we just did. Because we already have our data formatted into factors, we can essentially point to the location in the table instead of having clunky variable references. In these cases, those values within the brackets [X,Y]follow the system where the first value, Y is column, and the X is a row. You might notice that we have a
sum() function that takes in only one argument, smoking_table[X,] . In this case we’re asking R to basically take all values of that have that row location and collect them together to have an operation done on them. In this case, simply to add them all up. After creating our risk for exposed persons (Re) and risk for unexposed persons (Ru), and then calculate out our relative risk of ~3.85, which at first blush seems to indicate a very high risk of cancer when someone is a smoker when compared to non-smokers. But let’s confirm this with a more advanced function by way of a package made for epidemiology.

Calculating Relative Risk: Using epiR

epiR simplifies this process. We can just take our table and pass it through the epi.2by2 function:

Here, we’re first telling the function  what data we want to use, then indicating we want to treat this as a cohort count method, and finally that our outcomes, which is to say the counts of people with and without cancer are the columns. This code not only calculates relative risk but also provides other useful statistics, including levels of significance, the Odds Ratio (which is covered in a separate article), and even confidence intervals.

Interpreting the Results: Understanding the Meaning of Relative Risk Values

The relative risk value helps us understand the likelihood of lung cancer in smokers compared to non-smokers. Values greater than 1 suggest a higher risk in the exposed group, while values less than 1 suggest a lower risk. Unsurprisingly given our scenario, our risk is far higher than 1, and using the added functionality we found in the EpiRpackage, we also were able to find these results to be very significant.



Here, we covered how to take fairly raw data, including some free-form string data, and turn it into not only a simply table but also how to calculate the relative risk in R.


Humanities Moment

The featured image for this article is Smoke Rings (1887) by Georgios Jakobides (Greek, 1853-1932). Aside from completing around 200 oil paintings which often fetched high prices at time of creation, Jakobides also created contemporary designs for currency in his native Greece.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>