## Introduction

Odds Ratio (OR) calculations are a cornerstone in public health research, providing insights into the strength of association between an exposure and an outcome. In this tutorial, we’ll explore how what an Odds Ratio is, as well as how to calculate it in Python.

## Understanding Odds Ratio

Odds Ratio (OR) is a measure used in epidemiology to compare the odds of an event occurring in one group to the odds of it occurring in another group. This is especially useful in case-control studies where you’re comparing the odds of exposure in cases (those with the disease or outcome of interest) to the odds of exposure in controls (those without the disease).

## Odds as a Concept

Before diving into odds ratios, it’s important to understand what “odds” are. In a health context, odds are a way of representing the likelihood of an event happening. If the probability of an event happening is** P**, the odds are calculated as:

[math]{\text{Odds} = \frac{P}{1 – P}}[/math]

Simply put, this equation reads as “Odds can be calculated as the probability of an event occurring [math]{(P)}[/math], divided by the probability of the event not occurring [math]{(1 – P)}[/math]”.

## Calculating the Odds Ratio

Now, consider a case-control study with the following data:

- $$ : Number of cases (people with the disease) who were exposed to a certain risk factor.
- $$ : Number of controls (people without the disease) who were exposed to the same risk factor.
- $$ : Number of cases who were not exposed to the risk factor.
- $$ : Number of controls who were not exposed to the risk factor.

The odds of exposure among the cases is [math]{A/C}[/math], and the odds of exposure among the controls is [math]{B/D}[/math]

. The odds ratio is calculated as: [math]{\text{Odds Ratio (OR)} = \frac{\text{Odds of exposure in cases}}{\text{Odds of exposure in controls}} = \frac{A/C}{B/D} = \frac{A \times D}{B \times C}} [/math]## Interpretation of the Odds Ratio

**OR = 1**: This suggests there is no association between the exposure and the outcome.**OR > 1**: This indicates a positive association, meaning the exposure might increase the odds of the outcome.**OR < 1**: This implies a negative association, suggesting the exposure might decrease the odds of the outcome.

## Limitations

It’s important to remember that odds ratios can sometimes overestimate the risk, especially if the outcome is common. Also, they do not necessarily imply causation. For more explanation on the underlying mathematics and mechanics of Odds Ratios, please check out our Epi Explained series! For now, let’s get on with the calculation of Odds Ratios in Python.

## Practical Example: Smoking and Cancer

Let’s analyze a dataset to determine if smoking is associated with higher odds of lung cancer. To follow along, please download the folder for Odds Ratio on Cody’s Github.

#### Step 1: Load Python Libraries

As we get started, there are a few packages you’ll need ready. If you haven’t already, set up a virtual environment (venv) for this project by directing your command line to the project folder and type in the following: `python -m venv .venv `

This makes your project management cleaner by only allowing files and packages you bring in explicitly, rather than having every file and package in a single location which, if moved, breaks everything. Next, we’ll want to install some packages for our project, `pandas`

for basic data science functions, `numpy`

for more advanced calculations, and `scipy`

for statistical work. To bring these all in, activate your virtual environment using ` source .venv/bin/activate `

and then entering in `py -m pip install X `

where X can be replaced by any of the packages mentioned. With all that out of the way, we can finally get to the proper coding. Starting out, we can load in our packages into the script and rename them. Note that for `scipy`

we only need to pull in a specific part of the package.

#### Importing and Preparing the Dataset

We’ll use the `pandas`

library to import our dataset (named “smoking_survey.csv”) and select relevant columns, in our case those being `smoking_status`

and `diagnosis_codes`

.

#### Data Transformation for Analysis

Next, we need to define a quick function to look through our ` diagnosis_codes`

column and see if we have either of our two ICD codes of interest. In Python, this is fairly straight forward using a combination of`if/else`

, `in `

and ` or `

statements. We can then ` .apply `

our function directly on our dataframe and column of interest, and iterate down that whole column. we are then given a new field called ` lung_cancer `

. From this point, we can retain what are now our fields of interest, `smoking_status`

and ` lung_cancer`

. We now only need to do one more adjustment before we can craft our contingency table and get done with this analysis, and that’s to format our variables from Character type to Categorical. The reason for this is it helps structure future table work and allows us to easily depend on what order the sub variables will come in, which otherwise would be up to whatever character entry comes up first.

#### Creating a Contingency Table

Using the `pandas.crosstab`

function, we create a contingency table to visualize the distribution of smoking status and lung cancer occurrence.

This creates a table where you have your two status indicators, as well as the categories of whether or not cancer had developed.

#### Calculating the Odds Ratio

We employ the `fisher_exact`

method from `scipy.stats`

to calculate the Odds Ratio and obtain a p-value for statistical significance.

Here, we’re assigning two variables ` odds_ratio, p_value`

in a singular line from the two values that are sent back from the `fisher_exact`

function. It’s worth mentioning that in Python, functions can often return multiple values, so printing results instead of assigning them to a variable might be useful if you’re not familiar with them.

In printing these values, we can see there’s an OR value of 5.77 and a p-value of of 1.4131903723243655e-95, which is **far** under our threshold cut-off of 0.05, meaning the results are very significant.

## Conclusion

This tutorial provided an understanding of Odds Ratios and demonstrated how to calculate them in Python using real-world data. This skill is essential for public health researchers and epidemiologists to assess associations in epidemiological studies.

## Humanities Moment

The featured image for this article is Ships Riding on the Seine at Rouen (1872-1873) , by Claude Monet (French, 1840-1926). Monet is known as the most famous french impressionist, and in fact the coiner of the term as the movement was named after the painting Impressions. Unlike many artists of his time, Monet was wildly popular during both his lifetime and throughout the 1900s, with caricatures, portraits and landscapes all being featured in museums and exhibits worldwide.