Interior view of a sunlit artist’s studio: a bearded man sits on a wooden chair at left, holding a palette and brushes as he works on a landscape canvas propped near a bed draped with rumpled white linens, warm light streaming across the room.

CPH Focus: Evidence-Based Approaches to Public Health: Regression Analysis: Linear Regression

In this tutorial, we will cover the fundamental concepts of linear regression, a key statistical method used in public health for examining relationships between variables. Linear regression is widely used to model the relationship between a dependent variable and one or more independent variables, making it an essential tool for public health professionals to understand and apply. By the end of this tutorial, you will have a solid grasp of linear regression and how it can be used in public health research. We will also provide practice questions to help reinforce your understanding and prepare for the Certified in Public Health (CPH) exam.

Table of Contents:

  1. Introduction to Regression Analysis
  2. What Is Linear Regression?
    • Definition of Linear Regression
    • Assumptions of Linear Regression
  3. The Linear Regression Equation
  4. Interpretation of Regression Coefficients
  5. Assessing Model Fit: R-Squared
  6. Practice Questions
  7. Conclusion

1. Introduction to Regression Analysis

Regression analysis is a statistical method used to understand the relationship between a dependent variable (outcome) and one or more independent variables (predictors). In public health research, regression analysis helps researchers explore how factors such as age, income, or smoking status influence health outcomes like blood pressure, disease incidence, or body mass index (BMI). As an aside, it’s worth noting BMI is a very imperfect tool and should only be used if no other measures are available.

The most common type of regression is linear regression, which assumes a linear relationship between the dependent and independent variables. Linear regression is a fundamental tool for identifying trends, making predictions, and adjusting for confounding variables in public health studies.


2. What Is Linear Regression?

Linear regression is a statistical technique used to model the relationship between a continuous dependent variable and one or more independent variables. The goal of linear regression is to find the line (or hyperplane in multiple regression) that best fits the data and describes the relationship between the variables.

2.1 Definition of Linear Regression

Linear regression models the relationship between a dependent variable (Y) and one or more independent variables (X) by fitting a linear equation to the observed data. In simple linear regression (one independent variable), the equation is:

[math] Y = \beta_0 + \beta_1 X + \epsilon [/math]

Where:

  • Y is the dependent variable.
  • X is the independent variable.
  • β0 is the y-intercept (the value of Y when X = 0).
  • β1 is the slope of the line (how much Y changes for a one-unit change in X).
  • ε is the error term, representing the difference between the observed and predicted values of Y.

In multiple linear regression, the equation includes more than one independent variable:

[math] Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + … + \beta_n X_n + \epsilon [/math]

2.2 Assumptions of Linear Regression

For linear regression to provide reliable results, several assumptions must be met:

  • Linearity: The relationship between the independent variable(s) and the dependent variable should be linear.
  • Independence: The observations should be independent of each other.
  • Homoscedasticity: The variance of the residuals (errors) should be constant across all levels of the independent variable(s).
  • Normality of residuals: The residuals should be normally distributed.

3. The Linear Regression Equation

The linear regression equation summarizes the relationship between the dependent and independent variables. The equation for simple linear regression is:

[math] Y = \beta_0 + \beta_1 X + \epsilon [/math]

The goal of linear regression is to estimate the values of β0 and β1 that minimize the sum of the squared differences between the observed and predicted values of Y. This method is called ordinary least squares (OLS).

For example, in a study examining the relationship between exercise (X) and weight loss (Y), the regression equation might be:

[math] \text{Weight Loss} = 2 + 0.5 \times (\text{Hours of Exercise}) [/math]

This equation indicates that for each additional hour of exercise, weight loss increases by 0.5 units.


4. Interpretation of Regression Coefficients

The regression coefficients provide information about the relationship between the independent variable(s) and the dependent variable.

  • Intercept (β0): The intercept represents the value of Y when X = 0. It is the point where the regression line crosses the Y-axis.
  • Slope (β1): The slope represents the change in Y for a one-unit change in X. In the context of public health, this might be the change in blood pressure for each additional year of age.

In multiple regression, each coefficient represents the effect of one independent variable on the dependent variable, holding all other variables constant.


5. Assessing Model Fit: R-Squared

The R-squared value is a measure of how well the regression model fits the data. It represents the proportion of the variance in the dependent variable that is explained by the independent variable(s).

[math] R^2 = 1 – \frac{\text{Sum of Squared Residuals}}{\text{Total Sum of Squares}} [/math]

An R-squared value of 1 indicates that the model explains all the variance in the dependent variable, while an R-squared value of 0 indicates that the model explains none of the variance. In public health research, an R-squared value closer to 1 suggests a better-fitting model, but even lower values can provide useful insights, especially in complex biological or social systems.


6. Practice Questions

Test your understanding of linear regression with these practice questions. Try answering them before checking the solutions.

Question 1:

A study examines the relationship between daily physical activity (measured in steps) and body mass index (BMI). The regression equation is given as BMI = 30 – 0.01 × (Steps). What does the slope of -0.01 indicate?

Answer 1:

Answer: Click to reveal

The slope of -0.01 indicates that for each additional step taken per day, BMI decreases by 0.01 units. This suggests that increased physical activity is associated with a lower BMI.


Question 2:

In a linear regression model examining the effect of age and income on blood pressure, the R-squared value is 0.65. What does this tell you about the model?

Answer 2:

Answer: Click to reveal

An R-squared value of 0.65 means that 65% of the variance in blood pressure is explained by the independent variables (age and income) in the model. This indicates that the model fits the data reasonably well, explaining a significant portion of the variation in blood pressure.


Question 3:

A linear regression analysis finds that the p-value for the slope coefficient is 0.03. What does this p-value indicate about the relationship between the independent and dependent variables?

Answer 3:

Answer: Click to reveal

A p-value of 0.03 indicates that there is a statistically significant relationship between the independent and dependent variables at the 0.05 significance level. This means that the slope is significantly different from zero, suggesting that the independent variable has an effect on the dependent variable.


7. Conclusion

Linear regression is a powerful and widely used statistical method for analyzing relationships between variables in public health research. By understanding how to interpret regression coefficients, assess model fit using R-squared, and check the assumptions of linear regression, public health professionals can make informed decisions based on data and draw meaningful conclusions about health outcomes.

 

Final Tip for the CPH Exam:

Make sure you understand the basics of linear regression, including how to interpret regression coefficients and assess model fit. Practice solving problems involving the derivation of linear regression equations and interpreting p-values or R-squared values in the context of public health research.

 

 

Humanities Moment

The featured artwork for this CPH Focus is An Artist in His Studio (1904) by John Singer Sargent (American, 1856–1925). Sargent was a preeminent portrait painter renowned for his mastery in capturing the elegance and personality of his subjects during the Edwardian era. Born in Florence and trained in Paris, Sargent spent much of his career in Europe, producing a remarkable oeuvre that includes over 900 oil paintings and 2,000 watercolors, documenting his extensive travels and artistic versatility. Once controversial for his bold style and complex identity, Sargent is now celebrated for his technical brilliance and lasting impact on portraiture and modern art.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>