Premature Death Explore Your State Causal Model Research Partners

Finding Causal Links in Years of Potential Life Lost

Identifying Factors that Contribute or Mitigate Premature Death

The Dataset

The data used for this assesment was collected from the 2019 County Health Rankings & Roadmap. The master dataset was downloaded from the County Health Rankings website and processed by our analysts to include the selected variables:

  • Years of Potential Life Lost
  • Percent Smokers
  • Percent Physically Inactive
  • Percent Excessive Drinking
  • Primary Care Physician (PCP) Rate
  • Mental Health Rate
  • Preventable Hospital Stays
  • High School Graduation Rate
  • Percent Unemployed
  • Percent Children in Poverty
  • Social Association Rate
  • Injury Death Rate
  • Average Daily Pollution (2.5Pm)
  • Population
  • Years of Potential Life Lost Per Capita [Calculated]

  • The absolute values of the indicators were also transformed into natural logs, zero-centered values, and the percent difference from the average value. The natural logs were used for the multivariate regression model and the absolute values, zero-centered data, and percent difference are used to support exploratory data analysis and visualizations. The natural log transformation is used as coefficients on this scale can be directly interpreted as approximiate percentage changes. For example, with a significant correlational coefficient of 0.06, an increase by 1 in the indepedent variable is intepreted as having a 0.06 increase in the dependent variable.

    The full dataset can be found here.

    The Model

    Defining the Hypothesis

    To identify the validity of this model, the following hypothesis is proposed:

    H0: There is no significant effect between YPLL Per Capita and the identified predictors.
    H1: There is a relationship at the 99% significance level for any of the identified predictors.

    Multivariate Regression

    Using R, the statistical programming language, and R Studio, the LN was was loaded into a data frame while omitting any rows with missing values. This reduced the count from 3142 to 2532 observations. The formal regression model was then built as:

    lnYPLL = B1lnPercSmokers + B2lnPercPhysInactive + B3lnPercExcessiveDrinking + B4lnPcpRate + B5lnMentHealthRate + B6lnPreventabHospStays + B7lnHsGradRate + B8lnPercUnemployed + B9lnPercChildrenInPoverty + B10lnSocialAssociationRate + B11lnViolentCrimeRate + B12lnInjuryDeathRate + B13lnAverageDailyPm2.5Pollution + Error

    Prediagnostics

    Histograms

    To ensure a valid model the histogram of each variable was reviewed for normal skewness and kurtosis. The following visualization compiles each variable into an overall histogram. From here, a variable can be isolated and viewed independently. Please double click on the indicator to change the view.


    Reviewing the histograms, it seems that most variables are normall distributed. The mental health rate has a low kurtosis with uneven tails and high school graduation rate is extremely left-skewed. Significance testing and regression diagnostics will ensure validity in these variables if they are found significant.

    Next, to ensure there is minimal collinearity among the variables, a scatterplot matrix was also developed as seen below. The change in gradient from teal to magenta shows the change from a low YPLL value to a high one. Hovering over a point will show the coordinates between the two variables with the YPLL per Capita value below it. Ideally, the scatterplots among independent variables will exhibit no linear trend.


    Reviewing the scatterplot matrix, most values seem to be independent of each other. Physical inactivity and smoking seem to be correlated, but these variables are understood to be separate lifestyle choices. The child poverty rate and unemployment also seem to exhibit a linear trend. While these two variables may be multicollinear, they may inform different policy decisions and are both includes. The child poverty rate also seems to be connected to physical inactivity and the smoking rate.

    Finding Causal Links in Years of Potential Life Lost

    Identifying Factors that Contribute or Mitigate Premature Death

     

    Regression Results

     

    Residuals:        
    Min 1Q Median 3Q Max  
    -2.79 -0.4882 -0.0119 0.4958 3.4533  
               
      Estimate Std. Error t value Pr(>|t|)  
    (Intercept) -3.26197 1.190231 -2.741 0.00618 **
    lnPercSmokers 0.782523 0.125858 6.218 5.89E-10 ***
    lnPercPhysInactive 0.64718 0.126102 5.132 3.08E-07 ***
    lnPercExcessiveDrinking -1.10901 0.115211 -9.626 < 2e-16 ***
    lnPcpRate -0.39724 0.03252 -12.215 < 2e-16 ***
    lnMentHealthRate -0.28589 0.019635 -14.56 < 2e-16 ***
    lnPreventabHospStays -0.31323 0.055442 -5.65 1.79E-08 ***
    lnHsGradRate 1.103129 0.212901 5.181 2.38E-07 ***
    lnPercUnemployed -0.00242 0.069899 -0.035 0.97236  
    lnPercChildrenInPoverty 0.588918 0.065734 8.959 < 2e-16 ***
    lnSocialAssociationRate 1.007705 0.040206 25.064 < 2e-16 ***
    lnViolentCrimeRate -0.42364 0.024273 -17.453 < 2e-16 ***
    lnInjuryDeathRate 1.124204 0.073406 15.315 < 2e-16 ***
    lnAverageDailyPm2.5Pollution -2.0251 0.087925 -23.032 < 2e-16 ***
    ---          
    Signif. codes:    0 ‘***’ 0.001 ‘**’  0.01 ‘*’  0.05 ‘.’  0.1  ‘ ’ 
               
    Residual standard error: 0.7999 on 2518 degrees of freedom
    Multiple R-squared:  0.7003, Adjusted R-squared:  0.6987     
    F-statistic: 452.6 on 13 and 2518 DF,  p-value: < 2.2e-16

    The above is the regression output from R which provides the correlational coefficients, standard error, significance values, and r-square values. Out of the model, all variables and intercept are found to be significant except for Percent Unemployment Rate.

    Regression Diagnostics

    Leverage vs Residuals

    Identify whether residual error is non-linear

    Scale Location

    Whether residuals are spread evenly among variables

    Q-Q Plot

    Identifies if residuals are normall distributed

    Residuals vs Fitted

    Find influential cases and outliers


    Analysis

    With all variables except for unemployment to be found signicant at the 99th percentile, the null hypothesis is rejected. The adjusted r-square value is 0.70 which suggests that the model explains 70% of the variance in Years of Potential Life Lost per Capita. Despite this, many variables interact unexpectedly with the years of potential life lost per capita which calls into question the validity of the model. The following are significant variables rank-ordered by their positive to negative impact on potential life lost, that is the largest contributing factors to the largest mitigating factors. The list is:

    lnInjuryDeathRate 1.124204
    lnHsGradRate 1.103129
    lnSocialAssociationRate 1.007705
    lnPercSmokers 0.782523
    lnPercPhysInactive 0.64718
    lnPercChildrenInPoverty 0.588918
    lnMentHealthRate -0.28589
    lnPreventabHospStays -0.31323
    lnPcpRate -0.39724
    lnViolentCrimeRate -0.42364
    lnPercExcessiveDrinking -1.10901
    lnAverageDailyPm2.5Pollution -2.0251

    According to the model, the largest contributor to premature death is the Injury Death Rate, but the second largest is the High School Graduation Rate. Social Association is the third most significant variable in contributing to years of potential life lost. Smoking, inactvity, and children in poverty also predict increased years of potential life lost. Inversely, access to mental health and primary care services mitigate premature death, but so do violent crime, excessive drinking, and pollution. It is expected that there are unobserved variables connected to wealth and urbanization that help explain the inverse relationship between high school graduation rates, violent crime, pollution and and YPLL. Excessive drinking however is usually thought to be correlated with socialization, here they are not and have the opposite expected effect.