JoeBank
Investing in America's Future
Finding Causal Links in Years of Potential Life Lost
Identifying Factors that Contribute or Mitigate Premature Death
The Dataset
The data used for this assesment was collected from the 2019 County Health Rankings & Roadmap. The master dataset was downloaded from the County Health Rankings website and processed by our analysts to include the selected variables:
The absolute values of the indicators were also transformed into natural logs, zero-centered values, and the percent difference from the average value. The natural logs were used for the multivariate regression model and the absolute values, zero-centered data, and percent difference are used to support exploratory data analysis and visualizations. The natural log transformation is used as coefficients on this scale can be directly interpreted as approximiate percentage changes. For example, with a significant correlational coefficient of 0.06, an increase by 1 in the indepedent variable is intepreted as having a 0.06 increase in the dependent variable.
The full dataset can be found here.
The Model
Defining the Hypothesis
To identify the validity of this model, the following hypothesis is proposed:
H0: There is no significant effect between YPLL Per Capita and the identified predictors.
H1: There is a relationship at the 99% significance level for any of the identified predictors.
Multivariate Regression
Using R, the statistical programming language, and R Studio, the LN was was loaded into a data frame while omitting any rows with missing values. This reduced the count from 3142 to 2532 observations. The formal regression model was then built as:
lnYPLL = B1lnPercSmokers + B2lnPercPhysInactive + B3lnPercExcessiveDrinking + B4lnPcpRate + B5lnMentHealthRate + B6lnPreventabHospStays + B7lnHsGradRate + B8lnPercUnemployed + B9lnPercChildrenInPoverty + B10lnSocialAssociationRate + B11lnViolentCrimeRate + B12lnInjuryDeathRate + B13lnAverageDailyPm2.5Pollution + Error
Prediagnostics
Histograms
To ensure a valid model the histogram of each variable was reviewed for normal skewness and kurtosis. The following visualization compiles each variable into an overall histogram. From here, a variable can be isolated and viewed independently. Please double click on the indicator to change the view.
Reviewing the histograms, it seems that most variables are normall distributed. The mental health rate has a low kurtosis with uneven tails and high school graduation rate is extremely left-skewed. Significance testing and regression diagnostics will ensure validity in these variables if they are found significant.
Next, to ensure there is minimal collinearity among the variables, a scatterplot matrix was also developed as seen below. The change in gradient from teal to magenta shows the change from a low YPLL value to a high one. Hovering over a point will show the coordinates between the two variables with the YPLL per Capita value below it. Ideally, the scatterplots among independent variables will exhibit no linear trend.
Reviewing the scatterplot matrix, most values seem to be independent of each other. Physical inactivity and smoking seem to be correlated, but these variables are understood to be separate lifestyle choices. The child poverty rate and unemployment also seem to exhibit a linear trend. While these two variables may be multicollinear, they may inform different policy decisions and are both includes. The child poverty rate also seems to be connected to physical inactivity and the smoking rate.
Finding Causal Links in Years of Potential Life Lost
Identifying Factors that Contribute or Mitigate Premature Death
Regression Results
Residuals: | |||||
Min | 1Q | Median | 3Q | Max | |
-2.79 | -0.4882 | -0.0119 | 0.4958 | 3.4533 | |
Estimate | Std. Error | t value | Pr(>|t|) | ||
(Intercept) | -3.26197 | 1.190231 | -2.741 | 0.00618 | ** |
lnPercSmokers | 0.782523 | 0.125858 | 6.218 | 5.89E-10 | *** |
lnPercPhysInactive | 0.64718 | 0.126102 | 5.132 | 3.08E-07 | *** |
lnPercExcessiveDrinking | -1.10901 | 0.115211 | -9.626 | < 2e-16 | *** |
lnPcpRate | -0.39724 | 0.03252 | -12.215 | < 2e-16 | *** |
lnMentHealthRate | -0.28589 | 0.019635 | -14.56 | < 2e-16 | *** |
lnPreventabHospStays | -0.31323 | 0.055442 | -5.65 | 1.79E-08 | *** |
lnHsGradRate | 1.103129 | 0.212901 | 5.181 | 2.38E-07 | *** |
lnPercUnemployed | -0.00242 | 0.069899 | -0.035 | 0.97236 | |
lnPercChildrenInPoverty | 0.588918 | 0.065734 | 8.959 | < 2e-16 | *** |
lnSocialAssociationRate | 1.007705 | 0.040206 | 25.064 | < 2e-16 | *** |
lnViolentCrimeRate | -0.42364 | 0.024273 | -17.453 | < 2e-16 | *** |
lnInjuryDeathRate | 1.124204 | 0.073406 | 15.315 | < 2e-16 | *** |
lnAverageDailyPm2.5Pollution | -2.0251 | 0.087925 | -23.032 | < 2e-16 | *** |
--- | |||||
Signif. codes: | 0 ‘***’ | 0.001 ‘**’ | 0.01 ‘*’ | 0.05 ‘.’ | 0.1 ‘ ’ |
Residual standard error: 0.7999 on 2518 degrees of freedom | |||||
Multiple R-squared: 0.7003, | Adjusted R-squared: 0.6987 | ||||
F-statistic: 452.6 on 13 and 2518 DF, p-value: < 2.2e-16 |
The above is the regression output from R which provides the correlational coefficients, standard error, significance values, and r-square values. Out of the model, all variables and intercept are found to be significant except for Percent Unemployment Rate.
Regression Diagnostics
Leverage vs Residuals
Identify whether residual error is non-linear
Scale Location
Whether residuals are spread evenly among variables
Q-Q Plot
Identifies if residuals are normall distributed
Residuals vs Fitted
Find influential cases and outliers
Analysis
With all variables except for unemployment to be found signicant at the 99th percentile, the null hypothesis is rejected. The adjusted r-square value is 0.70 which suggests that the model explains 70% of the variance in Years of Potential Life Lost per Capita. Despite this, many variables interact unexpectedly with the years of potential life lost per capita which calls into question the validity of the model. The following are significant variables rank-ordered by their positive to negative impact on potential life lost, that is the largest contributing factors to the largest mitigating factors. The list is:
lnInjuryDeathRate | 1.124204 |
lnHsGradRate | 1.103129 |
lnSocialAssociationRate | 1.007705 |
lnPercSmokers | 0.782523 |
lnPercPhysInactive | 0.64718 |
lnPercChildrenInPoverty | 0.588918 |
lnMentHealthRate | -0.28589 |
lnPreventabHospStays | -0.31323 |
lnPcpRate | -0.39724 |
lnViolentCrimeRate | -0.42364 |
lnPercExcessiveDrinking | -1.10901 |
lnAverageDailyPm2.5Pollution | -2.0251 |
According to the model, the largest contributor to premature death is the Injury Death Rate, but the second largest is the High School Graduation Rate. Social Association is the third most significant variable in contributing to years of potential life lost. Smoking, inactvity, and children in poverty also predict increased years of potential life lost. Inversely, access to mental health and primary care services mitigate premature death, but so do violent crime, excessive drinking, and pollution. It is expected that there are unobserved variables connected to wealth and urbanization that help explain the inverse relationship between high school graduation rates, violent crime, pollution and and YPLL. Excessive drinking however is usually thought to be correlated with socialization, here they are not and have the opposite expected effect.