Causal Model

Finding Causal Links in Years of Potential Life Lost

Identifying Factors that Contribute or Mitigate Premature Death

The Dataset

The data used for this assesment was collected from the 2019 County Health Rankings & Roadmap. The master dataset was downloaded from the County Health Rankings website and processed by our analysts to include the selected variables:

Years of Potential Life Lost

Percent Smokers

Percent Physically Inactive

Percent Excessive Drinking

Primary Care Physician (PCP) Rate

Mental Health Rate

Preventable Hospital Stays

High School Graduation Rate

Percent Unemployed

Percent Children in Poverty

Social Association Rate

Injury Death Rate

Average Daily Pollution (2.5Pm)

Population

Years of Potential Life Lost Per Capita [Calculated]

The absolute values of the indicators were also transformed into natural logs, zero-centered values, and the percent difference from the average value. The natural logs were used for the multivariate regression model and the absolute values, zero-centered data, and percent difference are used to support exploratory data analysis and visualizations. The natural log transformation is used as coefficients on this scale can be directly interpreted as approximiate percentage changes. For example, with a significant correlational coefficient of 0.06, an increase by 1 in the indepedent variable is intepreted as having a 0.06 increase in the dependent variable.

The full dataset can be found here.

The Model

Defining the Hypothesis

To identify the validity of this model, the following hypothesis is proposed:

H0: There is no significant effect between YPLL Per Capita and the identified predictors.
H1: There is a relationship at the 99% significance level for any of the identified predictors.

Multivariate Regression

Using R, the statistical programming language, and R Studio, the LN was was loaded into a data frame while omitting any rows with missing values. This reduced the count from 3142 to 2532 observations. The formal regression model was then built as:

lnYPLL = B1lnPercSmokers + B2lnPercPhysInactive + B3lnPercExcessiveDrinking + B4lnPcpRate + B5lnMentHealthRate + B6lnPreventabHospStays + B7lnHsGradRate + B8lnPercUnemployed + B9lnPercChildrenInPoverty + B10lnSocialAssociationRate + B11lnViolentCrimeRate + B12lnInjuryDeathRate + B13lnAverageDailyPm2.5Pollution + Error

Prediagnostics

Histograms

To ensure a valid model the histogram of each variable was reviewed for normal skewness and kurtosis. The following visualization compiles each variable into an overall histogram. From here, a variable can be isolated and viewed independently. Please double click on the indicator to change the view.

Reviewing the histograms, it seems that most variables are normall distributed. The mental health rate has a low kurtosis with uneven tails and high school graduation rate is extremely left-skewed. Significance testing and regression diagnostics will ensure validity in these variables if they are found significant.

Next, to ensure there is minimal collinearity among the variables, a scatterplot matrix was also developed as seen below. The change in gradient from teal to magenta shows the change from a low YPLL value to a high one. Hovering over a point will show the coordinates between the two variables with the YPLL per Capita value below it. Ideally, the scatterplots among independent variables will exhibit no linear trend.

Reviewing the scatterplot matrix, most values seem to be independent of each other. Physical inactivity and smoking seem to be correlated, but these variables are understood to be separate lifestyle choices. The child poverty rate and unemployment also seem to exhibit a linear trend. While these two variables may be multicollinear, they may inform different policy decisions and are both includes. The child poverty rate also seems to be connected to physical inactivity and the smoking rate.

Finding Causal Links in Years of Potential Life Lost

Identifying Factors that Contribute or Mitigate Premature Death

Regression Results

Residuals:
Min	1Q	Median	3Q	Max
-2.79	-0.4882	-0.0119	0.4958	3.4533

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-3.26197	1.190231	-2.741	0.00618	**
lnPercSmokers	0.782523	0.125858	6.218	5.89E-10	***
lnPercPhysInactive	0.64718	0.126102	5.132	3.08E-07	***
lnPercExcessiveDrinking	-1.10901	0.115211	-9.626	< 2e-16	***
lnPcpRate	-0.39724	0.03252	-12.215	< 2e-16	***
lnMentHealthRate	-0.28589	0.019635	-14.56	< 2e-16	***
lnPreventabHospStays	-0.31323	0.055442	-5.65	1.79E-08	***
lnHsGradRate	1.103129	0.212901	5.181	2.38E-07	***
lnPercUnemployed	-0.00242	0.069899	-0.035	0.97236
lnPercChildrenInPoverty	0.588918	0.065734	8.959	< 2e-16	***
lnSocialAssociationRate	1.007705	0.040206	25.064	< 2e-16	***
lnViolentCrimeRate	-0.42364	0.024273	-17.453	< 2e-16	***
lnInjuryDeathRate	1.124204	0.073406	15.315	< 2e-16	***
lnAverageDailyPm2.5Pollution	-2.0251	0.087925	-23.032	< 2e-16	***
---
Signif. codes:	0 ‘***’	0.001 ‘**’	0.01 ‘*’	0.05 ‘.’	0.1 ‘ ’

Residual standard error: 0.7999 on 2518 degrees of freedom
Multiple R-squared: 0.7003,	Adjusted R-squared: 0.6987
F-statistic: 452.6 on 13 and 2518 DF, p-value: < 2.2e-16

The above is the regression output from R which provides the correlational coefficients, standard error, significance values, and r-square values. Out of the model, all variables and intercept are found to be significant except for Percent Unemployment Rate.

Regression Diagnostics

Leverage vs Residuals

Identify whether residual error is non-linear

Scale Location

Whether residuals are spread evenly among variables

Q-Q Plot

Identifies if residuals are normall distributed

Residuals vs Fitted

Find influential cases and outliers

Analysis

With all variables except for unemployment to be found signicant at the 99th percentile, the null hypothesis is rejected. The adjusted r-square value is 0.70 which suggests that the model explains 70% of the variance in Years of Potential Life Lost per Capita. Despite this, many variables interact unexpectedly with the years of potential life lost per capita which calls into question the validity of the model. The following are significant variables rank-ordered by their positive to negative impact on potential life lost, that is the largest contributing factors to the largest mitigating factors. The list is:

lnInjuryDeathRate	1.124204
lnHsGradRate	1.103129
lnSocialAssociationRate	1.007705
lnPercSmokers	0.782523
lnPercPhysInactive	0.64718
lnPercChildrenInPoverty	0.588918
lnMentHealthRate	-0.28589
lnPreventabHospStays	-0.31323
lnPcpRate	-0.39724
lnViolentCrimeRate	-0.42364
lnPercExcessiveDrinking	-1.10901
lnAverageDailyPm2.5Pollution	-2.0251

According to the model, the largest contributor to premature death is the Injury Death Rate, but the second largest is the High School Graduation Rate. Social Association is the third most significant variable in contributing to years of potential life lost. Smoking, inactvity, and children in poverty also predict increased years of potential life lost. Inversely, access to mental health and primary care services mitigate premature death, but so do violent crime, excessive drinking, and pollution. It is expected that there are unobserved variables connected to wealth and urbanization that help explain the inverse relationship between high school graduation rates, violent crime, pollution and and YPLL. Excessive drinking however is usually thought to be correlated with socialization, here they are not and have the opposite expected effect.

JoeBank

Investing in America's Future

Finding Causal Links in Years of Potential Life Lost

Identifying Factors that Contribute or Mitigate Premature Death

The Dataset

The Model

Prediagnostics

Finding Causal Links in Years of Potential Life Lost

Identifying Factors that Contribute or Mitigate Premature Death

Regression Results

Regression Diagnostics

Analysis