Multilevel analysis of infant mortality data
An analysis of infant mortality rates as predicted by national income, region/continent, and whether the country exports oil. I found that a multilevel model fits the data better than a completely pooled or the independent regression on each region.
I fit a number of models to the Leinhardt dataset (https://rdrr.io/cran/carData/man/Leinhardt.html)
The first model pooled all regions together and modeled simply the infant mortality against every country's income and whether they produce oil or not.
The second model fit a separate independent regression for each region.
The third and fourth model were hierarchical, with the third fitting a separate intercept with a common slope for income and oil, while the fourth fit seperate intercepts and separate independent slopes for income with a common slope for oil.
I find that the fourth model best fits the data, and in particular fits the region of Africa the best.
I've plotted 3 of the 4 different fits. The blue line indicates the hierarchical model with varying intercepts and slopes. For most of the regions, there's not a major difference between the model fits. However, the fits for Africa tend to be different.
The fit for each region independently (orange) misses any sort of income effect on infant mortality, likely because of the influence of "outliers" and the general shape of the data in Africa.
Modeling every country the same (green) tends to overestimate the impact of income on the dependent variable. This might mislead us in thinking that the rise of income will have a greater impact on infant mortality than it really will specifically in Africa.
The hierarchical model tends to average out both extremes, giving us (hopefully) a more reasonable estimate for income's impact on infant mortality while modeling the unique aspect of Africa and how it may be distinctly different from the rest of the world even though basic entering assumptions about the system probably still hold true.
Here's all regions superimposed on the same graph with their hierarchical fits (solid), individual fits (dashed), and the complete pooling (grey dashed).
RMSE for each of the models on the dataset is as follows:
Regression Model: 0.41524
Region Model: 0.32629
Region Heirarchical Model: 0.32937
Varying Slopes Model: 0.31604
The notebook for my analysis is here: https://github.com/btbyrnes/multilevel_models/blob/main/leinhardt%20heirarchical.ipynb