Regression Modeling in Practice: Logistic Regression Model

h 22 augusztus 2016 by Ernő Gólya

Logistic regression is another form of the linear regression model, so the basic idea is the same as a multiple regression analysis. But, unlike the multiple regression model, the logistic regression model is designed to test binary response variables. We can use a logistic regression model, including using odds ratios and confidence intervals to determine the magnitude of the association between our explanatory variables and response variable.

Our research question is, whether there is a relationship between the female education level and the under-5 child mortality rate (U5MR) of a country. Additionally, we would also like to find out how other factors, such as income per person and total expenditure on health, can potentionally modify this relationship. Because we are using a logistic regression model this time, the response variable needs to be binary (categorical with 2 categories). As child mortality rate is a quantitative response variable, we will have to bin it into 2 categories. The child mortality frequency distribution of the countries is skewed to the right so we will use the global median of the under5mort variable for the comparison.

u5_abovemedian:
0: under5mort <= 19.5 (median of U5MR rates of 154 countries)
1: under5mort > 19.5 (median of U5MR rates of 154 countries)
(Data source: Gapminder)

Null Hypothesis: There is no association between the female education level and the under-five child mortality rate being higher (or lower) than the median value of all countries.
Alternate Hypothesis: There is an association between the female education level and the under-five child mortality rate being higher (or lower) than the median value of all countries.

In order to better answer our research question, we will choose odds ratios as opposed to coefficients. The odds ratio is the probability of an even occurring in one group compared to the probability of an event occurring in another group. Odds ratios are the natural exponentiation of our parameter estimates and are always given in the form of odds and are not linear. An odds ratio can range from zero to positive infinity, and is centered around the value one. If we ran our model and got an odds ratio of one, it would mean that there's an equal probability of child mortality rate being above the global median in countries with different level of female education. It's also likely then that our model would be statistically non-significant. If an odds ratio is greater than one, it means that the probability of reporting under5mort > 19.5 value in a country increases among those with greater schooling for women compared to those with lower mean years in school rate. In contrast, if the odds ratio is below one, it means that the probability of having child mortality above the given rate decreases among those countries with better female education compared to those with lower mean years in school rate.

Results

Logistic regression model for the association between female education and child mortality rate
                           Logit Regression Results                           
==============================================================================
Dep. Variable:         u5_abovemedian   No. Observations:                  154
Model:                          Logit   Df Residuals:                      152
Method:                           MLE   Df Model:                            1
Date:                Mon, 22 Aug 2016   Pseudo R-squ.:                  0.4281
Time:                        12:24:43   Log-Likelihood:                -61.051
converged:                       True   LL-Null:                       -106.74
                                        LLR p-value:                 1.181e-21
=================================================================================
                    coef    std err          z      P>|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------
Intercept         0.3204      0.248      1.291      0.197        -0.166     0.807
c_womenschool    -0.6890      0.107     -6.429      0.000        -0.899    -0.479
=================================================================================

Odds Ratios
Intercept        1.377696
c_womenschool    0.502055
dtype: float64

Odds ratios with 95% confidence intervals
               Lower CI  Upper CI        OR
Intercept      0.847138  2.240539  1.377696
c_womenschool  0.406933  0.619412  0.502055

The logistic regression model for the association between the primary explanatory variable c_womenschool and response variable u5_abovemedian shows a negative association. Mean years in school is significantly associated with child mortality, such that countries with higher female education level are significantly less likely to have child mortality rate above the global average (OR= 0.50, 95% CI=0.40-0.62, p=0.000).

As with multiple regression, when using logistic regression, we can continue to add variables to our model in order to evaluate multiple predictors of our binary categorical response variable.

                           Logit Regression Results                           
==============================================================================
Dep. Variable:         u5_abovemedian   No. Observations:                  154
Model:                          Logit   Df Residuals:                      151
Method:                           MLE   Df Model:                            2
Date:                Mon, 22 Aug 2016   Pseudo R-squ.:                  0.5569
Time:                        12:44:08   Log-Likelihood:                -47.298
converged:                       True   LL-Null:                       -106.74
                                        LLR p-value:                 1.523e-26
=====================================================================================
                        coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept            -1.2607      0.607     -2.078      0.038        -2.450    -0.071
c_womenschool        -0.5102      0.120     -4.240      0.000        -0.746    -0.274
c_incomeperperson    -0.0004      0.000     -3.417      0.001        -0.001    -0.000
=====================================================================================

Odds Ratios
Intercept            0.283442
c_womenschool        0.600365
c_incomeperperson    0.999578
dtype: float64

Odds ratios with 95% confidence intervals
                   Lower CI  Upper CI        OR
Intercept          0.086280  0.931146  0.283442
c_womenschool      0.474219  0.760067  0.600365
c_incomeperperson  0.999336  0.999820  0.999578


                           Logit Regression Results                           
==============================================================================
Dep. Variable:         u5_abovemedian   No. Observations:                  154
Model:                          Logit   Df Residuals:                      151
Method:                           MLE   Df Model:                            2
Date:                Mon, 22 Aug 2016   Pseudo R-squ.:                  0.5569
Time:                        12:48:40   Log-Likelihood:                -47.294
converged:                       True   LL-Null:                       -106.74
                                        LLR p-value:                 1.517e-26
==================================================================================
                     coef    std err          z      P>|z|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------
Intercept         -2.2097      0.811     -2.726      0.006        -3.798    -0.621
c_womenschool     -0.4604      0.119     -3.862      0.000        -0.694    -0.227
c_healthexpend    -0.0039      0.001     -3.614      0.000        -0.006    -0.002
==================================================================================

Odds Ratios
Intercept         0.109733
c_womenschool     0.631044
c_healthexpend    0.996112
dtype: float64

Odds ratios with 95% confidence intervals
                Lower CI  Upper CI        OR
Intercept       0.022409  0.537333  0.109733
c_womenschool   0.499565  0.797126  0.631044
c_healthexpend  0.994010  0.998218  0.996112

After adding income per person and per capita expenditure on health to the model separately, both secondary variables show a negative, significant relationship with the child mortality response variable. Income per person OR= 0.99, 95% CI=0.999-0.999, p=0.001. Per capita expenditure on health OR= 0.99, 95% CI=0.994-0.998, p=0.000. These numbers appear to show very weak association with the dependent variable, as they are very close to 1. Their confidence intervals do not overlap with the the primary explanatory variable (c_womenschool), and they are not confounders in the relationship of mean years in school and under-five mortality.

                           Logit Regression Results                           
==============================================================================
Dep. Variable:         u5_abovemedian   No. Observations:                  154
Model:                          Logit   Df Residuals:                      150
Method:                           MLE   Df Model:                            3
Date:                Mon, 22 Aug 2016   Pseudo R-squ.:                  0.5651
Time:                        12:50:21   Log-Likelihood:                -46.424
converged:                       True   LL-Null:                       -106.74
                                        LLR p-value:                 5.612e-26
=====================================================================================
                        coef    std err          z      P>|z|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------
Intercept            -1.9172      0.827     -2.318      0.020        -3.538    -0.296
c_womenschool        -0.4783      0.123     -3.884      0.000        -0.720    -0.237
c_incomeperperson    -0.0002      0.000     -1.227      0.220        -0.001     0.000
c_healthexpend       -0.0022      0.002     -1.300      0.194        -0.005     0.001
=====================================================================================

Odds Ratios
Intercept            0.147023
c_womenschool        0.619839
c_incomeperperson    0.999779
c_healthexpend       0.997829
dtype: float64

Odds ratios with 95% confidence intervals
                   Lower CI  Upper CI        OR
Intercept          0.029058  0.743886  0.147023
c_womenschool      0.486908  0.789061  0.619839
c_incomeperperson  0.999427  1.000132  0.999779
c_healthexpend     0.994565  1.001105  0.997829

What happens if we control for both income per person and health expenditure? As we can see, income per person (OR= 0.99, 95% CI=0.99-1.00, p=0.22) and per capita health expenditure (OR= 0.99, 95% CI=0.99-1.00, p=0.19) are no longer significantly associated with child mortality. Here we have an example of confounding. Income per person and total expenditure on health show strong multicollinearity with each other. We would say that income per person confounds the relationship between expenditure on health and child mortality while expenditure on health confounds the relationship between income per person and the response variable, because their p-values are no longer significant when the other variable is included in the model. Further, because these variables are no longer associated with under-five mortality, we would not interpret the corresponding odds ratios.

Meanwhile, the relationship between female education and child mortality is still significant and strong (OR= 0.61, 95% CI=0.48-0.78, p=0.000). These results support the hypothesis for the association between our primary explanatory variable and our response variable.

Python code

# -*- coding: utf-8 -*-
# Created on 22/08/2016
# Author Ernő Gólya

%matplotlib inline
# import libraries
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm

# read in data file
data = pd.read_csv('custom_gapminder_2.csv', low_memory=False)

# set variables to numeric
data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors='coerce')
data["under5mort"] = pd.to_numeric(data["under5mort"],errors='coerce')
data["womenschool"] = pd.to_numeric(data["womenschool"],errors='coerce')
data["healthexpend"] = pd.to_numeric(data["healthexpend"],errors='coerce')

# remove observations with NaN values in any variables of interest
data = data.dropna()

# use country names as row names/indices for plotting purposes
data.index = data["country"]
data.drop("country", axis=1)

# create copy of data set
sub1 = data.copy()

# center explanatory variables
sub1['c_womenschool'] = data['womenschool'] - data['womenschool'].mean()
sub1['c_incomeperperson'] = data['incomeperperson'] - data['incomeperperson'].mean()
sub1['c_healthexpend'] = data['healthexpend'] - data['healthexpend'].mean()

# median global child mortality
under5median = sub1['under5mort'].median()

# create u5_abovemedian variable (value = 1 if under5mort > under5median, othervise value = 0)
def u5_abovemedian(row):
    if row['under5mort'] > under5median:
        return 1
    else:
        return 0
sub1['u5_abovemedian'] = sub1.apply(lambda row: u5_abovemedian(row), axis=1)

# logistic regression model for female education and child mortality
print "Logistic regression model for the association between female education and child mortality rate"
model1 = smf.logit('u5_abovemedian ~ c_womenschool', data=sub1).fit()
print model1.summary()
print ""

# odds ratios
print "Odds Ratios"
print np.exp(model1.params)
print ""

# odd ratios with 95% confidence intervals
print "Odds ratios with 95% confidence intervals"
params = model1.params
conf = model1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print np.exp(conf)

# logistic regression model with female education and income per person
model4 = smf.logit('u5_abovemedian ~ c_womenschool + c_incomeperperson', data=sub1).fit()
print model4.summary()
print ""

# odds ratios
print "Odds Ratios"
print np.exp(model4.params)
print ""

# odd ratios with 95% confidence intervals
print "Odds ratios with 95% confidence intervals"
params = model4.params
conf = model4.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print np.exp(conf)

# logistic regression model with female education and expenditure on health
model5 = smf.logit('u5_abovemedian ~ c_womenschool + c_healthexpend', data=sub1).fit()
print model5.summary()
print ""

# odds ratios
print "Odds Ratios"
print np.exp(model5.params)
print ""

# odd ratios with 95% confidence intervals
print "Odds ratios with 95% confidence intervals"
params = model5.params
conf = model5.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print np.exp(conf)

# logistic regression model with female education, income per person and expenditure on health
model6 = smf.logit('u5_abovemedian ~ c_womenschool + c_incomeperperson + c_healthexpend', data=sub1).fit()
print model6.summary()
print ""

# odds ratios
print "Odds Ratios"
print np.exp(model6.params)
print ""

# odd ratios with 95% confidence intervals
print "Odds ratios with 95% confidence intervals"
params = model6.params
conf = model6.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI', 'Upper CI', 'OR']
print np.exp(conf)