Regression Modeling in Practice: Describing Data

v 07 augusztus 2016 by Ernő Gólya

The data set I use for the Data Analysis and Interpretation Specialization courses by Wesleyan University is an excerpt of the data available at the Gapminder site.
Gapminder is a non-profit foundation promoting sustainable global development and achievement of the United Nations Millennium Development Goals. The indicators collected here provide an insight into the social, economic and environmental development at local, national and global levels, and allow longitudinal assessment of these variables over many years.

Sample

My sample provides values for under-five child mortality rate (used as response variable), mean years in school for women, per capita total expenditure on health and income per person (explanatory variables, moderators) for 167 countries from 2009 and 2010. In my analyses the data is usually filtered so that only observations without missing data are shown which means, in most cases, I have data for 154 observations. This consolidated and organized human population data is recorded for each country at an individual level. Unique identifier: country. Each row contains information about one observation (one year in one country). All variables are quantitative.

Procedure

Gapminder has unified many databases from several sources, including the Institute for Health Metrics and Evaluation, US Census Bureau’s International Database, United Nations Statistics Division, World Bank, universities and non-governmental organizations into a single database with a consistent format. Their purpose is to promote greater understanding and use of statistics and other information by making time series freely available and showing major global development trends by producing videos, presentations and charts. The Gapminder database includes over 600 variables, with data from 258 current and former countries and territories. All variables used here were generated by data reporting and are observational (child mortality includes estimations). The database is under active development and it is updated frequently, hence, identical analyses can sometimes produce different results.

The data is arranged in spreadsheet format. The data itself is available through excel files. These files include both the detailed meta-data, as well as the actual observations. There is normally one documentation for each indicator, but sometimes several indicators are described in one documentation. I converted the excel spreadsheets containing the required indicators to csv files and then joined them into one custom data set.

    import pandas as pd
    f1 = pd.read_csv('csv/under5mortality.csv')
    f2 = pd.read_csv('csv/schoolyears.csv')
    f3 = pd.read_csv('csv/health.csv')
    f4 = pd.read_csv('csv/gapminder.csv')
    merged_inner = f1.merge(f2,on='country').merge(f3,on='country').merge(f4,on='country')
    merged_inner[['country','under5mort','womenschool','healthexpend','incomeperperson']]

Sample from table

index	country	under5mort	womenschool	healthexpend	incomeperperson
0	Afghanistan	105.0	0.8	37.666786
1	Albania	16.6	10.7	240.824785	1914.99655094922
2	Algeria	27.4	7.1	178.245066	2231.99333515006
...	...	...	...	...	...
165	Zambia	84.8	6.7	72.884346	432.226336974583
166	Zimbabwe	95.1	9.0	NaN	320.771889948584

Measures

Data management was applied to remove countries that have one or more variable not measured. I also set variables to numeric. If categorical data was expected, variables were grouped into categories as needed.

    # set variables to numeric
    data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors='coerce')
    data["under5mort"] = pd.to_numeric(data["under5mort"],errors='coerce')
    data["womenschool"] = pd.to_numeric(data["womenschool"],errors='coerce')
    data["healthexpend"] = pd.to_numeric(data["healthexpend"],errors='coerce')

    # remove observations with NaN values in any variables of interest
    data = data.dropna()

    # Creating categories
    data["incomeperperson_cat"] = pd.cut(data.incomeperperson, [1, 1005, 3975, 12275, 65000],\
    labels=["Low", "Lower middle", "Upper middle", "High" ])
    sub2["under5mort_cat"] = pd.cut(sub2.under5mort,\
    [1, 10, 50, 100, 220], labels=["0-10", "10-50", "50-100", "100-220"])
    sub2["womenschool_cat"] = pd.cut(sub2.womenschool, [0, 4, 8, 12, 16],\
    labels=["0-4 years", "4-8 years", "8-12 years", "12-16 years"])
    sub2["healthexpend_cat"] = pd.cut(sub2.healthexpend, [1, 100, 300, 1000, 5000, 9000], \
    labels=["1-100", "100-300", "300-1000", "1000-5000", "5000-9000" ])

    # median global child mortality
    under5median = sub1['under5mort'].median()

    # create u5_abovemedian variable (value = 1 if under5mort > under5median, othervise value = 0)
    def u5_abovemedian(row):
        if row['under5mort'] > under5median:
            return 1
        else:
            return 0
    sub2['u5_abovemedian'] = sub2.apply(lambda row: u5_abovemedian(row), axis=1)

The under-five mortality dataset has been compiled from several sources, including estimations: CME info estimates, Human Mortality Database, Gapminder’s IMR series, Data extrapolated back to 1800 based on Gapminder Life Expectancy combined with model life tables, other guesstimates. This response variable measures the probability that a child born in a specific year will die before reaching the age of five if subject to current age-specific mortality rates. Expressed as a rate per 1,000 live births in each country. Data from 261 nations and territories covers years from 1800 through 2015 from which I chose year 2010. Binned into 4 categories (0-10, 10-50, 50-100, 100-220) for frequency distribution calculations. For Chi-Square test I divided the continuos data into two categories using the median as a cutoff point: 0: under5mort <= 19.5, 1: under5mort > 19.5.

Mean years in school: the average number of years of school attended by women of reproductive age, 15 to 44, including primary, secondary and tertiary education during 2009. Source organization is the Institute for Health Metrics and Evaluation. This data set containes data from 175 countries from 1970 to 2009. For some explorations it was binned into four categories: 0-4, 4-8, 8-12 and 12-16 years. Main explanatory variable.

Income per person data comes with 213 observations from 2010. Sources are the World Bank Work Development Indicators. It represents the Gross Domestic Product per capita in constant 2000 US$ for the year 2010. The inflation but not the differences in the cost of living between countries has been taken into account. Explanatory variable, also used as moderator in my previous Pearson correlation test. For several analyses it was binned into four categories based on World Bank income groups (low: 1-1,005 US$, lower-middle: 1,006-3,975 US$, upper-middle: 3,976-12,275 US$, and high: > 12,275 US$).

Per capita total expenditure on health is expressed at average exchange rate for that year in US$. Current prices, reported by the World Health Organization. Data extends back to 1995, includes 259 nations and territories. For the analyses performed on this explanatory variable, I use data from 2010. Quantitative data, binned into 5 categories (1-100 US$, 100-300 US$, 300-1000 US$, 1000-5000 US$, 5000-9000 US$) when needed.