Data Management & Visualization: Creating Graphs for Our Data

cs 28 július 2016 by Ernő Gólya

This is part of my coursework for Data Management and Visualization course by Wesleyan University on Coursera.

After implementing useful data management decisions, it is time to create visual representations of our data that help us better display our findings by graphing the variables we study. My research goal is to verify there is a relationship between child mortality under 5 years of age and the average amount of time (years) women spend in schools. I also want to look at other variables that may affect child mortality rates of different countries. Does it depend on other factors such as income level and total expenditure on health?
Based on what I read in several publications on this topic I developed the following hypothesis: better education of women reduces the rate of death among children younger than five. Of course there are many other factors that may or may not have a significant effect on child mortality (e.g., total spending on health, income level of the country etc.).

According to this week's assignment I use visual tools to display the variables and the relationships between them.

Selected variables:

under5mort, womenschool, healthexpend, incomeperperson

Plotting univariate distributions

Child mortality

Description of the under5mort variable

Describe child mortality under 5
count    154.000000
mean      39.708442
std       41.935490
min        2.400000
25%        8.350000
50%       19.500000
75%       62.750000
max      208.800000
Name: under5mort, dtype: float64

Univariate graph for child mortality

Spread: range from 2.4 to 208.8 (under-five mortality per 1,000 live births). The mean of the variable is 39.708442 with a high standard deviation of 41.935490.

This graph is unimodal, with its highest peak at the lowest category of 1-20. The graph is strongly skewed to the right as most of the observations are located in the lowest category with fewer observations in the other categories.

Mean years in school

Description of the womenschool variable

Describe mean years in school
count    154.000000
mean       9.011039
std        3.483417
min        1.300000
25%        6.225000
50%        9.900000
75%       11.900000
max       14.700000
Name: womenschool, dtype: float64

Univariate graph for mean years of schooling for women

Spread: range from 1.3 to 14.7 years. The mean is 9.011039, standard deviation is 3.483417.

The unimodal graph has a peak at 10-12 years. It's skewed-left distribution showing a tendency of increasing frequency from lower to higher categories.

Health expenditure

Description of the healthexpend variable

Describe health expenditures
count     154.000000
mean     1026.758235
std      1781.976820
min        11.903500
25%        73.886563
50%       270.191643
75%       859.051710
max      8361.732117
Name: healthexpend, dtype: float64

Univariate graph for total expenditure on health

Spread: range from 11.903500 to 8361.732117 (per capita US$). The mean of the variable is 1026.758235 with a high standard deviation of 1781.976820.

The unimodal graph shows an extreme skewed-right distribution with a peak at around 1-300 US$.

Income per person

Description of the incomeperperson variable

Describe income per person
count      154.000000
mean      7082.552210
std      10429.868464
min        103.775857
25%        726.172683
50%       2288.445126
75%       8020.798175
max      52301.587179
Name: incomeperperson, dtype: float64

Univariate graph for income per person

Spread: range from 103.775857 to 52301.587179 (per capita US$). The mean of the variable is 7082.552210 with a standard deviation of 10429.868464.

This graph is unimodal, with its highest peak at the first bin of Income per Person. It is skewed to the right as most countries are located at the low – lower middle income levels.

Plotting bivariate distributions

Female education and child mortality

The bivariate analysis of under5mort (response variable) as compared to womenschool (explanatory variable) shows a negative relationship between the two variables. In the scatterplot we can see that a higher educational realization by mothers is associated with lower child mortality rates within countries. So, my original hypothesis ”...more education for women is an important factor in reducing the rate of death among children...” seems to be supported by this data.

Health care and child mortality

The second analysis is a comparison of health expenditure and child mortality. The graph shows a negative relationship between the variables: high expenditure on health more generally comes along with lower levels of child mortality. According to the previous frequency analysis in 75.44% of the studied countries the per capita expenditure on health is below 1000 US$. Further examination of this group could reveal more details about the discussed correlation.

Income per person and child mortality

Income level of the country shows negative relationship with child mortality rate: the poorest countries have the highest levels of child mortality, and the countries with the highest income have the lowest rates.

The two bar graphs below show that in the African region, where the per capita income is the lowest, child mortality is at an extremely high level compared to the other regions. Of course there are many factors other than GDP per capita that contribute to this result, but those variables are beyond the scope of this brief study.

Child mortality by WHO Regions

Per capita income by WHO Regions

Python code

    # -*- coding: utf-8 -*-
    # Created on 28/07/2016
    # Author Ernő Gólya

    %matplotlib inline
    # Import libraries
    import pandas as pd
    import numpy as np
    import seaborn
    import matplotlib.pyplot as plt

    # Reading in data file
    data = pd.read_csv('custom_gapminder_2.csv', low_memory=False)

    #Set PANDAS to show all columns in DataFrame
    pd.set_option('display.max_columns', None)
    #Set PANDAS to show all rows in DataFrame
    pd.set_option('display.max_rows', None)

    # bug fix for display formats to avoid run time errors
    #pd.set_option('display.float_format', lambda x:'%f'%x)

    # Setting variables to numeric
    data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors='coerce')
    data["under5mort"] = pd.to_numeric(data["under5mort"],errors='coerce')
    data["womenschool"] = pd.to_numeric(data["womenschool"],errors='coerce')
    data["healthexpend"] = pd.to_numeric(data["healthexpend"],errors='coerce')

    # Remove observations with NaN values in any variables of interest
    # Describe function returns NaN for percentiles if dataset contains NaN
    data = data.dropna()

    # Creating categories for quantitative variables
    data["incomeperperson_cat"] = pd.cut(data.incomeperperson, [1, 1000, 4000, 12000, 65000],   
                                        labels=["Low", "Lower middle", "Upper middle", "High" ])
    data["under5mort_cat"] = pd.cut(data.under5mort, [1, 40, 80, 120, 160, 220], 
                                    labels=["0-40","40-80", "80-120", "120-160", "160-220"])
    data["womenschool_cat"] = pd.cut(data.womenschool, [0, 4, 8, 12, 16],  
                                    labels=["0-4 years", "4-8 years", "8-12 years", "12-16 years"])
    data["healthexpend_cat"] = pd.cut(data.healthexpend, [1, 500, 1000, 2000, 5000, 9000],
                                         labels=["1-500", "500-1000", "1000-2000", "2000-5000", "5000-9000"])

    # Descriptive statistics for quantitative variables
    desc1 = data["under5mort"].describe()
    desc2 = data["womenschool"].describe()
    desc3 = data["healthexpend"].describe()
    desc4 = data["incomeperperson"].describe()
    print "Describe child mortality under 5"
    print desc1
    print ""
    print "Describe mean years in school"
    print desc2
    print ""
    print "Describe health expenditures"
    print desc3
    print ""
    print "Describe income per person"
    print desc4

    # Univariate bar graph for categorical variable - not printed
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    seaborn.countplot(x="under5mort_cat", data=data);
    plt.xlabel('Under-five mortality rate (per 1,000 live births) categories');
    plt.ylabel('Number of countries')
    plt.title('Child mortality under the age of 5 - frequency distribution');

    # Univariate histogram for quantitative variable
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    seaborn.distplot(data["under5mort"].dropna(), kde=False, rug=False);
    plt.xlabel('Under-five mortality rate (per 1,000 live births)');
    plt.ylabel('Number of countries')
    plt.title('Child mortality under the age of 5 - frequency distribution');

    # Univariate bar graph for categorical variable - not printed
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    seaborn.countplot(x="womenschool_cat", data=data);
    plt.ylabel('Number of countries')
    plt.xlabel('Mean years of schooling for women categories');
    plt.title('Mean years of schooling for women of reproductive age, 15 to 44 - frequency distribution');

    # Univariate histogram for quantitative variable
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    seaborn.distplot(data["womenschool"].dropna(), kde=False, rug=False);
    plt.ylabel('Number of countries')
    plt.xlabel('Mean years of schooling for women');
    plt.title('Mean years of schooling for women of reproductive age, 15 to 44 - frequency distribution');

    # Univariate bar graph for categorical variable - not printed
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    seaborn.countplot(x="healthexpend_cat", data=data);
    plt.ylabel('Number of countries')
    plt.xlabel('Per capita total expenditure on health (US$) categories');
    plt.title('Per capita total expenditure on health expressed at average exchange rate in US$ - frequency distribution');

    # Univariate histogram for quantitative variable
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    seaborn.distplot(data["healthexpend"].dropna(), kde=False, rug=False);
    plt.ylabel('Number of countries')
    plt.xlabel('Per capita total expenditure on health (US$)');
    plt.title('Per capita total expenditure on health expressed at average exchange rate in US$ - frequency distribution');

    # Univariate bar graph for categorical variable - not printed
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    seaborn.countplot(x="incomeperperson_cat", data=data);
    plt.ylabel('Number of countries')
    plt.xlabel('Income per person (US$) level');
    plt.title('2010 Gross Domestic Product per capita in constant 2000 US$ - frequency distribution');

    # Univariate histogram for quantitative variable
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    seaborn.distplot(data["incomeperperson"].dropna(), kde=False, rug=False, bins=15);
    plt.ylabel('Number of countries')
    plt.xlabel('Income per person (US$)');
    plt.title('2010 Gross Domestic Product per capita in constant 2000 US$ - frequency distribution');

    # basic scatterplot:  Q->Q
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    scat1 = seaborn.regplot(x='womenschool', y='under5mort', fit_reg=True, data=data)
    plt.xlabel('Mean years of schooling for women')
    plt.ylabel('Children dying before the age of 5 per 1,000 live births')
    plt.title('Association between mean years of schooling for women (age 15 to 44) and child mortality');

    # basic scatterplot:  Q->Q
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    scat2 = seaborn.regplot(x='healthexpend', y='under5mort', fit_reg=True, data=data)
    plt.xlabel('Per capita expenditure on health (US$)')
    plt.ylabel('Children dying before the age of 5 per 1,000 live births')
    plt.title('Association between per capita total expenditure on health and child mortality');

    # basic scatterplot:  Q->Q
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    scat3 = seaborn.regplot(x='incomeperperson', y='under5mort', fit_reg=True, data=data)
    plt.xlabel('2010 GDP per capita (US$)')
    plt.ylabel('Children dying before the age of 5 per 1,000 live births')
    plt.title('Association between income per person (US$) and child mortality');

    # bivariate bar graph C->Q
    seaborn.factorplot(x='who_region', y='under5mort', data=data, kind='bar', ci=None, aspect=2, size=6)
    plt.xlabel('WHO Regions')
    plt.ylabel('Under-five mortality rate (per 1,000 live births)')
    plt.title('Average child mortality rate by WHO Regions');

    # bivariate bar graph C->Q
    seaborn.factorplot(x='who_region', y='incomeperperson', data=data, kind='bar', ci=None, aspect=2, size=6)
    plt.xlabel('WHO Regions')
    plt.ylabel('2010 GDP per capita (US$)')
    plt.title('Average per capita income by WHO Regions');