Data Management & Visualization: Making Data Management Decisions

k 26 július 2016 by Ernő Gólya

The following Python code is an example of how to make and implement data management decisions. In the previous assignment I already created categories for the relevant variables, as the dataset I study mostly have continuous data. I revise and alter some of the categories to be more similar to those used by World Bank and WHO.

World Bank income groups: WHO Member States are grouped into 4 income groups (low, lower-middle, upper-middle, and high) based on the World Bank list of analytical income classification of economies. Although it uses national income per capita estimates, for sake of demonstration I will apply these categories to the GDP per capita (incomeperperson) variable with different thresholds.

I will add a new variable to my subset: WHO regions. WHO Member States are grouped into 6 WHO regions: African Region, Region of the Americas, South-East Asia Region, European Region, Eastern Mediterranean Region, and Western Pacific Region. I create crosstabs to highlight the differenes between these areas.

I use a subset of the dataframe including only the variables of my interest. Frequency distribution calculations exclude countries with missing values (NaN) of the specific variable (dropna=False parameter is omitted, default value is True). Minimum, maximum, mean, median calculations are also presented.

# -*- coding: utf-8 -*-
# Created on 24/07/2016
# Author Ernő Gólya

# Import libraries
import pandas as pd
import numpy as np

# Reading in data file
data = pd.read_csv('custom_gapminder.csv', low_memory=False)

# Setting variables to numeric
data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors='coerce')
data["under5mort"] = pd.to_numeric(data["under5mort"],errors='coerce')
data["womenschool"] = pd.to_numeric(data["womenschool"],errors='coerce')
data["healthexpend"] = pd.to_numeric(data["healthexpend"],errors='coerce')

# Create subset including only variables of interest
sub1 = data[['country','under5mort','womenschool','healthexpend','incomeperperson']]

# Make a copy of subsetted data
sub2 = sub1.copy()

# Number of countries in subset
print "Subset contains:", len(sub2.index), "countries."
print ""

# preview dataset
print sub2.head(n=10), "\n"

# Creating categories
sub2["incomeperperson_cat"] = pd.cut(sub2.incomeperperson, [1, 1000, 4000, 12000, 65000],   
labels=["Low", "Lower middle", "Upper middle", "High" ])
sub2["under5mort_cat"] = pd.cut(sub2.under5mort, [1, 10, 50, 100, 220], labels=["0-10", "10-50", "50-100", "100-220"])
sub2["womenschool_cat"] = pd.cut(sub2.womenschool, [0, 4, 8, 12, 16],  
                                    labels=["0-4 years", "4-8 years", "8-12 years", "12-16 years"])
sub2["healthexpend_cat"] = pd.cut(sub2.healthexpend, [1, 100, 300, 1000, 5000, 9000],   
                                        labels=["1-100", "100-300", "300-1000", "1000-5000", "5000-9000" ])

#frequency distributions with minimum, maximum, mean, median calculation

#frequency distribution for incomeperperson_cat
print "GDP value range in dataset: ", sub2['incomeperperson'].min(),  " - ", sub2['incomeperperson'].max()
print "Mean: ",np.nanmean(sub2.incomeperperson)
print "Median: ",np.nanmedian(sub2.incomeperperson)
print "Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD."
c4 = sub2['incomeperperson_cat'].value_counts(sort=False)
print c4

print "Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD."
p4 = sub2['incomeperperson_cat'].value_counts(sort=False, normalize=True)
print p4 * 100
print "\n"

#frequency distribution for under5mort_cat
print "Child mortality value range in dataset: ", sub2['under5mort'].min(),  " - ", sub2['under5mort'].max()
print "Mean: ",np.nanmean(sub2.under5mort)
print "Median: ",np.nanmedian(sub2.under5mort)
print "Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births."
c1 = sub2['under5mort_cat'].value_counts(sort=False)
print c1

print "Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births."
p1 = sub2['under5mort_cat'].value_counts(sort=False, normalize=True)
print p1 * 100
print "\n"

#frequency distribution for womenschool_cat
print "Mean years in school value range in dataset: ", sub2['womenschool'].min(),  " - ", sub2['womenschool'].max()
print "Mean: ",np.nanmean(sub2.womenschool)
print "Median: ",np.nanmedian(sub2.womenschool)
print "Count for Mean years in school. Women of reproductive age, 15 to 44."
c2 = sub2['womenschool_cat'].value_counts(sort=False)
print c2

print "Percentage for Mean years in school. Women of reproductive age, 15 to 44."
p2 = sub2['womenschool_cat'].value_counts(sort=False, normalize=True)
print p2 * 100
print "\n"

#frequency distribution for healthexpend_cat
print "Per capita total expenditure on health value range in dataset: ", sub2['healthexpend'].min(),   
   " - ", sub2['healthexpend'].max()
print "Mean: ",np.nanmean(sub2.healthexpend)
print "Median: ",np.nanmedian(sub2.healthexpend)
print "Count for healthexpend. Per capita total expenditure on health expressed in US$."
c3 = sub2['healthexpend_cat'].value_counts(sort=False)
print c3

print "Percentage for healthexpend. Per capita total expenditure on health expressed in US$."
p3 = sub2['healthexpend_cat'].value_counts(sort=False, normalize=True)
print p3 * 100

Frequency tables

    Subset contains: 167 countries.

    GDP value range in dataset:  103.775857241  -  52301.587179
    Mean:  6876.91364439
    Median:  2226.50571315
    Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD.
    Low             50
    Lower middle    51
    Upper middle    30
    High            29
    Name: incomeperperson_cat, dtype: int64
    Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD.
    Low             31.250
    Lower middle    31.875
    Upper middle    18.750
    High            18.125
    Name: incomeperperson_cat, dtype: float64


    Child mortality value range in dataset:  2.4  -  208.8
    Mean:  40.8915662651
    Median:  21.3
    Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births.
    0-10       45
    10-50      67
    50-100     35
    100-220    19
    Name: under5mort_cat, dtype: int64
    Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births.
    0-10       27.108434
    10-50      40.361446
    50-100     21.084337
    100-220    11.445783
    Name: under5mort_cat, dtype: float64


    Mean years in school value range in dataset:  0.8  -  14.7
    Mean:  8.84850299401
    Median:  9.6
    Count for Mean years in school. Women of reproductive age, 15 to 44.
    0-4 years      21
    4-8 years      39
    8-12 years     72
    12-16 years    35
    Name: womenschool_cat, dtype: int64
    Percentage for Mean years in school. Women of reproductive age, 15 to 44.
    0-4 years      12.574850
    4-8 years      23.353293
    8-12 years     43.113772
    12-16 years    20.958084
    Name: womenschool_cat, dtype: float64


    Per capita total expenditure on health value range in dataset:  11.90349972  -  8361.732117
    Mean:  1009.4234372
    Median:  258.01889645
    Count for healthexpend. Per capita total expenditure on health expressed in US$.
    1-100        50
    100-300      31
    300-1000     45
    1000-5000    25
    5000-9000     7
    Name: healthexpend_cat, dtype: int64
    Percentage for healthexpend. Per capita total expenditure on health expressed in US$.
    1-100        31.645570
    100-300      19.620253
    300-1000     28.481013
    1000-5000    15.822785
    5000-9000     4.430380
    Name: healthexpend_cat, dtype: float64

Use crosstab() function to create cross counts for variables

    #crosstabs evaluating counts in womenshool and under5mort categories
    print "Cross-tabulation of womenshool and under5mort categories"
    print pd.crosstab(sub2['womenschool_cat'], sub2['under5mort_cat'])
    print ""

    #crosstabs evaluating counts in healthexpend and under5mort categories
    print "Cross-tabulation of healthexpend and under5mort categories"
    print pd.crosstab(sub2['healthexpend_cat'], sub2['under5mort_cat'])
    print ""

    #crosstabs evaluating counts in incomeperperson and under5mort categories
    print "Cross-tabulation of incomeperperson and under5mort categories"
    print pd.crosstab(sub2['incomeperperson_cat'], sub2['under5mort_cat'])

Crosstabs

    Cross-tabulation of womenshool and under5mort categories
    under5mort_cat   0-10  10-50  50-100  100-220
    womenschool_cat                              
    0-4 years           0      1       8       12
    4-8 years           0     16      17        6
    8-12 years         20     41      10        1
    12-16 years        25      9       0        0

    Cross-tabulation of healthexpend and under5mort categories
    under5mort_cat    0-10  10-50  50-100  100-220
    healthexpend_cat                              
    1-100                0      8      27       15
    100-300              0     26       3        2
    300-1000            15     25       4        1
    1000-5000           23      2       0        0
    5000-9000            7      0       0        0

    Cross-tabulation of incomeperperson and under5mort categories
    under5mort_cat       0-10  10-50  50-100  100-220
    incomeperperson_cat                              
    Low                     0      9      26       15
    Lower middle            5     40       5        1
    Upper middle           12     15       2        1
    High                   28      1       0        0

Add new variable (coloumn) to subset

# WHO regions dictionary
who_regions = {
        'African Region': 
            ['Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso', 'Burundi', 'Cameroon',
             'Cape Verde', 'Central African Republic', 'Chad', 'Comoros', "Cote d'Ivoire", 
             'Congo, Dem. Rep.', 'Equatorial Guinea', 'Eritrea', 'Ethiopia', 'Gabon', 'Gambia',
             'Ghana', 'Guinea', 'Guinea-Bissau', 'Kenya', 'Lesotho', 'Liberia', 'Madagascar',
             'Malawi', 'Mali', 'Mauritania', 'Mauritius', 'Mozambique', 'Namibia', 'Niger',
             'Nigeria', 'Congo, Rep.', 'Rwanda', 'Sao Tome and Principe', 'Senegal', 'Seychelles',
             'Sierra Leone', 'South Africa', 'Swaziland', 'Togo', 'Uganda', 'Tanzania', 'Zambia', 'Zimbabwe'],
        'Region of the Americas':
            ['Antigua and Barbuda', 'Argentina', 'Bahamas', 'Barbados', 'Belize', 'Bolivia', 'Brazil', 'Canada',
             'Chile', 'Colombia', 'Costa Rica', 'Cuba', 'Dominica', 'Dominican Republic', 'Ecuador', 'El Salvador',
             'Grenada', 'Guatemala', 'Guyana', 'Haiti', 'Honduras', 'Jamaica', 'Mexico', 'Nicaragua', 'Panama',
             'Paraguay', 'Peru', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Vincent and the Grenadines',
             'Suriname', 'Trinidad and Tobago', 'United States', 'Uruguay', 'Venezuela'],
        'South-East Asia Region': 
            ['Bangladesh', 'Bhutan', 'Democratic People\xe2\x80\x99s Republic of Korea', 'India', 'Indonesia',
             'Maldives', 'Myanmar', 'Nepal', 'Sri Lanka', 'Thailand', 'Timor-Leste'],
        'European Region': 
            ['Albania', 'Andorra', 'Armenia', 'Austria', 'Azerbaijan', 'Belarus', 'Belgium', 'Bosnia and Herzegovina',
             'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Georgia',
             'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Israel', 'Italy', 'Kazakhstan', 'Kyrgyzstan',
             'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Monaco', 'Montenegro', 'Netherlands', 'Norway', 'Poland',
             'Portugal', 'Moldova', 'Romania', 'Russia', 'San Marino', 'Serbia', 'Slovak Republic', 'Slovenia',
             'Spain', 'Sweden', 'Switzerland', 'Tajikistan', 'Macedonia, FYR', 'Turkey', 'Turkmenistan', 'Ukraine',
             'United Kingdom', 'Uzbekistan'], 
        'Eastern Mediterranean Region': 
            ['Afghanistan', 'Bahrain', 'Djibouti', 'Egypt', 'Iran', 'Iraq', 'Jordan', 'Kuwait', 'Lebanon',
             'Libyan Arab Jamahiriya', 'Morocco', 'Oman', 'Pakistan', 'Qatar', 'Saudi Arabia', 'Somalia',
             'Sudan', 'Syria', 'Tunisia', 'United Arab Emirates', 'Yemen'], 
        'Western Pacific Region': 
            ['Australia', 'Brunei Darussalam', 'Cambodia', 'China', 'Cook Islands', 'Fiji', 'Japan',
             'Kiribati', 'Lao People\xe2\x80\x99s Democratic Republic', 'Malaysia', 'Marshall Islands',
             'Micronesia (Federated States of)', 'Mongolia', 'Nauru', 'New Zealand', 'Niue', 'Palau',
             'Papua New Guinea', 'Philippines', 'Republic of Korea', 'Samoa', 'Singapore', 'Solomon Islands',
             'Taiwan', 'Tonga', 'Tuvalu', 'Vanuatu', 'Vietnam']}

# Swap keys - values in who_regions dictionary
d = {}
for key in who_regions:
    for value in who_regions[key]:
        d[value] = key

# Add coloumn who_region to subset
sub2['who_region']= sub2['country'].map(d)
print "Sample of subset with new variable"
sub2[['country','under5mort','womenschool','healthexpend','incomeperperson', 'who_region']].tail(n=5)

Sample of subset with new variable

	country	under5mort	womenschool	healthexpend	incomeperperson	who_region
162	Venezuela	16.6	10.5	663.387571	5528.363114	Region of the Americas
163	West Bank and Gaza	23.8	11.0	NaN	NaN	NaN
164	Vietnam	24.8	8.4	82.872177	722.807559	Western Pacific Region
165	Zambia	84.8	6.7	72.884346	432.226337	African Region
166	Zimbabwe	95.1	9.0	NaN	320.771890	African Region

Crosstabs with new variable

    #crosstabs evaluating counts in who_region and under5mort categories
    print "Cross-tabulation of who_region and under5mort categories"
    print pd.crosstab(sub2['who_region'], sub2['under5mort_cat'], normalize=False)
    print ""

    #crosstabs evaluating counts in who_region and womenshool categories
    print "Cross-tabulation of who_region and womenshool categories"
    print pd.crosstab(sub2['who_region'], sub2['womenschool_cat'], normalize=False)
    print ""

    #crosstabs evaluating counts in who_region and healthexpend categories
    print "Cross-tabulation of who_region and healthexpend categories"
    print pd.crosstab(sub2['who_region'], sub2['healthexpend_cat'], normalize=False)
    print ""

    #crosstabs evaluating counts in who_region and incomeperperson categories
    print "Cross-tabulation of who_region and incomeperperson categories"
    print pd.crosstab(sub2['who_region'], sub2['incomeperperson_cat'], normalize=False)

Cross-tabulation of who_region and under5mort categories

under5mort_cat	0-10	10-50	50-100	100-220
who_region
African Region	0	4	25	16
Eastern Mediterranean Region	3	11	3	2
European Region	32	12	2	0
Region of the Americas	5	22	0	1
South-East Asia Region	0	6	3	0
Western Pacific Region	5	11	2	0

Cross-tabulation of who_region and womenshool categories

womenschool_cat	0-4 years	4-8 years	8-12 years	12-16 years
who_region
African Region	17	18	10	0
Eastern Mediterranean Region	3	7	9	0
European Region	0	0	21	25
Region of the Americas	0	5	19	4
South-East Asia Region	1	5	3	0
Western Pacific Region	0	4	9	6

Cross-tabulation of who_region and healthexpend categories

healthexpend_cat	1-100	100-300	300-1000	1000-5000	5000-9000
who_region
African Region	32	5	7	0	0
Eastern Mediterranean Region	5	4	6	3	0
European Region	2	6	16	17	5
Region of the Americas	2	7	13	1	2
South-East Asia Region	5	1	1	0	0
Western Pacific Region	4	8	2	4	0

Cross-tabulation of who_region and incomeperperson categories

incomeperperson_cat	Low	Lower middle	Upper middle	High
who_region
African Region	32	7	5	0
Eastern Mediterranean Region	4	6	3	3
European Region	3	15	9	19
Region of the Americas	2	12	11	3
South-East Asia Region	4	3	1	0
Western Pacific Region	5	8	1	4

The most common range of child mortality rate is 10-50 death per 1,000 live births. 67 countries (40.36%) fell into this group. 27.1 % of the observations have a value less than or equal to 10, while about third of the countries pesented higher values (50 or higher mortality rate).

In 72 countries (43.11%) the women spend 8-10 years in schools. This is the most frequent category.

According to the crosstab calculations child mortality is higher in countries where women spend less time in education and per capita health expenditure is also on a lower level. High income countries have definitely better child survival rates.

Child mortality is especially high in the African region. In 35 countries out of 40, women spend 0-8 years in school while in the European region 8 years is the minimum of the average value.