Data Management & Visualization: Making Data Management Decisions
k 26 július 2016 by Ernő GólyaThe following Python code is an example of how to make and implement data management decisions. In the previous assignment I already created categories for the relevant variables, as the dataset I study mostly have continuous data. I revise and alter some of the categories to be more similar to those used by World Bank and WHO.
World Bank income groups: WHO Member States are grouped into 4 income groups (low, lower-middle, upper-middle, and high) based on the World Bank list of analytical income classification of economies. Although it uses national income per capita estimates, for sake of demonstration I will apply these categories to the GDP per capita (incomeperperson) variable with different thresholds.
I will add a new variable to my subset: WHO regions. WHO Member States are grouped into 6 WHO regions: African Region, Region of the Americas, South-East Asia Region, European Region, Eastern Mediterranean Region, and Western Pacific Region. I create crosstabs to highlight the differenes between these areas.
I use a subset of the dataframe including only the variables of my
interest. Frequency distribution calculations exclude countries with
missing values (NaN) of the specific variable (dropna=False parameter is
omitted, default value is True). Minimum, maximum,
mean, median calculations are also presented.
# -*- coding: utf-8 -*-
# Created on 24/07/2016
# Author Ernő Gólya
# Import libraries
import pandas as pd
import numpy as np
# Reading in data file
data = pd.read_csv('custom_gapminder.csv', low_memory=False)
# Setting variables to numeric
data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors='coerce')
data["under5mort"] = pd.to_numeric(data["under5mort"],errors='coerce')
data["womenschool"] = pd.to_numeric(data["womenschool"],errors='coerce')
data["healthexpend"] = pd.to_numeric(data["healthexpend"],errors='coerce')
# Create subset including only variables of interest
sub1 = data[['country','under5mort','womenschool','healthexpend','incomeperperson']]
# Make a copy of subsetted data
sub2 = sub1.copy()
# Number of countries in subset
print "Subset contains:", len(sub2.index), "countries."
print ""
# preview dataset
print sub2.head(n=10), "\n"
# Creating categories
sub2["incomeperperson_cat"] = pd.cut(sub2.incomeperperson, [1, 1000, 4000, 12000, 65000],
labels=["Low", "Lower middle", "Upper middle", "High" ])
sub2["under5mort_cat"] = pd.cut(sub2.under5mort, [1, 10, 50, 100, 220], labels=["0-10", "10-50", "50-100", "100-220"])
sub2["womenschool_cat"] = pd.cut(sub2.womenschool, [0, 4, 8, 12, 16],
labels=["0-4 years", "4-8 years", "8-12 years", "12-16 years"])
sub2["healthexpend_cat"] = pd.cut(sub2.healthexpend, [1, 100, 300, 1000, 5000, 9000],
labels=["1-100", "100-300", "300-1000", "1000-5000", "5000-9000" ])
#frequency distributions with minimum, maximum, mean, median calculation
#frequency distribution for incomeperperson_cat
print "GDP value range in dataset: ", sub2['incomeperperson'].min(), " - ", sub2['incomeperperson'].max()
print "Mean: ",np.nanmean(sub2.incomeperperson)
print "Median: ",np.nanmedian(sub2.incomeperperson)
print "Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD."
c4 = sub2['incomeperperson_cat'].value_counts(sort=False)
print c4
print "Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD."
p4 = sub2['incomeperperson_cat'].value_counts(sort=False, normalize=True)
print p4 * 100
print "\n"
#frequency distribution for under5mort_cat
print "Child mortality value range in dataset: ", sub2['under5mort'].min(), " - ", sub2['under5mort'].max()
print "Mean: ",np.nanmean(sub2.under5mort)
print "Median: ",np.nanmedian(sub2.under5mort)
print "Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births."
c1 = sub2['under5mort_cat'].value_counts(sort=False)
print c1
print "Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births."
p1 = sub2['under5mort_cat'].value_counts(sort=False, normalize=True)
print p1 * 100
print "\n"
#frequency distribution for womenschool_cat
print "Mean years in school value range in dataset: ", sub2['womenschool'].min(), " - ", sub2['womenschool'].max()
print "Mean: ",np.nanmean(sub2.womenschool)
print "Median: ",np.nanmedian(sub2.womenschool)
print "Count for Mean years in school. Women of reproductive age, 15 to 44."
c2 = sub2['womenschool_cat'].value_counts(sort=False)
print c2
print "Percentage for Mean years in school. Women of reproductive age, 15 to 44."
p2 = sub2['womenschool_cat'].value_counts(sort=False, normalize=True)
print p2 * 100
print "\n"
#frequency distribution for healthexpend_cat
print "Per capita total expenditure on health value range in dataset: ", sub2['healthexpend'].min(),
" - ", sub2['healthexpend'].max()
print "Mean: ",np.nanmean(sub2.healthexpend)
print "Median: ",np.nanmedian(sub2.healthexpend)
print "Count for healthexpend. Per capita total expenditure on health expressed in US$."
c3 = sub2['healthexpend_cat'].value_counts(sort=False)
print c3
print "Percentage for healthexpend. Per capita total expenditure on health expressed in US$."
p3 = sub2['healthexpend_cat'].value_counts(sort=False, normalize=True)
print p3 * 100
Frequency tables
Subset contains: 167 countries.
GDP value range in dataset: 103.775857241 - 52301.587179
Mean: 6876.91364439
Median: 2226.50571315
Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD.
Low 50
Lower middle 51
Upper middle 30
High 29
Name: incomeperperson_cat, dtype: int64
Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD.
Low 31.250
Lower middle 31.875
Upper middle 18.750
High 18.125
Name: incomeperperson_cat, dtype: float64
Child mortality value range in dataset: 2.4 - 208.8
Mean: 40.8915662651
Median: 21.3
Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births.
0-10 45
10-50 67
50-100 35
100-220 19
Name: under5mort_cat, dtype: int64
Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births.
0-10 27.108434
10-50 40.361446
50-100 21.084337
100-220 11.445783
Name: under5mort_cat, dtype: float64
Mean years in school value range in dataset: 0.8 - 14.7
Mean: 8.84850299401
Median: 9.6
Count for Mean years in school. Women of reproductive age, 15 to 44.
0-4 years 21
4-8 years 39
8-12 years 72
12-16 years 35
Name: womenschool_cat, dtype: int64
Percentage for Mean years in school. Women of reproductive age, 15 to 44.
0-4 years 12.574850
4-8 years 23.353293
8-12 years 43.113772
12-16 years 20.958084
Name: womenschool_cat, dtype: float64
Per capita total expenditure on health value range in dataset: 11.90349972 - 8361.732117
Mean: 1009.4234372
Median: 258.01889645
Count for healthexpend. Per capita total expenditure on health expressed in US$.
1-100 50
100-300 31
300-1000 45
1000-5000 25
5000-9000 7
Name: healthexpend_cat, dtype: int64
Percentage for healthexpend. Per capita total expenditure on health expressed in US$.
1-100 31.645570
100-300 19.620253
300-1000 28.481013
1000-5000 15.822785
5000-9000 4.430380
Name: healthexpend_cat, dtype: float64
Use crosstab() function to create cross counts for variables
#crosstabs evaluating counts in womenshool and under5mort categories
print "Cross-tabulation of womenshool and under5mort categories"
print pd.crosstab(sub2['womenschool_cat'], sub2['under5mort_cat'])
print ""
#crosstabs evaluating counts in healthexpend and under5mort categories
print "Cross-tabulation of healthexpend and under5mort categories"
print pd.crosstab(sub2['healthexpend_cat'], sub2['under5mort_cat'])
print ""
#crosstabs evaluating counts in incomeperperson and under5mort categories
print "Cross-tabulation of incomeperperson and under5mort categories"
print pd.crosstab(sub2['incomeperperson_cat'], sub2['under5mort_cat'])
Crosstabs
Cross-tabulation of womenshool and under5mort categories
under5mort_cat 0-10 10-50 50-100 100-220
womenschool_cat
0-4 years 0 1 8 12
4-8 years 0 16 17 6
8-12 years 20 41 10 1
12-16 years 25 9 0 0
Cross-tabulation of healthexpend and under5mort categories
under5mort_cat 0-10 10-50 50-100 100-220
healthexpend_cat
1-100 0 8 27 15
100-300 0 26 3 2
300-1000 15 25 4 1
1000-5000 23 2 0 0
5000-9000 7 0 0 0
Cross-tabulation of incomeperperson and under5mort categories
under5mort_cat 0-10 10-50 50-100 100-220
incomeperperson_cat
Low 0 9 26 15
Lower middle 5 40 5 1
Upper middle 12 15 2 1
High 28 1 0 0
Add new variable (coloumn) to subset
# WHO regions dictionary
who_regions = {
'African Region':
['Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso', 'Burundi', 'Cameroon',
'Cape Verde', 'Central African Republic', 'Chad', 'Comoros', "Cote d'Ivoire",
'Congo, Dem. Rep.', 'Equatorial Guinea', 'Eritrea', 'Ethiopia', 'Gabon', 'Gambia',
'Ghana', 'Guinea', 'Guinea-Bissau', 'Kenya', 'Lesotho', 'Liberia', 'Madagascar',
'Malawi', 'Mali', 'Mauritania', 'Mauritius', 'Mozambique', 'Namibia', 'Niger',
'Nigeria', 'Congo, Rep.', 'Rwanda', 'Sao Tome and Principe', 'Senegal', 'Seychelles',
'Sierra Leone', 'South Africa', 'Swaziland', 'Togo', 'Uganda', 'Tanzania', 'Zambia', 'Zimbabwe'],
'Region of the Americas':
['Antigua and Barbuda', 'Argentina', 'Bahamas', 'Barbados', 'Belize', 'Bolivia', 'Brazil', 'Canada',
'Chile', 'Colombia', 'Costa Rica', 'Cuba', 'Dominica', 'Dominican Republic', 'Ecuador', 'El Salvador',
'Grenada', 'Guatemala', 'Guyana', 'Haiti', 'Honduras', 'Jamaica', 'Mexico', 'Nicaragua', 'Panama',
'Paraguay', 'Peru', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Vincent and the Grenadines',
'Suriname', 'Trinidad and Tobago', 'United States', 'Uruguay', 'Venezuela'],
'South-East Asia Region':
['Bangladesh', 'Bhutan', 'Democratic People\xe2\x80\x99s Republic of Korea', 'India', 'Indonesia',
'Maldives', 'Myanmar', 'Nepal', 'Sri Lanka', 'Thailand', 'Timor-Leste'],
'European Region':
['Albania', 'Andorra', 'Armenia', 'Austria', 'Azerbaijan', 'Belarus', 'Belgium', 'Bosnia and Herzegovina',
'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Georgia',
'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Israel', 'Italy', 'Kazakhstan', 'Kyrgyzstan',
'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Monaco', 'Montenegro', 'Netherlands', 'Norway', 'Poland',
'Portugal', 'Moldova', 'Romania', 'Russia', 'San Marino', 'Serbia', 'Slovak Republic', 'Slovenia',
'Spain', 'Sweden', 'Switzerland', 'Tajikistan', 'Macedonia, FYR', 'Turkey', 'Turkmenistan', 'Ukraine',
'United Kingdom', 'Uzbekistan'],
'Eastern Mediterranean Region':
['Afghanistan', 'Bahrain', 'Djibouti', 'Egypt', 'Iran', 'Iraq', 'Jordan', 'Kuwait', 'Lebanon',
'Libyan Arab Jamahiriya', 'Morocco', 'Oman', 'Pakistan', 'Qatar', 'Saudi Arabia', 'Somalia',
'Sudan', 'Syria', 'Tunisia', 'United Arab Emirates', 'Yemen'],
'Western Pacific Region':
['Australia', 'Brunei Darussalam', 'Cambodia', 'China', 'Cook Islands', 'Fiji', 'Japan',
'Kiribati', 'Lao People\xe2\x80\x99s Democratic Republic', 'Malaysia', 'Marshall Islands',
'Micronesia (Federated States of)', 'Mongolia', 'Nauru', 'New Zealand', 'Niue', 'Palau',
'Papua New Guinea', 'Philippines', 'Republic of Korea', 'Samoa', 'Singapore', 'Solomon Islands',
'Taiwan', 'Tonga', 'Tuvalu', 'Vanuatu', 'Vietnam']}
# Swap keys - values in who_regions dictionary
d = {}
for key in who_regions:
for value in who_regions[key]:
d[value] = key
# Add coloumn who_region to subset
sub2['who_region']= sub2['country'].map(d)
print "Sample of subset with new variable"
sub2[['country','under5mort','womenschool','healthexpend','incomeperperson', 'who_region']].tail(n=5)
Sample of subset with new variable
country | under5mort | womenschool | healthexpend | incomeperperson | who_region | |
---|---|---|---|---|---|---|
162 | Venezuela | 16.6 | 10.5 | 663.387571 | 5528.363114 | Region of the Americas |
163 | West Bank and Gaza | 23.8 | 11.0 | NaN | NaN | NaN |
164 | Vietnam | 24.8 | 8.4 | 82.872177 | 722.807559 | Western Pacific Region |
165 | Zambia | 84.8 | 6.7 | 72.884346 | 432.226337 | African Region |
166 | Zimbabwe | 95.1 | 9.0 | NaN | 320.771890 | African Region |
Crosstabs with new variable
#crosstabs evaluating counts in who_region and under5mort categories
print "Cross-tabulation of who_region and under5mort categories"
print pd.crosstab(sub2['who_region'], sub2['under5mort_cat'], normalize=False)
print ""
#crosstabs evaluating counts in who_region and womenshool categories
print "Cross-tabulation of who_region and womenshool categories"
print pd.crosstab(sub2['who_region'], sub2['womenschool_cat'], normalize=False)
print ""
#crosstabs evaluating counts in who_region and healthexpend categories
print "Cross-tabulation of who_region and healthexpend categories"
print pd.crosstab(sub2['who_region'], sub2['healthexpend_cat'], normalize=False)
print ""
#crosstabs evaluating counts in who_region and incomeperperson categories
print "Cross-tabulation of who_region and incomeperperson categories"
print pd.crosstab(sub2['who_region'], sub2['incomeperperson_cat'], normalize=False)
Cross-tabulation of who_region and under5mort categories
under5mort_cat | 0-10 | 10-50 | 50-100 | 100-220 |
---|---|---|---|---|
who_region | ||||
African Region | 0 | 4 | 25 | 16 |
Eastern Mediterranean Region | 3 | 11 | 3 | 2 |
European Region | 32 | 12 | 2 | 0 |
Region of the Americas | 5 | 22 | 0 | 1 |
South-East Asia Region | 0 | 6 | 3 | 0 |
Western Pacific Region | 5 | 11 | 2 | 0 |
Cross-tabulation of who_region and womenshool categories
womenschool_cat | 0-4 years | 4-8 years | 8-12 years | 12-16 years |
---|---|---|---|---|
who_region | ||||
African Region | 17 | 18 | 10 | 0 |
Eastern Mediterranean Region | 3 | 7 | 9 | 0 |
European Region | 0 | 0 | 21 | 25 |
Region of the Americas | 0 | 5 | 19 | 4 |
South-East Asia Region | 1 | 5 | 3 | 0 |
Western Pacific Region | 0 | 4 | 9 | 6 |
Cross-tabulation of who_region and healthexpend categories
healthexpend_cat | 1-100 | 100-300 | 300-1000 | 1000-5000 | 5000-9000 |
---|---|---|---|---|---|
who_region | |||||
African Region | 32 | 5 | 7 | 0 | 0 |
Eastern Mediterranean Region | 5 | 4 | 6 | 3 | 0 |
European Region | 2 | 6 | 16 | 17 | 5 |
Region of the Americas | 2 | 7 | 13 | 1 | 2 |
South-East Asia Region | 5 | 1 | 1 | 0 | 0 |
Western Pacific Region | 4 | 8 | 2 | 4 | 0 |
Cross-tabulation of who_region and incomeperperson categories
incomeperperson_cat | Low | Lower middle | Upper middle | High |
---|---|---|---|---|
who_region | ||||
African Region | 32 | 7 | 5 | 0 |
Eastern Mediterranean Region | 4 | 6 | 3 | 3 |
European Region | 3 | 15 | 9 | 19 |
Region of the Americas | 2 | 12 | 11 | 3 |
South-East Asia Region | 4 | 3 | 1 | 0 |
Western Pacific Region | 5 | 8 | 1 | 4 |
The most common range of child mortality rate is 10-50 death per 1,000 live births. 67 countries (40.36%) fell into this group. 27.1 % of the observations have a value less than or equal to 10, while about third of the countries pesented higher values (50 or higher mortality rate).
In 72 countries (43.11%) the women spend 8-10 years in schools. This is the most frequent category.
According to the crosstab calculations child mortality is higher in countries where women spend less time in education and per capita health expenditure is also on a lower level. High income countries have definitely better child survival rates.
Child mortality is especially high in the African region. In 35 countries out of 40, women spend 0-8 years in school while in the European region 8 years is the minimum of the average value.