Data Management & Visualization: Frequency Distributions
cs 21 július 2016 by Ernő GólyaThis week I will call in a dataset and run frequency distributions for my chosen variables. I will also demonstrate how to create a subset of the data rows (i.e. observations).
During the course I use Gapminder dataset, which contains quantitative variables. Calculating frequency distribution with continuous data in this case is not very useful. For a continuous variable if we take a class for each distinct value of the variable, the number of classes may become very large, thus defeating the purpose of tabulation. Since a continuous variable can assume an infinite number of values within its range of variation, the classification of such data is necessary.
I use Pandas cut.() function to sort the data into categories. The ranges I choose are arbitrary as at this point the sole purpose of this transformation is to get more discernible results as opposed to display all distinct values separately.
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
# Reading in data file
data = pd.read_csv('custom_gapminder.csv', low_memory=False)
# Show the number of rows and columns
print len(data) # Number of countries in dataframe
print len(data.columns) # Number of variables in dataframe
# print len(data.index) # Another option for displaying (number of) rows in a Dataframe
Output:
167
19
My customized dataset contains data for 167 countries (the inner join merge included only those countries that are presented in every data files I used as a source).
The number of variables is 19, which are the variables from the original Gapminder file complemented with under5mort, womenschool and healthexpend.
# Replace all missing values with 0 by replace() function
#data["incomeperperson"] = data["incomeperperson"].replace(0, numpy.nan)
#data["under5mort"] = data["under5mort"].replace(0, numpy.nan)
#data["womenschool"] = data["womenschool"].replace(0, numpy.nan)
#data["healthexpend"] = data["healthexpend"].replace(0, numpy.nan)
I commented out the rows above, because missing values are displayed as NaN already.
# Creating categories for each variable
data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors='coerce')
data["incomeperperson_cat"] = pd.cut(data.incomeperperson, range(0, 56000, 6000))
data["under5mort"] = pd.to_numeric(data["under5mort"],errors='coerce')
data["under5mort_cat"] = pd.cut(data.under5mort, range(0, 240, 20))
data["womenschool"] = pd.to_numeric(data["womenschool"],errors='coerce')
data["womenschool_cat"] = pd.cut(data.womenschool, range(0, 20, 2))
data["healthexpend"] = pd.to_numeric(data["healthexpend"],errors='coerce')
data["healthexpend_cat"] = pd.cut(data.healthexpend, range(0, 10000, 1000))
Incomeperperson is sorted in a range of 0 - 56000 in increments of 6000. Underfivemort has a range from 0 to 240 with increments of 20. In the case of womenschool variable I set the maximum of range to 20, with 2 years steps, while healthexpand values vary between an extended 0 - 10000 range divided to 9 categories with an increment of 1000.
# Generate frequency distributions using the value\_counts() function
# Child Mortality frequency tables
print "Child mortality value range in dataset: ", data['under5mort'].min(), " - ", data['under5mort'].max(), "\n"
print "Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
c1 = data['under5mort_cat'].value_counts(sort=False, dropna=False)
print "Category:" + "\tFrequency"
print c1,"\n"
print "Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
p1 = data['under5mort_cat'].value_counts(sort=False, dropna=False, normalize=True)
print "Category:" + "\tFrequency %"
print p1 * 100
Child mortality value range in dataset: 2.4 - 208.8
Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births
Category: Frequency
(0, 20] 81
(20, 40] 25
(40, 60] 13
(60, 80] 16
(80, 100] 12
(100, 120] 11
(120, 140] 3
(140, 160] 0
(160, 180] 3
(180, 200] 1
(200, 220] 1
NaN 1
Name: under5mort_cat, dtype: int64
Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births
Category: Frequency %
(0, 20] 48.502994
(20, 40] 14.970060
(40, 60] 7.784431
(60, 80] 9.580838
(80, 100] 7.185629
(100, 120] 6.586826
(120, 140] 1.796407
(140, 160] 0.000000
(160, 180] 1.796407
(180, 200] 0.598802
(200, 220] 0.598802
NaN 0.598802
Name: under5mort_cat, dtype: float64
Child mortality rate under 5 years is the probability that a child born in a specific year will die before reaching the age of five if subject to current age-specific mortality rates. Expressed as a rate per 1,000 live births. In the sample data the minimum value is 2.4 the maximum is 208.8.
In almost half of the countries (48.5%) this value is under 20. In about 39,5% of the countries this rate is above 20 but under 100, while the rest (around 12%) has a value above 100. There is 1 instance of missing data.
# Mean Years In School frequency tables
print "Mean years in school value range in dataset: ", data['womenschool'].min(), " - ", data['womenschool'].max(),"\n"
print "Count for Mean years in school. Women of reproductive age, 15 to 44."
c2 = data['womenschool_cat'].value_counts(sort=False, dropna=False)
print "Category:" + "\tFrequency"
print c2,"\n"
print "Percentage for Mean years in school. Women of reproductive age, 15 to 44."
p2 = data['womenschool_cat'].value_counts(sort=False, dropna=False, normalize=True)
print "Category:" + "\tFrequency %"
print p2 * 100
Mean years in school value range in dataset: 0.8 - 14.7
Count for Mean years in school. Women of reproductive age, 15 to 44.
Category: Frequency
(0, 2] 7
(2, 4] 14
(4, 6] 22
(6, 8] 17
(8, 10] 30
(10, 12] 42
(12, 14] 33
(14, 16] 2
(16, 18] 0
Name: womenschool_cat, dtype: int64
Percentage for Mean years in school. Women of reproductive age, 15 to 44.
Category: Frequency %
(0, 2] 4.191617
(2, 4] 8.383234
(4, 6] 13.173653
(6, 8] 10.179641
(8, 10] 17.964072
(10, 12] 25.149701
(12, 14] 19.760479
(14, 16] 1.197605
(16, 18] 0.000000
Name: womenschool_cat, dtype: float64
Mean years in school (women of reproductive age, 15 to 44): the average
number of years of school attended by all people in the age and gender
group specified, including primary, secondary and tertiary education.
Minimum value in dataset is 0.8, maximum is 14.7. In 21 countries
(12.5%) the average time women spend in schools is less than 4 years.
There is only two countries (1.19%) where this value is more than 14
years.
As we can see the most common category is 10 - 12 years with 25.14%.
That means 42 observations. More than 62% of the countries falls in the
8 - 14 years range.
# Expenditure On Health frequency tables
print "Per capita total expenditure on health value range in dataset: ", data['healthexpend'].min(),
" - ", data['healthexpend'].max(),"\n"
print "Count for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$"
c3 = data['healthexpend_cat'].value_counts(sort=False, dropna=False)
print "Category:" + "\tFrequency"
print c3,"\n"
print "Percentage for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$"
p3 = data['healthexpend_cat'].value_counts(sort=False, dropna=False, normalize=True)
print "Category:" + "\tFrequency %"
print p3 * 100
Per capita total expenditure on health value range in dataset: 11.90349972 - 8361.732117
Count for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$
Category: Frequency
(0, 1000] 126
(1000, 2000] 8
(2000, 3000] 5
(3000, 4000] 4
(4000, 5000] 8
(5000, 6000] 2
(6000, 7000] 1
(7000, 8000] 1
(8000, 9000] 3
NaN 9
Name: healthexpend_cat, dtype: int64
Percentage for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$
Category: Frequency %
(0, 1000] 75.449102
(1000, 2000] 4.790419
(2000, 3000] 2.994012
(3000, 4000] 2.395210
(4000, 5000] 4.790419
(5000, 6000] 1.197605
(6000, 7000] 0.598802
(7000, 8000] 0.598802
(8000, 9000] 1.796407
NaN 5.389222
Name: healthexpend_cat, dtype: float64
Per capita total expenditure on health: per capita total expenditure on
health expressed at average exchange rate for that year in US\$.
Range is 11.90 - 8361.73. 9 countries has no data in the dataset.
According to the frequency table in most countries (count: 126, that is
75.44%) the per capita expenditure on health is below 1000 US\$.
One quarter of the observations varies between 1000 - 8361.73 US\$.
# Income Per Person frequency tables
print "GDP value range in dataset: ", data['incomeperperson'].min(), " - ", data['incomeperperson'].max(), "\n"
print "Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD"
c4 = data['incomeperperson_cat'].value_counts(sort=False, dropna=False)
print "Category:" + "\tFrequency"
print c4,"\n"
print "Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD"
p4 = data['incomeperperson_cat'].value_counts(sort=False, dropna=False, normalize=True)
print "Category:" + "\tFrequency %"
print p4 * 100
GDP value range in dataset: 103.775857241 - 52301.587179
Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD
Category: Frequency
(0, 6000] 115
(6000, 12000] 16
(12000, 18000] 6
(18000, 24000] 5
(24000, 30000] 9
(30000, 36000] 4
(36000, 42000] 4
(42000, 48000] 0
(48000, 54000] 1
NaN 7
Name: incomeperperson_cat, dtype: int64
Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD
Category: Frequency %
(0, 6000] 68.862275
(6000, 12000] 9.580838
(12000, 18000] 3.592814
(18000, 24000] 2.994012
(24000, 30000] 5.389222
(30000, 36000] 2.395210
(36000, 42000] 2.395210
(42000, 48000] 0.000000
(48000, 54000] 0.598802
NaN 4.191617
Name: incomeperperson_cat, dtype: float64
Incomeperperson: 2010 Gross Domestic Product per capita in constant 2000
US$. The inflation but not the differences in the cost of living
between countries has been taken into account.
The lowest reported value is 103.77 while the highest is 52301.58 (504
times more than the minimal value).
Data is missing from 7 countries.
The lowest range (0 - 6000) is the most common among the countries: 115
countries which is 68.86% of the observations.
The next highest category is 6000 - 12000 with 16 countries (9.58%).
The remaining 29 countries (21,56%) are distributed in the 12000 -
52301.58 range.
Creating subset of a dataframe
This is an example of narrowing down the original dataframe to a subset of rows that meet a certain criteria.
In this case the frequency analysis revealed that 68.86% of the countries in the dataframe fell into the lowest (artificial) category of the Income per person variable.
The subset below contains those countries where the value of this indicator is below 6,000, while the Per capita total expenditure on health is under 1,000 and Child mortality under 5 years is less than 20 per 1,000 live births.
# Create subset of countries with the lowest Income per person and Health expenditure per person values
sub1 = data[(data['incomeperperson'] < 6000) & (data['healthexpend'] < 1000) & (data['under5mort'] < 20)]
sub2 = sub1.copy()
sub2[['country','under5mort','womenschool','healthexpend','incomeperperson']]
# Number of countries in subset
print len(sub2.index)
Subset of countries
country | under5mort | womenschool | healthexpend | incomeperperson | |
---|---|---|---|---|---|
1 | Albania | 16.6 | 10.7 | 240.824785 | 1914.996551 |
6 | Armenia | 18.0 | 11.4 | 133.476761 | 1326.741757 |
13 | Belarus | 6.1 | 12.4 | 319.640401 | 2737.670379 |
15 | Belize | 19.0 | 9.4 | 238.823592 | 3545.652174 |
18 | Bosnia and Herzegovina | 7.5 | 9.7 | 499.207872 | 2183.344867 |
20 | Brazil | 16.7 | 9.0 | 990.394075 | 4699.411262 |
21 | Bulgaria | 13.0 | 12.4 | 434.893145 | 2549.558474 |
30 | China | 15.7 | 8.5 | 220.876645 | 2425.471293 |
31 | Colombia | 18.5 | 8.4 | 472.152223 | 3233.423780 |
35 | Costa Rica | 10.2 | 10.4 | 811.438052 | 5188.900935 |
38 | Cuba | 6.6 | 12.2 | 607.027219 | 4495.046262 |
54 | Georgia | 16.6 | 12.7 | 271.625472 | 1258.762596 |
64 | Hungary | 6.6 | 12.1 | 942.344999 | 5634.003948 |
67 | Iran | 19.2 | 8.2 | 316.925387 | 2161.546510 |
72 | Jamaica | 18.1 | 11.6 | 247.279979 | 3665.348369 |
79 | Latvia | 9.4 | 12.3 | 717.580855 | 5011.219456 |
83 | Lithuania | 6.8 | 12.6 | 781.419680 | 5332.238591 |
85 | Macedonia, FYR | 9.9 | 11.2 | 316.870773 | 2221.185664 |
88 | Malaysia | 8.3 | 11.2 | 367.918603 | 5184.709328 |
89 | Maldives | 13.0 | 5.9 | 382.475004 | 4038.857818 |
93 | Mauritius | 15.2 | 10.1 | 448.927451 | 5182.143721 |
95 | Moldova | 17.1 | 12.4 | 190.413037 | 595.874535 |
97 | Montenegro | 6.8 | 11.6 | 578.348772 | 2222.335052 |
111 | Panama | 19.8 | 11.4 | 616.386537 | 5900.616944 |
119 | Romania | 13.9 | 12.5 | 427.999858 | 2636.787800 |
120 | Russia | 12.0 | 13.2 | 525.318074 | 2923.144355 |
122 | Samoa | 18.7 | 12.9 | 203.978640 | 1784.071284 |
126 | Serbia | 7.6 | 10.9 | 546.027710 | 1194.711433 |
136 | Sri Lanka | 10.6 | 10.6 | 69.959105 | 1295.742686 |
142 | Syria | 15.3 | 7.6 | 96.585010 | 1525.780116 |
146 | Thailand | 14.5 | 8.9 | 179.146529 | 2712.517199 |
149 | Tonga | 17.4 | 12.6 | 171.799052 | 2025.282665 |
151 | Tunisia | 17.4 | 7.1 | 237.837072 | 3164.927693 |
152 | Turkey | 19.1 | 8.2 | 678.113325 | 5348.597192 |
155 | Ukraine | 11.8 | 12.9 | 234.356614 | 1036.830725 |
162 | Venezuela | 16.6 | 10.5 | 663.387571 | 5528.363114 |
Number of countries in subset: 36
Creating new ranges for the variables
sub2["incomeperperson_scat"] = pd.cut(sub2.incomeperperson, range(0, 6000, 1000))
sub2["under5mort_scat"] = pd.cut(sub2.under5mort, range(0, 20, 2))
sub2["womenschool_scat"] = pd.cut(sub2.womenschool, range(0, 20, 2))
sub2["healthexpend_scat"] = pd.cut(sub2.healthexpend, range(0, 1000, 100))
Example frequency distribution of subset
print "Child mortality value range in subset: ", sub2['under5mort'].min(), " - ", sub2['under5mort'].max(), "\n"
print "Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
cs1 = sub2['under5mort_scat'].value_counts(sort=False, dropna=False)
print "Category:" + "\tFrequency"
print cs1,"\n"
print "Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
ps1 = sub2['under5mort_scat'].value_counts(sort=False, dropna=False, normalize=True)
print "Category:" + "\tFrequency %"
print ps1 * 100
Child mortality value range in subset: 6.1 - 19.8
Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births
Category: Frequency
(0, 2] 0
(2, 4] 0
(4, 6] 0
(6, 8] 7
(8, 10] 3
(10, 12] 4
(12, 14] 3
(14, 16] 4
(16, 18] 8
NaN 7
Name: under5mort_scat, dtype: int64
Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births
Category: Frequency %
(0, 2] 0.000000
(2, 4] 0.000000
(4, 6] 0.000000
(6, 8] 19.444444
(8, 10] 8.333333
(10, 12] 11.111111
(12, 14] 8.333333
(14, 16] 11.111111
(16, 18] 22.222222
NaN 19.444444
Name: under5mort_scat, dtype: float64
print "Mean years in school value range in subaset: ", sub2['womenschool'].min(), " - ", sub2['womenschool'].max(),"\n"
print "Count for Mean years in school. Women of reproductive age, 15 to 44."
cs2 = sub2['womenschool_scat'].value_counts(sort=False, dropna=False)
print "Category:" + "\tFrequency"
print cs2,"\n"
print "Percentage for Mean years in school. Women of reproductive age, 15 to 44."
ps2 = sub2['womenschool_scat'].value_counts(sort=False, dropna=False, normalize=True)
print "Category:" + "\tFrequency %"
print ps2 * 100
Mean years in school value range in subaset: 5.9 - 13.2
Count for Mean years in school. Women of reproductive age, 15 to 44.
Category: Frequency
(0, 2] 0
(2, 4] 0
(4, 6] 1
(6, 8] 2
(8, 10] 8
(10, 12] 12
(12, 14] 13
(14, 16] 0
(16, 18] 0
Name: womenschool_scat, dtype: int64
Percentage for Mean years in school. Women of reproductive age, 15 to 44.
Category: Frequency %
(0, 2] 0.000000
(2, 4] 0.000000
(4, 6] 2.777778
(6, 8] 5.555556
(8, 10] 22.222222
(10, 12] 33.333333
(12, 14] 36.111111
(14, 16] 0.000000
(16, 18] 0.000000
Name: womenschool_scat, dtype: float64
36 countries in the original dataframe have an Income per person less than 6,000 USD and Per capita health expenditure under 1,000 USD with less than 20 child mortality per 1,000 live births. 7 observations has no data of under5mort. Child mortality value is between 6.1 - 19.8, the most common range is 16 - 18 with 22.22% (8 countries). Mean years in schools is 5.9 - 13.2 with the highest ratio of 36.11% (13 countries) in the 12 - 14 years range.