Data Management & Visualization: Frequency Distributions

cs 21 július 2016 by Ernő Gólya

This week I will call in a dataset and run frequency distributions for my chosen variables. I will also demonstrate how to create a subset of the data rows (i.e. observations).

During the course I use Gapminder dataset, which contains quantitative variables. Calculating frequency distribution with continuous data in this case is not very useful. For a continuous variable if we take a class for each distinct value of the variable, the number of classes may become very large, thus defeating the purpose of tabulation. Since a continuous variable can assume an infinite number of values within its range of variation, the classification of such data is necessary.

I use Pandas cut.() function to sort the data into categories. The ranges I choose are arbitrary as at this point the sole purpose of this transformation is to get more discernible results as opposed to display all distinct values separately.

    # -*- coding: utf-8 -*-
    import pandas as pd
    import numpy as np
    # Reading in data file
    data = pd.read_csv('custom_gapminder.csv', low_memory=False)
    # Show the number of rows and columns 
    print len(data) # Number of countries in dataframe
    print len(data.columns) # Number of variables in dataframe
    # print len(data.index) # Another option for displaying (number of) rows in a Dataframe

Output:

    167
    19

My customized dataset contains data for 167 countries (the inner join merge included only those countries that are presented in every data files I used as a source).

The number of variables is 19, which are the variables from the original Gapminder file complemented with under5mort, womenschool and healthexpend.

    # Replace all missing values with 0 by replace() function 
    #data["incomeperperson"] = data["incomeperperson"].replace(0, numpy.nan)
    #data["under5mort"] = data["under5mort"].replace(0, numpy.nan)
    #data["womenschool"] = data["womenschool"].replace(0, numpy.nan)
    #data["healthexpend"] = data["healthexpend"].replace(0, numpy.nan)

I commented out the rows above, because missing values are displayed as NaN already.

    # Creating categories for each variable

    data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors='coerce')
    data["incomeperperson_cat"] = pd.cut(data.incomeperperson, range(0, 56000, 6000))

    data["under5mort"] = pd.to_numeric(data["under5mort"],errors='coerce')
    data["under5mort_cat"] = pd.cut(data.under5mort, range(0, 240, 20))

    data["womenschool"] = pd.to_numeric(data["womenschool"],errors='coerce')
    data["womenschool_cat"] = pd.cut(data.womenschool, range(0, 20, 2))

    data["healthexpend"] = pd.to_numeric(data["healthexpend"],errors='coerce')
    data["healthexpend_cat"] = pd.cut(data.healthexpend, range(0, 10000, 1000))

Incomeperperson is sorted in a range of 0 - 56000 in increments of 6000. Underfivemort has a range from 0 to 240 with increments of 20. In the case of womenschool variable I set the maximum of range to 20, with 2 years steps, while healthexpand values vary between an extended 0 - 10000 range divided to 9 categories with an increment of 1000.

    # Generate frequency distributions using the value\_counts() function 
    # Child Mortality frequency tables
    print "Child mortality value range in dataset: ", data['under5mort'].min(),  " - ", data['under5mort'].max(), "\n"
    print "Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
    c1 = data['under5mort_cat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print c1,"\n"
    print "Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
    p1 = data['under5mort_cat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print p1 * 100
Child mortality value range in dataset:  2.4  -  208.8

Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births
Category:   Frequency
(0, 20]       81
(20, 40]      25
(40, 60]      13
(60, 80]      16
(80, 100]     12
(100, 120]    11
(120, 140]     3
(140, 160]     0
(160, 180]     3
(180, 200]     1
(200, 220]     1
NaN            1
Name: under5mort_cat, dtype: int64

Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births
Category:   Frequency %
(0, 20]       48.502994
(20, 40]      14.970060
(40, 60]       7.784431
(60, 80]       9.580838
(80, 100]      7.185629
(100, 120]     6.586826
(120, 140]     1.796407
(140, 160]     0.000000
(160, 180]     1.796407
(180, 200]     0.598802
(200, 220]     0.598802
NaN            0.598802
Name: under5mort_cat, dtype: float64

Child mortality rate under 5 years is the probability that a child born in a specific year will die before reaching the age of five if subject to current age-specific mortality rates. Expressed as a rate per 1,000 live births. In the sample data the minimum value is 2.4 the maximum is 208.8.

In almost half of the countries (48.5%) this value is under 20. In about 39,5% of the countries this rate is above 20 but under 100, while the rest (around 12%) has a value above 100. There is 1 instance of missing data.

    # Mean Years In School frequency tables
    print "Mean years in school value range in dataset: ", data['womenschool'].min(),  " - ", data['womenschool'].max(),"\n"
    print "Count for Mean years in school. Women of reproductive age, 15 to 44."
    c2 = data['womenschool_cat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print c2,"\n"
    print "Percentage for Mean years in school. Women of reproductive age, 15 to 44."
    p2 = data['womenschool_cat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print p2 * 100
Mean years in school value range in dataset:  0.8  -  14.7

Count for Mean years in school. Women of reproductive age, 15 to 44.
Category:   Frequency
(0, 2]       7
(2, 4]      14
(4, 6]      22
(6, 8]      17
(8, 10]     30
(10, 12]    42
(12, 14]    33
(14, 16]     2
(16, 18]     0
Name: womenschool_cat, dtype: int64

Percentage for Mean years in school. Women of reproductive age, 15 to 44.
Category:   Frequency %
(0, 2]       4.191617
(2, 4]       8.383234
(4, 6]      13.173653
(6, 8]      10.179641
(8, 10]     17.964072
(10, 12]    25.149701
(12, 14]    19.760479
(14, 16]     1.197605
(16, 18]     0.000000
Name: womenschool_cat, dtype: float64

Mean years in school (women of reproductive age, 15 to 44): the average number of years of school attended by all people in the age and gender group specified, including primary, secondary and tertiary education.
Minimum value in dataset is 0.8, maximum is 14.7. In 21 countries (12.5%) the average time women spend in schools is less than 4 years. There is only two countries (1.19%) where this value is more than 14 years.
As we can see the most common category is 10 - 12 years with 25.14%. That means 42 observations. More than 62% of the countries falls in the 8 - 14 years range.

    # Expenditure On Health frequency tables
    print "Per capita total expenditure on health value range in dataset:  ", data['healthexpend'].min(), 
    " - ", data['healthexpend'].max(),"\n"
    print "Count for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$"
    c3 = data['healthexpend_cat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print c3,"\n"
    print "Percentage for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$"
    p3 = data['healthexpend_cat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print p3 * 100
Per capita total expenditure on health value range in dataset:  11.90349972  -  8361.732117

Count for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$
Category:   Frequency
(0, 1000]       126
(1000, 2000]      8
(2000, 3000]      5
(3000, 4000]      4
(4000, 5000]      8
(5000, 6000]      2
(6000, 7000]      1
(7000, 8000]      1
(8000, 9000]      3
NaN               9
Name: healthexpend_cat, dtype: int64

Percentage for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$
Category:   Frequency %
(0, 1000]       75.449102
(1000, 2000]     4.790419
(2000, 3000]     2.994012
(3000, 4000]     2.395210
(4000, 5000]     4.790419
(5000, 6000]     1.197605
(6000, 7000]     0.598802
(7000, 8000]     0.598802
(8000, 9000]     1.796407
NaN              5.389222
Name: healthexpend_cat, dtype: float64

Per capita total expenditure on health: per capita total expenditure on health expressed at average exchange rate for that year in US\$.
Range is 11.90 - 8361.73. 9 countries has no data in the dataset.
According to the frequency table in most countries (count: 126, that is 75.44%) the per capita expenditure on health is below 1000 US\$.
One quarter of the observations varies between 1000 - 8361.73 US\$.

    # Income Per Person frequency tables
    print "GDP value range in dataset: ", data['incomeperperson'].min(),  " - ", data['incomeperperson'].max(), "\n"
    print "Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD"
    c4 = data['incomeperperson_cat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print c4,"\n"
    print "Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD"
    p4 = data['incomeperperson_cat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print p4 * 100
GDP value range in dataset:  103.775857241  -  52301.587179

Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD
Category:   Frequency
(0, 6000]         115
(6000, 12000]      16
(12000, 18000]      6
(18000, 24000]      5
(24000, 30000]      9
(30000, 36000]      4
(36000, 42000]      4
(42000, 48000]      0
(48000, 54000]      1
NaN                 7
Name: incomeperperson_cat, dtype: int64

Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD
Category:   Frequency %
(0, 6000]         68.862275
(6000, 12000]      9.580838
(12000, 18000]     3.592814
(18000, 24000]     2.994012
(24000, 30000]     5.389222
(30000, 36000]     2.395210
(36000, 42000]     2.395210
(42000, 48000]     0.000000
(48000, 54000]     0.598802
NaN                4.191617
Name: incomeperperson_cat, dtype: float64

Incomeperperson: 2010 Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account.
The lowest reported value is 103.77 while the highest is 52301.58 (504 times more than the minimal value).
Data is missing from 7 countries.
The lowest range (0 - 6000) is the most common among the countries: 115 countries which is 68.86% of the observations.
The next highest category is 6000 - 12000 with 16 countries (9.58%).
The remaining 29 countries (21,56%) are distributed in the 12000 - 52301.58 range.

Creating subset of a dataframe

This is an example of narrowing down the original dataframe to a subset of rows that meet a certain criteria.

In this case the frequency analysis revealed that 68.86% of the countries in the dataframe fell into the lowest (artificial) category of the Income per person variable.

The subset below contains those countries where the value of this indicator is below 6,000, while the Per capita total expenditure on health is under 1,000 and Child mortality under 5 years is less than 20 per 1,000 live births.

# Create subset of countries with the lowest Income per person and Health expenditure per person values
sub1 = data[(data['incomeperperson'] < 6000) & (data['healthexpend'] < 1000) & (data['under5mort'] < 20)]
sub2 = sub1.copy()
sub2[['country','under5mort','womenschool','healthexpend','incomeperperson']]
# Number of countries in subset
print len(sub2.index)

Subset of countries

country under5mort womenschool healthexpend incomeperperson
1 Albania 16.6 10.7 240.824785 1914.996551
6 Armenia 18.0 11.4 133.476761 1326.741757
13 Belarus 6.1 12.4 319.640401 2737.670379
15 Belize 19.0 9.4 238.823592 3545.652174
18 Bosnia and Herzegovina 7.5 9.7 499.207872 2183.344867
20 Brazil 16.7 9.0 990.394075 4699.411262
21 Bulgaria 13.0 12.4 434.893145 2549.558474
30 China 15.7 8.5 220.876645 2425.471293
31 Colombia 18.5 8.4 472.152223 3233.423780
35 Costa Rica 10.2 10.4 811.438052 5188.900935
38 Cuba 6.6 12.2 607.027219 4495.046262
54 Georgia 16.6 12.7 271.625472 1258.762596
64 Hungary 6.6 12.1 942.344999 5634.003948
67 Iran 19.2 8.2 316.925387 2161.546510
72 Jamaica 18.1 11.6 247.279979 3665.348369
79 Latvia 9.4 12.3 717.580855 5011.219456
83 Lithuania 6.8 12.6 781.419680 5332.238591
85 Macedonia, FYR 9.9 11.2 316.870773 2221.185664
88 Malaysia 8.3 11.2 367.918603 5184.709328
89 Maldives 13.0 5.9 382.475004 4038.857818
93 Mauritius 15.2 10.1 448.927451 5182.143721
95 Moldova 17.1 12.4 190.413037 595.874535
97 Montenegro 6.8 11.6 578.348772 2222.335052
111 Panama 19.8 11.4 616.386537 5900.616944
119 Romania 13.9 12.5 427.999858 2636.787800
120 Russia 12.0 13.2 525.318074 2923.144355
122 Samoa 18.7 12.9 203.978640 1784.071284
126 Serbia 7.6 10.9 546.027710 1194.711433
136 Sri Lanka 10.6 10.6 69.959105 1295.742686
142 Syria 15.3 7.6 96.585010 1525.780116
146 Thailand 14.5 8.9 179.146529 2712.517199
149 Tonga 17.4 12.6 171.799052 2025.282665
151 Tunisia 17.4 7.1 237.837072 3164.927693
152 Turkey 19.1 8.2 678.113325 5348.597192
155 Ukraine 11.8 12.9 234.356614 1036.830725
162 Venezuela 16.6 10.5 663.387571 5528.363114

Number of countries in subset: 36

Creating new ranges for the variables

    sub2["incomeperperson_scat"] = pd.cut(sub2.incomeperperson, range(0, 6000, 1000))
    sub2["under5mort_scat"] = pd.cut(sub2.under5mort, range(0, 20, 2))
    sub2["womenschool_scat"] = pd.cut(sub2.womenschool, range(0, 20, 2))
    sub2["healthexpend_scat"] = pd.cut(sub2.healthexpend, range(0, 1000, 100))

Example frequency distribution of subset

    print "Child mortality value range in subset: ", sub2['under5mort'].min(),  " - ", sub2['under5mort'].max(), "\n"
    print "Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
    cs1 = sub2['under5mort_scat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print cs1,"\n"
    print "Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
    ps1 = sub2['under5mort_scat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print ps1 * 100
    Child mortality value range in subset:  6.1  -  19.8 

    Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births
    Category:   Frequency
    (0, 2]      0
    (2, 4]      0
    (4, 6]      0
    (6, 8]      7
    (8, 10]     3
    (10, 12]    4
    (12, 14]    3
    (14, 16]    4
    (16, 18]    8
    NaN         7
    Name: under5mort_scat, dtype: int64 

    Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births
    Category:   Frequency %
    (0, 2]       0.000000
    (2, 4]       0.000000
    (4, 6]       0.000000
    (6, 8]      19.444444
    (8, 10]      8.333333
    (10, 12]    11.111111
    (12, 14]     8.333333
    (14, 16]    11.111111
    (16, 18]    22.222222
    NaN         19.444444
    Name: under5mort_scat, dtype: float64
    print "Mean years in school value range in subaset: ", sub2['womenschool'].min(),  " - ", sub2['womenschool'].max(),"\n"
    print "Count for Mean years in school. Women of reproductive age, 15 to 44."
    cs2 = sub2['womenschool_scat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print cs2,"\n"
    print "Percentage for Mean years in school. Women of reproductive age, 15 to 44."
    ps2 = sub2['womenschool_scat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print ps2 * 100
    Mean years in school value range in subaset:  5.9  -  13.2 

    Count for Mean years in school. Women of reproductive age, 15 to 44.
    Category:   Frequency
    (0, 2]       0
    (2, 4]       0
    (4, 6]       1
    (6, 8]       2
    (8, 10]      8
    (10, 12]    12
    (12, 14]    13
    (14, 16]     0
    (16, 18]     0
    Name: womenschool_scat, dtype: int64 

    Percentage for Mean years in school. Women of reproductive age, 15 to 44.
    Category:   Frequency %
    (0, 2]       0.000000
    (2, 4]       0.000000
    (4, 6]       2.777778
    (6, 8]       5.555556
    (8, 10]     22.222222
    (10, 12]    33.333333
    (12, 14]    36.111111
    (14, 16]     0.000000
    (16, 18]     0.000000
    Name: womenschool_scat, dtype: float64

36 countries in the original dataframe have an Income per person less than 6,000 USD and Per capita health expenditure under 1,000 USD with less than 20 child mortality per 1,000 live births. 7 observations has no data of under5mort. Child mortality value is between 6.1 - 19.8, the most common range is 16 - 18 with 22.22% (8 countries). Mean years in schools is 5.9 - 13.2 with the highest ratio of 36.11% (13 countries) in the 12 - 14 years range.