Data Management & Visualization: Frequency Distributions

cs 21 július 2016 by Ernő Gólya

This week I will call in a dataset and run frequency distributions for my chosen variables. I will also demonstrate how to create a subset of the data rows (i.e. observations).

During the course I use Gapminder dataset, which contains quantitative variables. Calculating frequency distribution with continuous data in this case is not very useful. For a continuous variable if we take a class for each distinct value of the variable, the number of classes may become very large, thus defeating the purpose of tabulation. Since a continuous variable can assume an infinite number of values within its range of variation, the classification of such data is necessary.

I use Pandas cut.() function to sort the data into categories. The ranges I choose are arbitrary as at this point the sole purpose of this transformation is to get more discernible results as opposed to display all distinct values separately.

    # -*- coding: utf-8 -*-
    import pandas as pd
    import numpy as np
    # Reading in data file
    data = pd.read_csv('custom_gapminder.csv', low_memory=False)
    # Show the number of rows and columns 
    print len(data) # Number of countries in dataframe
    print len(data.columns) # Number of variables in dataframe
    # print len(data.index) # Another option for displaying (number of) rows in a Dataframe

Output:

    167
    19

My customized dataset contains data for 167 countries (the inner join merge included only those countries that are presented in every data files I used as a source).

The number of variables is 19, which are the variables from the original Gapminder file complemented with under5mort, womenschool and healthexpend.

    # Replace all missing values with 0 by replace() function 
    #data["incomeperperson"] = data["incomeperperson"].replace(0, numpy.nan)
    #data["under5mort"] = data["under5mort"].replace(0, numpy.nan)
    #data["womenschool"] = data["womenschool"].replace(0, numpy.nan)
    #data["healthexpend"] = data["healthexpend"].replace(0, numpy.nan)

I commented out the rows above, because missing values are displayed as NaN already.

    # Creating categories for each variable

    data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors='coerce')
    data["incomeperperson_cat"] = pd.cut(data.incomeperperson, range(0, 56000, 6000))

    data["under5mort"] = pd.to_numeric(data["under5mort"],errors='coerce')
    data["under5mort_cat"] = pd.cut(data.under5mort, range(0, 240, 20))

    data["womenschool"] = pd.to_numeric(data["womenschool"],errors='coerce')
    data["womenschool_cat"] = pd.cut(data.womenschool, range(0, 20, 2))

    data["healthexpend"] = pd.to_numeric(data["healthexpend"],errors='coerce')
    data["healthexpend_cat"] = pd.cut(data.healthexpend, range(0, 10000, 1000))

Incomeperperson is sorted in a range of 0 - 56000 in increments of 6000. Underfivemort has a range from 0 to 240 with increments of 20. In the case of womenschool variable I set the maximum of range to 20, with 2 years steps, while healthexpand values vary between an extended 0 - 10000 range divided to 9 categories with an increment of 1000.

    # Generate frequency distributions using the value\_counts() function 
    # Child Mortality frequency tables
    print "Child mortality value range in dataset: ", data['under5mort'].min(),  " - ", data['under5mort'].max(), "\n"
    print "Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
    c1 = data['under5mort_cat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print c1,"\n"
    print "Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
    p1 = data['under5mort_cat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print p1 * 100

Child mortality value range in dataset:  2.4  -  208.8

Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births
Category:   Frequency
(0, 20]       81
(20, 40]      25
(40, 60]      13
(60, 80]      16
(80, 100]     12
(100, 120]    11
(120, 140]     3
(140, 160]     0
(160, 180]     3
(180, 200]     1
(200, 220]     1
NaN            1
Name: under5mort_cat, dtype: int64

Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births
Category:   Frequency %
(0, 20]       48.502994
(20, 40]      14.970060
(40, 60]       7.784431
(60, 80]       9.580838
(80, 100]      7.185629
(100, 120]     6.586826
(120, 140]     1.796407
(140, 160]     0.000000
(160, 180]     1.796407
(180, 200]     0.598802
(200, 220]     0.598802
NaN            0.598802
Name: under5mort_cat, dtype: float64

Child mortality rate under 5 years is the probability that a child born in a specific year will die before reaching the age of five if subject to current age-specific mortality rates. Expressed as a rate per 1,000 live births. In the sample data the minimum value is 2.4 the maximum is 208.8.

In almost half of the countries (48.5%) this value is under 20. In about 39,5% of the countries this rate is above 20 but under 100, while the rest (around 12%) has a value above 100. There is 1 instance of missing data.

    # Mean Years In School frequency tables
    print "Mean years in school value range in dataset: ", data['womenschool'].min(),  " - ", data['womenschool'].max(),"\n"
    print "Count for Mean years in school. Women of reproductive age, 15 to 44."
    c2 = data['womenschool_cat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print c2,"\n"
    print "Percentage for Mean years in school. Women of reproductive age, 15 to 44."
    p2 = data['womenschool_cat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print p2 * 100

Mean years in school value range in dataset:  0.8  -  14.7

Count for Mean years in school. Women of reproductive age, 15 to 44.
Category:   Frequency
(0, 2]       7
(2, 4]      14
(4, 6]      22
(6, 8]      17
(8, 10]     30
(10, 12]    42
(12, 14]    33
(14, 16]     2
(16, 18]     0
Name: womenschool_cat, dtype: int64

Percentage for Mean years in school. Women of reproductive age, 15 to 44.
Category:   Frequency %
(0, 2]       4.191617
(2, 4]       8.383234
(4, 6]      13.173653
(6, 8]      10.179641
(8, 10]     17.964072
(10, 12]    25.149701
(12, 14]    19.760479
(14, 16]     1.197605
(16, 18]     0.000000
Name: womenschool_cat, dtype: float64

Mean years in school (women of reproductive age, 15 to 44): the average number of years of school attended by all people in the age and gender group specified, including primary, secondary and tertiary education.
Minimum value in dataset is 0.8, maximum is 14.7. In 21 countries (12.5%) the average time women spend in schools is less than 4 years. There is only two countries (1.19%) where this value is more than 14 years.
As we can see the most common category is 10 - 12 years with 25.14%. That means 42 observations. More than 62% of the countries falls in the 8 - 14 years range.

    # Expenditure On Health frequency tables
    print "Per capita total expenditure on health value range in dataset:  ", data['healthexpend'].min(), 
    " - ", data['healthexpend'].max(),"\n"
    print "Count for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$"
    c3 = data['healthexpend_cat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print c3,"\n"
    print "Percentage for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$"
    p3 = data['healthexpend_cat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print p3 * 100

Per capita total expenditure on health value range in dataset:  11.90349972  -  8361.732117

Count for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$
Category:   Frequency
(0, 1000]       126
(1000, 2000]      8
(2000, 3000]      5
(3000, 4000]      4
(4000, 5000]      8
(5000, 6000]      2
(6000, 7000]      1
(7000, 8000]      1
(8000, 9000]      3
NaN               9
Name: healthexpend_cat, dtype: int64

Percentage for healthexpend. Per capita total expenditure on health expressed at average exchange rate for that year in US$
Category:   Frequency %
(0, 1000]       75.449102
(1000, 2000]     4.790419
(2000, 3000]     2.994012
(3000, 4000]     2.395210
(4000, 5000]     4.790419
(5000, 6000]     1.197605
(6000, 7000]     0.598802
(7000, 8000]     0.598802
(8000, 9000]     1.796407
NaN              5.389222
Name: healthexpend_cat, dtype: float64

Per capita total expenditure on health: per capita total expenditure on health expressed at average exchange rate for that year in US\$.
Range is 11.90 - 8361.73. 9 countries has no data in the dataset.
According to the frequency table in most countries (count: 126, that is 75.44%) the per capita expenditure on health is below 1000 US\$.
One quarter of the observations varies between 1000 - 8361.73 US\$.

    # Income Per Person frequency tables
    print "GDP value range in dataset: ", data['incomeperperson'].min(),  " - ", data['incomeperperson'].max(), "\n"
    print "Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD"
    c4 = data['incomeperperson_cat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print c4,"\n"
    print "Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD"
    p4 = data['incomeperperson_cat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print p4 * 100

GDP value range in dataset:  103.775857241  -  52301.587179

Count for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD
Category:   Frequency
(0, 6000]         115
(6000, 12000]      16
(12000, 18000]      6
(18000, 24000]      5
(24000, 30000]      9
(30000, 36000]      4
(36000, 42000]      4
(42000, 48000]      0
(48000, 54000]      1
NaN                 7
Name: incomeperperson_cat, dtype: int64

Percentage for incomeperperson. 2010 Gross Domestic Product per capita in constant 2000 USD
Category:   Frequency %
(0, 6000]         68.862275
(6000, 12000]      9.580838
(12000, 18000]     3.592814
(18000, 24000]     2.994012
(24000, 30000]     5.389222
(30000, 36000]     2.395210
(36000, 42000]     2.395210
(42000, 48000]     0.000000
(48000, 54000]     0.598802
NaN                4.191617
Name: incomeperperson_cat, dtype: float64

Incomeperperson: 2010 Gross Domestic Product per capita in constant 2000 US$. The inflation but not the differences in the cost of living between countries has been taken into account.
The lowest reported value is 103.77 while the highest is 52301.58 (504 times more than the minimal value).
Data is missing from 7 countries.
The lowest range (0 - 6000) is the most common among the countries: 115 countries which is 68.86% of the observations.
The next highest category is 6000 - 12000 with 16 countries (9.58%).
The remaining 29 countries (21,56%) are distributed in the 12000 - 52301.58 range.

Creating subset of a dataframe

This is an example of narrowing down the original dataframe to a subset of rows that meet a certain criteria.

In this case the frequency analysis revealed that 68.86% of the countries in the dataframe fell into the lowest (artificial) category of the Income per person variable.

The subset below contains those countries where the value of this indicator is below 6,000, while the Per capita total expenditure on health is under 1,000 and Child mortality under 5 years is less than 20 per 1,000 live births.

# Create subset of countries with the lowest Income per person and Health expenditure per person values
sub1 = data[(data['incomeperperson'] < 6000) & (data['healthexpend'] < 1000) & (data['under5mort'] < 20)]
sub2 = sub1.copy()
sub2[['country','under5mort','womenschool','healthexpend','incomeperperson']]
# Number of countries in subset
print len(sub2.index)

Subset of countries

	country	under5mort	womenschool	healthexpend	incomeperperson
1	Albania	16.6	10.7	240.824785	1914.996551
6	Armenia	18.0	11.4	133.476761	1326.741757
13	Belarus	6.1	12.4	319.640401	2737.670379
15	Belize	19.0	9.4	238.823592	3545.652174
18	Bosnia and Herzegovina	7.5	9.7	499.207872	2183.344867
20	Brazil	16.7	9.0	990.394075	4699.411262
21	Bulgaria	13.0	12.4	434.893145	2549.558474
30	China	15.7	8.5	220.876645	2425.471293
31	Colombia	18.5	8.4	472.152223	3233.423780
35	Costa Rica	10.2	10.4	811.438052	5188.900935
38	Cuba	6.6	12.2	607.027219	4495.046262
54	Georgia	16.6	12.7	271.625472	1258.762596
64	Hungary	6.6	12.1	942.344999	5634.003948
67	Iran	19.2	8.2	316.925387	2161.546510
72	Jamaica	18.1	11.6	247.279979	3665.348369
79	Latvia	9.4	12.3	717.580855	5011.219456
83	Lithuania	6.8	12.6	781.419680	5332.238591
85	Macedonia, FYR	9.9	11.2	316.870773	2221.185664
88	Malaysia	8.3	11.2	367.918603	5184.709328
89	Maldives	13.0	5.9	382.475004	4038.857818
93	Mauritius	15.2	10.1	448.927451	5182.143721
95	Moldova	17.1	12.4	190.413037	595.874535
97	Montenegro	6.8	11.6	578.348772	2222.335052
111	Panama	19.8	11.4	616.386537	5900.616944
119	Romania	13.9	12.5	427.999858	2636.787800
120	Russia	12.0	13.2	525.318074	2923.144355
122	Samoa	18.7	12.9	203.978640	1784.071284
126	Serbia	7.6	10.9	546.027710	1194.711433
136	Sri Lanka	10.6	10.6	69.959105	1295.742686
142	Syria	15.3	7.6	96.585010	1525.780116
146	Thailand	14.5	8.9	179.146529	2712.517199
149	Tonga	17.4	12.6	171.799052	2025.282665
151	Tunisia	17.4	7.1	237.837072	3164.927693
152	Turkey	19.1	8.2	678.113325	5348.597192
155	Ukraine	11.8	12.9	234.356614	1036.830725
162	Venezuela	16.6	10.5	663.387571	5528.363114

Number of countries in subset: 36

Creating new ranges for the variables

    sub2["incomeperperson_scat"] = pd.cut(sub2.incomeperperson, range(0, 6000, 1000))
    sub2["under5mort_scat"] = pd.cut(sub2.under5mort, range(0, 20, 2))
    sub2["womenschool_scat"] = pd.cut(sub2.womenschool, range(0, 20, 2))
    sub2["healthexpend_scat"] = pd.cut(sub2.healthexpend, range(0, 1000, 100))

Example frequency distribution of subset

    print "Child mortality value range in subset: ", sub2['under5mort'].min(),  " - ", sub2['under5mort'].max(), "\n"
    print "Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
    cs1 = sub2['under5mort_scat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print cs1,"\n"
    print "Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births"
    ps1 = sub2['under5mort_scat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print ps1 * 100

    Child mortality value range in subset:  6.1  -  19.8 

    Count for child mortality under the age of 5. Expressed as a rate per 1,000 live births
    Category:   Frequency
    (0, 2]      0
    (2, 4]      0
    (4, 6]      0
    (6, 8]      7
    (8, 10]     3
    (10, 12]    4
    (12, 14]    3
    (14, 16]    4
    (16, 18]    8
    NaN         7
    Name: under5mort_scat, dtype: int64 

    Percentage for child mortality under the age of 5. Expressed as a rate per 1,000 live births
    Category:   Frequency %
    (0, 2]       0.000000
    (2, 4]       0.000000
    (4, 6]       0.000000
    (6, 8]      19.444444
    (8, 10]      8.333333
    (10, 12]    11.111111
    (12, 14]     8.333333
    (14, 16]    11.111111
    (16, 18]    22.222222
    NaN         19.444444
    Name: under5mort_scat, dtype: float64

    print "Mean years in school value range in subaset: ", sub2['womenschool'].min(),  " - ", sub2['womenschool'].max(),"\n"
    print "Count for Mean years in school. Women of reproductive age, 15 to 44."
    cs2 = sub2['womenschool_scat'].value_counts(sort=False, dropna=False)
    print "Category:" + "\tFrequency"
    print cs2,"\n"
    print "Percentage for Mean years in school. Women of reproductive age, 15 to 44."
    ps2 = sub2['womenschool_scat'].value_counts(sort=False, dropna=False, normalize=True)
    print "Category:" + "\tFrequency %"
    print ps2 * 100

    Mean years in school value range in subaset:  5.9  -  13.2 

    Count for Mean years in school. Women of reproductive age, 15 to 44.
    Category:   Frequency
    (0, 2]       0
    (2, 4]       0
    (4, 6]       1
    (6, 8]       2
    (8, 10]      8
    (10, 12]    12
    (12, 14]    13
    (14, 16]     0
    (16, 18]     0
    Name: womenschool_scat, dtype: int64 

    Percentage for Mean years in school. Women of reproductive age, 15 to 44.
    Category:   Frequency %
    (0, 2]       0.000000
    (2, 4]       0.000000
    (4, 6]       2.777778
    (6, 8]       5.555556
    (8, 10]     22.222222
    (10, 12]    33.333333
    (12, 14]    36.111111
    (14, 16]     0.000000
    (16, 18]     0.000000
    Name: womenschool_scat, dtype: float64

36 countries in the original dataframe have an Income per person less than 6,000 USD and Per capita health expenditure under 1,000 USD with less than 20 child mortality per 1,000 live births. 7 observations has no data of under5mort. Child mortality value is between 6.1 - 19.8, the most common range is 16 - 18 with 22.22% (8 countries). Mean years in schools is 5.9 - 13.2 with the highest ratio of 36.11% (13 countries) in the 12 - 14 years range.