Pearson Correlation

sze 03 augusztus 2016 by Ernő Gólya

This is part of my coursework for the Data Analysis Tools course by Wesleyan University on Coursera.

Week 3 assignment: Generate a correlation coefficient.

The Pearson correlation coefficient (r) is a measure that determines the degree to which two variables' movements are associated. The range of values for the correlation coefficient is -1.0 to 1.0. If a calculated correlation is greater than 1.0 or less than -1.0, a mistake has been made. A correlation of -1.0 indicates a perfect negative linear relationship between the two variables, while a correlation of +1.0 indicates a perfect positive linear correlation. In both cases, knowing the value of one variable, one can predict the value of the second.

Data source: Gapminder World.

I want to compare the child mortality rates against years of schooling for women for the 154 countries in the data set. Both response variable (under-five mortality rate per 1,000 live births) and explanatory variable (mean years of schooling for women, age 15 to 44) are quantitative variables, thus Pearson correlation coefficient (r) can be used.

The scatterplot for the two variables seems to show a negative linear correlation:

pearson

    Output of pearsonr() function:
    Association between mean years of schooling for women (age 15 to 44) and child mortality
    (-0.78566944724891918, 1.6315641063941191e-33)

The correlation coefficient is -0.78, indicating a strong negative linear relationship.

The r^2^ value (coefficient of determination) is 0.608, meaning that 60.8% of the variability in the child mortality rate is described by the variation in female education.

The p-value is 1.63e-33, indicating that the correlation is statistically significant.

This suggests that an increased level of female education in a country is correlated with a decrease in the recorded child mortality rate, and the strength of association between the variables is high (r=-0.78).

Note, that Pearson's r is sensitive to outliers, which can have a very large effect on the line of best fit and the Pearson correlation coefficient, leading to very difficult conclusions regarding our data. Therefore, it is best if there are no outliers or they are kept to a minimum. For now, dealing with outliers is out of the scope of this presentation.

Python code

    # -*- coding: utf-8 -*-
    # Created on 03/08/2016
    # Author Ernő Gólya

    %matplotlib inline
    # import libraries
    import pandas as pd
    import numpy as np
    import seaborn
    import scipy
    import matplotlib.pyplot as plt

    # read in data file
    data = pd.read_csv('custom_gapminder_2.csv', low_memory=False)

    # set variables to numeric
    data["incomeperperson"] = pd.to_numeric(data["incomeperperson"],errors='coerce')
    data["under5mort"] = pd.to_numeric(data["under5mort"],errors='coerce')
    data["womenschool"] = pd.to_numeric(data["womenschool"],errors='coerce')
    data["healthexpend"] = pd.to_numeric(data["healthexpend"],errors='coerce')

    # remove observations with NaN values in any variables of interest
    data = data.dropna()

    # generate correlation coefficient
    print "Association between mean years of schooling for women (age 15 to 44) and child mortality"
    print scipy.stats.pearsonr(data['womenschool'], data['under5mort'])
    print ""
    print "Association between income per person (US$) and child mortality"
    print scipy.stats.pearsonr(data['incomeperperson'], data['under5mort'])
    print ""
    print "Association between per capita total expenditure on health and child mortality"
    print scipy.stats.pearsonr(data['healthexpend'], data['under5mort'])

    # basic scatterplot
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    scat1 = seaborn.regplot(x='womenschool', y='under5mort', fit_reg=True, data=data)
    plt.xlabel('Mean years of schooling for women')
    plt.ylabel('Children dying before the age of 5 per 1,000 live births')
    plt.title('Association between mean years of schooling for women (age 15 to 44) and child mortality');

    # basic scatterplot
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    scat3 = seaborn.regplot(x='incomeperperson', y='under5mort', fit_reg=True, data=data)
    plt.xlabel('2010 GDP per capita (US$)')
    plt.ylabel('Children dying before the age of 5 per 1,000 live births')
    plt.title('Association between income per person (US$) and child mortality');

    # basic scatterplot
    fig=plt.figure(figsize=(10, 7), dpi= 80, facecolor='w', edgecolor='k')
    scat2 = seaborn.regplot(x='healthexpend', y='under5mort', fit_reg=True, data=data)
    plt.xlabel('Per capita expenditure on health (US$)')
    plt.ylabel('Children dying before the age of 5 per 1,000 live births')
    plt.title('Association between per capita total expenditure on health and child mortality');