Data analysis tools - Pearson correlation coefficient

For Week 3 assignment I'm testing association between average country income per person and average oil usage rate.



In [1]:

    
%matplotlib inline

import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

data = pandas.read_csv('gapminder.csv', low_memory=False)



In [2]:

    
data['oilperperson'] = pandas.to_numeric(data['oilperperson'], errors='coerce')
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan)

Following block draws the diagram



In [3]:

    
scat1 = seaborn.regplot(x="incomeperperson", y="oilperperson", fit_reg=True, data=data)
plt.xlabel('Income per person')
plt.ylabel('Oil usage per person')
plt.title('Scatterplot for the Association Between "Income per person" and "Oil per person"')
plt.show()

Diagram clearly indicates that association between incompe per person and oil usage per person is positive but doesn't look very strong.



In [4]:

    
data_clean=data.dropna()
pearson_coeff, p_value = scipy.stats.pearsonr(data_clean['incomeperperson'], data_clean['oilperperson'])
print ('Association between "Income per person" and "Oil per person" values')
print "Pearson correlation coefficient: ", pearson_coeff
print "P_Value: ", p_value









    



Association between "Income per person" and "Oil per person" values
Pearson correlation coefficient:  0.541892508737
P_Value:  6.4737526716e-06

P value is much smaller than 0.05 and Pearson coefficient is 0.54, so we can see that relationship is statistically significant and association is not very strong with pearson coefficient of 0.54. Still it's higly unlikely that a relationship of this maginute would be due to chance alone.



In [ ]: