For Week 3 assignment I'm testing association between average country income per person and average oil usage rate.
In [1]:
%matplotlib inline
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt
data = pandas.read_csv('gapminder.csv', low_memory=False)
In [2]:
data['oilperperson'] = pandas.to_numeric(data['oilperperson'], errors='coerce')
data['incomeperperson'] = pandas.to_numeric(data['incomeperperson'], errors='coerce')
data['incomeperperson']=data['incomeperperson'].replace(' ', numpy.nan)
Following block draws the diagram
In [3]:
scat1 = seaborn.regplot(x="incomeperperson", y="oilperperson", fit_reg=True, data=data)
plt.xlabel('Income per person')
plt.ylabel('Oil usage per person')
plt.title('Scatterplot for the Association Between "Income per person" and "Oil per person"')
plt.show()
Diagram clearly indicates that association between incompe per person and oil usage per person is positive but doesn't look very strong.
In [4]:
data_clean=data.dropna()
pearson_coeff, p_value = scipy.stats.pearsonr(data_clean['incomeperperson'], data_clean['oilperperson'])
print ('Association between "Income per person" and "Oil per person" values')
print "Pearson correlation coefficient: ", pearson_coeff
print "P_Value: ", p_value
P value is much smaller than 0.05 and Pearson coefficient is 0.54, so we can see that relationship is statistically significant and association is not very strong with pearson coefficient of 0.54. Still it's higly unlikely that a relationship of this maginute would be due to chance alone.
In [ ]: