July 2015
Written by Susan Chen at NYU Stern with help from Professor David Backus
Contact: jiachen2017@u.northwestern.edu
Since 2000, the Programme for International Student Assessment (PISA) has been administered every three years to evaluate education systems around the world. It also gathers family and education background information through surveys. The test, which assesses 15-year-old students in reading, math, and science, is administered to a total of around 510,000 students in 65 countries. The duration of the test is two hours, and it contains a mix of open-ended and multiple-choice questions. Learn more about the test here.
I am interested in seeing if there is a correlation between a nation's wealth and their PISA scores. Do wealthier countries generally attain higher scores, and if so, to what extent? I am using GDP per capita as the economic measure of wealth because this is information that could be sensitive to population numbers so GDP per capita in theory should allow us to compare larger countries (in terms of geography or population) with small countries.
In terms of the correlation between GDP per capita and each component of the PISA, the r-squared values for an OLS regression model, which usually reflect how well the model fits the data, are 0.57, 0.63, and 0.57 for reading, math, and science, respectively. Qatar and Vietnam, outliers, are excluded from the model.
I use matplotlib.pyplot to plot scatter plots. I use pandas, a Python package that allows for fast data manipulation and analysis, to organize my dataset. I access World Bank data through the remote data access API for pandas, pandas.io. I also use numpy, a Python package for scientific computing, for the mathematical calculations that were needed to fit the data more appropriately. Lastly, I use statmodels.formula.api, a Python module used for a variety of statistical computations, for running an OLS linear regression.
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from pandas.io import wb
PISA 2012 scores are downloaded as an excel file from the statlink on page 21 of the published PISA key findings. I deleted the explanatory text surrounding the table. I kept only the "Mean Score in PISA 2012" column for each subject and then saved the file as a csv. Then, I read the file into pandas and renamed the columns.
In [5]:
file1 = '/users/susan/desktop/PISA/PISA2012clean.csv' # file location
df1 = pd.read_csv(file1)
#pandas remote data access API for World Bank GDP per capita data
df2 = wb.download(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2012, end=2012)
In [6]:
df1
Out[6]:
In [466]:
#drop multilevel index
df2.index = df2.index.droplevel('year')
In [467]:
df1.columns = ['Country','Math','Reading','Science']
df2.columns = ['GDPpc']
In [468]:
#combine PISA and GDP datasets based on country column
df3 = pd.merge(df1, df2, how='left', left_on = 'Country', right_index = True)
In [469]:
df3.columns = ['Country','Math','Reading','Science','GDPpc']
In [470]:
#drop rows with missing GDP per capita values
df3 = df3[pd.notnull(df3['GDPpc'])]
In [471]:
print (df3)
I initially plotted the data and ran the regression without excluding any outliers. The resulting r-squared values for reading, math, and science were 0.29, 0.32, and 0.27, respectively. Looking at the scatter plot, there seem to be two obvious outliers, Qatar and Vietnam. I decided to exclude the data for these two countries because the remaining countries do seem to form a trend. I found upon excluding them that the correlation between GDP per capita and scores was much higher.
Qatar is an outlier as it placed relatively low, 63rd out of the 65 countries, with a relatively high GDP per capita at about $131000. Qatar has a high GDP per capita for a country with just 1.8 million people, and only 13% of which are Qatari nationals. Qatar is a high income economy as it contains one of the world's largest natural gas and oil reserves.
Vietnam is an outlier because it placed relatively high, 17th out of the 65 countries, with a relatively low GDP per capita at about $4900. Reasons for Vietnam's high score may be due to the investment of the government in education and the uniformity of classroom professionalism and discipline found across countries. At the same time, rote learning is much more emphasized than creative thinking, and it is important to note that many disadvantaged students are forced to drop out, reasons which may account for the high score.
In [472]:
df3.index = df3.Country #set country column as the index
df3 = df3.drop(['Qatar', 'Vietnam']) # drop outlier
In [473]:
Reading = df3.Reading
Science = df3.Science
Math = df3.Math
GDP = np.log(df3.GDPpc)
#PISA reading vs GDP per capita
plt.scatter(x = GDP, y = Reading, color = 'r')
plt.title('PISA 2012 Reading scores vs. GDP per capita')
plt.xlabel('GDP per capita (log)')
plt.ylabel('PISA Reading Score')
plt.show()
#PISA math vs GDP per capita
plt.scatter(x = GDP, y = Math, color = 'b')
plt.title('PISA 2012 Math scores vs. GDP per capita')
plt.xlabel('GDP per capita (log)')
plt.ylabel('PISA Math Score')
plt.show()
#PISA science vs GDP per capita
plt.scatter(x = GDP, y = Science, color = 'g')
plt.title('PISA 2012 Science scores vs. GDP per capita')
plt.xlabel('GDP per capita (log)')
plt.ylabel('PISA Science Score')
plt.show()
In [474]:
lm = smf.ols(formula='Reading ~ GDP', data=df3).fit()
lm2.params
lm.summary()
Out[474]:
In [475]:
lm2 = smf.ols(formula='Math ~ GDP', data=df3).fit()
lm2.params
lm2.summary()
Out[475]:
In [476]:
lm3 = smf.ols(formula='Science ~ GDP', data=df3).fit()
lm3.params
lm3.summary()
Out[476]:
The results show that countries with a higher GDP per capita seem to have a relatively higher advantage even though correlation does not imply causation. GDP per capita only reflects the potential of the country to divert financial resources towards education, and not how much is actually allocated to education. While the correlation is not weak, it is not strong enough to indicate the fact that a country's greater wealth will lead to a better education system. Deviations from the trend line would show that countries with similar performance on the PISA can vary greatly in terms of GDP per capita. The two outliers, Vietnam and Qatar, are two examples of that. At the same time, great scores are not necessarily indicative of a great educational system. There are many factors that need to be taken into consideration when evaluating a country's educational system, such as secondary school enrollment, and this provides a a great opportunity for further research.
PISA 2012 scores are downloaded from the statlink on page 21 of the published PISA key findings.
GDP per capita data is accessed through the World Bank API for Pandas. Documentation is found here. GDP per capita is based on PPP and is in constant 2011 international dollars (indicator: NY.GDP.PCAP.PP.KD).