Is there a relationship between GDP per capita and PISA scores?

July 2015

Written by Susan Chen at NYU Stern with help from Professor David Backus

Contact: jiachen2017@u.northwestern.edu

About PISA

Since 2000, the Programme for International Student Assessment (PISA) has been administered every three years to evaluate education systems around the world. It also gathers family and education background information through surveys. The test, which assesses 15-year-old students in reading, math, and science, is administered to a total of around 510,000 students in 65 countries. The duration of the test is two hours, and it contains a mix of open-ended and multiple-choice questions. Learn more about the test here.

I am interested in seeing if there is a correlation between a nation's wealth and their PISA scores. Do wealthier countries generally attain higher scores, and if so, to what extent? I am using GDP per capita as the economic measure of wealth because this is information that could be sensitive to population numbers so GDP per capita in theory should allow us to compare larger countries (in terms of geography or population) with small countries.

Abstract

In terms of the correlation between GDP per capita and each component of the PISA, the r-squared values for an OLS regression model, which usually reflect how well the model fits the data, are 0.57, 0.63, and 0.57 for reading, math, and science, respectively. Qatar and Vietnam, outliers, are excluded from the model.

Packages Imported

I use matplotlib.pyplot to plot scatter plots. I use pandas, a Python package that allows for fast data manipulation and analysis, to organize my dataset. I access World Bank data through the remote data access API for pandas, pandas.io. I also use numpy, a Python package for scientific computing, for the mathematical calculations that were needed to fit the data more appropriately. Lastly, I use statmodels.formula.api, a Python module used for a variety of statistical computations, for running an OLS linear regression.


In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from pandas.io import wb

Creating the Dataset

PISA 2012 scores are downloaded as an excel file from the statlink on page 21 of the published PISA key findings. I deleted the explanatory text surrounding the table. I kept only the "Mean Score in PISA 2012" column for each subject and then saved the file as a csv. Then, I read the file into pandas and renamed the columns.


In [5]:
file1 = '/users/susan/desktop/PISA/PISA2012clean.csv' # file location
df1 = pd.read_csv(file1)

#pandas remote data access API for World Bank GDP per capita data
df2 = wb.download(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2012, end=2012)

In [6]:
df1


Out[6]:
Unnamed: 0 Math Mean score in PISA 2012 Reading Mean score in PISA 2012 Science Mean score in PISA 2012
0 OECD average 494 496 501
1 Shanghai-China 613 570 580
2 Singapore 573 542 551
3 Hong Kong SAR, China 561 545 555
4 Chinese Taipei 560 523 523
5 Korea, Rep. 554 536 538
6 Macao SAR, China 538 509 521
7 Japan 536 538 547
8 Liechtenstein 535 516 525
9 Switzerland 531 509 515
10 Netherlands 523 511 522
11 Estonia 521 516 541
12 Finland 519 524 545
13 Canada 518 523 525
14 Poland 518 518 526
15 Belgium 515 509 505
16 Germany 514 508 524
17 Vietnam 511 508 528
18 Austria 506 490 506
19 Australia 504 512 521
20 Ireland 501 523 522
21 Slovenia 501 481 514
22 Denmark 500 496 498
23 New Zealand 500 512 516
24 Czech Republic 499 493 508
25 France 495 505 499
26 United Kingdom 494 499 514
27 Iceland 493 483 478
28 Latvia 491 489 502
29 Luxembourg 490 488 491
... ... ... ... ...
36 United States 481 498 497
37 Lithuania 479 477 496
38 Sweden 478 483 485
39 Hungary 477 488 494
40 Croatia 471 485 491
41 Israel 466 486 470
42 Greece 453 477 467
43 Serbia 449 446 445
44 Turkey 448 475 463
45 Romania 445 438 439
46 Cyprus 440 449 438
47 Bulgaria 439 436 446
48 United Arab Emirates 434 442 448
49 Kazakhstan 432 393 425
50 Thailand 427 441 444
51 Chile 423 441 445
52 Malaysia 421 398 420
53 Mexico 413 424 415
54 Montenegro 410 422 410
55 Uruguay 409 411 416
56 Costa Rica 407 441 429
57 Albania 394 394 397
58 Brazil 391 410 405
59 Argentina 388 396 406
60 Tunisia 388 404 398
61 Jordan 386 399 409
62 Colombia 376 403 399
63 Qatar 376 388 384
64 Indonesia 375 396 382
65 Peru 368 384 373

66 rows × 4 columns


In [466]:
#drop multilevel index 
df2.index = df2.index.droplevel('year')

In [467]:
df1.columns = ['Country','Math','Reading','Science']
df2.columns = ['GDPpc']

In [468]:
#combine PISA and GDP datasets based on country column  
df3 = pd.merge(df1, df2, how='left', left_on = 'Country', right_index = True)

In [469]:
df3.columns = ['Country','Math','Reading','Science','GDPpc']

In [470]:
#drop rows with missing GDP per capita values
df3 = df3[pd.notnull(df3['GDPpc'])]

In [471]:
print (df3)


                 Country  Math  Reading  Science          GDPpc
2              Singapore   573      542      551   75630.362887
3   Hong Kong SAR, China   561      545      555   50346.649117
5            Korea, Rep.   554      536      538   31901.072927
6       Macao SAR, China   538      509      521  125515.187908
7                  Japan   536      538      547   34987.608282
9            Switzerland   531      509      515   54573.197208
10           Netherlands   523      511      522   45484.082518
11               Estonia   521      516      541   24760.559867
12               Finland   519      524      545   39488.977536
13                Canada   518      523      525   41865.045517
14                Poland   518      518      526   22740.224933
15               Belgium   515      509      505   40687.440797
16               Germany   514      508      524   42958.772318
17               Vietnam   511      508      528    4912.322206
18               Austria   506      490      506   44215.863721
19             Australia   504      512      521   42521.714986
20               Ireland   501      523      522   44673.440616
21              Slovenia   501      481      514   27681.668389
22               Denmark   500      496      498   42868.645335
23           New Zealand   500      512      516   32805.756325
24        Czech Republic   499      493      508   28332.591265
25                France   495      505      499   37226.961708
26        United Kingdom   494      499      514   36535.374819
27               Iceland   493      483      478   39925.369977
28                Latvia   491      489      502   20597.467915
29            Luxembourg   490      488      491   89153.059332
30                Norway   489      504      495   63620.043290
31              Portugal   487      488      489   25952.506623
32                 Italy   485      490      494   34812.683721
33                 Spain   484      488      496   31970.720637
..                   ...   ...      ...      ...            ...
35       Slovak Republic   482      463      471   25424.359256
36         United States   481      498      497   50549.187136
37             Lithuania   479      477      496   23710.978341
38                Sweden   478      483      485   43262.833999
39               Hungary   477      488      494   22305.839751
40               Croatia   471      485      491   20182.830175
41                Israel   466      486      470   30518.063020
42                Greece   453      477      467   24990.769394
43                Serbia   449      446      445   12504.790356
44                Turkey   448      475      463   18057.160189
45               Romania   445      438      439   17502.038714
46                Cyprus   440      449      438   31709.765453
47              Bulgaria   439      436      446   15442.810193
48  United Arab Emirates   434      442      448   57028.224546
49            Kazakhstan   432      393      425   21505.651217
50              Thailand   427      441      444   13586.110882
51                 Chile   423      441      445   21049.519788
52              Malaysia   421      398      420   21920.222472
53                Mexico   413      424      415   16320.144646
54            Montenegro   410      422      410   13711.739482
55               Uruguay   409      411      416   18447.275504
56            Costa Rica   407      441      429   13161.713132
57               Albania   394      394      397    9811.150106
58                Brazil   391      410      405   15233.719298
60               Tunisia   388      404      398   10609.051596
61                Jordan   386      399      409   11340.226773
62              Colombia   376      403      399   11635.983719
63                 Qatar   376      388      384  130989.826004
64             Indonesia   375      396      382    9326.838608
65                  Peru   368      384      373   10912.561600

[61 rows x 5 columns]

Excluding Outliers

I initially plotted the data and ran the regression without excluding any outliers. The resulting r-squared values for reading, math, and science were 0.29, 0.32, and 0.27, respectively. Looking at the scatter plot, there seem to be two obvious outliers, Qatar and Vietnam. I decided to exclude the data for these two countries because the remaining countries do seem to form a trend. I found upon excluding them that the correlation between GDP per capita and scores was much higher.

Qatar is an outlier as it placed relatively low, 63rd out of the 65 countries, with a relatively high GDP per capita at about $131000. Qatar has a high GDP per capita for a country with just 1.8 million people, and only 13% of which are Qatari nationals. Qatar is a high income economy as it contains one of the world's largest natural gas and oil reserves.

Vietnam is an outlier because it placed relatively high, 17th out of the 65 countries, with a relatively low GDP per capita at about $4900. Reasons for Vietnam's high score may be due to the investment of the government in education and the uniformity of classroom professionalism and discipline found across countries. At the same time, rote learning is much more emphasized than creative thinking, and it is important to note that many disadvantaged students are forced to drop out, reasons which may account for the high score.


In [472]:
df3.index = df3.Country #set country column as the index 
df3 = df3.drop(['Qatar', 'Vietnam']) # drop outlier

Plotting the Data

I use the log of the GDP per capita to plot against each component of the PISA on a scatter plot.


In [473]:
Reading = df3.Reading
Science = df3.Science
Math = df3.Math
GDP = np.log(df3.GDPpc)

#PISA reading vs GDP per capita
plt.scatter(x = GDP, y = Reading, color = 'r') 
plt.title('PISA 2012 Reading scores vs. GDP per capita')
plt.xlabel('GDP per capita (log)')
plt.ylabel('PISA Reading Score')
plt.show()

#PISA math vs GDP per capita
plt.scatter(x = GDP, y = Math, color = 'b')
plt.title('PISA 2012 Math scores vs. GDP per capita')
plt.xlabel('GDP per capita (log)')
plt.ylabel('PISA Math Score')
plt.show()

#PISA science vs GDP per capita
plt.scatter(x = GDP, y = Science, color = 'g')
plt.title('PISA 2012 Science scores vs. GDP per capita')
plt.xlabel('GDP per capita (log)')
plt.ylabel('PISA Science Score')
plt.show()


Regression Analysis

The OLS regression results indicate that the there is a 0.57 correlation betweeen reading scores and GDP per capita, 0.63 between math scores and GDP per capita, and 0.57 between science scores and GDP per capita.


In [474]:
lm = smf.ols(formula='Reading ~ GDP', data=df3).fit()
lm2.params
lm.summary()


Out[474]:
OLS Regression Results
Dep. Variable: Reading R-squared: 0.573
Model: OLS Adj. R-squared: 0.566
Method: Least Squares F-statistic: 76.56
Date: Thu, 27 Aug 2015 Prob (F-statistic): 3.97e-12
Time: 23:11:11 Log-Likelihood: -281.78
No. Observations: 59 AIC: 567.6
Df Residuals: 57 BIC: 571.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -120.7619 67.968 -1.777 0.081 -256.866 15.342
GDP 58.1105 6.641 8.750 0.000 44.811 71.410
Omnibus: 2.433 Durbin-Watson: 1.507
Prob(Omnibus): 0.296 Jarque-Bera (JB): 1.893
Skew: -0.436 Prob(JB): 0.388
Kurtosis: 3.091 Cond. No. 185.

In [475]:
lm2 = smf.ols(formula='Math ~ GDP', data=df3).fit()
lm2.params
lm2.summary()


Out[475]:
OLS Regression Results
Dep. Variable: Math R-squared: 0.628
Model: OLS Adj. R-squared: 0.621
Method: Least Squares F-statistic: 96.02
Date: Thu, 27 Aug 2015 Prob (F-statistic): 7.87e-14
Time: 23:11:11 Log-Likelihood: -285.45
No. Observations: 59 AIC: 574.9
Df Residuals: 57 BIC: 579.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -236.7352 72.330 -3.273 0.002 -381.573 -91.898
GDP 69.2546 7.068 9.799 0.000 55.102 83.407
Omnibus: 0.740 Durbin-Watson: 1.545
Prob(Omnibus): 0.691 Jarque-Bera (JB): 0.281
Skew: -0.138 Prob(JB): 0.869
Kurtosis: 3.196 Cond. No. 185.

In [476]:
lm3 = smf.ols(formula='Science ~ GDP', data=df3).fit()
lm3.params
lm3.summary()


Out[476]:
OLS Regression Results
Dep. Variable: Science R-squared: 0.569
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 75.30
Date: Thu, 27 Aug 2015 Prob (F-statistic): 5.22e-12
Time: 23:11:11 Log-Likelihood: -286.69
No. Observations: 59 AIC: 577.4
Df Residuals: 57 BIC: 581.5
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -162.6998 73.871 -2.202 0.032 -310.624 -14.776
GDP 62.6344 7.218 8.677 0.000 48.180 77.088
Omnibus: 0.232 Durbin-Watson: 1.591
Prob(Omnibus): 0.891 Jarque-Bera (JB): 0.426
Skew: -0.046 Prob(JB): 0.808
Kurtosis: 2.594 Cond. No. 185.

Conclusion

The results show that countries with a higher GDP per capita seem to have a relatively higher advantage even though correlation does not imply causation. GDP per capita only reflects the potential of the country to divert financial resources towards education, and not how much is actually allocated to education. While the correlation is not weak, it is not strong enough to indicate the fact that a country's greater wealth will lead to a better education system. Deviations from the trend line would show that countries with similar performance on the PISA can vary greatly in terms of GDP per capita. The two outliers, Vietnam and Qatar, are two examples of that. At the same time, great scores are not necessarily indicative of a great educational system. There are many factors that need to be taken into consideration when evaluating a country's educational system, such as secondary school enrollment, and this provides a a great opportunity for further research.

Data Sources

PISA 2012 scores are downloaded from the statlink on page 21 of the published PISA key findings.

GDP per capita data is accessed through the World Bank API for Pandas. Documentation is found here. GDP per capita is based on PPP and is in constant 2011 international dollars (indicator: NY.GDP.PCAP.PP.KD).