Is there a relationship between GDP per capita and PISA scores?

July 2015

Written by Susan Chen at NYU Stern with help from Professor David Backus

Contact: jiachen2017@u.northwestern.edu

About PISA

Since 2000, the Programme for International Student Assessment (PISA) has been administered every three years to evaluate education systems around the world. It also gathers family and education background information through surveys. The test, which assesses 15-year-old students in reading, math, and science, is administered to a total of around 510,000 students in 65 countries. The duration of the test is two hours, and it contains a mix of open-ended and multiple-choice questions. Learn more about the test here.

I am interested in seeing if there is a correlation between a nation's wealth and their PISA scores. Do wealthier countries generally attain higher scores, and if so, to what extent? I am using GDP per capita as the economic measure of wealth because this is information that could be sensitive to population numbers so GDP per capita in theory should allow us to compare larger countries (in terms of geography or population) with small countries.

Abstract

In terms of the correlation between GDP per capita and each component of the PISA, the r-squared values for an OLS regression model, which usually reflect how well the model fits the data, are 0.57, 0.63, and 0.57 for reading, math, and science, respectively. Qatar and Vietnam, outliers, are excluded from the model.

Packages Imported

I use matplotlib.pyplot to plot scatter plots. I use pandas, a Python package that allows for fast data manipulation and analysis, to organize my dataset. I access World Bank data through the remote data access API for pandas, pandas.io. I also use numpy, a Python package for scientific computing, for the mathematical calculations that were needed to fit the data more appropriately. Lastly, I use statmodels.formula.api, a Python module used for a variety of statistical computations, for running an OLS linear regression.



In [4]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from pandas.io import wb

Creating the Dataset

PISA 2012 scores are downloaded as an excel file from the statlink on page 21 of the published PISA key findings. I deleted the explanatory text surrounding the table. I kept only the "Mean Score in PISA 2012" column for each subject and then saved the file as a csv. Then, I read the file into pandas and renamed the columns.



In [5]:

    
file1 = '/users/susan/desktop/PISA/PISA2012clean.csv' # file location
df1 = pd.read_csv(file1)

#pandas remote data access API for World Bank GDP per capita data
df2 = wb.download(indicator='NY.GDP.PCAP.PP.KD', country='all', start=2012, end=2012)



In [6]:

    
df1









    Out[6]:






  
    
      
      Unnamed: 0
      Math Mean score in PISA 2012
      Reading Mean score in PISA 2012
      Science Mean score in PISA 2012
    
  
  
    
      0 
               OECD average
       494
       496
       501
    
    
      1 
             Shanghai-China
       613
       570
       580
    
    
      2 
                  Singapore
       573
       542
       551
    
    
      3 
       Hong Kong SAR, China
       561
       545
       555
    
    
      4 
             Chinese Taipei
       560
       523
       523
    
    
      5 
                Korea, Rep.
       554
       536
       538
    
    
      6 
           Macao SAR, China
       538
       509
       521
    
    
      7 
                      Japan
       536
       538
       547
    
    
      8 
              Liechtenstein
       535
       516
       525
    
    
      9 
                Switzerland
       531
       509
       515
    
    
      10
                Netherlands
       523
       511
       522
    
    
      11
                    Estonia
       521
       516
       541
    
    
      12
                    Finland
       519
       524
       545
    
    
      13
                     Canada
       518
       523
       525
    
    
      14
                     Poland
       518
       518
       526
    
    
      15
                    Belgium
       515
       509
       505
    
    
      16
                    Germany
       514
       508
       524
    
    
      17
                    Vietnam
       511
       508
       528
    
    
      18
                    Austria
       506
       490
       506
    
    
      19
                  Australia
       504
       512
       521
    
    
      20
                    Ireland
       501
       523
       522
    
    
      21
                   Slovenia
       501
       481
       514
    
    
      22
                    Denmark
       500
       496
       498
    
    
      23
                New Zealand
       500
       512
       516
    
    
      24
             Czech Republic
       499
       493
       508
    
    
      25
                     France
       495
       505
       499
    
    
      26
             United Kingdom
       494
       499
       514
    
    
      27
                    Iceland
       493
       483
       478
    
    
      28
                     Latvia
       491
       489
       502
    
    
      29
                 Luxembourg
       490
       488
       491
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      36
              United States
       481
       498
       497
    
    
      37
                  Lithuania
       479
       477
       496
    
    
      38
                     Sweden
       478
       483
       485
    
    
      39
                    Hungary
       477
       488
       494
    
    
      40
                    Croatia
       471
       485
       491
    
    
      41
                     Israel
       466
       486
       470
    
    
      42
                     Greece
       453
       477
       467
    
    
      43
                     Serbia
       449
       446
       445
    
    
      44
                     Turkey
       448
       475
       463
    
    
      45
                    Romania
       445
       438
       439
    
    
      46
                     Cyprus
       440
       449
       438
    
    
      47
                   Bulgaria
       439
       436
       446
    
    
      48
       United Arab Emirates
       434
       442
       448
    
    
      49
                 Kazakhstan
       432
       393
       425
    
    
      50
                   Thailand
       427
       441
       444
    
    
      51
                      Chile
       423
       441
       445
    
    
      52
                   Malaysia
       421
       398
       420
    
    
      53
                     Mexico
       413
       424
       415
    
    
      54
                 Montenegro
       410
       422
       410
    
    
      55
                    Uruguay
       409
       411
       416
    
    
      56
                 Costa Rica
       407
       441
       429
    
    
      57
                    Albania
       394
       394
       397
    
    
      58
                     Brazil
       391
       410
       405
    
    
      59
                  Argentina
       388
       396
       406
    
    
      60
                    Tunisia
       388
       404
       398
    
    
      61
                     Jordan
       386
       399
       409
    
    
      62
                   Colombia
       376
       403
       399
    
    
      63
                      Qatar
       376
       388
       384
    
    
      64
                  Indonesia
       375
       396
       382
    
    
      65
                       Peru
       368
       384
       373
    
  

66 rows × 4 columns



In [466]:

    
#drop multilevel index 
df2.index = df2.index.droplevel('year')



In [467]:

    
df1.columns = ['Country','Math','Reading','Science']
df2.columns = ['GDPpc']



In [468]:

    
#combine PISA and GDP datasets based on country column  
df3 = pd.merge(df1, df2, how='left', left_on = 'Country', right_index = True)



In [469]:

    
df3.columns = ['Country','Math','Reading','Science','GDPpc']



In [470]:

    
#drop rows with missing GDP per capita values
df3 = df3[pd.notnull(df3['GDPpc'])]



In [471]:

    
print (df3)









    



                 Country  Math  Reading  Science          GDPpc
2              Singapore   573      542      551   75630.362887
3   Hong Kong SAR, China   561      545      555   50346.649117
5            Korea, Rep.   554      536      538   31901.072927
6       Macao SAR, China   538      509      521  125515.187908
7                  Japan   536      538      547   34987.608282
9            Switzerland   531      509      515   54573.197208
10           Netherlands   523      511      522   45484.082518
11               Estonia   521      516      541   24760.559867
12               Finland   519      524      545   39488.977536
13                Canada   518      523      525   41865.045517
14                Poland   518      518      526   22740.224933
15               Belgium   515      509      505   40687.440797
16               Germany   514      508      524   42958.772318
17               Vietnam   511      508      528    4912.322206
18               Austria   506      490      506   44215.863721
19             Australia   504      512      521   42521.714986
20               Ireland   501      523      522   44673.440616
21              Slovenia   501      481      514   27681.668389
22               Denmark   500      496      498   42868.645335
23           New Zealand   500      512      516   32805.756325
24        Czech Republic   499      493      508   28332.591265
25                France   495      505      499   37226.961708
26        United Kingdom   494      499      514   36535.374819
27               Iceland   493      483      478   39925.369977
28                Latvia   491      489      502   20597.467915
29            Luxembourg   490      488      491   89153.059332
30                Norway   489      504      495   63620.043290
31              Portugal   487      488      489   25952.506623
32                 Italy   485      490      494   34812.683721
33                 Spain   484      488      496   31970.720637
..                   ...   ...      ...      ...            ...
35       Slovak Republic   482      463      471   25424.359256
36         United States   481      498      497   50549.187136
37             Lithuania   479      477      496   23710.978341
38                Sweden   478      483      485   43262.833999
39               Hungary   477      488      494   22305.839751
40               Croatia   471      485      491   20182.830175
41                Israel   466      486      470   30518.063020
42                Greece   453      477      467   24990.769394
43                Serbia   449      446      445   12504.790356
44                Turkey   448      475      463   18057.160189
45               Romania   445      438      439   17502.038714
46                Cyprus   440      449      438   31709.765453
47              Bulgaria   439      436      446   15442.810193
48  United Arab Emirates   434      442      448   57028.224546
49            Kazakhstan   432      393      425   21505.651217
50              Thailand   427      441      444   13586.110882
51                 Chile   423      441      445   21049.519788
52              Malaysia   421      398      420   21920.222472
53                Mexico   413      424      415   16320.144646
54            Montenegro   410      422      410   13711.739482
55               Uruguay   409      411      416   18447.275504
56            Costa Rica   407      441      429   13161.713132
57               Albania   394      394      397    9811.150106
58                Brazil   391      410      405   15233.719298
60               Tunisia   388      404      398   10609.051596
61                Jordan   386      399      409   11340.226773
62              Colombia   376      403      399   11635.983719
63                 Qatar   376      388      384  130989.826004
64             Indonesia   375      396      382    9326.838608
65                  Peru   368      384      373   10912.561600

[61 rows x 5 columns]

Excluding Outliers

I initially plotted the data and ran the regression without excluding any outliers. The resulting r-squared values for reading, math, and science were 0.29, 0.32, and 0.27, respectively. Looking at the scatter plot, there seem to be two obvious outliers, Qatar and Vietnam. I decided to exclude the data for these two countries because the remaining countries do seem to form a trend. I found upon excluding them that the correlation between GDP per capita and scores was much higher.

Qatar is an outlier as it placed relatively low, 63rd out of the 65 countries, with a relatively high GDP per capita at about $131000. Qatar has a high GDP per capita for a country with just 1.8 million people, and only 13% of which are Qatari nationals. Qatar is a high income economy as it contains one of the world's largest natural gas and oil reserves.

Vietnam is an outlier because it placed relatively high, 17th out of the 65 countries, with a relatively low GDP per capita at about $4900. Reasons for Vietnam's high score may be due to the investment of the government in education and the uniformity of classroom professionalism and discipline found across countries. At the same time, rote learning is much more emphasized than creative thinking, and it is important to note that many disadvantaged students are forced to drop out, reasons which may account for the high score.



In [472]:

    
df3.index = df3.Country #set country column as the index 
df3 = df3.drop(['Qatar', 'Vietnam']) # drop outlier

Plotting the Data

I use the log of the GDP per capita to plot against each component of the PISA on a scatter plot.



In [473]:

    
Reading = df3.Reading
Science = df3.Science
Math = df3.Math
GDP = np.log(df3.GDPpc)

#PISA reading vs GDP per capita
plt.scatter(x = GDP, y = Reading, color = 'r') 
plt.title('PISA 2012 Reading scores vs. GDP per capita')
plt.xlabel('GDP per capita (log)')
plt.ylabel('PISA Reading Score')
plt.show()

#PISA math vs GDP per capita
plt.scatter(x = GDP, y = Math, color = 'b')
plt.title('PISA 2012 Math scores vs. GDP per capita')
plt.xlabel('GDP per capita (log)')
plt.ylabel('PISA Math Score')
plt.show()

#PISA science vs GDP per capita
plt.scatter(x = GDP, y = Science, color = 'g')
plt.title('PISA 2012 Science scores vs. GDP per capita')
plt.xlabel('GDP per capita (log)')
plt.ylabel('PISA Science Score')
plt.show()

Regression Analysis

The OLS regression results indicate that the there is a 0.57 correlation betweeen reading scores and GDP per capita, 0.63 between math scores and GDP per capita, and 0.57 between science scores and GDP per capita.



In [474]:

    
lm = smf.ols(formula='Reading ~ GDP', data=df3).fit()
lm2.params
lm.summary()









    Out[474]:





OLS Regression Results

  Dep. Variable:          Reading        R-squared:             0.573


  Model:                    OLS          Adj. R-squared:        0.566


  Method:              Least Squares     F-statistic:           76.56


  Date:              Thu, 27 Aug 2015    Prob (F-statistic):  3.97e-12


  Time:                  23:11:11        Log-Likelihood:      -281.78


  No. Observations:           59         AIC:                   567.6


  Df Residuals:               57         BIC:                   571.7


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   -120.7619     67.968     -1.777   0.081   -256.866    15.342


  GDP           58.1105      6.641      8.750   0.000     44.811    71.410




  Omnibus:         2.433    Durbin-Watson:         1.507


  Prob(Omnibus):   0.296    Jarque-Bera (JB):      1.893


  Skew:           -0.436    Prob(JB):              0.388


  Kurtosis:        3.091    Cond. No.               185.



In [475]:

    
lm2 = smf.ols(formula='Math ~ GDP', data=df3).fit()
lm2.params
lm2.summary()









    Out[475]:





OLS Regression Results

  Dep. Variable:           Math          R-squared:             0.628


  Model:                    OLS          Adj. R-squared:        0.621


  Method:              Least Squares     F-statistic:           96.02


  Date:              Thu, 27 Aug 2015    Prob (F-statistic):  7.87e-14


  Time:                  23:11:11        Log-Likelihood:      -285.45


  No. Observations:           59         AIC:                   574.9


  Df Residuals:               57         BIC:                   579.0


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   -236.7352     72.330     -3.273   0.002   -381.573   -91.898


  GDP           69.2546      7.068      9.799   0.000     55.102    83.407




  Omnibus:         0.740    Durbin-Watson:         1.545


  Prob(Omnibus):   0.691    Jarque-Bera (JB):      0.281


  Skew:           -0.138    Prob(JB):              0.869


  Kurtosis:        3.196    Cond. No.               185.



In [476]:

    
lm3 = smf.ols(formula='Science ~ GDP', data=df3).fit()
lm3.params
lm3.summary()









    Out[476]:





OLS Regression Results

  Dep. Variable:          Science        R-squared:             0.569


  Model:                    OLS          Adj. R-squared:        0.562


  Method:              Least Squares     F-statistic:           75.30


  Date:              Thu, 27 Aug 2015    Prob (F-statistic):  5.22e-12


  Time:                  23:11:11        Log-Likelihood:      -286.69


  No. Observations:           59         AIC:                   577.4


  Df Residuals:               57         BIC:                   581.5


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   -162.6998     73.871     -2.202   0.032   -310.624   -14.776


  GDP           62.6344      7.218      8.677   0.000     48.180    77.088




  Omnibus:         0.232    Durbin-Watson:         1.591


  Prob(Omnibus):   0.891    Jarque-Bera (JB):      0.426


  Skew:           -0.046    Prob(JB):              0.808


  Kurtosis:        2.594    Cond. No.               185.

Conclusion

The results show that countries with a higher GDP per capita seem to have a relatively higher advantage even though correlation does not imply causation. GDP per capita only reflects the potential of the country to divert financial resources towards education, and not how much is actually allocated to education. While the correlation is not weak, it is not strong enough to indicate the fact that a country's greater wealth will lead to a better education system. Deviations from the trend line would show that countries with similar performance on the PISA can vary greatly in terms of GDP per capita. The two outliers, Vietnam and Qatar, are two examples of that. At the same time, great scores are not necessarily indicative of a great educational system. There are many factors that need to be taken into consideration when evaluating a country's educational system, such as secondary school enrollment, and this provides a a great opportunity for further research.

Data Sources

PISA 2012 scores are downloaded from the statlink on page 21 of the published PISA key findings.

GDP per capita data is accessed through the World Bank API for Pandas. Documentation is found here. GDP per capita is based on PPP and is in constant 2011 international dollars (indicator: NY.GDP.PCAP.PP.KD).

	Unnamed: 0	Math Mean score in PISA 2012	Reading Mean score in PISA 2012	Science Mean score in PISA 2012
0	OECD average	494	496	501
1	Shanghai-China	613	570	580
2	Singapore	573	542	551
3	Hong Kong SAR, China	561	545	555
4	Chinese Taipei	560	523	523
5	Korea, Rep.	554	536	538
6	Macao SAR, China	538	509	521
7	Japan	536	538	547
8	Liechtenstein	535	516	525
9	Switzerland	531	509	515
10	Netherlands	523	511	522
11	Estonia	521	516	541
12	Finland	519	524	545
13	Canada	518	523	525
14	Poland	518	518	526
15	Belgium	515	509	505
16	Germany	514	508	524
17	Vietnam	511	508	528
18	Austria	506	490	506
19	Australia	504	512	521
20	Ireland	501	523	522
21	Slovenia	501	481	514
22	Denmark	500	496	498
23	New Zealand	500	512	516
24	Czech Republic	499	493	508
25	France	495	505	499
26	United Kingdom	494	499	514
27	Iceland	493	483	478
28	Latvia	491	489	502
29	Luxembourg	490	488	491
...	...	...	...	...
36	United States	481	498	497
37	Lithuania	479	477	496
38	Sweden	478	483	485
39	Hungary	477	488	494
40	Croatia	471	485	491
41	Israel	466	486	470
42	Greece	453	477	467
43	Serbia	449	446	445
44	Turkey	448	475	463
45	Romania	445	438	439
46	Cyprus	440	449	438
47	Bulgaria	439	436	446
48	United Arab Emirates	434	442	448
49	Kazakhstan	432	393	425
50	Thailand	427	441	444
51	Chile	423	441	445
52	Malaysia	421	398	420
53	Mexico	413	424	415
54	Montenegro	410	422	410
55	Uruguay	409	411	416
56	Costa Rica	407	441	429
57	Albania	394	394	397
58	Brazil	391	410	405
59	Argentina	388	396	406
60	Tunisia	388	404	398
61	Jordan	386	399	409
62	Colombia	376	403	399
63	Qatar	376	388	384
64	Indonesia	375	396	382
65	Peru	368	384	373

Dep. Variable:	Reading	R-squared:	0.573
Model:	OLS	Adj. R-squared:	0.566
Method:	Least Squares	F-statistic:	76.56
Date:	Thu, 27 Aug 2015	Prob (F-statistic):	3.97e-12
Time:	23:11:11	Log-Likelihood:	-281.78
No. Observations:	59	AIC:	567.6
Df Residuals:	57	BIC:	571.7
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-120.7619	67.968	-1.777	0.081	-256.866 15.342
GDP	58.1105	6.641	8.750	0.000	44.811 71.410

Omnibus:	2.433	Durbin-Watson:	1.507
Prob(Omnibus):	0.296	Jarque-Bera (JB):	1.893
Skew:	-0.436	Prob(JB):	0.388
Kurtosis:	3.091	Cond. No.	185.

Dep. Variable:	Math	R-squared:	0.628
Model:	OLS	Adj. R-squared:	0.621
Method:	Least Squares	F-statistic:	96.02
Date:	Thu, 27 Aug 2015	Prob (F-statistic):	7.87e-14
Time:	23:11:11	Log-Likelihood:	-285.45
No. Observations:	59	AIC:	574.9
Df Residuals:	57	BIC:	579.0
Df Model:	1
Covariance Type:	nonrobust

Omnibus:	0.740	Durbin-Watson:	1.545
Prob(Omnibus):	0.691	Jarque-Bera (JB):	0.281
Skew:	-0.138	Prob(JB):	0.869
Kurtosis:	3.196	Cond. No.	185.

Dep. Variable:	Science	R-squared:	0.569
Model:	OLS	Adj. R-squared:	0.562
Method:	Least Squares	F-statistic:	75.30
Date:	Thu, 27 Aug 2015	Prob (F-statistic):	5.22e-12
Time:	23:11:11	Log-Likelihood:	-286.69
No. Observations:	59	AIC:	577.4
Df Residuals:	57	BIC:	581.5
Df Model:	1
Covariance Type:	nonrobust

Omnibus:	0.232	Durbin-Watson:	1.591
Prob(Omnibus):	0.891	Jarque-Bera (JB):	0.426
Skew:	-0.046	Prob(JB):	0.808
Kurtosis:	2.594	Cond. No.	185.