Análise dos dados do IDEB

DISCLAIMER: THIS IS STILL WORK IN PROGRESS

Em pleno ano eleitoral diversos candidatos aos governos estaduais e presidência da república tratram do tema Educação. É comum que candidatos usem um indicador como o IDEB para avaliar seus próprios desempenhos e de oponentes ao mesmo tempo em que propostas para melhorar esse indicador são veículadas, como construção de novas escolas, contratação de professores, etc. Nossa opinião é que num país de terceiro mundo como o Brasil mesmo outros fatores não diretamente ligados a educação, como violência ou renda possuem uma relação com o desempenho do IDEB; neste notebook buscamos mensurar o impacto que fatores socio-econômicos desempenham nos resultados do IDEB estadual.

Para tal usaremos as seguintes fontes públicas:

Idealmente deveríamos trabalhar com dados de todos os fatores como sendo do mesmo período (ano de 2015) e essa é uma melhoria que pretendemos realizar no futuro.

Análise dos dados do IDEB

  • O que é o IDEB e como é calculado
  • Que questões queremos responder?
  • Adicionar mapa com estados
  • Análise 1:
  • Análise 2:

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import seaborn as sns
import statsmodels.api as sm
import scikits.bootstrap as bootstrap
import matplotlib.pyplot as plt
import warnings
import patsy
import folium
from statsmodels.formula.api import ols, rlm

%matplotlib inline
sns.set(color_codes=True)
warnings.filterwarnings('ignore')

Carrega dados


In [2]:
df = pd.read_csv('IDEB.csv', sep=';', encoding='ansi')
df


Out[2]:
Ano UF ESTADO IDEB_AI IDEB_AF PE_MA2 PE_MA4 TH_MA2 TH_MA4
0 2017 AC Acre 5.7 4.6 29.125 24.2375 54.15 41.175
1 2017 AL Alagoas 4.9 3.9 24.675 23.3125 55.55 56.550
2 2017 AP Amapá 4.4 3.5 22.225 16.1625 51.30 43.725
3 2017 AM Amazonas 5.3 4.4 29.275 23.2875 33.80 34.250
4 2017 BA Bahia 4.7 3.4 23.950 20.4000 46.00 42.875
5 2017 CE Ceará 6.1 4.9 23.675 21.5625 49.85 49.675
6 2017 DF Distrito Federal 6.0 4.3 5.500 4.4000 21.85 24.700
7 2017 ES Espírito Santo 5.7 4.4 9.825 7.9625 34.70 36.925
8 2017 GO Goiás 5.9 5.1 6.950 5.6000 42.30 43.550
9 2017 MA Maranhão 4.5 3.7 31.250 28.5750 32.00 33.800
10 2017 MT Mato Grosso 5.7 4.7 6.750 5.3000 33.60 36.525
11 2017 MS Mato Grosso do Sul 5.5 4.6 5.950 4.5750 22.90 24.100
12 2017 MG Minas Gerais 6.3 4.5 9.475 8.0375 20.80 21.525
13 2017 PA Paraná 4.5 3.6 5.650 4.7750 25.00 25.800
14 2017 PB Paraíba 4.7 3.6 21.475 18.5375 32.90 35.850
15 2017 PR Pará 6.3 4.7 26.525 21.5125 52.10 47.975
16 2017 PE Pernambuco 4.8 4.1 21.775 19.1375 52.30 45.500
17 2017 PI Piauí 5.0 4.2 25.300 22.1000 21.00 21.175
18 2017 RJ Rio de Janeiro 5.3 4.2 7.075 5.5375 38.40 35.525
19 2017 RN Rio Grande do Norte 4.5 3.4 18.350 16.9000 60.70 53.325
20 2017 RS Rio Grande do Sul 5.6 4.4 4.950 4.6000 27.65 26.450
21 2017 RO Rondônia 5.7 4.8 12.400 10.4750 33.70 33.600
22 2017 RR Roraima 5.4 4.0 15.400 12.3750 41.85 38.900
23 2017 SC Santa Catarina 6.3 5.0 3.600 2.8500 15.35 14.550
24 2017 SE Sergipe 4.3 3.4 22.350 19.3000 60.20 56.975
25 2017 SP São Paulo 6.5 4.9 4.725 3.9625 10.80 11.950
26 2017 TO Tocantins 5.4 4.5 16.175 14.1375 32.10 30.725
27 2015 AC Acre 5.3 4.4 19.350 19.8750 28.20 28.475
28 2015 AL Alagoas 4.3 3.2 21.950 22.1750 57.55 61.200
29 2015 AP Amapá 4.3 3.5 10.100 12.7250 36.15 34.775
... ... ... ... ... ... ... ... ... ...
132 2009 SE Sergipe 3.4 2.8 20.900 20.4000 30.05 28.750
133 2009 SP São Paulo 5.3 4.3 3.900 4.0000 15.60 16.750
134 2009 TO Tocantins 4.4 3.9 13.150 15.6000 20.45 18.675
135 2007 AC Acre 3.7 3.7 25.250 24.2000 21.25 19.925
136 2007 AL Alagoas 3.1 2.6 31.550 32.3750 56.30 46.900
137 2007 AP Amapá 3.3 3.4 14.250 14.6000 29.90 31.025
138 2007 AM Amazonas 3.4 3.2 19.800 18.0500 21.10 19.400
139 2007 BA Bahia 3.2 2.8 23.800 23.4750 24.85 21.800
140 2007 CE Ceará 3.5 3.3 25.800 26.0250 22.50 21.500
141 2007 DF Distrito Federal 4.8 3.5 5.350 6.2750 28.45 30.400
142 2007 ES Espírito Santo 4.3 3.7 8.950 9.4000 52.10 50.150
143 2007 GO Goiás 4.1 3.5 7.300 7.2000 26.15 26.200
144 2007 MA Maranhão 3.5 3.2 31.950 32.1750 16.85 15.175
145 2007 MT Mato Grosso 4.3 3.7 8.650 8.5500 30.95 31.600
146 2007 MS Mato Grosso do Sul 4.1 3.7 6.800 7.4000 30.10 29.425
147 2007 MG Minas Gerais 4.6 3.8 9.000 9.0250 21.15 21.725
148 2007 PA Paraná 3.0 3.1 6.100 6.3250 29.65 29.100
149 2007 PB Paraíba 3.3 2.8 22.000 22.7000 23.25 21.450
150 2007 PR Pará 4.8 4.0 18.050 17.8250 29.75 27.450
151 2007 PE Pernambuco 3.3 2.6 24.650 24.8500 52.80 51.950
152 2007 PI Piauí 3.3 3.2 29.250 29.7250 13.15 12.575
153 2007 RJ Rio de Janeiro 4.1 3.5 5.550 5.4000 44.55 46.625
154 2007 RN Rio Grande do Norte 3.2 2.8 20.800 21.3500 17.00 14.800
155 2007 RS Rio Grande do Sul 4.5 3.7 6.350 6.3000 18.95 18.750
156 2007 RO Rondônia 3.9 3.3 13.350 12.0750 32.30 34.700
157 2007 RR Roraima 4.1 3.5 15.250 20.1000 27.70 25.575
158 2007 SC Santa Catarina 4.7 4.1 2.550 2.7250 10.80 10.875
159 2007 SE Sergipe 3.2 2.8 19.900 19.8000 27.45 26.000
160 2007 SP São Paulo 4.8 4.0 4.100 4.3250 17.90 21.575
161 2007 TO Tocantins 4.0 3.6 18.050 17.3250 16.90 16.200

162 rows × 9 columns

Análise Exploratória

Estatísticas descritivas


In [3]:
df.describe()


Out[3]:
Ano IDEB_AI IDEB_AF PE_MA2 PE_MA4 TH_MA2 TH_MA4
count 162.000000 162.000000 162.000000 162.000000 162.000000 162.000000 162.000000
mean 2012.000000 4.661111 3.779012 13.741821 13.742978 32.315432 31.386574
std 3.426241 0.827347 0.556769 8.166125 8.036941 12.165066 11.615080
min 2007.000000 3.000000 2.600000 2.100000 2.300000 10.800000 10.875000
25% 2009.000000 4.000000 3.400000 5.725000 5.756250 23.312500 22.387500
50% 2012.000000 4.650000 3.800000 14.150000 14.143750 32.050000 31.050000
75% 2015.000000 5.300000 4.175000 19.993750 19.875000 39.300000 37.756250
max 2017.000000 6.500000 5.100000 31.950000 32.375000 69.150000 67.000000

Diferença entre as médias dos anos finais e iniciais no conjunto de dados


In [4]:
plt.figure()
sns.distplot(df["IDEB_AI"], hist=False, rug=True, label="IDEB_AI")
sns.distplot(df["IDEB_AF"], hist=False, rug=True, label="IDEB_AF")


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x181e11c84e0>

Como vimos nas estatísticas descritivas a média dos dados do IDEB Anos Iniciais (IDEB AI) são superiores que a média do IDEB Anos Finais (IDEB AF), respectivamente 5.11 e 4.00. Porém a diferença entre essas médias é estatisticamente signifcante? Vamos calcular um intervalo de confiança de 95% para essas médias e exibir o resultado:


In [5]:
ci_mean_ideb_ai = bootstrap.ci(data=df["IDEB_AI"], alpha=0.05)  
ci_mean_ideb_af = bootstrap.ci(data=df["IDEB_AF"], alpha=0.05)

fig, ax = plt.subplots()
ax.plot(["IDEB_AI", "IDEB_AI"], ci_mean_ideb_ai, label="IDEB_AI")
ax.plot(["IDEB_AF", "IDEB_AF"], ci_mean_ideb_af, label="IDEB_AF")
ax.legend()

plt.show()


A falta de intersecção nos intervalos acima mostra que para o intervalo de confiança escolhido (95%) a diferença entre as médias é estatisticamente significante.

IDEB Anos Iniciais - Distribuição por ano


In [6]:
g = sns.catplot(y='IDEB_AI', x='Ano', kind='violin', inner=None, data=df)
sns.swarmplot(y='IDEB_AI', x='Ano', color='k', size=3, data=df, ax=g.ax)
sns.lmplot(data=df, y='IDEB_AI', x='Ano')


Out[6]:
<seaborn.axisgrid.FacetGrid at 0x181e5390240>

IDEB Anos Iniciais - Distribuição por estado em 2017


In [ ]:


In [7]:
#https://github.com/python-visualization/folium/tree/master/examples/data

for t in ("AI", "AF"):
    for year in range(2007, 2018, 2):
        m = folium.Map(location=[-16, -55], zoom_start=4, tiles='cartodbpositron')
        state_data = df[df['Ano']==year][["UF", "IDEB_"+t]]
        m.choropleth(
            geo_data=open('uf.json', encoding="latin-1").read(),
            name='choropleth',
            data=state_data,
            columns=['UF', 'IDEB_'+t],
            key_on='feature.id',
            fill_color='YlGn',
            fill_opacity=0.7,
            line_opacity=0.2,
            legend_name='IDEB {} ({})'.format(t, year)
        )
        folium.LayerControl().add_to(m)
        m.save(outfile="UF_{}_{}.html".format(t, year))

In [8]:
from IPython.display import IFrame
IFrame(src='uf_AI_2017.html', width=500, height=600)


Out[8]:

In [9]:
df['IDEB_AI_STD'] = 0.0
df['IDEB_AF_STD'] = 0.0
df['PE_MA2_STD'] = 0.0
df['TH_MA2_STD'] = 0.0

In [10]:
for year in range(2007, 2018, 2):
    ideb_ai_year = df[df['Ano']==year]['IDEB_AI']
    ideb_af_year = df[df['Ano']==year]['IDEB_AF']
    pe_ma2_year  = df[df['Ano']==year]['PE_MA2']
    th_ma2_year  = df[df['Ano']==year]['TH_MA2']
    pe_ma4_year  = df[df['Ano']==year]['PE_MA4']
    th_ma4_year  = df[df['Ano']==year]['TH_MA4']
    df.loc[df['Ano']==year, 'IDEB_AI_STD'] = (ideb_ai_year - ideb_ai_year.mean()) / ideb_ai_year.std()
    df.loc[df['Ano']==year, 'IDEB_AF_STD'] = (ideb_af_year - ideb_af_year.mean()) / ideb_af_year.std()
    df.loc[df['Ano']==year, 'PE_MA2_STD']  = (pe_ma2_year  - pe_ma2_year.mean() ) / pe_ma2_year.std()
    df.loc[df['Ano']==year, 'TH_MA2_STD']  = (th_ma2_year  - th_ma2_year.mean() ) / th_ma2_year.std()
    df.loc[df['Ano']==year, 'PE_MA4_STD']  = (pe_ma4_year  - pe_ma4_year.mean() ) / pe_ma4_year.std()
    df.loc[df['Ano']==year, 'TH_MA4_STD']  = (th_ma4_year  - th_ma4_year.mean() ) / th_ma4_year.std()
    
df.tail()


Out[10]:
Ano UF ESTADO IDEB_AI IDEB_AF PE_MA2 PE_MA4 TH_MA2 TH_MA4 IDEB_AI_STD IDEB_AF_STD PE_MA2_STD TH_MA2_STD PE_MA4_STD TH_MA4_STD
157 2007 RR Roraima 4.1 3.5 15.25 20.100 27.70 25.575 0.408135 0.292109 -0.051595 0.012938 0.459939 -0.106998
158 2007 SC Santa Catarina 4.7 4.1 2.55 2.725 10.80 10.875 1.409920 1.683923 -1.450158 -1.426934 -1.447410 -1.420758
159 2007 SE Sergipe 3.2 2.8 19.90 19.800 27.45 26.000 -1.094543 -1.331674 0.460478 -0.008362 0.427006 -0.069015
160 2007 SP São Paulo 4.8 4.0 4.10 4.325 17.90 21.575 1.576884 1.451954 -1.279467 -0.822017 -1.271769 -0.464483
161 2007 TO Tocantins 4.0 3.6 18.05 17.325 16.90 16.200 0.241171 0.524078 0.256750 -0.907217 0.155312 -0.944855

In [11]:
sns.lmplot(data=df, y='IDEB_AI_STD', x='Ano')


Out[11]:
<seaborn.axisgrid.FacetGrid at 0x181e321e978>

IDEB Anos Finais - Distribuição por ano


In [12]:
g = sns.catplot(y='IDEB_AF', x='Ano', kind='violin', inner=None, data=df)
sns.swarmplot(y='IDEB_AF', x='Ano', color='k', size=3, data=df, ax=g.ax)
sns.lmplot(data=df, y='IDEB_AF', x='Ano')


Out[12]:
<seaborn.axisgrid.FacetGrid at 0x181e59e53c8>

In [13]:
from IPython.display import IFrame
IFrame(src='UF_AF_2017.html', width=500, height=600)


Out[13]:

In [14]:
sns.lmplot(data=df, y='IDEB_AF_STD', x='Ano')


Out[14]:
<seaborn.axisgrid.FacetGrid at 0x181e5a39ba8>

Nota-se que em tanto nos Anos Iniciais como nos Anos Finais existe um trend de crescimento da mediana das notas a cada ano, bem como uma progressiva expansão do range de distribuição de notas bem como uma ligeira migração da densidade para cima.


In [15]:
plt.figure()
sns.kdeplot(df["PE_MA2"], bw=1.5, label="% Pobreza Extrema MA(2)")
sns.kdeplot(df["PE_MA4"], bw=1.5, label="% Pobreza Extrema MA(4)")


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x181e5a9e080>

In [16]:
sns.catplot(y='PE_MA2', x='Ano', kind='box', data=df)


Out[16]:
<seaborn.axisgrid.FacetGrid at 0x181e59b0ef0>

In [17]:
sns.catplot(y='PE_MA4', x='Ano', kind='box', data=df)


Out[17]:
<seaborn.axisgrid.FacetGrid at 0x181e5c9fba8>

In [ ]:


In [18]:
plt.figure()
sns.distplot(df["TH_MA2"], hist=False, rug=True, label="Homicídios/1000 hab MA(2)")
sns.distplot(df["TH_MA4"], hist=False, rug=True, label="Homicídios/1000 hab MA(4)")


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x181e5d09780>

In [19]:
sns.catplot(y='TH_MA2', x='Ano', kind='box', data=df)


Out[19]:
<seaborn.axisgrid.FacetGrid at 0x181e5d04c18>

In [20]:
sns.catplot(y='TH_MA4', x='Ano', kind='box', data=df)


Out[20]:
<seaborn.axisgrid.FacetGrid at 0x181e5e1f898>

In [ ]:

Busca de associações entre IDEB e as variáveis sócio-econômicas

IDEB Anos Iniciais (IDEB_AI)


In [21]:
plt.figure()
sns.lmplot(data=df, x="PE_MA2", y="IDEB_AI_STD")
sns.lmplot(data=df, x="PE_MA2", y="IDEB_AI_STD", hue="Ano")
sns.lmplot(data=df, x="PE_MA4", y="IDEB_AI_STD", hue="Ano")
sns.lmplot(data=df, x="TH_MA2", y="IDEB_AI_STD")
sns.lmplot(data=df, x="TH_MA2", y="IDEB_AI_STD", hue="Ano")
sns.lmplot(data=df, x="TH_MA4", y="IDEB_AI_STD", hue="Ano")


Out[21]:
<seaborn.axisgrid.FacetGrid at 0x181e6038ac8>
<Figure size 432x288 with 0 Axes>

In [22]:
formula = "IDEB_AI ~ TH_MA2 * PE_MA2"
model = ols(formula, df).fit()
model.summary()


Out[22]:
OLS Regression Results
Dep. Variable: IDEB_AI R-squared: 0.292
Model: OLS Adj. R-squared: 0.278
Method: Least Squares F-statistic: 21.68
Date: Sat, 22 Sep 2018 Prob (F-statistic): 8.21e-12
Time: 00:19:53 Log-Likelihood: -170.74
No. Observations: 162 AIC: 349.5
Df Residuals: 158 BIC: 361.8
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5.9981 0.300 19.972 0.000 5.405 6.591
TH_MA2 -0.0219 0.010 -2.199 0.029 -0.042 -0.002
PE_MA2 -0.0989 0.017 -5.790 0.000 -0.133 -0.065
TH_MA2:PE_MA2 0.0015 0.001 2.995 0.003 0.001 0.003
Omnibus: 0.732 Durbin-Watson: 1.380
Prob(Omnibus): 0.694 Jarque-Bera (JB): 0.748
Skew: 0.159 Prob(JB): 0.688
Kurtosis: 2.901 Cond. No. 3.33e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.33e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [23]:
formula = "IDEB_AI_STD ~ TH_MA2 * PE_MA2"
model = ols(formula, df).fit()
model.summary()


Out[23]:
OLS Regression Results
Dep. Variable: IDEB_AI_STD R-squared: 0.396
Model: OLS Adj. R-squared: 0.384
Method: Least Squares F-statistic: 34.50
Date: Sat, 22 Sep 2018 Prob (F-statistic): 3.32e-17
Time: 00:19:53 Log-Likelihood: -186.00
No. Observations: 162 AIC: 380.0
Df Residuals: 158 BIC: 392.4
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.9995 0.330 6.059 0.000 1.348 2.651
TH_MA2 -0.0367 0.011 -3.353 0.001 -0.058 -0.015
PE_MA2 -0.1087 0.019 -5.790 0.000 -0.146 -0.072
TH_MA2:PE_MA2 0.0014 0.001 2.538 0.012 0.000 0.003
Omnibus: 4.309 Durbin-Watson: 2.236
Prob(Omnibus): 0.116 Jarque-Bera (JB): 4.534
Skew: 0.206 Prob(JB): 0.104
Kurtosis: 3.709 Cond. No. 3.33e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.33e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [24]:
formula = "IDEB_AI_STD ~ TH_MA2_STD * PE_MA2_STD"
model = ols(formula, df).fit()
model.summary()


Out[24]:
OLS Regression Results
Dep. Variable: IDEB_AI_STD R-squared: 0.401
Model: OLS Adj. R-squared: 0.389
Method: Least Squares F-statistic: 35.21
Date: Sat, 22 Sep 2018 Prob (F-statistic): 1.76e-17
Time: 00:19:53 Log-Likelihood: -185.35
No. Observations: 162 AIC: 378.7
Df Residuals: 158 BIC: 391.0
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -0.0305 0.063 -0.484 0.629 -0.155 0.094
TH_MA2_STD -0.1671 0.067 -2.478 0.014 -0.300 -0.034
PE_MA2_STD -0.5490 0.066 -8.308 0.000 -0.680 -0.418
TH_MA2_STD:PE_MA2_STD 0.0913 0.055 1.665 0.098 -0.017 0.200
Omnibus: 5.587 Durbin-Watson: 2.307
Prob(Omnibus): 0.061 Jarque-Bera (JB): 6.910
Skew: 0.205 Prob(JB): 0.0316
Kurtosis: 3.925 Cond. No. 1.71


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [25]:
resids = {'year':[]}
l = []
for year in range(2007, 2018, 2):
    l.append(model.resid.values[df["Ano"]==year].tolist())
    for i in range(len(l[0])):
        resids['year'].append(year)
resids['residual'] = [item for sublist in l for item in sublist] 
df_resids = pd.DataFrame.from_dict(resids)
df_resids.plot.scatter(x='year', y='residual', c='DarkBlue')


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x181e6428e48>

In [ ]:


In [ ]:


In [ ]:

MA4


In [26]:
formula = "IDEB_AI ~ TH_MA4 * PE_MA4"
model = ols(formula, df).fit()
model.summary()


Out[26]:
OLS Regression Results
Dep. Variable: IDEB_AI R-squared: 0.340
Model: OLS Adj. R-squared: 0.327
Method: Least Squares F-statistic: 27.12
Date: Sat, 22 Sep 2018 Prob (F-statistic): 3.33e-14
Time: 00:19:53 Log-Likelihood: -165.01
No. Observations: 162 AIC: 338.0
Df Residuals: 158 BIC: 350.4
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 5.8949 0.294 20.059 0.000 5.314 6.475
TH_MA4 -0.0154 0.010 -1.566 0.119 -0.035 0.004
PE_MA4 -0.0927 0.016 -5.662 0.000 -0.125 -0.060
TH_MA4:PE_MA4 0.0012 0.001 2.255 0.026 0.000 0.002
Omnibus: 1.196 Durbin-Watson: 1.465
Prob(Omnibus): 0.550 Jarque-Bera (JB): 0.854
Skew: 0.155 Prob(JB): 0.653
Kurtosis: 3.174 Cond. No. 3.16e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [27]:
formula = "IDEB_AI_STD ~ TH_MA4 * PE_MA4"
model = ols(formula, df).fit()
model.summary()


Out[27]:
OLS Regression Results
Dep. Variable: IDEB_AI_STD R-squared: 0.409
Model: OLS Adj. R-squared: 0.398
Method: Least Squares F-statistic: 36.51
Date: Sat, 22 Sep 2018 Prob (F-statistic): 5.54e-18
Time: 00:19:53 Log-Likelihood: -184.15
No. Observations: 162 AIC: 376.3
Df Residuals: 158 BIC: 388.7
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.9054 0.331 5.761 0.000 1.252 2.559
TH_MA4 -0.0318 0.011 -2.878 0.005 -0.054 -0.010
PE_MA4 -0.1014 0.018 -5.503 0.000 -0.138 -0.065
TH_MA4:PE_MA4 0.0011 0.001 1.863 0.064 -6.49e-05 0.002
Omnibus: 4.422 Durbin-Watson: 2.241
Prob(Omnibus): 0.110 Jarque-Bera (JB): 5.069
Skew: 0.165 Prob(JB): 0.0793
Kurtosis: 3.801 Cond. No. 3.16e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [28]:
formula = "IDEB_AI_STD ~ TH_MA4_STD * PE_MA4_STD"
model = ols(formula, df).fit()
model.summary()


Out[28]:
OLS Regression Results
Dep. Variable: IDEB_AI_STD R-squared: 0.407
Model: OLS Adj. R-squared: 0.396
Method: Least Squares F-statistic: 36.14
Date: Sat, 22 Sep 2018 Prob (F-statistic): 7.71e-18
Time: 00:19:53 Log-Likelihood: -184.49
No. Observations: 162 AIC: 377.0
Df Residuals: 158 BIC: 389.3
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -0.0221 0.062 -0.357 0.722 -0.144 0.100
TH_MA4_STD -0.1560 0.065 -2.382 0.018 -0.285 -0.027
PE_MA4_STD -0.5704 0.064 -8.884 0.000 -0.697 -0.444
TH_MA4_STD:PE_MA4_STD 0.0800 0.053 1.509 0.133 -0.025 0.185
Omnibus: 5.126 Durbin-Watson: 2.291
Prob(Omnibus): 0.077 Jarque-Bera (JB): 6.389
Skew: 0.169 Prob(JB): 0.0410
Kurtosis: 3.912 Cond. No. 1.61


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [29]:
resids = {'year':[]}
l = []
for year in range(2007, 2018, 2):
    l.append(model.resid.values[df["Ano"]==year].tolist())
    for i in range(len(l[0])):
        resids['year'].append(year)
resids['residual'] = [item for sublist in l for item in sublist] 
df_resids = pd.DataFrame.from_dict(resids)
df_resids.plot.scatter(x='year', y='residual', c='DarkBlue')


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x181e64c5c18>

In [ ]:


In [ ]:

IDEB Anos Finais (IDEB_AF)


In [30]:
formula = "IDEB_AF_STD ~  TH_MA2 * PE_MA2"
model = ols(formula, df).fit()
model.summary()


Out[30]:
OLS Regression Results
Dep. Variable: IDEB_AF_STD R-squared: 0.362
Model: OLS Adj. R-squared: 0.350
Method: Least Squares F-statistic: 29.87
Date: Sat, 22 Sep 2018 Prob (F-statistic): 2.37e-15
Time: 00:19:53 Log-Likelihood: -190.42
No. Observations: 162 AIC: 388.8
Df Residuals: 158 BIC: 401.2
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.7100 0.339 5.042 0.000 1.040 2.380
TH_MA2 -0.0314 0.011 -2.796 0.006 -0.054 -0.009
PE_MA2 -0.0656 0.019 -3.402 0.001 -0.104 -0.028
TH_MA2:PE_MA2 0.0004 0.001 0.755 0.452 -0.001 0.002
Omnibus: 2.326 Durbin-Watson: 2.398
Prob(Omnibus): 0.313 Jarque-Bera (JB): 2.382
Skew: 0.276 Prob(JB): 0.304
Kurtosis: 2.783 Cond. No. 3.33e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.33e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [31]:
formula = "IDEB_AF_STD ~  TH_MA4_STD * PE_MA4_STD"
model = ols(formula, df).fit()
model.summary()


Out[31]:
OLS Regression Results
Dep. Variable: IDEB_AF_STD R-squared: 0.385
Model: OLS Adj. R-squared: 0.374
Method: Least Squares F-statistic: 33.03
Date: Sat, 22 Sep 2018 Prob (F-statistic): 1.25e-16
Time: 00:19:53 Log-Likelihood: -187.38
No. Observations: 162 AIC: 382.8
Df Residuals: 158 BIC: 395.1
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.0073 0.063 0.116 0.908 -0.117 0.132
TH_MA4_STD -0.2838 0.067 -4.258 0.000 -0.415 -0.152
PE_MA4_STD -0.4735 0.065 -7.244 0.000 -0.603 -0.344
TH_MA4_STD:PE_MA4_STD -0.0265 0.054 -0.492 0.623 -0.133 0.080
Omnibus: 0.897 Durbin-Watson: 2.416
Prob(Omnibus): 0.638 Jarque-Bera (JB): 1.004
Skew: 0.162 Prob(JB): 0.605
Kurtosis: 2.792 Cond. No. 1.61


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [ ]:

Abordagem bayesiana


In [32]:
from pymc3 import *


WARNING (theano.configdefaults): g++ not available, if using conda: `conda install m2w64-toolchain`
WARNING (theano.configdefaults): g++ not detected ! Theano will be unable to execute optimized C-implementations (for both CPU and GPU) and will default to Python implementations. Performance will be severely degraded. To remove this warning, set Theano flags cxx to an empty string.
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

In [39]:
with Model() as unpooled_model:
    # Priors
    #sigma = HalfCauchy('sigma', beta=10., testval=1.)
    #intercept = Normal('Intercept', 0., sd=20.)
    #beta_pe = Normal('beta_pe', 0., sd=20.)
    #beta_th = Normal('beta_th', 0., sd=20.)
    #beta_peth = Normal('beta_peth', 0., sd=30.)

    # Model
    #ideb = intercept +  beta_pe * df.PE_MA2_STD.values  +  beta_th * df.TH_MA2_STD.values + beta_peth * df.TH_MA2_STD.values * df.PE_MA2_STD.values
    
    # Likelihood
    #likelihood = Normal('y', mu=ideb, sd=sigma, observed=df.IDEB_AI_STD)
    
    GLM.from_formula("IDEB_AF_STD ~  TH_MA2_STD * PE_MA2_STD", df)
    
    # Inference
    trace = sample(1000, tune=500, cores=3)


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [sd, TH_MA2_STD:PE_MA2_STD, PE_MA2_STD, TH_MA2_STD, Intercept]
Sampling 3 chains: 100%|████████████████████████████████████████████████████████| 4500/4500 [02:26<00:00, 30.73draws/s]

In [63]:
plt.figure(figsize=(7,7))
traceplot(trace,
         lines={k: v['mean'] for k, v in pm.summary(traces[-1000:]).iterrows()})
plt.tight_layout()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-63-026464d0f659> in <module>()
      1 plt.figure(figsize=(7,7))
      2 traceplot(trace,
----> 3          lines={k: v['mean'] for k, v in pm.summary(traces[-1000:]).iterrows()})
      4 plt.tight_layout()

NameError: name 'pm' is not defined
<Figure size 504x504 with 0 Axes>

In [61]:
years = sorted(df.Ano.unique())
year_idx = df.Ano.values - 2007
year_idx = list(map(lambda x: x//2, year_idx))
n_years = len(years)
#print(n_years, year_idx)

with Model() as hmodel:
    # Hyper-Priors
    sigma_intercept = HalfCauchy('sigma_intercept', 2.)
    sigma_pe = HalfCauchy('sigma_pe', 2.)
    sigma_th = HalfCauchy('sigma_th', 2.)
    sigma_peth = HalfCauchy('sigma_peth', 2.)
    
    mu_intercept = Normal('mu_intercept', mu=0., sd=4)
    mu_pe = Normal('mu_pe', mu=-0.3, sd=2)
    mu_th = Normal('mu_th', mu=-0.5, sd=2)
    mu_peth = Normal('mu_peth', mu=-0.2, sd=2)
    
    # Priors
    intercept = Normal('intercept', mu=mu_intercept, sd=sigma_intercept, shape=n_years)
    beta_pe = Normal('beta_pe', mu=mu_pe, sd=sigma_pe, shape=n_years)
    beta_th = Normal('beta_th', mu=mu_th, sd=sigma_th, shape=n_years)
    beta_peth = Normal('beta_peth', mu=mu_peth, sd=sigma_peth, shape=n_years)

    eps = HalfCauchy('eps', 2.)
    # Model
    ideb = intercept[year_idx] +  beta_pe[year_idx] * df.PE_MA2_STD.values  +  beta_th[year_idx] * df.TH_MA2_STD.values + beta_peth[year_idx] * df.TH_MA2_STD.values * df.PE_MA2_STD.values
    
    # Likelihood
    likelihood = Normal('y', mu=ideb, sd=eps, observed=df.IDEB_AI_STD)
    
    
    
    # Inference
    trace = sample(500, tune=200, cores=1)


Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [eps, beta_peth, beta_th, beta_pe, intercept, mu_peth, mu_th, mu_pe, mu_intercept, sigma_peth, sigma_th, sigma_pe, sigma_intercept]
Sampling 2 chains: 100%|████████████████████████████████████████████████████████| 1400/1400 [34:18<00:00,  6.26draws/s]
There were 162 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.5572116887634, but should be close to 0.8. Try to increase the number of tuning steps.
There were 23 divergences after tuning. Increase `target_accept` or reparameterize.
The gelman-rubin statistic is larger than 1.05 for some parameters. This indicates slight problems during sampling.
The estimated number of effective samples is smaller than 200 for some parameters.

In [66]:
plt.figure(figsize=(7,7))
traceplot(trace,
          lines={k: v['mean'] for k, v in summary(trace[-300:]).iterrows()})
plt.tight_layout()


<Figure size 504x504 with 0 Axes>

In [ ]:


In [ ]:

Conclusão

A análise realizada aqui nos fornece evidências que o impacto de fatores sócio-econômicos possuem num indicador como o IDEB dentro das unidades da federação Brasileira. Esses impactos podem ser até mesmo maiores que os fatores diretamente ligados a educação, principalmente nos anos iniciais.


In [ ]: