Gini Coefficient Across the United States

Zohaib Anwar N10143738

My plan is to look at inequality across the U.S., specifically at the state level. I want to firstly, deduce which states have higher or lower inequality (as measured by the Gini index). Thereafter I want to determine what is driving the inequality across states, such as differing levels of education. I will use a variable for income earned in the past 12 months (wagp) as answered by each respondent of the data set described below, to construct a Gini coefficient by state (the state in which the individual resides is another variable st), and for the entire U.S. Then I will compare the results between states and the entire U.S. The other variables analyzed are age, nationality, years of schooling, sex, race, number of Hispanics, and whether the individuals work in STEM related fields. I will also check correlations between other variables and the Gini coefficient to determine whether they are affecting which states have higher or lower inequality. After doing this I hope to be better able to explain inequality across the U.S., which states have more/less of it, and why.

Data

In this project I will use the Public Use Microdata Sample, which contains data from the American Community Survey, from the years 2011-2015. This data source has information on both personal and household levels of individuals across the U.S. The data is released yearly, and contains information on about millions of individuals (about 295,000 individuals are surveyed monthly). I will use the individual records. The data itself is responses to several survey questions given by these individuals. It contains several variables. I will use wagp, or the variable for income earned in the past 12 months as answered by each respondent, to construct a Gini coefficient by state (the state in which the individual resides is another variable st), and for the entire U.S. Then I will compare the results between states and the entire U.S. I will also check correlations between other variables and the Gini coefficient to determine whether they are affecting which states have higher or lower inequality. I will look at years of schooling, property tax, age, and the number of minorities and immigrants.

Below is the code I have used to access the data. Most of the columns from the data set are not needed for this analysis, hence I have specified some of the specific columns I am interested in and removed the rest of them.



In [83]:

    
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import zipfile as zf
import numpy as np
import requests, io 
import shutil
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity="all"
import html5lib
import plotly 
from plotly.offline import iplot, iplot_mpl 
import plotly.graph_objs as go
%matplotlib inline
plotly.offline.init_notebook_mode(connected=True)
from bs4 import BeautifulSoup
import statsmodels.api as sm



In [2]:

    
url= "https://www2.census.gov/programs-surveys/acs/data/pums/2015/5-Year/csv_pus.zip"
r=requests.get(url,timeout=1)                                                    #requesting the zipfile from the website



In [3]:

    
mlz = zf.ZipFile(io.BytesIO(r.content)) 
mlz.namelist()









    Out[3]:





['ss15pusa.csv',
 'ss15pusb.csv',
 'ss15pusc.csv',
 'ss15pusd.csv',
 'ACS2011_2015_PUMS_README.pdf']



In [4]:

    
cols=['ADJINC','PUMA00','ST','AGEP','CIT','ENG','MAR','MIL','SCHL','SEX','WAGP','HISP','NATIVITY','PINCP','RAC1P','SCIENGRLP']
na=['b','bb','bbb','bbbb','bbbbb','bbbbb','bbbbbb','bbbbbbb','bbbbbbbb']
url1='https://gist.githubusercontent.com/JeffPaine/3083347/'
url2='raw/7460b8beb99f717efb87c6e388903852fc52caf9/us_state_abbreviations.py'
r=requests.get(url1+url2) 
exec(BeautifulSoup(r.content,'lxml').get_text())  
                                                  
#Creating a list of states in a variable conveniently named State by scraping
#a ready-made list from github.



In [5]:

    
PUMS1=pd.read_csv(mlz.open(mlz.namelist()[0]),usecols=cols,na_values=na)
PUMS2=pd.read_csv(mlz.open(mlz.namelist()[1]),usecols=cols,na_values=na)
PUMS3=pd.read_csv(mlz.open(mlz.namelist()[2]),usecols=cols,na_values=na)
PUMS4=pd.read_csv(mlz.open(mlz.namelist()[3]),usecols=cols,na_values=na)         #Saving the seperate csv files into variables



In [6]:

    
PUMS=pd.concat([PUMS1,PUMS2,PUMS3,PUMS4])                         #Combining the files and renaming the columns and indices
PUMS['ADJINC']=PUMS['ADJINC'].map({1073094:2011,1042852:2012,1025215:2013,1009585:2014,1001264:2015})
PUMS['ST']=PUMS['ST'].map({1:"AL", 2:"AK", 4:"AZ", 5:"AR", 6:"CA", 8:"CO", 9:"CT",11:"DC" ,10:"DE", 12:"FL", 13:"GA", 
          15:"HI", 16:"ID", 17:"IL", 18:"IN", 19:"IA", 20:"KS", 21:"KY", 22:"LA", 23:"ME", 24:"MD", 
          25:"MA", 26:"MI", 27:"MN", 28:"MS", 29:"MO", 30:"MT", 31:"NE", 32:"NV", 33:"NH", 34:"NJ", 
          35:"NM", 36:"NY", 37:"NC", 38:"ND", 39:"OH", 40:"OK", 41:"OR", 42:"PA", 44:"RI", 45:"SC", 
          46:"SD", 47:"TN", 48:"TX", 49:"UT", 50:"VT", 51:"VA", 53:"WA", 54:"WV", 55:"WI", 56:"WY",72:"PR"})



In [7]:

    
#Creating a new dataframe with the required index and states
Gini=pd.DataFrame([x for x in list(range(2011,2016)) for _ in range(52)],
                  columns=['Year']).merge(pd.DataFrame((['US']+([x for x in states]))*5,columns=['State']),
                                          left_index=True,right_index=True).set_index(['Year','State'])
#Calculating Gini coefficient for entire U.S. and by state, by finding 
#the area under the Lorenz curve and the line of equality.
Gini['Gini']=np.nan
def Gini_calculation(year,st=None):
    if st==None:
        Gini1=PUMS[(PUMS['ADJINC']==year) & (PUMS['WAGP']>0)].sort_values('WAGP')
        Gini1['Equality']=np.linspace(0.0,1.0,len(Gini1['WAGP']))
        Gini1['%']=Gini1['WAGP']/sum(Gini1['WAGP'])
        Gini1['cumulative']=np.cumsum(Gini1['%'])
        area_perf=np.trapz(Gini1['Equality'],Gini1['Equality'])
        area_lorenz=np.trapz(Gini1['cumulative'],Gini1['Equality'])
        Gini_US=(area_perf-area_lorenz)/area_perf
        Gini.loc[(year,'US'),'Gini']=Gini_US
    else:
        Gini1=PUMS[(PUMS['ADJINC']==year) & (PUMS['ST']==st) & (PUMS['WAGP']>0)].sort_values('WAGP')
        Gini1['Equality']=np.linspace(0.0,1.0,len(Gini1['WAGP']))
        Gini1['%']=Gini1['WAGP']/sum(Gini1['WAGP'])
        Gini1['cumulative']=np.cumsum(Gini1['%'])
        area_perf=np.trapz(Gini1['Equality'],Gini1['Equality'])
        area_lorenz=np.trapz(Gini1['cumulative'],Gini1['Equality'])
        Gini_ST=(area_perf-area_lorenz)/area_perf
        Gini.loc[(year,st),'Gini']=Gini_ST

for x in list(range(2011,2016)):
    Gini_calculation(x)
    for y in states:
        Gini_calculation(x,st=y)



In [8]:

    
#Creating a Quartile and Scale variable to determine whether the 
#Gini coefficient is below, above or equal to the
#coefficient of the U.S. for that year.
Gini['Scale']=np.NaN
Gini['Scale']=Gini['Scale'].astype(object)
def Gini_Scale(x):
    for g,i in enumerate(Gini.loc[x]['Gini']):
        a=Gini.loc[x].index[g]
        if i>Gini.loc[(x,'US'),'Gini']:
            Gini.loc[(x,a),'Scale':'Scale']='Larger'
        elif i<Gini.loc[(x,'US'),'Gini']:
            Gini.loc[(x,a),'Scale':'Scale']='Smaller'
        elif i==Gini.loc[(x,'US'),'Gini']:
            Gini.loc[(x,a),'Scale':'Scale']='Equal'
for year in list(range(2011,2016)):
    Gini_Scale(year)



In [9]:

    
#Mapping the Gini Coefficient for the latest year of data to get 
#a sense of the spread.
data = [ dict(
        type='choropleth',
        locations = states,
        z =Gini.xs(2015,level=0)['Gini'].drop('US'),
        locationmode = 'USA-states',
        text = Gini.xs(2012,level=0)['Gini'].drop('US'),
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Gini Coefficient")
        ) ]

layout = dict(
        title = 'Gini Coefficient<br>(Hover for breakdown)',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = False,
            lakecolor = 'rgb(255, 255, 255)'),
             )
iplot(go.Figure(data=data, layout=layout), link_text="")

Here it is interesting to note that most of the states, in 2015, with higher Gini coefficients are those along the coast such as California, Texas, New York and New Jersey. It is also obvious that not many states have relatively large or lower Gini coefficients, in fact only a few seem to stand out. I will explore this more below.



In [10]:

    
#Here I create columns for each of the relevant variables
#and a function that calculates the proportion of
#Minority, Immigrants, Hispanic, people in STEM related fields,
#as well as average schooling and age for each state and year,
#well as for the U.S.as
PEd=PUMS.set_index(['ADJINC','ST']).sort_index(level='ST') 
Gini['P_Sex']=np.nan                                        
Gini['P_Min']=np.nan                                       
Gini['P_Im']=np.nan                                        
Gini['P_Hisp']=np.nan                                      
Gini['Stem']=np.nan                                        
Gini['Age']=np.nan
Gini['Schooling']=np.nan
def Proportion_Calc(x,y=None):
    if y==None:
        Gini.loc[(x,'US'),'P_Sex']=100*PEd.loc[x]['SEX'][PEd.loc[x]['SEX']==2].count()/PEd.\
        loc[x].shape[0]
        Gini.loc[(x,'US'),'P_Min']=100*PEd.loc[x]['RAC1P'][PEd.loc[x]['RAC1P'].isin([2,3,4,5,7])].count()/PEd.\
        loc[x].shape[0]
        Gini.loc[(x,'US'),'P_Im']=100*PEd.loc[x]['NATIVITY'][PEd.loc[x]['NATIVITY']==2].count()/PEd.loc[x]. \
        shape[0]
        Gini.loc[(x,'US'),'P_Hisp']=100*PEd.loc[x]['HISP'][PEd.loc[x]['HISP']>1].count()/PEd.loc[x]. \
        shape[0]
        Gini.loc[(x,'US'),'Stem']=100*PEd.loc[x]['SCIENGRLP'][PEd.loc[x]['SCIENGRLP']==1].count()/PEd.loc[x]. \
        shape[0]
        Gini.loc[(x,'US'),'Age']=PEd.loc[x]['AGEP'].mean()
        Gini.loc[(x,'US'),'Schooling']=PEd.loc[x]['SCHL'].mean()
    if y!=None:
        Gini.loc[(x,y),'P_Sex']=100*PEd.loc[x][y:y]['SEX'][PEd.loc[x][y:y]['SEX']==2].count()/PEd. \
        loc[x][y:y].shape[0]
        Gini.loc[(x,y),'P_Min']=100*PEd.loc[x][y:y]['RAC1P'][PEd.loc[x][y:y]['RAC1P'].isin([2,3,4,5,7])].count()/PEd. \
        loc[x][y:y].shape[0]
        Gini.loc[(x,y),'P_Im']=100*PEd.loc[x][y:y]['NATIVITY'][PEd.loc[x][y:y]['NATIVITY']==2].count()/PEd.loc[x][y:y]. \
        shape[0]
        Gini.loc[(x,y),'P_Hisp']=100*PEd.loc[x][y:y]['HISP'][PEd.loc[x][y:y]['HISP']>1].count()/PEd.loc[x][y:y]. \
        shape[0]
        Gini.loc[(x,y),'Stem']=100*PEd.loc[x][y:y]['SCIENGRLP'][PEd.loc[x][y:y]['SCIENGRLP']==1].count()/PEd.loc[x][y:y]. \
        shape[0]
        Gini.loc[(x,y),'Age']=PEd.loc[x][y:y]['AGEP'].mean()
        Gini.loc[(x,y),'Schooling']=PEd.loc[x][y:y]['SCHL'].mean()



In [11]:

    
for year in list(range(2011,2016)):
    Proportion_Calc(year)
    for st in states:
        Proportion_Calc(year,y=st)



In [13]:

    
#Creating sorted variables to determine which states have higher
#values for each variable
def Sorted(z):
    for year in list(range(2011,2016)):
        globals()['{}_{}'.format(z,year)]=Gini.loc[year].sort_values('{}'.format(z),ascending=False)
        globals()['{}_{}'.format(z,year)]['Scale']=np.nan
        for g,i in enumerate(globals()['{}_{}'.format(z,year)]['{}'.format(z)]):
            a=globals()['{}_{}'.format(z,year)].index[g]
            if i>globals()['{}_{}'.format(z,year)].loc['US','{}'.format(z)]:
                globals()['{}_{}'.format(z,year)].loc[a,'Scale':'Scale']='Larger'
            elif i<globals()['{}_{}'.format(z,year)].loc['US','{}'.format(z)]:
                globals()['{}_{}'.format(z,year)].loc[a,'Scale':'Scale']='Smaller'
            elif i==globals()['{}_{}'.format(z,year)].loc['US','{}'.format(z)]:
                globals()['{}_{}'.format(z,year)].loc[a,'Scale':'Scale']='Equal'
    globals()['All_{}'.format(z)]=[]
    for year in list(range(2011,2016)):
        globals()['Order_{}_{}'.format(z,year)]=pd.DataFrame()
        globals()['Order_{}_{}'.format(z,year)]['Scale']=globals()['{}_{}'.format(z,year)]['Scale'].drop('US')
        globals()['All_{}'.format(z)].append(globals()['Order_{}_{}'.format(z,year)])
    globals()['C_{}'.format(z)]=pd.DataFrame()
    globals()['D_{}'.format(z)]=pd.DataFrame()
    globals()['E_{}'.format(z)]=pd.DataFrame()
    globals()['F_{}'.format(z)]=pd.DataFrame()
    for value in globals()['All_{}'.format(z)]:
        globals()['C_{}'.format(z)]=pd.concat([globals()['C_{}'.format(z)],pd.Series(value[value['Scale']=='Larger'].index.values).head(5).reset_index(drop=True)],axis=1,ignore_index=True)
        globals()['D_{}'.format(z)]=pd.concat([globals()['D_{}'.format(z)],pd.Series(value[value['Scale']=='Smaller'].index.values).tail(5).reset_index(drop=True)],axis=1,ignore_index=True)
        globals()['E_{}'.format(z)]=pd.concat([globals()['E_{}'.format(z)],pd.Series(value[value['Scale']=='Larger'].index.values)],axis=1,ignore_index=True)
        globals()['F_{}'.format(z)]=pd.concat([globals()['F_{}'.format(z)],pd.Series(value[value['Scale']=='Smaller'].index.values)],axis=1,ignore_index=True)



In [14]:

    
Sorted('Gini')
Sorted('P_Sex')
Sorted('P_Min')
Sorted('P_Im')
Sorted('P_Hisp')
Sorted('Stem')
Sorted('Age')
Sorted('Schooling')



In [15]:

    
#Creating tables to display states that have both a higher Gini coefficient
#and higher values for each of the variables of interest
Upper=pd.DataFrame('State: ',columns=list(range(2011,2016)),index=['Fem','Min','Hisp','Imm','Stem','Age','Schooling'])
for i in C_Gini.columns:
    for a in C_Gini[i]:
        if a in list(E_P_Sex[i]):
            Upper.ix['Fem':'Fem',i+2011:i+2011]=Upper.ix['Fem':'Fem',i+2011:i+2011]+' {}'.format(a)
    for a in C_Gini[i]:
        if a in list(E_P_Min[i]):
            Upper.ix['Min':'Min',i+2011:i+2011]=Upper.ix['Min':'Min',i+2011:i+2011]+' {}'.format(a)
    for a in C_Gini[i]:
        if a in list(E_P_Im[i]):
            Upper.ix['Imm':'Imm',i+2011:i+2011]=Upper.ix['Imm':'Imm',i+2011:i+2011]+' {}'.format(a)
    for a in C_Gini[i]:
        if a in list(E_P_Hisp[i]):
            Upper.ix['Hisp':'Hisp',i+2011:i+2011]=Upper.ix['Hisp':'Hisp',i+2011:i+2011]+' {}'.format(a)
    for a in C_Gini[i]:
        if a in list(E_Schooling[i]):
            Upper.ix['Schooling':'Schooling',i+2011:i+2011]=Upper.ix['Schooling':'Schooling',i+2011:i+2011]+' {}'.format(a)
    for a in C_Gini[i]:
        if a in list(E_Stem[i]):
            Upper.ix['Stem':'Stem',i+2011:i+2011]=Upper.ix['Stem':'Stem',i+2011:i+2011]+' {}'.format(a)
    for a in C_Gini[i]:
        if a in list(E_Age[i]):
            Upper.ix['Age':'Age',i+2011:i+2011]=Upper.ix['Age':'Age',i+2011:i+2011]+' {}'.format(a)
Lower=pd.DataFrame('State: ',columns=list(range(2011,2016)),index=['Fem','Min','Hisp','Imm','Stem','Age','Schooling'])
for i in D_Gini.columns:
    for a in D_Gini[i]:
        if a in list(F_P_Sex[i]):
            Lower.ix['Fem':'Fem',i+2011:i+2011]=Lower.ix['Fem':'Fem',i+2011:i+2011]+' {}'.format(a)
    for a in D_Gini[i]:
        if a in list(F_P_Min[i]):
            Lower.ix['Min':'Min',i+2011:i+2011]=Lower.ix['Min':'Min',i+2011:i+2011]+' {}'.format(a)
    for a in D_Gini[i]:
        if a in list(F_P_Im[i]):
            Lower.ix['Imm':'Imm',i+2011:i+2011]=Lower.ix['Imm':'Imm',i+2011:i+2011]+' {}'.format(a)
    for a in D_Gini[i]:
        if a in list(F_P_Hisp[i]):
            Lower.ix['Hisp':'Hisp',i+2011:i+2011]=Lower.ix['Hisp':'Hisp',i+2011:i+2011]+' {}'.format(a)
    for a in D_Gini[i]:
        if a in list(F_Schooling[i]):
            Lower.ix['Schooling':'Schooling',i+2011:i+2011]=Lower.ix['Schooling':'Schooling',i+2011:i+2011]+' {}'.format(a)
    for a in D_Gini[i]:
        if a in list(F_Stem[i]):
            Lower.ix['Stem':'Stem',i+2011:i+2011]=Lower.ix['Stem':'Stem',i+2011:i+2011]+' {}'.format(a)
    for a in D_Gini[i]:
        if a in list(F_Age[i]):
            Lower.ix['Age':'Age',i+2011:i+2011]=Lower.ix['Age':'Age',i+2011:i+2011]+' {}'.format(a)



In [16]:

    
Upper









    Out[16]:






  
    
      
      2011
      2012
      2013
      2014
      2015
    
  
  
    
      Fem
      State:  CT NY MA
      State:  CT MA NY DC
      State:  CT NY DC MA
      State:  CT NY MA NJ
      State:  CT MA NY
    
    
      Min
      State:  NY
      State:  NY DC
      State:  NY DC
      State:  NY NJ
      State:  NY
    
    
      Hisp
      State:  CA
      State:  CA
      State:  CA
      State:  CA NJ
      State:  CA
    
    
      Imm
      State:  CT NY CA MA
      State:  CT MA CA NY DC
      State:  CT NY CA DC MA
      State:  CT NY MA CA NJ
      State:  CT MA NY CA
    
    
      Stem
      State:  CT NY MA
      State:  CT MA NY
      State:  CT NY MA
      State:  CT NY MA NJ
      State:  CT MA NY UT
    
    
      Age
      State:  CT NY MA
      State:  CT MA NY
      State:  CT NY MA
      State:  CT NY MA NJ
      State:  CT MA NY
    
    
      Schooling
      State:  CT NY MA
      State:  CT MA NY DC
      State:  CT NY DC MA
      State:  CT NY MA NJ
      State:  CT MA NY

Here I have displayed the states the states with the highest Gini Coefficient (top 5), and whether or not they also have higher proportions and averages for each of the variables than the United States. As it can be seen most of the variables have somewhere from 2-5 states that also have higher values than the United States. This implies that most of the variables, proportion female, minority, hispanic, immigrant, people with a degree in a STEM related field, average age and schooling all may have a positive effect on inequality. This is is as expected for most of the variables as one would expect number of females, minority, hispanic, immigrants and those in a STEM field to increase inequality in a state, as well as states with higher average age. Average schooling on the other hand, I would expect to negatively affect inequality, since a higher level of schooling implies that more people are educated, but it could also imply that there are more skilled versus unskilled workers in the state leading to higher inequality.



In [17]:

    
Lower









    Out[17]:






  
    
      
      2011
      2012
      2013
      2014
      2015
    
  
  
    
      Fem
      State:  MT IA WY HI SD
      State:  NV IA WY SD HI
      State:  ND IA SD HI WY
      State:  IA WY NV SD HI
      State:  WI IA SD ND HI
    
    
      Min
      State:  MT IA WY HI SD
      State:  NV IA WY SD
      State:  ND IA SD WY
      State:  IA WY NV SD
      State:  WI IA SD ND
    
    
      Hisp
      State:  MT IA WY HI SD
      State:  IA WY SD HI
      State:  ND IA SD HI WY
      State:  IA WY SD HI
      State:  WI IA SD ND HI
    
    
      Imm
      State:  MT IA WY SD
      State:  IA WY SD
      State:  ND IA SD WY
      State:  IA WY SD
      State:  WI IA SD ND
    
    
      Stem
      State:  IA
      State:  NV IA WY HI
      State:  IA WY
      State:  IA WY NV
      State:  IA SD HI
    
    
      Age
      State:  WY
      State:  NV WY
      State:  SD WY
      State:  WY NV SD
      State:  SD ND
    
    
      Schooling
      State:  SD
      State:  NV SD
      State:  IA SD
      State:  NV SD
      State:  IA SD ND

A similar conclusion can be drawn from this table displaying states with the lowest (bottom 5) Gini coefficients, and whether or not they also have lower values than the United States for each of the variables. The results almost mirror that of the Upper table.



In [69]:

    
fig,ax= plt.subplots(7,1, figsize=(20,30))
sns.violinplot(x='Year',y='P_Sex',hue='Scale', ax=ax[0],\
               data=Gini.reset_index()[Gini.reset_index()['State']!='US']\
              )
sns.violinplot(x='Year',y='P_Min',hue='Scale', ax=ax[1],\
               data=Gini.reset_index()[Gini.reset_index()['State']!='US']\
              )
sns.violinplot(x='Year',y='P_Im',hue='Scale', ax=ax[2],\
               data=Gini.reset_index()[Gini.reset_index()['State']!='US']\
              )
sns.violinplot(x='Year',y='P_Hisp',hue='Scale', ax=ax[3],\
               data=Gini.reset_index()[Gini.reset_index()['State']!='US']\
              )
sns.violinplot(x='Year',y='Stem',hue='Scale', ax=ax[4],\
               data=Gini.reset_index()[Gini.reset_index()['State']!='US']\
              )
sns.violinplot(x='Year',y='Age',hue='Scale', ax=ax[5],\
               data=Gini.reset_index()[Gini.reset_index()['State']!='US']\
              )
sns.violinplot(x='Year',y='Schooling',hue='Scale', ax=ax[6],\
               data=Gini.reset_index()[Gini.reset_index()['State']!='US']\
              )
ax[0].set_title('Larger versus Smaller',fontsize='x-large')
ax[0].legend(loc='upper left',fontsize='large')
ax[1].legend(loc='upper left')
ax[2].legend(loc='upper left')
ax[3].legend(loc='upper left')
ax[4].legend(loc='upper left')
ax[5].legend(loc='upper left')
ax[6].legend(loc='upper left')
ax[0].set_ylabel('Sex',fontsize='x-large')
ax[0].set_xlabel('Year',fontsize='x-large')
ax[0].tick_params('x',labelsize='large')
ax[1].set_ylabel('Minority',fontsize='x-large')
ax[1].set_xlabel('Year',fontsize='x-large')
ax[1].tick_params('x',labelsize='large')
ax[2].set_ylabel('Immigrant',fontsize='x-large')
ax[2].set_xlabel('Year',fontsize='x-large')
ax[2].tick_params('x',labelsize='large')
ax[3].set_ylabel('Hispanic',fontsize='x-large')
ax[3].set_xlabel('Year',fontsize='x-large')
ax[3].tick_params('x',labelsize='large')
ax[4].set_ylabel('Stem',fontsize='x-large')
ax[4].set_xlabel('Year',fontsize='x-large')
ax[4].tick_params('x',labelsize='large')
ax[5].set_ylabel('Age',fontsize='x-large')
ax[5].set_xlabel('Year',fontsize='x-large')
ax[5].tick_params('x',labelsize='large')
ax[6].set_ylabel('Schooling',fontsize='x-large')
ax[6].set_xlabel('Year',fontsize='x-large')
ax[6].tick_params('x',labelsize='large');

These are violin plots for each of the variables separating the data sets into a group for those states with a larger Gini Coefficient than the U.S. and those with lower. As one can see here, it seems a larger Gini coefficient is associated with a higher chance (and median) of getting a larger propotion for pretty much all of the variables except Age. Hence, higher Gini coefficient, these plots imply, is associated with a higher average schooling, a higher propotion of those with a STEM related degree, Hispanic, immigrant, minority, and female. Age it seems is saying that a higher Gini coefficient is associated with a lower average age, though this does not conceptually seem sound as more inequality should imply higher age as the older are more likely to have unequaly income, although this may say otherwise.



In [75]:

    
sns.jointplot(x='P_Sex',y='Gini',\
               data=Gini_2011.drop('US'), kind="kde");









    



C:\Users\za369\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future



In [76]:

    
sns.jointplot(x='P_Im',y='Gini',\
               data=Gini_2011.drop('US'), kind="kde");









    



C:\Users\za369\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future



In [77]:

    
sns.jointplot(x='P_Min',y='Gini',\
               data=Gini_2011.drop('US'), kind="kde");









    



C:\Users\za369\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future



In [78]:

    
sns.jointplot(x='P_Hisp',y='Gini',\
               data=Gini_2011.drop('US'), kind="kde");









    



C:\Users\za369\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future



In [79]:

    
sns.jointplot(x='Stem',y='Gini',\
               data=Gini_2011.drop('US'), kind="kde");









    



C:\Users\za369\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future



In [80]:

    
sns.jointplot(x='Age',y='Gini',\
               data=Gini_2011.drop('US'), kind="kde");









    



C:\Users\za369\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future



In [81]:

    
sns.jointplot(x='Schooling',y='Gini',\
               data=Gini_2011.drop('US'), kind="kde");









    



C:\Users\za369\AppData\Local\Continuum\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning:

using a non-integer number instead of an integer will result in an error in the future

The above figures are heatmaps detailing the correlation between Gini and the variables in question. As one can see they confirm the results from before, as all of the variables have positive correlation, except agae again, which is questionable.



In [86]:

    
X=Gini[['P_Sex','Schooling','Age','P_Hisp','P_Min','P_Im','Stem']]
X = sm.add_constant(X)
est=sm.OLS(Gini['Gini'],X).fit()
est.summary()









    Out[86]:





OLS Regression Results

  Dep. Variable:           Gini          R-squared:             0.419


  Model:                    OLS          Adj. R-squared:        0.403


  Method:              Least Squares     F-statistic:           25.99


  Date:              Fri, 12 May 2017    Prob (F-statistic):  1.18e-26


  Time:                  16:51:06        Log-Likelihood:       748.58


  No. Observations:          260         AIC:                  -1481.


  Df Residuals:              252         BIC:                  -1453.


  Df Model:                    7                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  const          0.2346      0.055      4.283   0.000      0.127     0.342


  P_Sex          0.0068      0.001      6.121   0.000      0.005     0.009


  Schooling      0.0012      0.003      0.422   0.673     -0.004     0.007


  Age           -0.0037      0.001     -5.706   0.000     -0.005    -0.002


  P_Hisp         0.0003      0.000      1.610   0.109  -5.66e-05     0.001


  P_Min      -6.222e-05      0.000     -0.559   0.577     -0.000     0.000


  P_Im           0.0007      0.000      2.633   0.009      0.000     0.001


  Stem           0.0186      0.004      4.563   0.000      0.011     0.027




  Omnibus:        25.764    Durbin-Watson:         2.218


  Prob(Omnibus):   0.000    Jarque-Bera (JB):     33.574


  Skew:           -0.691    Prob(JB):           5.12e-08


  Kurtosis:        4.090    Cond. No.           4.45e+03

Here I provide simple regression results. The coefficients for Sex, Schooling, Age, immigration, and Stem are all significant, and also confirm our previous results. The R-Squared is .419 implying that 41.9% of the variation in the Gini Coefficient is explained by the variables

These results are questionable though as they are a simple regression on multi-level and multi-year data. What really needs to be done is a panel regression.

Conclusion

Through this project I have calculated the Gini coefficient across the states, and for the U.S., displayed these results in a map, observed what a higher Gini coefficiecnt means in terms of the other variables, and have provided regression results. For the most part it seems all of the variables, proportion of females in a state, minority, hispanics, and those with STEM degree as well as higher average schooling all imply higher inequality in the state, while the results show that age is associated with lower inequality.

These results are somewhat questionable as a panel regression is not provided. Similarly, the topcoding in the PUMS dataset, especially for wage, is not accounted for. Other variables not from this data set should also be observed, as there are definitely other causes of inequality not addressed here. Furthermore, one can compare across countries, or even across smaller units of area such as Metropolitan Statistical Areas. These are all areas of further improvement.

	2011	2012	2013	2014	2015
Fem	State: CT NY MA	State: CT MA NY DC	State: CT NY DC MA	State: CT NY MA NJ	State: CT MA NY
Min	State: NY	State: NY DC	State: NY DC	State: NY NJ	State: NY
Hisp	State: CA	State: CA	State: CA	State: CA NJ	State: CA
Imm	State: CT NY CA MA	State: CT MA CA NY DC	State: CT NY CA DC MA	State: CT NY MA CA NJ	State: CT MA NY CA
Stem	State: CT NY MA	State: CT MA NY	State: CT NY MA	State: CT NY MA NJ	State: CT MA NY UT
Age	State: CT NY MA	State: CT MA NY	State: CT NY MA	State: CT NY MA NJ	State: CT MA NY
Schooling	State: CT NY MA	State: CT MA NY DC	State: CT NY DC MA	State: CT NY MA NJ	State: CT MA NY

	2011	2012	2013	2014	2015
Fem	State: MT IA WY HI SD	State: NV IA WY SD HI	State: ND IA SD HI WY	State: IA WY NV SD HI	State: WI IA SD ND HI
Min	State: MT IA WY HI SD	State: NV IA WY SD	State: ND IA SD WY	State: IA WY NV SD	State: WI IA SD ND
Hisp	State: MT IA WY HI SD	State: IA WY SD HI	State: ND IA SD HI WY	State: IA WY SD HI	State: WI IA SD ND HI
Imm	State: MT IA WY SD	State: IA WY SD	State: ND IA SD WY	State: IA WY SD	State: WI IA SD ND
Stem	State: IA	State: NV IA WY HI	State: IA WY	State: IA WY NV	State: IA SD HI
Age	State: WY	State: NV WY	State: SD WY	State: WY NV SD	State: SD ND
Schooling	State: SD	State: NV SD	State: IA SD	State: NV SD	State: IA SD ND

Dep. Variable:	Gini	R-squared:	0.419
Model:	OLS	Adj. R-squared:	0.403
Method:	Least Squares	F-statistic:	25.99
Date:	Fri, 12 May 2017	Prob (F-statistic):	1.18e-26
Time:	16:51:06	Log-Likelihood:	748.58
No. Observations:	260	AIC:	-1481.
Df Residuals:	252	BIC:	-1453.
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	0.2346	0.055	4.283	0.000	0.127 0.342
P_Sex	0.0068	0.001	6.121	0.000	0.005 0.009
Schooling	0.0012	0.003	0.422	0.673	-0.004 0.007
Age	-0.0037	0.001	-5.706	0.000	-0.005 -0.002
P_Hisp	0.0003	0.000	1.610	0.109	-5.66e-05 0.001
P_Min	-6.222e-05	0.000	-0.559	0.577	-0.000 0.000
P_Im	0.0007	0.000	2.633	0.009	0.000 0.001
Stem	0.0186	0.004	4.563	0.000	0.011 0.027

Omnibus:	25.764	Durbin-Watson:	2.218
Prob(Omnibus):	0.000	Jarque-Bera (JB):	33.574
Skew:	-0.691	Prob(JB):	5.12e-08
Kurtosis:	4.090	Cond. No.	4.45e+03