Income Inequality between high earners and low earners

A critique of http://www.informationisbeautiful.net/visualizations/what-are-wallst-protestors-angry-about/



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set(palette = sns.dark_palette("skyblue", 8, reverse=True))









    



/usr/local/lib/python3.4/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Getting the data

Before going into the purely visual aspects and how effective they are at conveying a story, I want to understand what data we are dealing with. At the bottom of the graph, there is a bit.ly URL that points to a google drive document. Adding export?format=xlsx will allow us to download this document as an excel spreadsheet, which can then be sliced and diced easily with the pandas analytics module.



In [2]:

    
!wget 'https://docs.google.com/spreadsheets/d/1N_Hc-xKr7DQc8bZAvLROGWr5Cr-A6MfGnH91fFW3ZwA/export?format=xlsx&id=1N_Hc-xKr7DQc8bZAvLROGWr5Cr-A6MfGnH91fFW3ZwA' -O wallstreet.xlsx









    



--2015-11-18 10:07:04--  https://docs.google.com/spreadsheets/d/1N_Hc-xKr7DQc8bZAvLROGWr5Cr-A6MfGnH91fFW3ZwA/export?format=xlsx&id=1N_Hc-xKr7DQc8bZAvLROGWr5Cr-A6MfGnH91fFW3ZwA
Resolving docs.google.com... 216.58.219.238, 2607:f8b0:4006:80f::200e
Connecting to docs.google.com|216.58.219.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/vnd.openxmlformats-officedocument.spreadsheetml.sheet]
Saving to: “wallstreet.xlsx”

    [ <=>                                   ] 9,302       --.-K/s   in 0.002s  

2015-11-18 10:07:05 (4.77 MB/s) - “wallstreet.xlsx” saved [9302]



In [3]:

    
df = pd.read_excel('wallstreet.xlsx', skiprows=1, index_col = 'Country')



In [4]:

    
df.describe()









    Out[4]:






  
    
      
      Gini
      Year
      Unnamed: 4
      Unnamed: 5
      Unnamed: 6
    
  
  
    
      count
      135.000000
      135.000000
      0
      0
      0
    
    
      mean
      39.993333
      2005.607407
      NaN
      NaN
      NaN
    
    
      std
      9.920762
      3.919138
      NaN
      NaN
      NaN
    
    
      min
      23.000000
      1989.000000
      NaN
      NaN
      NaN
    
    
      25%
      32.350000
      2005.000000
      NaN
      NaN
      NaN
    
    
      50%
      39.000000
      2007.000000
      NaN
      NaN
      NaN
    
    
      75%
      45.800000
      2008.000000
      NaN
      NaN
      NaN
    
    
      max
      70.700000
      2010.000000
      NaN
      NaN
      NaN

First issue with the data, right away we can see the wide range of dates. Let's look at the date distribution. We probably would want to use only 2010 if it represents enough data. We will make a note of 39.99 as the average Gini coefficient over all those years.



In [5]:

    
df['Year'].hist(bins=22)  # 22 bins so I get every year as a distinct sum









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fe2861112e8>

We will get just the data for 2009. Not only it is recent, but it is plenty of data points to represent at once. This will also address the other issue with the data: in the raw form, it is too numerous and will overload the reader if presented as is. We will also load the US data, since it is supposed to tell the story of 'occupy wallstreet'. If we are missing further critical data, we can always add a specific data point later, as we are keeping the original data frame untouched.



In [6]:

    
gini_df = df[(df.Year==2009)|(df.index=='United States')]['Gini']  # Only 2009, and choose only the gini columns (and the index, country)



In [7]:

    
gini_df









    Out[7]:





Country
Hungary          24.7
Kazakhstan       26.7
Ireland          29.3
Estonia          31.4
Korea, South     31.4
Indonesia        36.8
Georgia          40.8
Venezuela        41.0
Russia           42.2
Uruguay          42.4
Uganda           44.3
United States    45.0
Argentina        45.8
Malaysia         46.2
Singapore        47.8
Peru             48.0
Costa Rica       50.3
Chile            52.1
Paraguay         53.2
Thailand         53.6
Brazil           53.9
Bolivia          58.2
Colombia         58.5
Name: Gini, dtype: float64



In [8]:

    
current_ax = gini_df.plot(kind='barh', color=sns.color_palette()[0])
current_ax.set_title('Gini index (%) in 2009')
current_ax.vlines(39.99, 0, len(gini_df), color=sns.color_palette()[2])









    



/usr/local/lib/python3.4/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))






    Out[8]:





<matplotlib.collections.LineCollection at 0x7fe285e4ec18>

This is already way easier to compare than the original infographic. Perhaps not as snazzy, but at least it gives us a start in trying to understand the data. But it is just that, a start. One angle would be to investigate how much above average is the Gini for the US. But I would also want to have the measures, including the average from the same year. A quick comparison of the two distributions (2009 vs all the data) shows how sampling on 2009 skews toward a higher Gini.



In [9]:

    
ax = df['Gini'].plot(kind='kde')
gini_df.plot(kind='kde', ax=ax)  #overlay 2009 vs all years/countries









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fe285cfc6d8>

Comparing with GDP, population, gender inequality, even subjective "satisfaction indexes" and the like would be much more interesting. To tell a real story, we need to show some correlation, and provide some narrative and/or visualization to explain Gini. At the end of the day, perhaps the real story is that Gini is not a great universal indicator.

Colors

Where the graph at http://www.informationisbeautiful.net/visualizations/what-are-wallst-protestors-angry-about/ was using a very gradual change in hue based on the value (redundant, the width and the number already shows this), it is so subtle that it doesn't show any significant difference between two consecutive rows.

A better use of color is to highlight our focus, or for reference lines. With that in mind, let's enhance our bar plot with judicious use of color for making it quicker to spot the US data.



In [10]:

    
current_ax = gini_df.plot(kind='barh', color=sns.color_palette()[0])
current_ax.patches[list(gini_df.index).index("United States")].set_facecolor('#cc5555')
current_ax.set_title('Gini index (%) in 2009')
current_ax.vlines(39.99, 0, len(gini_df), color=sns.color_palette()[2])
current_ax.annotate('Average for\n1989-2010',
                    (40, 2),
                    xytext=(20, 10), 
                    textcoords='offset points',
                    arrowprops=dict(arrowstyle='-|>'))









    



/usr/local/lib/python3.4/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))






    Out[10]:





<matplotlib.text.Annotation at 0x7fe285c64c18>



In [ ]:

	Gini	Year	Unnamed: 4	Unnamed: 5	Unnamed: 6
count	135.000000	135.000000	0	0	0
mean	39.993333	2005.607407	NaN	NaN	NaN
std	9.920762	3.919138	NaN	NaN	NaN
min	23.000000	1989.000000	NaN	NaN	NaN
25%	32.350000	2005.000000	NaN	NaN	NaN
50%	39.000000	2007.000000	NaN	NaN	NaN
75%	45.800000	2008.000000	NaN	NaN	NaN
max	70.700000	2010.000000	NaN	NaN	NaN