A critique of http://www.informationisbeautiful.net/visualizations/what-are-wallst-protestors-angry-about/
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set(palette = sns.dark_palette("skyblue", 8, reverse=True))
Before going into the purely visual aspects and how effective they are at conveying a story, I want to understand what data we are dealing with. At the bottom of the graph, there is a bit.ly URL that points to a google drive document. Adding export?format=xlsx will allow us to download this document as an excel spreadsheet, which can then be sliced and diced easily with the pandas analytics module.
In [2]:
!wget 'https://docs.google.com/spreadsheets/d/1N_Hc-xKr7DQc8bZAvLROGWr5Cr-A6MfGnH91fFW3ZwA/export?format=xlsx&id=1N_Hc-xKr7DQc8bZAvLROGWr5Cr-A6MfGnH91fFW3ZwA' -O wallstreet.xlsx
In [3]:
df = pd.read_excel('wallstreet.xlsx', skiprows=1, index_col = 'Country')
In [4]:
df.describe()
Out[4]:
First issue with the data, right away we can see the wide range of dates. Let's look at the date distribution. We probably would want to use only 2010 if it represents enough data. We will make a note of 39.99 as the average Gini coefficient over all those years.
In [5]:
df['Year'].hist(bins=22) # 22 bins so I get every year as a distinct sum
Out[5]:
We will get just the data for 2009. Not only it is recent, but it is plenty of data points to represent at once. This will also address the other issue with the data: in the raw form, it is too numerous and will overload the reader if presented as is. We will also load the US data, since it is supposed to tell the story of 'occupy wallstreet'. If we are missing further critical data, we can always add a specific data point later, as we are keeping the original data frame untouched.
In [6]:
gini_df = df[(df.Year==2009)|(df.index=='United States')]['Gini'] # Only 2009, and choose only the gini columns (and the index, country)
In [7]:
gini_df
Out[7]:
In [8]:
current_ax = gini_df.plot(kind='barh', color=sns.color_palette()[0])
current_ax.set_title('Gini index (%) in 2009')
current_ax.vlines(39.99, 0, len(gini_df), color=sns.color_palette()[2])
Out[8]:
This is already way easier to compare than the original infographic. Perhaps not as snazzy, but at least it gives us a start in trying to understand the data. But it is just that, a start. One angle would be to investigate how much above average is the Gini for the US. But I would also want to have the measures, including the average from the same year. A quick comparison of the two distributions (2009 vs all the data) shows how sampling on 2009 skews toward a higher Gini.
In [9]:
ax = df['Gini'].plot(kind='kde')
gini_df.plot(kind='kde', ax=ax) #overlay 2009 vs all years/countries
Out[9]:
Comparing with GDP, population, gender inequality, even subjective "satisfaction indexes" and the like would be much more interesting. To tell a real story, we need to show some correlation, and provide some narrative and/or visualization to explain Gini. At the end of the day, perhaps the real story is that Gini is not a great universal indicator.
Where the graph at http://www.informationisbeautiful.net/visualizations/what-are-wallst-protestors-angry-about/ was using a very gradual change in hue based on the value (redundant, the width and the number already shows this), it is so subtle that it doesn't show any significant difference between two consecutive rows.
A better use of color is to highlight our focus, or for reference lines. With that in mind, let's enhance our bar plot with judicious use of color for making it quicker to spot the US data.
In [10]:
current_ax = gini_df.plot(kind='barh', color=sns.color_palette()[0])
current_ax.patches[list(gini_df.index).index("United States")].set_facecolor('#cc5555')
current_ax.set_title('Gini index (%) in 2009')
current_ax.vlines(39.99, 0, len(gini_df), color=sns.color_palette()[2])
current_ax.annotate('Average for\n1989-2010',
(40, 2),
xytext=(20, 10),
textcoords='offset points',
arrowprops=dict(arrowstyle='-|>'))
Out[10]:
In [ ]: