In [1]:
import pandas as pd
%pylab inline
In [2]:
df = pd.read_csv("data/eu_trade_sums.csv")
In [4]:
df.head(4)
Out[4]:
In [5]:
df.dtypes
Out[5]:
In [6]:
df = df.set_index('geo')
We are only interested the year range from 2002 - 2006
In [7]:
yrs = [str(yr) for yr in range(2002, 2016)]
Let's filter out the following types of record:
In [8]:
export_df = df[(df['trade_type'] == 'Export') &
(df['partner'] == 'EXT_EU28')
].loc[['EU28', 'UK']][yrs]
In [9]:
export_df.head(4)
Out[9]:
Let's transpoe this to get 2 columns of series data:
In [10]:
export_df = export_df.T
In [11]:
export_df.head(4)
Out[11]:
Let's rename the columns to clarify these columns related to export from these entities:
In [12]:
export_df = export_df.rename(columns={'EU28': 'EU28_TO_EXT', 'UK': 'UK_TO_EXT'})
In [13]:
export_df.head(4)
Out[13]:
Now, let's get the columns from UK and EU28 to those partners inside EU28
In [14]:
int_df = df[(df['trade_type'] == 'Export') &
(df['partner'] == 'EU28')
].loc[['EU28', 'UK']][yrs]
In [15]:
int_df.head(4)
Out[15]:
In [16]:
int_df = int_df.T
In [17]:
int_df.head(4)
Out[17]:
Let's now combine these 2 new columns to the exports to outside UK and EU28
In [18]:
export_df = pd.concat([export_df, int_df], axis=1)
In [19]:
export_df.head(4)
Out[19]:
In [20]:
export_df = export_df.rename(columns={'EU28': 'EU28_TO_INT',
'UK' : 'UK_TO_INT'})
In [21]:
export_df.head(4)
Out[21]:
In [22]:
export_df.plot(legend=False)
Out[22]:
In [23]:
export_df.plot()
Out[23]:
In [24]:
export_df[['UK_TO_EXT', 'UK_TO_INT']].plot()
Out[24]:
In [37]:
from bokeh.plotting import figure, output_file, show
from bokeh.layouts import gridplot
In [27]:
TOOLS = 'resize,pan,wheel_zoom,box_zoom,reset,hover'
In [29]:
p = figure(tools=TOOLS, x_range=(2002, 2015), y_range=(200000, 500000),
title="UK Import Export Trends from 2002-2014")
In [30]:
p.yaxis.axis_label = "Value in $1000"
In [34]:
p.line(yrs, export_df['UK_TO_EXT'], color='#A6CEE3', legend='UK_TO_EXT')
p.line(yrs, export_df['UK_TO_INT'], color='#B2DF8A', legend='UK_TO_INT')
p.legend.location = 'top_left'
In [35]:
output_file("uk_grade.html", title="UK Trade from 2002-2014")
In [40]:
# open a browser
show(p)
Let look at % change.
First, let's remove the aggregate sum (by identifying the aggregate key 'EU28'. Remember that we have set the index to "geo" already.
In [25]:
df = df[~ df.index.isin(['EU28'])]
In [75]:
df.head(4)
Out[75]:
In [76]:
pct_change_df = df.copy()
Recall that yrs column is of type "str" even though they supposedly represent the year number.
In [77]:
for yr in yrs:
pct_change_df[yr] = (df[yr] - df[str(int(yr)-1)]) / df[str(int(yr)-1)]
In [78]:
pct_change_df.head(4)
Out[78]:
What is the year with the largest spread
In [79]:
[(yr, abs(pct_change_df[yr].max() - pct_change_df[yr].min(0))) for yr in yrs]
Out[79]:
2010 seems to have a big % change in recent years.
Let's find some outliers by using standard deviations.
In [81]:
pct_change_df['2010'].std()
Out[81]:
In [82]:
pct_change_df['2010'].mean()
Out[82]:
Let's define outliers are those > 2 standard deviations from the mean.
In [84]:
pct_change_df[pct_change_df['2010'].abs() >=
(pct_change_df['2010'].mean() + 2*pct_change_df['2010'].std())]
Out[84]:
Looks like these 3 countries are outliers defined as having % change > 2 standard deviations from their means.
Let's use sorting to see the range of values for 2010
In [85]:
pct_change_df['2010'].sort_values()
Out[85]:
There are very few countries with negative % change values for 2010. Let's separate out those values.
In [86]:
pct_change_df[pct_change_df['2010'] < 0]
Out[86]:
Looks like Greece, Hungary, and Ireland all shrunk in imports for 2010. Luxumberg shrunk in both imports & exports in 2010.
Also looks like very few countries have % change values > 0.4. Let's examine those values for 2010.
In [87]:
pct_change_df[pct_change_df['2010'] > 0.4]
Out[87]:
Looks like Lithuania has grown both import & export by > 40% that year.
For next steps in outliers analysis, we will next dig up news articles or reviews about why Lithuania and Luxumberg have such outlier behaviors during 2010.
In [ ]: