In [1]:
import pandas as pd
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
df = pd.read_csv("data/eu_trade_sums.csv")

In [4]:
df.head(4)


Out[4]:
geo trade_type partner 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999
0 AT Export EU28 193043.0 187548.0 184865.0 181689.0 181889.0 166299.0 143052.0 181180.0 176327.0 159517.0 147064.0 142481.0 131439.0 126666.0 NaN NaN NaN
1 AT Export EXT_EU28 82468.0 80797.0 78903.0 77668.0 73038.0 63858.0 53376.0 65339.0 62447.0 58309.0 53871.0 46923.0 40317.0 39731.0 NaN NaN NaN
2 AT Import EU28 215023.0 210492.0 211444.0 212546.0 212734.0 187011.0 160875.0 197137.0 190105.0 175463.0 165908.0 160883.0 144982.0 134624.0 NaN NaN NaN
3 AT Import EXT_EU28 65242.0 63511.0 64554.0 65337.0 62293.0 52877.0 44262.0 53467.0 47817.0 43097.0 38657.0 31628.0 30994.0 30983.0 NaN NaN NaN

In [5]:
df.dtypes


Out[5]:
geo            object
trade_type     object
partner        object
2015          float64
2014          float64
2013          float64
2012          float64
2011          float64
2010          float64
2009          float64
2008          float64
2007          float64
2006          float64
2005          float64
2004          float64
2003          float64
2002          float64
2001          float64
2000          float64
1999          float64
dtype: object

In [6]:
df = df.set_index('geo')

We are only interested the year range from 2002 - 2006


In [7]:
yrs = [str(yr) for yr in range(2002, 2016)]

Let's filter out the following types of record:

  1. Export only
  2. Partners are those with the UK and outside the EU28

In [8]:
export_df = df[(df['trade_type'] == 'Export') &
               (df['partner'] == 'EXT_EU28')
              ].loc[['EU28', 'UK']][yrs]

In [9]:
export_df.head(4)


Out[9]:
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
geo
EU28 1770688.0 1723332.0 1890370.0 2098946.0 2304970.0 2468964.0 2618295.0 2187922.0 2706392.0 3108358.0 3368521.0 3473297.0 3406038.0 3581303.0
UK 228493.0 219982.0 229696.0 272441.0 267942.0 271456.0 284998.0 228765.0 296174.0 362484.0 365917.0 459670.0 396446.0 460978.0

Let's transpoe this to get 2 columns of series data:


In [10]:
export_df = export_df.T

In [11]:
export_df.head(4)


Out[11]:
geo EU28 UK
2002 1770688.0 228493.0
2003 1723332.0 219982.0
2004 1890370.0 229696.0
2005 2098946.0 272441.0

Let's rename the columns to clarify these columns related to export from these entities:


In [12]:
export_df = export_df.rename(columns={'EU28': 'EU28_TO_EXT', 'UK': 'UK_TO_EXT'})

In [13]:
export_df.head(4)


Out[13]:
geo EU28_TO_EXT UK_TO_EXT
2002 1770688.0 228493.0
2003 1723332.0 219982.0
2004 1890370.0 229696.0
2005 2098946.0 272441.0

Now, let's get the columns from UK and EU28 to those partners inside EU28


In [14]:
int_df = df[(df['trade_type'] == 'Export') &
            (df['partner'] == 'EU28')
           ].loc[['EU28', 'UK']][yrs]

In [15]:
int_df.head(4)


Out[15]:
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
geo
EU28 3818366.0 3854353.0 4173739.0 4464325.0 5031897.0 5361878.0 5479161.0 4427973.0 5112932.0 5644778.0 5673476.0 5677316.0 5864543.0 6140607.0
UK 364138.0 320375.0 328835.0 355831.0 450290.0 373316.0 357057.0 280643.0 331356.0 365346.0 370063.0 354449.0 364118.0 368544.0

In [16]:
int_df = int_df.T

In [17]:
int_df.head(4)


Out[17]:
geo EU28 UK
2002 3818366.0 364138.0
2003 3854353.0 320375.0
2004 4173739.0 328835.0
2005 4464325.0 355831.0

Let's now combine these 2 new columns to the exports to outside UK and EU28


In [18]:
export_df = pd.concat([export_df, int_df], axis=1)

In [19]:
export_df.head(4)


Out[19]:
geo EU28_TO_EXT UK_TO_EXT EU28 UK
2002 1770688.0 228493.0 3818366.0 364138.0
2003 1723332.0 219982.0 3854353.0 320375.0
2004 1890370.0 229696.0 4173739.0 328835.0
2005 2098946.0 272441.0 4464325.0 355831.0

In [20]:
export_df = export_df.rename(columns={'EU28': 'EU28_TO_INT', 
                                     'UK' : 'UK_TO_INT'})

In [21]:
export_df.head(4)


Out[21]:
geo EU28_TO_EXT UK_TO_EXT EU28_TO_INT UK_TO_INT
2002 1770688.0 228493.0 3818366.0 364138.0
2003 1723332.0 219982.0 3854353.0 320375.0
2004 1890370.0 229696.0 4173739.0 328835.0
2005 2098946.0 272441.0 4464325.0 355831.0

Trends

Let's now plot to see any trends


In [22]:
export_df.plot(legend=False)


Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d2a56d0>

In [23]:
export_df.plot()


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d3b5150>
Looks like there is significantly more external trade outside EU28. Let's just look at the UK data

In [24]:
export_df[['UK_TO_EXT', 'UK_TO_INT']].plot()


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d2f38d0>

Interactive Plot


In [37]:
from bokeh.plotting import figure, output_file, show
from bokeh.layouts import gridplot

In [27]:
TOOLS = 'resize,pan,wheel_zoom,box_zoom,reset,hover'

In [29]:
p = figure(tools=TOOLS, x_range=(2002, 2015), y_range=(200000, 500000),
           title="UK Import Export Trends from 2002-2014")

In [30]:
p.yaxis.axis_label = "Value in $1000"

In [34]:
p.line(yrs, export_df['UK_TO_EXT'], color='#A6CEE3', legend='UK_TO_EXT')
p.line(yrs, export_df['UK_TO_INT'], color='#B2DF8A', legend='UK_TO_INT')
p.legend.location = 'top_left'

In [35]:
output_file("uk_grade.html", title="UK Trade from 2002-2014")

In [40]:
# open a browser
show(p)

Outliers

Let look at % change.

First, let's remove the aggregate sum (by identifying the aggregate key 'EU28'. Remember that we have set the index to "geo" already.


In [25]:
df = df[~ df.index.isin(['EU28'])]

In [75]:
df.head(4)


Out[75]:
trade_type partner 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999
geo
AT Export EU28 193043.0 187548.0 184865.0 181689.0 181889.0 166299.0 143052.0 181180.0 176327.0 159517.0 147064.0 142481.0 131439.0 126666.0 NaN NaN NaN
AT Export EXT_EU28 82468.0 80797.0 78903.0 77668.0 73038.0 63858.0 53376.0 65339.0 62447.0 58309.0 53871.0 46923.0 40317.0 39731.0 NaN NaN NaN
AT Import EU28 215023.0 210492.0 211444.0 212546.0 212734.0 187011.0 160875.0 197137.0 190105.0 175463.0 165908.0 160883.0 144982.0 134624.0 NaN NaN NaN
AT Import EXT_EU28 65242.0 63511.0 64554.0 65337.0 62293.0 52877.0 44262.0 53467.0 47817.0 43097.0 38657.0 31628.0 30994.0 30983.0 NaN NaN NaN

In [76]:
pct_change_df = df.copy()

Recall that yrs column is of type "str" even though they supposedly represent the year number.


In [77]:
for yr in yrs:
    pct_change_df[yr] = (df[yr] - df[str(int(yr)-1)]) / df[str(int(yr)-1)]

In [78]:
pct_change_df.head(4)


Out[78]:
trade_type partner 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999
geo
AT Export EU28 0.029299 0.014513 0.017480 -0.001100 0.093747 0.162507 -0.210443 0.027523 0.105381 0.084677 0.032166 0.084009 0.037682 NaN NaN NaN NaN
AT Export EXT_EU28 0.020681 0.024004 0.015901 0.063392 0.143756 0.196380 -0.183091 0.046311 0.070967 0.082382 0.148072 0.163851 0.014749 NaN NaN NaN NaN
AT Import EU28 0.021526 -0.004502 -0.005185 -0.000884 0.137548 0.162462 -0.183943 0.036990 0.083448 0.057592 0.031234 0.109676 0.076940 NaN NaN NaN NaN
AT Import EXT_EU28 0.027255 -0.016157 -0.011984 0.048866 0.178074 0.194636 -0.172162 0.118159 0.109520 0.114856 0.222240 0.020456 0.000355 NaN NaN NaN NaN

What is the year with the largest spread


In [79]:
[(yr, abs(pct_change_df[yr].max() - pct_change_df[yr].min(0))) for yr in yrs]


Out[79]:
[('2002', nan),
 ('2003', 0.922437290580437),
 ('2004', 1.1973918362807252),
 ('2005', 0.77510836760343016),
 ('2006', 1.0974189015338094),
 ('2007', 0.68337596565606351),
 ('2008', 0.73242893213138349),
 ('2009', 0.41701830776207),
 ('2010', 0.77620824423939028),
 ('2011', 0.64468658836173809),
 ('2012', 0.40757073398065757),
 ('2013', 0.50770193608178382),
 ('2014', 0.69787182660594926),
 ('2015', 0.65402845613708527)]

2010 seems to have a big % change in recent years.

Let's find some outliers by using standard deviations.


In [81]:
pct_change_df['2010'].std()


Out[81]:
0.11778424716089249

In [82]:
pct_change_df['2010'].mean()


Out[82]:
0.20447136781293507

Let's define outliers are those > 2 standard deviations from the mean.


In [84]:
pct_change_df[pct_change_df['2010'].abs() >= 
              (pct_change_df['2010'].mean() + 2*pct_change_df['2010'].std())]


Out[84]:
trade_type partner 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999
geo
BG Export EXT_EU28 0.000785 -0.071965 0.045409 0.126386 0.260671 0.489159 -0.322789 0.153392 0.169818 0.266743 0.288433 0.119338 0.058218 NaN NaN NaN NaN
CZ Import EXT_EU28 0.102882 0.043694 -0.070514 -0.022636 0.158539 0.452684 -0.259793 0.301929 0.185348 0.266972 0.032504 -0.154460 0.107936 NaN NaN NaN NaN
LT Export EXT_EU28 -0.191402 0.006403 0.198597 0.172429 0.276142 0.449822 -0.340340 0.449244 0.075822 0.257226 0.331423 0.067060 0.348454 NaN NaN NaN NaN

Looks like these 3 countries are outliers defined as having % change > 2 standard deviations from their means.

Let's use sorting to see the range of values for 2010


In [85]:
pct_change_df['2010'].sort_values()


Out[85]:
geo
LU   -0.287050
LU   -0.118591
EL   -0.113044
HR   -0.045526
IE   -0.042371
IE    0.003527
IE    0.039697
DK    0.047805
DK    0.050215
DK    0.053122
HR    0.062120
EL    0.093680
FR    0.106570
PT    0.109449
ES    0.110758
BG    0.111965
MT    0.112542
BE    0.115844
CY    0.117683
FR    0.123650
EL    0.132565
FI    0.136728
DE    0.139136
BE    0.139652
IE    0.142192
DK    0.146256
SI    0.146373
IT    0.147951
SI    0.148808
LU    0.157511
        ...   
PL    0.266329
NL    0.268815
ES    0.277191
BE    0.281675
EL    0.284633
LT    0.288320
SE    0.289761
LV    0.294582
UK    0.294665
MT    0.294668
CZ    0.306317
IT    0.312230
EE    0.312303
SE    0.315566
LV    0.316998
HU    0.318835
SI    0.325969
EE    0.329750
PL    0.336837
SE    0.344466
SK    0.351815
MT    0.357491
SK    0.372937
RO    0.387858
EE    0.389522
MT    0.420200
LT    0.427507
LT    0.449822
CZ    0.452684
BG    0.489159
Name: 2010, dtype: float64

There are very few countries with negative % change values for 2010. Let's separate out those values.


In [86]:
pct_change_df[pct_change_df['2010'] < 0]


Out[86]:
trade_type partner 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999
geo
EL Import EU28 -0.011554 0.052153 -0.024641 -0.092378 -0.064558 -0.113044 -0.167735 0.034348 0.121114 0.105525 -0.002684 0.223572 0.242409 NaN NaN NaN NaN
HR Import EU28 0.103244 0.178629 0.094336 0.006855 0.104890 -0.045526 -0.285067 0.094356 0.061801 0.131649 0.070830 0.033353 0.126327 NaN NaN NaN NaN
IE Import EXT_EU28 0.095668 0.146735 0.097808 0.108564 -0.016647 -0.042371 -0.097875 -0.064441 0.001091 -0.000545 0.082031 -0.042978 -0.028298 NaN NaN NaN NaN
LU Export EU28 0.092843 0.062869 -0.027817 -0.037380 0.021101 -0.118591 -0.135602 0.051926 -0.102930 0.190753 0.166059 0.122882 0.100969 NaN NaN NaN NaN
LU Import EXT_EU28 0.452790 -0.048928 -0.136559 0.279079 0.032977 -0.287050 -0.062342 0.028338 -0.153985 0.269407 0.294330 0.201769 0.441710 NaN NaN NaN NaN

Looks like Greece, Hungary, and Ireland all shrunk in imports for 2010. Luxumberg shrunk in both imports & exports in 2010.

Also looks like very few countries have % change values > 0.4. Let's examine those values for 2010.


In [87]:
pct_change_df[pct_change_df['2010'] > 0.4]


Out[87]:
trade_type partner 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999
geo
BG Export EXT_EU28 0.000785 -0.071965 0.045409 0.126386 0.260671 0.489159 -0.322789 0.153392 0.169818 0.266743 0.288433 0.119338 0.058218 NaN NaN NaN NaN
CZ Import EXT_EU28 0.102882 0.043694 -0.070514 -0.022636 0.158539 0.452684 -0.259793 0.301929 0.185348 0.266972 0.032504 -0.154460 0.107936 NaN NaN NaN NaN
LT Export EXT_EU28 -0.191402 0.006403 0.198597 0.172429 0.276142 0.449822 -0.340340 0.449244 0.075822 0.257226 0.331423 0.067060 0.348454 NaN NaN NaN NaN
LT Import EXT_EU28 -0.058658 -0.143427 -0.013144 0.067575 0.288577 0.427507 -0.401784 0.592402 -0.019068 0.135555 0.393113 -0.029152 0.089929 NaN NaN NaN NaN
MT Import EXT_EU28 -0.054859 0.427825 0.138429 -0.004228 0.038191 0.420200 -0.055915 -0.068020 -0.092583 0.393204 -0.078005 -0.138767 0.019652 NaN NaN NaN NaN

Looks like Lithuania has grown both import & export by > 40% that year.

For next steps in outliers analysis, we will next dig up news articles or reviews about why Lithuania and Luxumberg have such outlier behaviors during 2010.


In [ ]: