Exploring the ICIJ Offshore Leaks database

Kyle Willett

20 Jul 2016

One basic question is to assess the demographics of the companies involved, many of whom fit historical definitions of tax havens. In the past, these include countries with small populations, overseas territories, and unstable or non-democratic governments.

I examined the data in the Offshore ICIJ database to see if the companies involved are associated with countries fitting these characteristics, or whether there's a broader set of locations involved. Matching the address nodes against population and business count identifies a group of roughly a dozen nations responsible for a disproportionate fraction of offshore businesses, particularly focusing on small island nations and British overseas dependencies. The British Virgin Islands had by far the highest ratio of offshore business to total population. Institutions in these countries should be high-priority candidates for follow-up investigations into illegal offshore activity.

While processing the data, I also devised a method to better clean the data and merge nodes that were originally designated as separate locations. Using measures of similarity on text processing, 3% of the address nodes can be merged in the ICIJ database.


In [1]:
%matplotlib inline

# Import packages

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

In [2]:
# Load in data from the Panama Papers ICIJ site (set of CSV files)

addresses = pd.read_csv("Addresses.csv",
                        dtype={'address':str,
                               'icij_id':str,
                               'valid_until':str,
                               'country_codes':str,
                               'countries':str,
                               'node_id':np.int64,
                               'sourceID':str})

Look briefly at structure and content of the addresses file.


In [3]:
addresses.head()


Out[3]:
address icij_id valid_until country_codes countries node_id sourceID
0 -\t27 ROSEWOOD DRIVE #16-19 SINGAPORE 737920 6991059DFFB057DF310B9BF31CC4A0E6 The Panama Papers data is current through 2015 SGP Singapore 14000001 Panama Papers
1 "Almaly Village" v.5, Almaty Kazakhstan C65A269424AA1A8BBAC416FA30150938 The Panama Papers data is current through 2015 KAZ Kazakhstan 14000002 Panama Papers
2 "Cantonia" South Road St Georges Hill Weybridg... 240EE44DFB70AF775E6CD02AF8CB889B The Panama Papers data is current through 2015 GBR United Kingdom 14000003 Panama Papers
3 "CAY-OS" NEW ROAD; ST.SAMPSON; GUERNSEY; CHANN... 340C6D3D2FC60AE58C53BB3618129C43 The Panama Papers data is current through 2015 GGY Guernsey 14000004 Panama Papers
4 "Chirag" Plot No 652; Mwamba Road; Kizingo; Mo... 6DBCD74F5AAB63EABC166885CE8CD7CD The Panama Papers data is current through 2015 KEN Kenya 14000005 Panama Papers

While the number of addresses and companies at each location can be counted, that doesn't give any context as to whether the smaller countries (with more distinct characteristics) are represented at unusual frequencies. As an initial estimate, we'll compare it to the country's total population (future metrics could include things like GDP, international memberships, numbers of banks, etc).


In [4]:
# Load 2015 population data from World Bank

popfile = "API_SP.POP.TOTL_DS2_en_csv_v2.csv"
country_name = []
country_code = []
pop_2015 = []
with open(popfile,'r') as f:
    for idx,line in enumerate(f):
        if idx > 5:
            s = line.split(",")
            country_name.append(s[0].strip()[1:-1])
            country_code.append(s[1].strip()[1:-1])
            pop_2015.append(int((s[-2].strip()[1:-1]).split(".")[0]))

Link the two datasets (Panama Papers and population)


In [5]:
# Join the dataframes

dfpop = pd.DataFrame(index=country_code,data={"country_name":country_name,"population":pop_2015})

# Add a new column that counts the number of addresses in each country

counted = addresses.groupby('country_codes').count()['countries']
joined = dfpop.join(counted)
joined.rename(columns = {'countries':'address_count'}, inplace = True)

Now that we have all the data in a single frame, let's look at the results. Start by simply plotting the population against the number of addresses per country. If there's no strong preference toward tax havens, the two values should be strongly correlated.


In [6]:
# Plot as a scatter plot

good = np.isfinite(joined.address_count)
joined[good].plot(x='population',y='address_count',
                  kind='scatter',
                  loglog=True,
                  figsize=(8,8));


There's definitely a trend (even within log space), but quite a lot of scatter and outliers. For example, one country in the top 10 for overall address count is in the bottom 5 for population.


In [7]:
# Check the actual correlation value

joined.corr()


Out[7]:
population address_count
population 1.000000 0.561871
address_count 0.561871 1.000000

This value (0.561) tells us that the data are mildly correlated, but not with high statistical significance.

That gives the overall distribution; let's look at the countries dominating the high and low ends of the data.


In [8]:
print "Countries with the highest number of addresses\n\n",joined.sort_values(
    by="address_count",ascending=False).head()
print ""
print "Countries with the lowest number of addresses\n\n",joined.sort_values(
    by="address_count",ascending=True).head()


Countries with the highest number of addresses

      country_name  population  address_count
CHN          China  1371220000          28073
HKG      Hong Kong     7305700          21041
USA  United States   321418820           6860
SGP      Singapore     5535002           5728
RUS         Russia   144096812           5534

Countries with the lowest number of addresses

                 country_name  population  address_count
MNP  Northern Mariana Islands       55070              1
GNB             Guinea-Bissau     1844325              1
BTN                    Bhutan      774830              1
SMR                San Marino       31781              1
GNQ         Equatorial Guinea      845060              1

So the top five correlate with three of the most populous countries in the world (US, China, Russia) and two smaller countries that are known for being business and financial capitals (Hong Kong, Singapore). The bottom five include a US dependency, two of the smallest countries on the African mainland, a European micronation, and the relatively isolated Asian nation of Bhutan.

So let's clarify the question further, and move away from absolute numbers to a ratio. What are the countries that are statistically over-represented in terms of their population to business ratio?


In [9]:
ratio = joined.address_count / joined.population
joined['pop_bus_ratio'] = ratio

In [10]:
# Plot the results (in log space) as a histogram

fig,ax = plt.subplots(1,1,figsize=(8,6))
p = sns.distplot(np.log10(joined[good].pop_bus_ratio), kde=False, color="b",ax=ax)
p.set_xlabel('log ratio of population to number of businesses')
p.set_ylabel('Count');


So even by eye, there's a clear skew and a long tail to the right of the distribution. The high value here indicates countries with many more addresses/businesses per capita citizen than expected.


In [11]:
# What are the names of these countries?

joined.sort_values("pop_bus_ratio",ascending=False)[:15]


Out[11]:
country_name population address_count pop_bus_ratio
VGB British Virgin Islands 30117 4125 0.136966
LIE Liechtenstein 37531 690 0.018385
MCO Monaco 37731 589 0.015611
GIB Gibraltar 32217 440 0.013657
IMN Isle of Man 87780 1171 0.013340
CYM Cayman Islands 59967 646 0.010773
SYC Seychelles 92900 871 0.009376
KNA St. Kitts and Nevis 55572 347 0.006244
BMU Bermuda 65235 218 0.003342
HKG Hong Kong 7305700 21041 0.002880
BHS Bahamas 388019 954 0.002459
CYP Cyprus 1165300 2385 0.002047
MHL Marshall Islands 52993 97 0.001830
TCA Turks and Caicos Islands 34339 53 0.001543
NRU Nauru 10222 15 0.001467

This is more like the definition above; numerous dependencies, lots of micronations and small island countries, and very small average populations overall.


In [12]:
# Plot the total population versus the ratio constructed above.

joined[good].plot(x='population',y='pop_bus_ratio',
                  kind='scatter',
                  loglog=True,
                  figsize=(8,8));


So now the data are separating more clearly. Instead of picking a sample by eye, let's try a clustering algorithm on these features and see if the extremes select a similar set of data.


In [13]:
# Use the k-means clustering algorithm from scikit-learn

from sklearn.cluster import KMeans

n_clusters = 5
X = np.log10(np.array(joined[good][['population','pop_bus_ratio']]))
y_pred = KMeans(n_clusters=n_clusters).fit_predict(X)

In [14]:
plt.figure(figsize=(8, 8))
plt.subplot(111)
plt.scatter(X[:, 0], X[:, 1], c=y_pred,cmap=plt.cm.Set1,s=40)
plt.xlabel('population')
plt.ylabel('pop_bus_ratio');



In [15]:
np.mean(X[y_pred == 1,0])


Out[15]:
5.4996211164501014

In [16]:
# Look at the countries corresponding to the top left cluster

cluster_topleft = 0
xmean = max(X[:,0])
for i in range(n_clusters):
    avgpop = np.mean(X[y_pred == i,0])
    if avgpop < xmean:
        xmean = avgpop
        cluster_topleft = i
        
print joined[good][y_pred == cluster_topleft]


                 country_name  population  address_count  pop_bus_ratio
BMU                   Bermuda       65235            218       0.003342
CYM            Cayman Islands       59967            646       0.010773
DMA                  Dominica       72680             90       0.001238
GIB                 Gibraltar       32217            440       0.013657
IMN               Isle of Man       87780           1171       0.013340
KNA       St. Kitts and Nevis       55572            347       0.006244
LIE             Liechtenstein       37531            690       0.018385
MCO                    Monaco       37731            589       0.015611
MHL          Marshall Islands       52993             97       0.001830
NRU                     Nauru       10222             15       0.001467
SYC                Seychelles       92900            871       0.009376
TCA  Turks and Caicos Islands       34339             53       0.001543
VGB    British Virgin Islands       30117           4125       0.136966

Almost uniformly, these are countries with very small populations (< 100,000). 11/14 are small island countries, and 6/14 are dependencies of the United Kingdom (but with different tax and financial status, which presumably strongly affects the likelihood of choosing to incorporate there). The latter category is really over-represented; the UN List of Non-Self-Governing Territories has only 17 total entries, of which 6 appear in this smaller list of 14. The odds of that happening independently are essentially zero ($p\simeq10^{-20}$).

The (British) Virgin Islands deserve special mention; their average ratio of population to number of addresses is roughly a factor of 10 higher than any other country in the database, with 1 Offshore Leaks business for roughly every 7.3 inhabitants. There could be many reasons for this (including geographical proximity to the Mossack Fonseca offices in Panama), but it's a very clear starting point for further investigations into a wide range of companies.


In [17]:
# Here are the top 15 countries in the list above colored in bright orange on a world map. Spot any?

# Made with https://www.amcharts.com/visited_countries/#MC,LI,IM,GI,KN,TC,VG,BM,DM,KY,SC,MH,NR

from IPython.display import Image
Image(url="http://i.imgur.com/hpgZsx1.png")


Out[17]:

In [18]:
# Zooming way into the Caribbean, a few of the entries are finally visible.

Image(url="http://i.imgur.com/dt5anpZ.png")


Out[18]:

Address disambiguation

Iterating a little bit more on the addresses would potentially be very useful; small differences between the strings can point to the same place and cause potential overcounting (or make it difficult to merge records that should point to the same node).

Can we clean the data any further beyond the original state?


In [19]:
import jellyfish

In [20]:
# Group by country; there shouldn't be any need to compare addresses in different countries,
# since that would be computationally expensive and we assume that would require a large shift
# in the difference between text strings (ie, extremely unlikely to change 1 letter and go from a real address
# in North Korea to one in South Africa).

grouped = addresses.groupby("country_codes")

In [21]:
# As a test case, let's look at addresses in the last country in the group: Zimbabwe.

zw = list(addresses[addresses['country_codes'] == 'ZWE'].address)
print len(zw)


225

In [22]:
dl = []
for idx,z in enumerate(zw):
    for z2 in zw[idx+1:]:
        try:
            dl_ = jellyfish.damerau_levenshtein_distance(unicode(z,'utf-8'),unicode(z2,'utf-8'))
            dl.append(dl_)
        except:
            pass

In [23]:
fig,ax = plt.subplots(1,1,figsize=(8,6))
p = sns.distplot(dl, kde=False, color="b",ax=ax)
p.set_xlabel('Damerau-Levenshtein distance')
p.set_ylabel('Count of Zimbabwean addresses');


There seems to be a clear peak of duplicates at a D-L distance of only a couple ($<3). Try a couple different sizes and manually evaluate the true positive and true negative rate.


In [24]:
for n in (1,2,3,4,5):
    print "\n DL <= {}".format(n)
    for idx,z in enumerate(zw):
        for z2 in zw[idx+1:]:
            try:
                dl_ = jellyfish.damerau_levenshtein_distance(unicode(z,'utf-8'),unicode(z2,'utf-8'))
                if dl_ == n:
                    print "{} -> {}".format(z,z2)
                    break
            except:
                pass


 DL <= 1
11 Argyl Drive Highlands Harare Zimbabwe -> 11 Argyll Drive Highlands Harare Zimbabwe
5 CARNARVON ROAD HILLSIDE BULAWAYO ZIMBABWE -> 5 CARNAVON ROAD HILLSIDE BULAWAYO ZIMBABWE
5 CARNAVON ROAD HILLSIDE BULAWAUYO ZIMBABWE -> 5 CARNAVON ROAD HILLSIDE BULAWAYO ZIMBABWE
85 Harare Drive; Northwood; Harare; Zimbabwe -> 85 Harere Drive; Northwood; Harare; Zimbabwe
The Lodge; 24b Arnold Edmunds Drive; Glen Horne; Harare; Zimbabwe -> The Lodge; 24b Arnold Edmunds Drive; Glen Lorne; Harare; Zimbabwe

 DL <= 2
20 Basset Crescent; Alexander Park; Harare; Zimbabwe -> 20 Basset Crescent; Alexandra Park; Harare; Zimbabwe
22 Pevensey Road; Highlands; Harare; Zimbabwe -> 23 Pevensey Road; Highlands; Harare; Zimbawe
5 CARNARVON ROAD HILLSIDE BULAWAYO ZIMBABWE -> 5 CARNAVON ROAD HILLSIDE BULAWAUYO ZIMBABWE
First Floor CABS Building; Northend Close; Northridge Park; Harare; Zimabwe -> First Floor CABS Building; Northend Close; Northridge Park; Harare; Zimbabwe.
Jacobin Investments Limite; 18 Kenilworth Road; Newlands; Harare; Zimbabwe -> Jacobin Investments Limited18 Kenilworth Road; Newlands; Harare; Zimbabwe

 DL <= 3
37 Orange grove Drive; Highlands; Harare; Zimbabwe -> 18 Orange Grove Drive; Highlands; Harare; Zimbabwe
632 Price Drive Borrowdale Brooke Harare Zimbabwe -> 638 Price Drive; Borrowdale Brooke Harare; Zimbabwe
Box CH 92; Chisipite; Harare; Zimbabwe -> PO Box CH 92; Chisipite; Harare; Zimbabwe
First Floor CABS Building; Northend Close; Northridge Park; Harare; Zimbabwe. -> First Floor; CABS Building; Northend Close; Northridge Park; Harare; Zimbawe
52 Orange Grove Drive; Highlands Harare; Zimbabwe -> 18 Orange Grove Drive; Highlands; Harare; Zimbabwe
12 HUNT ROAD BORROWDALE, HARARE -> 12 HUNT ROAD; BORRROWDALE; HARARE

 DL <= 4
37 Orange grove Drive; Highlands; Harare; Zimbabwe -> 52 Orange Grove Drive; Highlands Harare; Zimbabwe
P.O Box CH 760; Chisipite; Harare; Zimbabwe -> PO Box CH 92; Chisipite; Harare; Zimbabwe

 DL <= 5
11 CONCARDE WAY; RIDGEVIEW; HARARE; ZIMBABWE -> 11 CONCORDE WAY; RIDGEWAY; HARARE; ZIMBABWE
23 Willowmead Lane; Colne Valley; Harare; Zimbabwe -> 23 Willowmead; Colne Valley; Harare; Zimbabwe
27 Maiden Drive; Highlands; Harare; Zimbabwe -> 33 Maiden Drive Highlands Harare Zimbabwe
79 Kendal Close; Rolf Valley; Harare; Zimbabwe -> 79 Kendal Close Rozfe Valley Harare Zimbabwe
9 Warthog Close; Borrowdale; Harare; Zimbabwe -> 9 Warthog Close; Borrowdale West; Harare; Zimbabwe
The Lodge; 24b Arnold Edmunds Drive; Glen Lorne; Harare; Zimbabwe -> The Lodge; Arnold Edmonds Drive; Glen Lorne; Harare; Zimbabwe

Assess each address pair manually; simple transpositions or errors are a true positive. A situation where the house number changed, though, could be either a transcription error OR two truly different locations on the same street. To be conservative, the latter will be counted as a false positive.

DL <= 1: TP = 5, FP = 0
DL <= 2: TP = 9, FP = 1
DL <= 3: TP = 12, FP = 4
DL <= 4: TP = 12, FP = 6
DL <= 5: TP = 17, FP = 7

The false positive rate increases sharply at DL <= 3. Acting conservatively, we'll only use a DL-distance of 2 for the sample.


In [ ]:
# Run algorithm on the entire sample. How many nodes can we merge in the set?

dl_threshold = 2

duplicate_addresses = {}
failed_to_parse = 0
for g in grouped:
    name = g[0]
    print "Processing {}".format(name)
    duplicate_addresses[name] = 0
    lst = list(g[1].address)
    if len(lst) > 1:
        for idx,add1 in enumerate(lst):
            # Don't compare twice and double-count.
            for add2 in lst[idx+1:]:
                try:
                    dl = jellyfish.damerau_levenshtein_distance(unicode(add1,'utf-8'),unicode(add2,'utf-8'))
                    if dl <= dl_threshold:
                        duplicate_addresses[name] += 1
                        break
                except:
                    failed_to_parse += 1

In [26]:
print duplicate_addresses


{'AGO': 3, 'DZA': 0, 'EGY': 5, 'BGD': 0, 'LIE': 71, 'NAM': 1, 'BGR': 0, 'BOL': 0, 'GHA': 1, 'PAK': 9, 'CPV': 1, 'JOR': 17, 'LBR': 5, 'LBY': 1, 'MYS': 71, 'PRI': 0, 'SXM': 0, 'PRK': 0, 'PSE': 1, 'TZA': 1, 'BWA': 2, 'KHM': 0, 'TTO': 0, 'ETH': 0, 'PRY': 0, 'HKG': 575, 'SAU': 21, 'LBN': 17, 'SVN': 1, 'BFA': 0, 'CHE': 123, 'HRV': 0, 'CHL': 2, 'CHN': 439, 'KNA': 40, 'SLE': 0, 'JAM': 1, 'GIB': 60, 'DJI': 0, 'GIN': 0, 'FIN': 0, 'URY': 23, 'THA': 40, 'SYC': 132, 'NPL': 0, 'LAO': 2, 'YEM': 0, 'PHL': 11, 'ZAF': 51, 'NIC': 1, 'ROU': 2, 'VIR': 0, 'SYR': 0, 'MAC': 14, 'MAF': 0, 'MLT': 8, 'KAZ': 1, 'TCA': 4, 'PYF': 0, 'NIU': 1, 'DMA': 7, 'BEN': 0, 'NGA': 0, 'BEL': 9, 'TGO': 0, 'DEU': 6, 'GUM': 0, 'LKA': 1, 'GBR': 148, 'GUY': 0, 'CRI': 5, 'CMR': 0, 'MAR': 0, 'MNP': 0, 'LSO': 1, 'HUN': 6, 'TKM': 0, 'SUR': 0, 'NLD': 7, 'BMU': 15, 'TCD': 1, 'GEO': 3, 'MNE': 3, 'MNG': 2, 'MHL': 10, 'MTQ': 0, 'BLZ': 22, 'MMR': 0, 'NFK': 0, 'VGB': 482, 'BLR': 1, 'GRD': 0, 'GRC': 7, 'AND': 1, 'MOZ': 0, 'TJK': 0, 'HTI': 4, 'MEX': 9, 'ZWE': 9, 'LCA': 1, 'IND': 14, 'LVA': 1, 'BTN': 0, 'VCT': 2, 'VNM': 4, 'NOR': 1, 'CZE': 6, 'ATG': 0, 'FJI': 0, 'HND': 1, 'MUS': 59, 'DOM': 6, 'LUX': 41, 'ISR': 17, 'SMR': 0, 'PER': 13, 'REU': 0, 'IDN': 66, 'VUT': 8, 'MKD': 0, 'COD': 0, 'ISL': 6, 'COK': 9, 'NER': 0, 'COL': 16, 'PRT': 9, 'MDA': 0, 'GGY': 140, 'MDG': 1, 'ECU': 10, 'SEN': 1, 'MDV': 0, 'ASM': 0, 'CUW': 7, 'FRA': 18, 'LTU': 0, 'RWA': 0, 'ZMB': 1, 'GMB': 1, 'JEY': 241, 'GTM': 11, 'DNK': 2, 'IMN': 114, 'AUS': 33, 'AUT': 1, 'VEN': 14, 'KEN': 0, 'WSM': 11, 'TUR': 14, 'ALB': 0, 'OMN': 1, 'ITA': 13, 'BRN': 2, 'TUN': 0, 'RUS': 98, 'BRB': 2, 'BRA': 33, 'CIV': 0, 'SRB': 0, 'GNQ': 0, 'USA': 89, 'QAT': 2, 'SWE': 3, 'AZE': 1, 'GNB': 0, 'SWZ': 1, 'TON': 0, 'CAN': 27, 'UKR': 10, 'KOR': 6, 'AIA': 5, 'CAF': 0, 'SVK': 0, 'CYP': 92, 'BIH': 0, 'SGP': 291, 'TWN': 490, 'XXX': 17, 'SOM': 0, 'UZB': 4, 'POL': 2, 'KWT': 3, 'GAB': 1, 'CYM': 28, 'EST': 1, 'MWI': 1, 'ESP': 5, 'IRQ': 0, 'SLV': 0, 'MLI': 0, 'IRL': 4, 'IRN': 3, 'ABW': 0, 'PNG': 0, 'PAN': 136, 'SDN': 0, 'SLB': 0, 'NZL': 17, 'MCO': 20, 'JPN': 10, 'KGZ': 0, 'UGA': 0, 'ARE': 82, 'ARG': 8, 'BHS': 72, 'BHR': 7, 'ARM': 2, 'NRU': 0, 'CUB': 0}

In [35]:
print "Able to merge {} addresses in system ({:.1f}%).".format(sum(duplicate_addresses.values()),
                                      sum(duplicate_addresses.values())*100./len(addresses))


Able to merge 5006 addresses in system (3.3%).

In [ ]: