In this notebook we demonstrate a few of the Python ecosystem tools that enable research in areas that can be difficult to do using traditional tools such as Stata that are typically fit-for-purpose tools.
The agility of a full programming language environment allows for a high degree of flexibility and the Python ecosystem provides a vast toolkit to remain productive.
Oil (3330), has a large world export share, but is not strongly co-exported (i.e. connected in the network) with any other products (other than LNG).
Machinery, Electronics, Garments are all sectors that have a high degree of co-export potential with other related products and form part of a densely connected core of the network.
Developing Economies typically occupy products in the weakly connected periphery of the network and new products tend to emerge close to exisiting products in the network. (established from analysis using the Product Space network). Middle Income Countries manage to diffuse into the densely connected core of the product space
Interest in studying networks is increasing within Economics with recent publications building network type features into their models, or using network analysis to uncover structural features of data that may otherwise go unexplored.
Many people who have interacted with tools from network analysis have done so via the idea of Social Network Analysis (SNA).
A Graph is a way of specifying relationships among a collection of items
They consist of a collection of nodes (or vertices) that are joined together by edges.
In [1]:
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')
In [2]:
g = nx.Graph()
In [3]:
g.add_nodes_from(["A", "B", "C", "D"])
In [4]:
nx.draw_networkx(g)
plt.show()
In [5]:
g.add_edge("A", "B") #Add Edge between Nodes A and B
g.add_edge("A", "C") #Add Edge between Nodes A and C
g.add_edge("A", "D") #Add Edge between Nodes A and D
In [6]:
nx.draw_networkx(g)
plt.show()
In [7]:
nx.degree_centrality(g)
Out[7]:
In [8]:
g.add_edge("C","D")
In [9]:
nx.draw_networkx(g)
plt.show()
In [10]:
nx.degree_centrality(g)
Out[10]:
One early example of Social Network Analysis was conducted by Zachary (1977) who set out to use network analysis to explain factional dynamics and to understand fission in small groups. A network of friendship was used to understand and identify how this Karate group eventually split due to an initial conflict between two members.
The structure of these relationships can be exploited to uncover new insights into the data:
One visualization (Cao, 2013) demonstrates how algorithmic analysis can reveal meaningful structure that clearly identifies roles played by certain individuals, that is based on observing simpler relational information on friendship between pairs of individuals.
Let's focus on an application of network analysis that is applied to international trade data to replicate some of the results contained in the Hidalgo (2007) paper and later in the The Atlas of Complexity and The Observatory of Economic Complexity.
The Hidalgo (2007) paper is used as a motivating example to demonstrate various tools that are available in the Python ecosystem.
In this setting we want to looking at a characterisation of International Trade data by considering:
Assumption: If products are highly co-exported across countries, then the products are revealed to be more likely to share similar factors of production (or capabilities) required to produce them. For example, Shirts and Pants require a set of similar skills that lend themselves to be co-exported, while shirts and cars are much more dissimilar.
This relational information between products can be represented by a edge weights.
A high value means they have a high likelihood of being co-exported
In [11]:
g = nx.Graph()
g.add_edge("P1", "P2")
pos = nx.spring_layout(g)
nx.draw_networkx(g, pos=pos, node_size=600, font_size=14)
nx.draw_networkx_edge_labels(g, pos=pos, edge_labels={("P1", "P2") : "Coexport Probability: Min{P(P1 | P2), P(P2 | P1)}"}, font_size=14)
plt.show()
In [12]:
adj = pd.read_csv("./data/simple_productspace.csv", names=["P1", "P2", "weight"])
adj
Out[12]:
In [13]:
g = nx.read_weighted_edgelist("./data/simple_productspace.csv", delimiter=",")
In [14]:
g.nodes()
Out[14]:
In [15]:
#Visualize the Network
pos = nx.spring_layout(g)
weights = [g[u][v]['weight']*10 for u,v in g.edges()]
nx.draw_networkx(g, node_size=1200, pos=pos, width=weights)
plt.show()
We now want to compute the edge weights to explore the full product space network derived from product level international trade data by computing the proximity matrix:
$$ \phi_{ij} = \min \{ P(RCA_i >= 1 \hspace{0.25cm}| \hspace{0.25cm} RCA_j >= 1), P(RCA_j >= 1 \hspace{0.25cm} | \hspace{0.25cm} RCA_i >= 1) \} $$Proximity: A high proximity value suggests any two products are exported by a similar set of countries.
The tasks involve:
In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numba import jit
import networkx as nx
from bokeh.io import output_notebook
In [17]:
#-Load Jupyter Extensions-#
%matplotlib inline
output_notebook()
International Trade Data is largely available in SITC and HS product classification systems.
In this notebook we will focus on SITC revision 2 Level 4 data with 786 defined products.
| Classification | Level | Products |
|---|---|---|
| SITC | 4 | 786 |
| HS | 6 | 5016 |
Note:
We use SITC data in this seminar, but as you can see performance of code becomes even more important when working with fully disaggregated HS international trade data
In [18]:
fl = "./data/year_origin_sitc_rev2.csv"
data = pd.read_csv(fl, converters={'sitc':str}) #Import SITC codes as strings to preserve formatting
In [19]:
data.head()
Out[19]:
In [ ]:
data['year'].unique()
In [ ]:
data.shape[0]
In [ ]:
data['origin'].unique()
In [20]:
data[(data['year'] == 2000)&(data['origin']=="AUS")].head()
Out[20]:
In [21]:
data[data['origin'] == 'AUS'].set_index(["year","sitc"])["export"].unstack(level="year").head()
Out[21]:
The literature uses the standard Balassa definition for Revealed Comparative Advantage
$$ \large RCA_{cpt} = \frac{\frac{E_{cpt}}{E_{ct}}}{\frac{E_{pt}}{E_t}} $$where,
Reference: Balassa, B. (1965), Trade Liberalisation and Revealed Comparative Advantage, The Manchester School, 33, 99-123.
To compute RCA we need to aggregate data at difference levels to obtain each component of the fraction defined above.
Let's break the equation down to figure out what needs to be computed:
$$ \large E_{ct} = \sum_{p}{E_{cpt}} $$
In [22]:
cntry_export = data[["year", "origin", "export"]].groupby(by=["year", "origin"]).sum()
cntry_export.head(n=2)
Out[22]:
This gives us a pandas.DataFrame that is indexed by a multi-index object. This can be very useful but we would like to use this data in the original data table for each product exported at time t by each country. We could use this new object and:
merge the data back into the original data DataFrametransform to request an object that is of the same shape as the original data DataFrame.
In [23]:
data["cntry_export"] = data[["year", "origin", "export"]].groupby(by=["year", "origin"]).transform(np.sum)
data["prod_export"] = data[["year", "sitc", "export"]].groupby(by=["year", "sitc"]).transform(np.sum)
data["world_export"] = data[["year", "export"]].groupby(by=["year"]).transform(np.sum)
Now that the components of the equation have been computed we can now simply calculate $RCA$ as expressed by the original fraction
In [24]:
data["rca"] = (data["export"] / data["cntry_export"]) / (data["prod_export"] / data["world_export"])
In [25]:
data.head()
Out[25]:
$RCA >= 1$ is where country $c$ has a revealed comparative advantage in product $p$ at time $t$
Therefore we can define the matrix $M_{cp}$:
$$ M_{cp} = \begin{cases} 1 \text{ if }RCA \ge 1\\ 0 \text{ if }RCA \lt 1 \end{cases} $$We can first construct $RCA$ matrices and then compute $M_{cp}$ using a conditional map
In [26]:
#-Generate Yearly RCA Mcp Matrices and store them in a Dictionary-#
rca = {}
for year in data.year.unique():
yr = data[data.year == year].set_index(['origin', 'sitc']).unstack('sitc')['rca']
rca[year] = yr
In [27]:
rca[2000].head()
Out[27]:
In [28]:
#-Generate Yearly Binary Mcp Matrices-#
mcp = {}
for year in rca.keys():
mcp[year] = rca[year].fillna(0.0).applymap(lambda x: 1 if x >= 1.0 else 0.0)
#-Generate Yearly Binary Mcp Matrices-#
mcp = {}
for year in rca.keys():
mcp[year] = rca[year].fillna(0.0).applymap(lambda x: 1 if x >= 1.0 else 0.0)
In [29]:
mcp[2000].head()
Out[29]:
In [ ]:
products = mcp[1998].loc['AUS']
products[products == 1.0]
Proximity: A high proximity value suggests any two products are exported by a similar set of countries.
$$ \phi_{ij} = \min \{ P(RCA_i >=1 \hspace{0.25cm} | \hspace{0.25cm} RCA_j >= 1), P(RCA_j >= 1 \hspace{0.25cm} | \hspace{0.25cm} RCA_i >= 1) \} $$The minimum conditional probability of coexport can be computed:
$$ \phi_{ij} = \frac{\sum_c \{ M_{cp_i} * M_{cp_j} \}}{max \{k_{p_i}, k_{p_j}\}} $$where,
The $\phi_{ij}$ matrix is therefore computed through all pairwise combinations of column vectors which is computationally intensive.
In [30]:
def proximity_matrix_pandas(mcp, fillna=True):
products = sorted(list(mcp.columns))
sum_products = mcp.sum(axis=0)
proximity = pd.DataFrame(index=products, columns=products)
for i, product1 in enumerate(products):
for j, product2 in enumerate(products):
if j > i: #Symmetric Matrix Condition
continue
numerator = (mcp[product1] * mcp[product2]).sum()
denominator = max(sum_products[product1], sum_products[product2])
if denominator == 0:
cond_prob = np.nan
else:
cond_prob = numerator / denominator
proximity.set_value(index=product1, col=product2, value=cond_prob)
proximity.set_value(index=product2, col=product1, value=cond_prob)
if fillna:
proximity = proximity.fillna(0.0)
return proximity
In [31]:
%%time
prox_2000 = proximity_matrix_pandas(mcp[2000])
In [32]:
prox_2000.unstack().describe()
Out[32]:
In [33]:
prox_2000.unstack().hist()
Out[33]:
at ~1 minute this is taking a reasonably long time to compute for one year. This makes working with this data in an agile way problematic and computing for 50 years would take an hour to compute. While this was easy to implement, it isn't very fast!
Let's profile this code to get an understanding where we spend most of our time
For this line to run you will need to install line_profiler by running:
conda install line_profiler
In [34]:
# !conda install line_profiler
In [35]:
import line_profiler
%load_ext line_profiler
In [36]:
# %lprun -f proximity_matrix_pandas proximity_matrix_pandas(mcp[2000])
In [37]:
def proximity_matrix_numpy(mcp, fillna=False):
products = sorted(list(mcp.columns))
num_products = len(products)
proximity = np.empty((num_products, num_products))
col_sums = mcp.sum().values
data = mcp.T.as_matrix() #This generates a c x p numpy array
for index1 in range(0,num_products):
for index2 in range(0,num_products):
if index2 > index1:
continue
numerator = (data[index1] * data[index2]).sum()
denominator = max(col_sums[index1], col_sums[index2])
if denominator == 0.0:
cond_prob = np.nan
else:
cond_prob = numerator / denominator
proximity[index1][index2] = cond_prob
proximity[index2][index1] = cond_prob
# Return DataFrame Representation #
proximity = pd.DataFrame(proximity, index=products, columns=products)
proximity.index.name = 'productcode1'
proximity.columns.name = 'productcode2'
if fillna:
proximity = proximity.fillna(0.0)
return proximity
In [38]:
%%time
prox_2000_numpy = proximity_matrix_numpy(mcp[2000])
In [39]:
prox_2000.equals(prox_2000_numpy)
Out[39]:
Numba is a package you can use to accelerate your code by using a technique called just in time (or JIT) compilation. It converts your high-level python code to low level llvm code to run it closer to the raw machine level.
nopython=True ensures the jit compiles without any python objects. If it cannot achieve this it will throw an error.
Numba now supports a lot of the NumPy api and can be checked here
In [40]:
@jit(nopython=True)
def coexport_probability(data, num_products, col_sums):
proximity = np.empty((num_products, num_products))
for index1 in range(0,num_products):
for index2 in range(0,num_products):
if index2 > index1:
continue
numerator = (data[index1] * data[index2]).sum()
denominator = max(col_sums[index1], col_sums[index2])
if denominator == 0.0:
cond_prob = np.nan
else:
cond_prob = numerator / denominator
proximity[index1][index2] = cond_prob
proximity[index2][index1] = cond_prob
return proximity
def proximity_matrix_numba(mcp, fillna=False):
products = sorted(list(mcp.columns))
num_products = len(products)
col_sums = mcp.sum().values
data = mcp.T.as_matrix()
proximity = coexport_probability(data, num_products, col_sums) #Call Jit Function
# Return DataFrame Representation #
proximity = pd.DataFrame(proximity, index=products, columns=products)
proximity.index.name = 'productcode1'
proximity.columns.name = 'productcode2'
if fillna:
proximity = proximity.fillna(0.0)
return proximity
In [41]:
prox_2000_numba = proximity_matrix_numba(mcp[2000])
In [42]:
%%timeit
prox_2000_numba = proximity_matrix_numba(mcp[2000])
In [43]:
prox_2000_numba.equals(prox_2000)
Out[43]:
In [44]:
%%time
proximity = {}
for year in mcp.keys():
proximity[year] = proximity_matrix_numba(mcp[year])
In [45]:
proximity[2000].head()
Out[45]:
Now that we have a fast single year computation, we can compute all cross-sections serially using a loop.
Alternatively, we can parallelize these operations using Dask to delay computation and then ask the Dask scheduler to coordinate the computation over the number of cores available to you. This is particularly useful when using HS data.
Note: This simple approach to parallelization does have some overhead to coordinate the computations so you won't get a full 4 x speed up when using a 4-core machine.
In [46]:
import dask
from distributed import Client
Client()
Out[46]:
In [47]:
#-Setup the Computations as a Collection of Tasks-#
collection = []
for year in sorted(mcp.keys()):
collection.append((year, dask.delayed(proximity_matrix_numba)(mcp[year])))
In [48]:
%%time
#-Compute the Results-#
result = dask.compute(*collection)
In [49]:
#-Organise the list of returned tuples into a convenient dictionary-#
results = {}
for year, df in result:
results[year] = df
In [50]:
results[2000].equals(prox_2000)
Out[50]:
In [51]:
results[2000].head()
Out[51]:
Note: Dask does a lot more than this and is worth looking into for medium to large scale computations
In [52]:
#-Save Results into a HDF5 File-#
fl = "data/sitcr2l4_proximity.h5"
store = pd.HDFStore(fl, mode='w')
for year in results.keys():
store["Y{}".format(year)] = results[year]
store.close()
In [53]:
%%html
<style>
table {margin-left: 0 !important;}
</style>
For SITC Data: (786 Products, 229 Countries, 52 Years)
| Function | Time/Year | Total Time | Speedup |
|---|---|---|---|
| pandas | 220 seconds | ~177 minutes | - |
| pandas_symmetric | 104 seconds | ~84 minutes | BASE |
| numpy | 2.5 seconds | 120 seconds | ~41x |
| numba | 124 milliseconds | 6 seconds | ~800x |
| numba + dask | N/A | 5 seconds | - |
For HS Data: (5016 Products, 222 Countries, 20 Years)
| Function | Time/Year | Total Time | Speedup |
|---|---|---|---|
| pandas | 1 Hour 25 minutes | - | - |
| pandas_symmetric | 43 minutes | - | BASE |
| numpy | 1 min 37 seconds | - | ~28x |
| numba | 5 seconds | 1min 45 seconds | ~516x |
| numba + dask | N/A | 45 seconds | - |
These were run on the following machine:
| Item | Details |
|---|---|
| Processor | Xeon E5 @ 3.6Ghz |
| Cores | 8 |
| RAM | 32Gb RAM |
| Python | Python 3.6 |
In [54]:
prox = pd.read_hdf("data/sitcr2l4_proximity.h5", key="Y2000")
In [55]:
prox.head()
Out[55]:
In [56]:
edge_list = prox.unstack()
In [57]:
#-Construct Sequence of node pairs as a pd.Series
edge_list.head()
Out[57]:
In [58]:
#-Remove Self Loops-#
edge_list = edge_list[edge_list != 1.0] #TODO: do this operation properly to compare node1 == node2
In [59]:
edge_list.head()
Out[59]:
In [60]:
#-Construct DataFrame-#
edge_list = edge_list.reset_index()
edge_list.columns = ["P1","P2","weight"]
In [61]:
edge_list["inv_weight"] = 1 - edge_list['weight'] #Useful when working with minimum spanning tree in networkx
In [62]:
edge_list.head()
Out[62]:
In [63]:
edge_list[["weight","inv_weight"]].hist();
In [64]:
import networkx as nx
In [65]:
#-Construct the complete network-#
g = nx.from_pandas_dataframe(edge_list, source="P1", target="P2", edge_attr=["weight", "inv_weight"])
print("# of Nodes: {}".format(g.number_of_nodes()))
print("# of Edges: {}".format(g.number_of_edges()))
In [66]:
mst = nx.minimum_spanning_tree(g, weight='inv_weight') #Maximum Spanning Tree
print("# of Nodes: {}".format(mst.number_of_nodes()))
print("# of Edges: {}".format(mst.number_of_edges()))
In [67]:
mst["0011"]
Out[67]:
In [68]:
#-Build Maximum Spanning Tree + Keep Edges > 0.50-#
ps = nx.Graph()
#Add MST ('weight' attribute only)
for u,v,w in mst.edges_iter(data=True):
ps.add_edge(u,v,attr_dict={'weight' : w["weight"]})
#Add Edges > 0.50
for u,v,w in g.edges_iter(data=True):
if w['weight'] >= 0.50:
ps.add_edge(u,v,attr_dict={'weight' : w["weight"]})
In [69]:
print("# of Nodes: {}".format(ps.number_of_nodes()))
print("# of Edges: {}".format(ps.number_of_edges()))
In [70]:
ps_nodes = pd.read_csv("data/PS_SITC_nodes", sep="\t", converters={'sitc' : str},
names=["sitc", "community", "x", "y", "nodesize","leamer","pname","ncolor"])
ps_edges = pd.read_csv("data/PS_SITC_edges", sep="\t", converters={'sourceid' : str, 'targetid' : str},
names=["sourceid", "sourcex", "sourcey","targetid","targetx","targety", "width","color"])
In [71]:
ps_nodes.head()
Out[71]:
In [72]:
ps_nodes.shape
Out[72]:
In [73]:
def normalize(df, column):
max_value = df[column].max()
min_value = df[column].min()
df[column+"_scaled"] = (df[column] - min_value) / (max_value - min_value)
return df
In [74]:
#Preprocess Coordinates to be Normalized between 0,1
ps_nodes = normalize(ps_nodes, 'x')
ps_nodes = normalize(ps_nodes, 'y')
In [75]:
ps_nodes.head()
Out[75]:
In [76]:
import numpy as np
#-Obtain Dictionary of Coordinates-#
coord = {}
xy = ps_nodes[["x_scaled","y_scaled"]].values
for idx, productcode in enumerate(ps_nodes["sitc"]):
coord[productcode] = xy[idx]
#-Add Missing Nodes-#
coord['6784'] = np.array([0,0])
In [77]:
coord
Out[77]:
In [78]:
#-Check Entry-#
ps_nodes[ps_nodes.sitc == "0011"]
Out[78]:
In [79]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(50,20))
ax = fig.gca()
nx.draw_networkx(ps, ax=ax, pos=coord, node_size=750, width=1)
plt.savefig("productspace1.png")
Note: File is saved Locally to view the network in greater detail
In [80]:
#-Let's See where Apparel Chapter 84 Nodes are Located-#
def choose_color(x):
if x[:2] == "84":
return "b"
else:
return "r"
nodes = pd.DataFrame(sorted(list(coord.keys())), columns=["nodeid"])
nodes["color"] = nodes["nodeid"].apply(lambda x: choose_color(x))
In [81]:
nodes[nodes.color == 'b'].head()
Out[81]:
In [82]:
#-Get the Order of Nodes the Same as Network Node List
order = pd.DataFrame(ps.nodes()).reset_index()
order.columns = ["order", "nodeid"]
nodes = nodes.merge(order, how="inner", on="nodeid")
nodes = nodes.sort_values(by="order")
In [83]:
fig = plt.figure(figsize=(50,20))
ax = fig.gca()
nx.draw_networkx(ps, ax=ax, pos=coord, node_size=750, width=1, node_color = nodes.color.values, )
plt.savefig("productspace2.png")
In [84]:
#Can Output to use with Gephi / Cytoscape (Exploratory Network Tools)
nx.write_gml(ps, "product_space.gml")
[1] Zachary, W. (1977), "An Information Flow Model for Conflict and Fission in Small Groups", Journal of Anthropological Research, Vol. 33, No. 4 (Winter, 1977), pp. 452-473
[2] Cao, X., Wang X., Jin D., Cao Y. & He, D. (2013), "Identifying overlapping communities as well as hubs and outliers via nonnegative matrix factorization", Scientific Reports, Vol 3, Issue 2993
[3] Hidalgo, C.A., Klinger, B., Barabasi, A.-L., Hausmann, R. (2007), "The Product Space Conditions the Development of Nations", Science, Vol 317, pp 482-487
[4] Atlas of Complexity (http://atlas.cid.harvard.edu/)
[5] The Observatory of Economic Complexity (http://atlas.media.mit.edu/en/)
[6] Atlas of Complexity Gride Points for Nodes sourced from http://www.michelecoscia.com/?page_id=223
[7] Balassa, B. (1965), "Trade Liberalisation and Revealed Comparative Advantage", The Manchester School, 33, 99-123.
In [ ]: