This notebook contains some guidelines for the students to prepare programming environments for the upcomming seminars and home assignments. During our seminars we will mostly work in IPython Notebook, however sometimes we may switch to MatLab. Homeworks also can be submitted only as IPython Notebooks or $\LaTeX$ reports with Python source code.
(We are using Python 2.7.8, not 3.x.x)
I suggest you to install free Anaconda Python distribution which already includes all necessary Python packages and modules - IPython, NumPy, SciPy, IPython, Matplotlib and NetworkX. Simply download appropriate installer for your OS and follow the proposed steps.
A small tutorial for the NetworkX and other useful Python modules will be just in a minute.. Now let's deal with MatLab.
MatLab is a great computational environment, however it does not poses good instruments to operate with networks and graphs. Fortunately you can download MatLab Boost Graph Library which contains almost all graph utilities that will help you with assignments. To start using MatLabBGL algorithms simply set path to the downloaded folder.
Matlab is also sick in graph visualisation. To this purpose you can download GraphViz software and Matlab interface for it. All install instructions are presented on the corresponding web-pages.
During this course you will mostly operate with graph structures and matrices. Also you should be able to plot some figures and make conclusions based on them.
In [ ]:
import networkx as nx
import matplotlib.pyplot as plt
# plt.xkcd()
import numpy as np
%matplotlib inline
Networks can be described in several formats e.g. adjacency matrix, edge list, etc. One of the most complete graph representations is GML format which look more or less like XML. See an example:
graph
[
node
[
id A
label "Node A"
]
node
[
id B
label "Node B"
]
node
[
id C
label "Node C"
]
edge
[
source B
target A
label "Edge B to A"
]
edge
[
source C
target A
label "Edge C to A"
]
]
Thankfully, NetworkX supports almost all of recent network representations. Let's load a very famous Zachary's Karate Club network. Nodes in this network represent members of the club and edges between them occur if two persons also frequently communicate outside club's activities. Sadly, the organisation has splitted out into two groups due to conflict and misunderstanding between its leaders.
In [ ]:
G = nx.read_gml(path = 'data/karate.gml')
nx.draw_networkx(G)
draw_network command has several options to improve the view of the network. For instance, let's make nodes with greater degree bigger and more colorfull:
In [ ]:
k = nx.degree(G)
plt.figure(1, figsize=(10,7))
coord = nx.spring_layout(G)
nx.draw_networkx(G,
pos=coord,
nodelist=k.keys(),
node_size = [d*100 for d in k.values()],
node_color=k.values(),
font_size=8,
cmap=plt.cm.Reds,
)
Now we can clearly observe two nodes with highest degree centrality: 1 and 34. Exactly these two most important club members are responsible for the breakdown.
We also might be interested in node degree distribution and network's adjacency matrix:
In [ ]:
# load adjacency matrix
A = nx.to_numpy_matrix(G, dtype=np.bool)
k = G.degree()
# alternatevly you can find k as k = np.sum(A, axis=0)
# sort nodes according to degree
id_sort = np.array(sorted(k, key = k.get, reverse=True))
k_sort = np.array(k.values())
k_sort = k_sort[id_sort-1]
# show adjacency matrix
plt.figure(1, figsize=(6, 6))
plt.imshow(A,
cmap="Greys",
interpolation="none"
)
plt.title('Zachary\'s club adjacency matrix', fontsize=14)
In [ ]:
# show node degrees
plt.figure(1, figsize=(10, 5))
plt.bar(np.arange(34), k_sort)
plt.xticks(np.arange(34), id_sort)
plt.margins(0.2)
plt.ylabel('Degree')
plt.xlabel('Node id')
Now we have meet some basics of plotting and data manipulations in networkX. For more details visit the following web-pages:
Download dataset and import it with NumPy. There you will find a table D with 2901 rows and 5 columns. Set y = D[:,0] and X = D[:, 1:].
For this task you have to fit a least-squares linear regression model $\hat{y} = X\hat{\beta}$, where
$$ \hat{\beta} = (X^T X)^{-1} X^T y $$The residuals of the model are agreed to be calculated as $$ \text{res} = y - \hat{y} $$
1. Download dataset
2. Fit the model
3. Make two plots:
3.1. Choose a feature of dataset and plot fitted line against it
3.2. Make residuals against prediction plot
hint: DO NOT USE NUMPY's MATRICES. Use arrays and np.dot(A,B) instead of matrix multiplication A*B
In [ ]:
import numpy as np;
import numpy.linalg as la;
with file( "data/bikinis.txt", "r" ) as f:
data = np.loadtxt( f )
## set the dependent variable and create the model design matrix
Y = data[:,0 ]
## Probably one should include the intercept...
X = data[:,1:]
## Using a weel-known result, compute the OLS estimates of the linear coefficients.
beta = np.dot(
## (X'X)^{-1} (X'Y)
la.inv(
np.dot( X.T, X ) ),
np.dot( X.T, Y ) )
## Show everyone the triumph of math over the puny numbers!!
beta
## C ompute the fitted valeus of the dependent variable and the residuals
yhat = X.dot( beta )
resid = np.array( Y - yhat )
In [ ]:
## Import thr plotting library (once again)
import matplotlib.pyplot as plt
%matplotlib inline
## Set the style. Yes, XKCD is great, but not one should
## be careful not to overuse it like Comic Sans.
# plt.xkcd()
print plt.style.available
plt.style.use( "ggplot" )
## Set the dimension of the plotting area and draw a scatter plot
plt.figure( figsize = (10, 10) )
plt.plot( yhat, resid, 'xb' )
To download .gml file with your facebook friend network you may use getnet app.
1. Import your Facebook network
2. Now play with it!
2.1. Draw your network
2.2. Output your network's order and size
2.3. Draw the degree distribution (histogram)
2.4. Find nodes with the highest degrees. Who are these people?
2.5. Find connected components of your network. Interpret them.
In [ ]:
## Import our workhorse and call it "nx"!
import networkx as nx;
## Read the GML graph of facebook friends. Beware of the issues with UTF8
## The studid GML standard explicitly uses ASCII. This is just silly!
## UTF8 should be stored in HTM entitites &uXXXX;
## Used the first hit in Google to conver the UTF8 into HTML enities:
## http://konieczny.be/unicode.html
G = nx.read_gml( path = "data/huge_100004196072232_2015_01_15_17_39_6242c23baf56ef9592db0b9453072491.gml" )
## Get the degrees: returns the degree of each vertex
deg = G.degree( )
## Setup the plotting area and draw the network
plt.figure( figsize = (10, 10) )
nx.draw_networkx( G,
pos = nx.spring_layout( G ),
nodelist = deg.keys(),
node_size = [ d*100 for d in deg.values() ],
node_color = deg.values(),
font_size = 8,
cmap = plt.cm.Reds,)
In [ ]:
## The order of a graph is $|V|$ and the size is $|E|$
print "The network G is of the order %d. Its size is %d." % ( G.order( ), G.size( ) )
In [ ]:
## Build a historgarm of the empirical distribution of vertex degrees.
f, m, p = plt.hist( deg.values( ), bins = 12 ) # [x for k,v in deg.items() for x in v*[k]]
In [ ]:
## Get the TOP 10 vertices with the highest degree
# top10 = sorted( deg.keys( ), key = lambda k: deg[k], reverse = True )[:10]
top10_idx = sorted( deg, key = deg.get, reverse = True )[:10]
## See who are the hubs. (Actually checking
## the existence of keys in unnecessary).
nodes = [ G.node[k] for k in top10_idx ] # if k in deg ]
## Get their degrees
degrees = [ deg[k] for k in top10_idx ]
## Transform
text = [ "-- \"%s\" befriended %d other people" % (n['label'], v) for n, v in zip( nodes, degrees ) ]
## Show the culprits
print "Top 10 most friendly peoply within my network:"
print "\n".join( text )
In [ ]:
## Find the connected components using the BFS graph traversal
CC = nx.connected_components( G )
## Commected components are most likely the acquaintances i met in various communities
## However a better community detection algorithm might consider the volume of paths
## through any particular node, with the view to not fusing weakly connected groups.
communities = [ [G.node[ n ][ 'label' ] for n in c] for c in CC ]
## Print each member of every community.
print "\n\n".join( [ ", ".join( sorted( group ) ) for group in communities ] )
In [ ]:
## Study the largest connected component in the graph
H = [ G.subgraph( c ) for c in nx.connected_components( G ) ]
deg = H[ 0 ].degree( )
## Show the radius of the component and plot it
print "The largest community has radius %d" % nx.radius( H[ 0 ] )
plt.figure( figsize = (10, 10) )
nx.draw_networkx( H[ 0 ],
pos = nx.spring_layout( H[ 0 ] ),
nodelist = sorted( deg, key = deg.get ),
# nodelist = deg.keys( ),
# node_size = [ d * 10 for d in deg.values( ) ],
node_color = deg.values( ),
font_size = 8,
cmap = plt.cm.Purples,)
In [ ]: