scona
is a tool to perform network analysis over structural covariance networks of brain regions.
This tutorial will go through the basic functionality of scona
, taking us from our inputs (a matrix of structural regional measures over subjects) to a report of local network measures for each brain region, and network level comparisons to a cohort of random graphs of the same degree.
In [1]:
import matplotlib.pylab as plt
%matplotlib inline
import networkx as nx
import numpy as np
import seaborn as sns
sns.set(context="notebook", font_scale=1.5, style="white")
import scona as scn
import scona.datasets as datasets
from scona.scripts.visualisation_commands import view_corr_mat
A scona
analysis starts with four inputs.
regional_measures: A pandas DataFrame with subjects as rows. The columns should include structural measures for each brain region, as well as any subject-wise covariates.
names: A list of names of the brain regions. This will be used to specify which columns of the regional_measures matrix to want to correlate over.
covars (optional): A list of your covariates. This will be used to specify which columns of regional_measure you wish to correct for.
centroids: A list of tuples representing the cartesian coordinates of brain regions. This list should be in the same order as the list of brain regions to accurately assign coordinates to regions. The coordinates are expected to obey the convention the the x=0 plane is the same plane that separates the left and right hemispheres of the brain.
In [2]:
# Read in sample data from the NSPN WhitakerVertes PNAS 2016 paper.
df, names, covars, centroids = datasets.NSPN_WhitakerVertes_PNAS2016.import_data()
df
is the regional_measures file.
Each row is a participant in the study, and each column is a measure associated with that person.
names
are the columns in the regional measures data frame that you care about.
The regional measures file can contain as many columns as you'd like. Some will be demographic measures (such as centre
, age_scan
, or male
), others will be brain measures that you care about (see the names list below), and there can even be useless ones such as Unnamed:
and lh_unknown_part1
that are created as part of your preprocessing.
In [3]:
# The first 5 rows of our regional measures data frame
df.head()
Out[3]:
In [4]:
# The top 10 regional names
names[:10]
Out[4]:
centroids
are the x, y, z coordinates of the centres of each of the brain regions (listed in names
)
In [5]:
# The centroids for the top 10 regions
centroids[:10]
Out[5]:
In this case covars
is an empty list because the analyses published by Whitaker, Vertes et al did not correct for any measures, but you could give it a list of column names containing information that you'd like to covary out of the correlation matrices.
A good example would be age_scan
and male
so lets set up that alternate covars list (covars_age_male
) to compare the results as we go through this tutorial.
In [6]:
# Show the covars that ships with the example data set
covars
Out[6]:
In [7]:
# Create a list of covariates to check their effect
covars_age_male = ['age_scan', 'male']
covars_age_male
Out[7]:
The scona
command create_residuals_df
will calculate the residual variation in each of the brain regions after correcting for the covariates defined in covars
.
We give the function the data frame (df
) the columns that correspond to the brain regions we care about (names
) and the list of columns we want to covary out (covars
or covars_age_male
).
Note that it's totally fine to pass an empty list for covars
. In fact, that's the default behaviour! In this case the values will simply be de-meaned.
You can see the slight variations in the values when we remove the (linear) effects of age and gender (df_res_age_male
) compared to no covariates (df_res
).
In [8]:
df_res = scn.create_residuals_df(df, names, covars)
df_res.head()
Out[8]:
In [9]:
df_res_age_male = scn.create_residuals_df(df, names, covars_age_male)
df_res_age_male.head()
Out[9]:
Now we're ready to create a correlation matrix.
You'll notice (above) that df_res
doesn't contain the other measures from df
- it only has the brain measures we're interested in. The default behaviour of create_corrmat
is to correlate all of the columns. You can pass an optional measure (names
) if you want to only correlate a subset of measures.
The default method of creating a correlation matrix is pearson correlation. If you'd prefer you can use any of the options implemented by pandas, which are pearson
, kendall
, spearman
at time of writing.
Our correlation matrix M
is a square pandas data frame with the rows and columns both corresponding to the names
of the regions you're interested in.
In [10]:
M = scn.create_corrmat(df_res)
M.head()
Out[10]:
In [11]:
M_age_male = scn.create_corrmat(df_res_age_male)
In [12]:
# Take a look at the two different correlation matrices
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
ax[0].imshow(M, cmap='RdBu_r', vmin=-0.8, vmax=0.8)
ax[0].set_title('No covariates')
ax[1].imshow(M_age_male, cmap='RdBu_r', vmin=-0.8, vmax=0.8)
ax[1].set_title('Remove age and gender effects')
plt.show()
From our correlation matrix we're ready to make a graph!
A short sidenote on the BrainNetwork
class. This is a very lightweight subclass of the Networkx.Graph
class. This means that any methods you can use on a Networkx.Graph
object can also be used on a BrainNetwork
object, although the reverse is not true. We have added various methods that allow us to keep track of measures that have already been calculated. This is particularly useful later on when one is dealing with 1000 random graphs (or more!) and saves a lot of time.
Initialise a weighted graph G
from the correlation matrix M
. The parcellation
and centroids
arguments are used to label nodes with names and coordinates respectively.
G
can be created from a pandas data frame (as we're doing here), a numpy array, or a networkx graph. So this is a great place to jump in from another processing package if you have a preferred way of creating your correlation matrix.
The scn.BrainNetwork
options parcellation
and centroids
are optional, but recommended. If you don't give scn.BrainNetwork
the names of your regions or the coordinates of each region, there will be some graph measures that can't be calculated (for example euclidean distance) and the output might be somewhat harder to interpret (you won't have a nice mapping between the nodes back to your regional data).
In this section of the tutorial we'll just take the data with no covariates forward to help with readability) but a comparison of values with and without the age and gender covariates is at the very end of this notebook :)
In [13]:
G = scn.BrainNetwork(network=M, parcellation=names, centroids=centroids)
In [14]:
# Take a look at the information stored for the first node
# of the graph we created
G.node[0]
Out[14]:
In [15]:
# Or you can see the weight (the correlation strength)
# of the egde between nodes 0 and 1
G.edges[(0,1)]
Out[15]:
Next we'll threshold G
at cost 10 to create a binary graph with 10% as many edges as the complete graph G
.
Ordinarily when thresholding one takes the 10% of edges with the highest weight. However, this can sometimes leave a disconnected graph, which in the context of brain networks is not biologically plausible nor easily interpretable. Therefore, in order to ensure that the resulting graph is connected, we will calculate a minimum spanning tree first. A "tree" is the smallest possible graph where you can travel from every node to every other node. The minimum spanning tree builds that tree from the strongest available edges.
This calculation happens within the BrainNetwork.threshold
method. If you want to omit the minimum spanning tree step, you can pass the argument mst=False
to G.threshold
.
The threshold
method does not edit objects inplace.
In [16]:
G10 = G.threshold(10)
In [17]:
# Note that edges that still exist now have a weight of 1
G10.edges[(0,1)]
Out[17]:
In [18]:
# Some of these edges don't exist any more!
# (90% of them to be exact)
try:
G10.edges[(0,50)]
except KeyError:
print("This edge does not exist!")
The BrainNetwork
method calculate_nodal_measures
computes and records the following nodal measures by default:
betweenness
: betweenness centralitycloseness
: closeness centralityclustering
: clustering coefficientdegree
: number of edges from that nodemodule
: which module this node belongs to according to the louvain community detection algorithm (starting from 0)participation_coefficient
: participation coefficient according to the partition defined aboveshortest_path_length
: shortest path length from this node to any other in the networkLets start by seeing the nodal attributes we already have. The useful method report_nodal_measure
returns all these values in a DataFrame.
In [19]:
G10.report_nodal_measures().head()
Out[19]:
We can run calculate_nodal_measures
to fill in a lot more information to that DataFrame.
This method takes around 6 seconds to run. Which isn't long but adds up when you're creating random graphs for comparison with your real network data. Therefore this BrainNetwork
method will check to see if the measures already exist before it re-calculates them. If they're already there it will just move on to the next one.
You can re-calculate the nodal measures by setting the option force=True
.
In [20]:
%%timeit -n 1
G10.calculate_nodal_measures()
In [21]:
%%timeit -n 1
G10.calculate_nodal_measures()
In [22]:
%%timeit -n 1
G10.calculate_nodal_measures(force=True)
The calculate_nodal_measures
method assigns these calculated values to the BrainNetwork
graph.
We can look at a nice summary using report_nodal_measures
which outputs a pandas DataFrame.
In [23]:
G10.report_nodal_measures().head()
Out[23]:
The method calculate_spatial_measures
calculates nodal and edge values that relate to the euclidean distances between nodes. They can only be calculated if the centroids
file is available.
The edge attributes are:
euclidean
: the distance between two nodes (in the units given in the centroids
file, likely mm).interhem
: whether the node connects regions in two different hemispheres (1
if yes, 0
if no).The nodal measures are:
average_dist
: the average distance across all edges connected to that nodetotal_dist
: the total distance of all edges connected to that nodehemisphere
: whether the node is in the left (L
) or right (R
) hemisphereinterhem
: number of interhemispheric edges from that nodeinterhem_proportion
: proportion of interhemispheric edges from that node
In [24]:
G10.calculate_spatial_measures()
In [25]:
# Lets take a look at all the nodal measures again
G10.report_nodal_measures().head()
Out[25]:
It's quite likely that you may want to attribute additional values to each node. For example, in their paper in 2016, Whitaker, Vertes and colleagues correlated the cross sectional decrease in cortical thickness between ages 14 and 24 for each region with its degree. They showed that the hubs - the best connected regions - changed the most in this late adolescent age range.
As we're using this data
We can also add measures as one might normally add nodal attributes to a networkx graph
In [26]:
dCT = df.loc[:, ['age_scan'] + names].corr().iloc[0, 1:]
dCT.head()
Out[26]:
In [27]:
nx.set_node_attributes(G10, name="dCT", values={ node: dCT[G10._node[node]['name']] for node in G10.nodes()})
These show up in our DataFrame too
In [28]:
nodal_df = G10.report_nodal_measures(columns=['name', 'degree', 'dCT'])
nodal_df.head()
Out[28]:
In [29]:
nodal_df = G10.report_nodal_measures(columns=['degree', 'dCT'])
sns.regplot(data=nodal_df, y='dCT', x='degree')
sns.despine()
In [30]:
G10.calculate_global_measures()
Out[30]:
In [31]:
G10.rich_club()
Out[31]:
In [32]:
brain_bundle = scn.GraphBundle([G10], ['NSPN_cost10'])
This creates a dictionary-like object with BrainNetwork H
keyed by 'NSPN_cost=10'
In [33]:
brain_bundle
Out[33]:
Now add a series of random graphs created by edge swap randomisation of G10 (keyed by 'NSPN_cost10'
).
The create_random_graphs
method of the GraphBundle
class takes in a real network (in our case G10
) and creates a number (10
in the example below) of random graphs. The output is a dictionary of all these graphs. The original one has the name you gave it (here: "NSPN_cost10"
) and the random graphs have a suffix of _Ri
where i
is an integer counter starting at 0
.
What's pretty cool is that this function again saves the work you've done before. So if you generate a few random graphs and save them to the brain_bundle
dictionary and then later want to add some more, the create_random_graphs
method will recognise that there are some random graphs there already and append the new ones to the brain_bundle
. The index for these additional random graphs will start counting from where the original run left off.
KW: DO A TIMEIT FUN DEMO HERE.
In [34]:
# Note that 10 is not usually a sufficient number of random graphs to do meaningful analysis,
# it is used here for time considerations
brain_bundle.create_random_graphs('NSPN_cost10', 5)
In [35]:
brain_bundle
Out[35]:
In [36]:
brain_bundle.report_global_measures()
Out[36]:
In [44]:
rich_club_df = brain_bundle.report_rich_club()
rich_club_df.iloc[55:65, :]
Out[44]:
In [38]:
calc_closeness = lambda x: x.calculate_nodal_measures(measure_list=["closeness"])
report_closeness = lambda x: x.report_nodal_measures(columns=["name", "closeness"])
brain_bundle.apply(graph_function=calc_closeness)
closeness_dict = brain_bundle.apply(graph_function=report_closeness)
for graph_key, closeness_df in closeness_dict.items():
print(closeness_df.head())
In [39]:
calc_degree = lambda x: x.calculate_nodal_measures(measure_list=["degree"])
report_degree = lambda x: x.report_nodal_measures(columns=["name", "degree"])
brain_bundle.apply(graph_function=calc_degree)
degree_dict = brain_bundle.apply(graph_function=report_degree)
for graph_key, degree_df in degree_dict.items():
print(degree_dict["NSPN_cost10"].head() == degree_df.head())
In [42]:
degree = G10.report_nodal_measures(columns=["degree"])
sns.distplot(degree)
np.percentile(degree, 90)
Out[42]:
In [ ]: