In [9]:
# settings and modules
%config InlineBackend.figure_format = 'retina'
%pylab inline
from nsaba.nsaba import nsaba
NSABA (NeuroSynth Allen Brain Atlas) is a python module with a suite of tools for analyzing fMRI meta-analyses from Neurosynth (http://neurosynth.org) and brain-wide, genome-scale human gene experession data from the Allen Brain Institute (http://human.brain-map.org/)
This tool can be used to estimate gene expression or term association with given coordinates, visualize relationships between gene expression and term association, and perform several analyses on these data
In other words, this tool can ask questions such as, "Is the D2 receptor expressed significantly higher in brain regions that are activated in studies that talk about 'reward'?" Moreover, this tool can be used to algorithmically generate novel hypotheses about genes and functional networks
This notebook will demonstrate some of the functionality of the Nsaba toolbox to explore the relationship between D2 receptors and parts of the brain associated with reward. This is the long demo. For a shorter walkthrough see the Short Demo notebook in this Repo.
Estimation Methods:
estimate_aba_ge(entrez_ids, coords=None, **kwargs): Retrieves, estimates and stores gene expression coefficients in ABA dictionary based on a passed list of NIH Entrez IDs.
estimate_ns_act(self, term, coords=None, **kwargs): Uses radius neighbors regression to estimate Neurosynth term activation (tf-idf) at specified coordinates. If no coordinates are passed, ABA sampled locations in corresponding NsabaBase are used.
Organization Methods
matrix_builder(ns_terms=None,entrez_ids=None): Generates a np.array() matrix of pre-estimated term activations and gene expression coefficients.
ge_ratio(self, entrez_ids, coords=None, **kwargs): Calculates the ratio of gene expression at each ABA sampled MNI coordinate or custom coordinates.
Initialization Methods:
aba_load(aba_path='.', csv_names=None): loads data from the ABA into dataframes.
ns_load(ns_path='.', ns_files=None): loads data from NS into dataframes.
load_ge_pkl(pkl_file="Nsaba_ABA_ge.pkl", output_dir='.'): loads a previously saved pkl file of genetic expression. Faster than aba_load for a series of genes.
load_ns_pkl(pkl_file="Nsaba_NS_act.pkl", output_dir='.'): loads a previously saved pkl file of term activation. Faster than ns_load for a series of terms.
Bookkeeping Methods:
_check_static_members(): Reminds the user to run aba_load and ns_load before doing anything
check_entrez_struct(entrez_ids): checks that entrez_id's are entered in the correct format
is_term(term): checks that a term exists in the NS term library
is_coord((x,y,z)): checks if a coordinate set has a matching neurosynth study ID
is_id(study_id): checks if a study ID matches the Neurosynth database. The two dataframes, one with coordinates and one with activation vectors have some mismatched ID's.
coord_to_ids((x,y,z)): finds study IDs with coordinates that match (x,y,z)
coord_to_terms((x,y,z)): returns the mean activation of all terms at coordinates (x,y,z). If there is no activation data at these coordinates, it returns []
_coord_to_ge(coord, ids, search_radii, k): returns the estimated gene expression of a gene at a coordinate using the knn method (described below)
_id_to_terms(study_id): returns the vector of term activations for a NS study
id_to_coords. Not a method. Use id_dict.
_term_to_coords(term, thresh=0): returns coordinates where a term activation is above a given threshold. set thresh to -1 for all activation.
pickle_ge(pkl_file="Nsaba_ABA_ge.pkl", path='.'): Saves Current ABA data to a pkl file.
pickle_ns(pkl_file="Nsaba_NS_act.pkl", path='.'): Saves Current term data to a pkl file.
ns_load_id_dict(): builds a dictionary of NS study ids with their corresponding coordinates. Helpful for NS methods where lots of overlapping data has to be accessed.
In [10]:
#local paths to data
ns_path = "/Users/Torben/Documents/ABI analysis/current_data_new/"
aba_path = '/Users/Torben/Documents/ABI analysis/normalized_microarray_donor9861/'
#initialization methods
nsaba.Nsaba.ns_load(ns_path)
nsaba.Nsaba.aba_load(aba_path)
#Creating Nsaba object and loading previously estimated gene activation values
N = nsaba.Nsaba()
N.load_ge_pickle(pkl_file='/Users/Torben/Documents/ABI analysis/normalized_microarray_donor9861/Nsaba_ABA_ge.pkl')
#N.load_ns_pickle(pkl_file= '/Users/Torben/Documents/ABI analysis/Nsaba_NS_act.pkl')
Code from the cell below will initialize data for these terms and genes.
In [3]:
from nsaba.nsaba import geneinfo
term1 = 'reward'
term2 = 'motivation'
genes = [1813 , 6531]
#local path to csv file of gene information scraped from the NIH database (see geneInfo section near end of notebook)
gene_info_output = geneinfo.get_local_gene_info('/Users/Torben/Code/nsaba/csv/',genes)
for info in gene_info_output:
print info,'\n'
N.estimate_ns_act(term1, thresh=-1, method='knn')
N.estimate_ns_act(term2, thresh=-1, method='knn')
methods estimate_aba_ge and estimate_ns_act can be used to estimate gene expression and term association, respectively.
As a default, each will return the values for a given gene or term at the locations where there is data in the ABA.
However, custom coordinates can be used as well. To do this, create an array with tuples of MNI coordinates and set it to the keyword coords in the estimation method as done below. Coordinates can be saved in the Nsaba object by setting store_coords to True and will be saved under self.ge[entrez_id]['store_coords'].
Most genes in the ABA are sensed by multiple probes. The ID's and expression values of each probe are saved as keys in self.ge[entrez_id]. These can be accessed with self.ge[entrez_id][probe_id]['GE']
The mean of each probes at every coordinate be found under self.ge[entrez_id]['mean']
We use radius neighbors regression to estimate gene expression at cordinates where there is no data. This algorithm takes the average of every value found within a given radius. The default radius is 5. The default weighting of the neighboring values is uniform in the current iteration of NSABA. Distance based estimation is in development
If there are no nearby points with data, estimation methods return nan's
In [4]:
radius = 10
rnn_args = dict();
rnn_args['radius']= radius
custom_coords = [(-10,20,1),(-4,21,10),(23,11,-6),(10,-20,-3),(-11,8,-20)]
N.estimate_aba_ge(entrez_ids = genes,coords=custom_coords,store_coords=True,z_score=True,rnn_args=rnn_args)
mean_expression_at_coords = N.ge[1813]['mean']['GE']
print 'Estimation of DRD2 gene expression at custom coordinates: '+ str(mean_expression_at_coords)
N.estimate_ns_act(term='reward',coords=custom_coords)
reward_estimation = N.term['reward']['act']
print 'Estimation of reward at custom coordinates: '+ str(reward_estimation)
visualize_ge(gene): plots all of the points where there is genetic expression where color corresponds to genetic expression.
visualize_ns(term,no_ids): plots the coordinates most associated with the given term where color corresponds to term frequency.
visualize_ns_ns(term1,term2): plots the activation of two terms against eachother. Returns linear regression coefficients and r^2.
visualize_ge_ge([gene1,gene2]): plots the expression of two genes against eachother. Returns linear regression coefficients and r^2.
visualize_ge_ns(term, gene): plots the correlation between term activation and gene expression. Returns linear regression coefficients and r^2.
In the plots, the anterior of the brain is facing away and to the right.
In [12]:
from nsaba.nsaba.visualizer import NsabaVisualizer
N.estimate_aba_ge(entrez_ids = genes)
N.estimate_ns_act(term='reward')
N.estimate_ns_act(term='motivation')
#different estimation methods can be used and visualized.
V = NsabaVisualizer(N)
V.visualize_ge([genes[0]])
V.visualize_ge([genes[1]])
In [15]:
V.visualize_ns('reward',no_ids = 20)
V.visualize_ns('motivation',no_ids = 20)
In [14]:
V.lstsq_ge_ge(genes[0],genes[1])
Out[14]:
Interesting note: We're plotting the gene expression of the D2 receptor and the dopamine transporter. The dopamine transporter isn't specific to D2. That's why we see a subset of the data correlating but not all of it.
In [16]:
V.lstsq_ns_ns(term1,term2,logx=True,logy=True)
Out[16]:
In [17]:
V.lstsq_ns_ge('reward',[genes[0]],logy=True,verbose=True)
Out[17]:
We realized that linear correlations and regressions weren't capturing much of what was going on in the data so we built an analysis class to do more appropriate statistical analyses
t_test(term, gene, quant, log, graphops): performs a t-test on gene expression of given gene between papers that use that word and papers that don't. A significant t-test indicates that gene expression levels are significantly different. How much a word is used can be modulated with quant (0-1). if log is set to True, the t-test is performed on logged data. graphops can be set to 'density', 'violin', or 'box' to make different helpful plots.
t_test_multi(term, quant, sample_num, genes_of_interest): performs t-tests on all (subsample sample_num) gene expression of all genes (that have been initialized) between papers that use a term and papers that don't (as defined by quant).
p_val_distr(ttest_metrics): uses the output from t_test_multi and plots the distribution of p_values for every for every term/non-term gene expression comparison
effect_size_distr(ttest_metrics): uses the output from t_test_multi and plots the distribution of effect_sizes (cohen's d) for every term/non-term gene expression comparison. We believe the effect size of the t-test will be more informative about significance than p value.
fetch_gene_discriptions(ttest_metrics, nih_fetch_num, alpha): querries the NIH database of gene information and returns information about the top n genes. Prints new significance value after Bonferroni Correction using alpha
In [5]:
from nsaba.nsaba.analysis import NsabaAnalysis
A = NsabaAnalysis(N)
A.term_ge_ttest('reward',genes[0],graphops='density',quant=90)
In [11]:
A.term_ge_ttest('reward',genes[0],graphops='violin',quant = 90)
In [12]:
A.term_ge_ttest('reward',genes[0],graphops='box')
In [6]:
ttest_metrics = A.term_ge_ttest_multi('reward',quant=90)
In [7]:
A.cohen_d_distr(ttest_metrics,genes_of_interest=genes)
In [7]:
top_genes = A.fetch_gene_descriptions(ttest_metrics,csv_path='/Users/Torben/Code/nsaba/csv/')
In [25]:
from nsaba.nsaba import geneinfo
#get information about given genes from the NIH gene database
gene_information = geneinfo.scrape_nih_gene_info(str(genes[0]))
print gene_information
#The GeneCards database can also be searched to find the entrez id of a given gene using the following method:
#genecard_entrez_id = geneinfo.fetch_entrez_ids('DRD2',0)
#print genecard_entrez_id
In [24]:
from nsaba.nsaba import builder
NB = builder.NsabaBuilder()
NB.get_aba_ge_all()
NB.pickle_ge(pkl_file="Nsaba_ABA_ge.pkl",output_dir='/Users/Torben/Documents/')
Out[24]: