In order to do this analysis you have to be in the tests directory of GEOparse.
In the paper Systematic identification of microRNA functions by combining target prediction and expression profiling Wang and Wang provided a series of microarrays from 7 time-points after miR-124a transfection. The series can be found in GEO under the GSE6207 accession. We use this series to demonstrate general principles of GEOparse. Mind that this tutorial is not abut how to properly calculate log fold changes - the approach undertaken here is simplistic.
We start with the imports:
In [88]:
%matplotlib inline
import GEOparse
import pandas as pd
import pylab as pl
import seaborn as sns
pl.rcParams['figure.figsize'] = (14, 10)
pl.rcParams['ytick.labelsize'] = 12
pl.rcParams['xtick.labelsize'] = 11
pl.rcParams['axes.labelsize'] = 23
pl.rcParams['legend.fontsize'] = 20
sns.set_style('ticks')
c1, c2, c3, c4 = sns.color_palette("Set1", 4)
We also prepared a simple tabulated file with the description of each GSM. It will be usefull to calculate LFC.
In [91]:
experiments = pd.read_table("GSE6207_experiments.tab")
We can look in to this file:
In [90]:
experiments
Out[90]:
Now we select the GSMs that are controls.
In [92]:
controls = experiments[experiments.Type == 'control'].Experiment.tolist()
Using GEOparse we can download experiments and look into the data:
In [93]:
gse = GEOparse.get_GEO("GSE6207")
The GPL we are interested:
In [94]:
gse.gpls['GPL570'].columns
Out[94]:
And the columns that are available for exemplary GSM:
In [6]:
gse.gsms["GSM143385"].columns
Out[6]:
We take the opportunity and check if everything is OK with the control samples. For this we just use simple histogram. To obtain table with each GSM as column, ID_REF as index and VALUE in each cell we use pivot_samples method from GSE object (we restrict the columns to the controls):
In [97]:
pivoted_control_samples = gse.pivot_samples('VALUE')[controls]
pivoted_control_samples.head()
Out[97]:
And we plot:
In [98]:
pivoted_control_samples.hist()
sns.despine(offset=10, trim=True)
Next we would like to filter out probes that are not expressed. The gene is expressed (in definition here) when its average log2 intensity in control samples is above 0.25 quantile. I.e. we filter out worst 25% genes.
In [99]:
pivoted_control_samples_average = pivoted_control_samples.median(axis=1)
print "Number of probes before filtering: ", len(pivoted_control_samples_average)
In [10]:
expression_threshold = pivoted_control_samples_average.quantile(0.25)
In [100]:
expressed_probes = pivoted_control_samples_average[pivoted_control_samples_average >= expression_threshold].index.tolist()
print "Number of probes above threshold: ", len(expressed_probes)
We can see that the filtering succeeded. Now we can pivot all the samples and filter out probes that are not expressed:
In [102]:
samples = gse.pivot_samples("VALUE").ix[expressed_probes]
The most important thing is to calculate log fold changes. What we have to do is for each time-point identify control and transfected sample and subtract the VALUES (they are provided as log2 transformed already, we subtract transfection from the control). In the end we create new DataFrame with LFCs:
In [103]:
lfc_results = {}
sequence = ['4 hours',
'8 hours',
'16 hours',
'24 hours',
'32 hours',
'72 hours',
'120 hours']
for time, group in experiments.groupby("Time"):
print time
control_name = group[group.Type == "control"].Experiment.iloc[0]
transfection_name = group[group.Type == "transfection"].Experiment.iloc[0]
lfc_results[time] = (samples[transfection_name] - samples[control_name]).to_dict()
lfc_results = pd.DataFrame(lfc_results)[sequence]
Let's look at the data sorted by 24-hours time-point:
In [104]:
lfc_results.sort("24 hours").head()
Out[104]:
We are interested in the gene expression changes upon transfection. Thus, we have to annotate each probe with ENTREZ gene ID, remove probes without ENTREZ or with multiple assignments. Although this strategy might not be optimal, after this we average the LFC for each gene over probes.
In [105]:
# annotate with GPL
lfc_result_annotated = lfc_results.reset_index().merge(gse.gpls['GPL570'].table[["ID", "ENTREZ_GENE_ID"]],
left_on='index', right_on="ID").set_index('index')
del lfc_result_annotated["ID"]
# remove probes without ENTREZ
lfc_result_annotated = lfc_result_annotated.dropna(subset=["ENTREZ_GENE_ID"])
# remove probes with more than one gene assigned
lfc_result_annotated = lfc_result_annotated[~lfc_result_annotated.ENTREZ_GENE_ID.str.contains("///")]
# for each gene average LFC over probes
lfc_result_annotated = lfc_result_annotated.groupby("ENTREZ_GENE_ID").median()
We can now look at the data:
In [16]:
lfc_result_annotated.sort("24 hours").head()
Out[16]:
At that point our job is basicaly done. However, we might want to check if the experiments worked out at all. To do this we will use hsa-miR-124a-3p targets predicted by MIRZA-G algorithm. The targets should be downregulated. First we read MIRZA-G results:
In [108]:
header = ["GeneID", "miRNA", "Total score without conservation", "Total score with conservation"]
miR124_targets = pd.read_table("seed-mirza-g_all_mirnas_per_gene_scores_miR_124a.tab", names=header)
miR124_targets.head()
Out[108]:
We shall extract targets as a simple list of strings:
In [109]:
miR124_targets_list = map(str, miR124_targets.GeneID.tolist())
print "Number of targets:", len(miR124_targets_list)
As can be seen there is a lot of targets (genes that posses a seed match in their 3'UTRs). We will use all of them. As first stem we will annotate genes if they are targets or not and add this information as a column to DataFrame:
In [85]:
lfc_result_annotated["Is miR-124a target"] = [i in miR124_targets_list for i in lfc_result_annotated.index]
In [86]:
cols_to_plot = [i for i in lfc_result_annotated.columns if "hour" in i]
In the end we can plot the results:
In [87]:
a = sns.pointplot(data=lfc_result_annotated[lfc_result_annotated["Is miR-124a target"]][cols_to_plot],
color=c2,
label="miR-124a target")
b = sns.pointplot(data=lfc_result_annotated[~lfc_result_annotated["Is miR-124a target"]][cols_to_plot],
color=c1,
label="No miR-124a target")
sns.despine()
pl.legend([pl.mpl.patches.Patch(color=c2), pl.mpl.patches.Patch(color=c1)],
["miR-124a target", "No miR-124a target"], frameon=True, loc='lower left')
pl.xlabel("Time after transfection")
pl.ylabel("Median log2 fold change")
Out[87]:
As can be seen the targets of hsa-miR-124a-3p behaves in the expected way. With each time-point their downregulation is stronger up the 72 hours. After 120 hours the transfection is probably lost. This means that the experiments worked out.