In [1]:
%pylab inline
import os
import random
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import scipy
import seaborn as sns
from Bio import SeqIO
import tools
These describe microarray results for samples that underwent two treatments:
data/control_unix_endings_flags.csv
data/treatment_unix_endings.csv
In [2]:
# Input array data filepaths
controlarrayfile = os.path.join('..', 'data', 'control_unix_endings_flags.csv') # control experiment array data (preprocessed)
treatmentarrayfile = os.path.join('..', 'data', 'treatment_unix_endings.csv') # treatment experiment array data (preprocessed)
In [3]:
control = pd.read_csv(controlarrayfile, sep=',', skiprows=4, index_col=0)
treatment = pd.read_csv(treatmentarrayfile, sep=',', skiprows=4, index_col=0)
# Uncomment the lines below to inspect the first few rows of each dataframe
#control.head()
#treatment.head()
In [4]:
len(control)
Out[4]:
In both control and treatment datasets, the mapping of experimental samples (input and output) across the three replicates is:
Raw
$\rightarrow$ input.1
Raw.1
$\rightarrow$ output.1
Raw.2
$\rightarrow$ input.2
Raw.3
$\rightarrow$ output.2
Raw.4
$\rightarrow$ input.3
Raw.5
$\rightarrow$ output.3
In [5]:
colnames_in = ['Raw', 'Raw.1', 'Raw.2', 'Raw.3', 'Raw.4', 'Raw.5'] # raw data columns
colnames_out = ['input.1', 'output.1', 'input.2', 'output.2', 'input.3', 'output.3'] # renamed raw data columns
# Reduce control and treatment arrays to raw data columns only
control = control[colnames_in]
control.columns = colnames_out
treatment = treatment[colnames_in]
treatment.columns = colnames_out
We expect that there is good agreement between input and output raw intensities for each replicate control or treatment experiment. We also expect that there should be good agreement across replicates within the controls, and within the treatment. We inspect this agreement visually with a matrix of scatterplots, below.
The plot_correlation()
function can be found in the accompanying tools.py
module.
In [6]:
# Plot correlations for control data
tools.plot_correlation(control);
There is good visual correlation between the intensities for the control arrays, and the Spearman's R values also indicate good correlation.
In [7]:
# Plot correlations for treatment data
tools.plot_correlation(treatment);
There is - mostly - good visual correlation between the intensities for the control arrays, and the Spearman's R values also indicate good correlation. There appear to be three problematic probes in replicate 3 that we may need to deal with in the data cleanup.
In [8]:
# Select outlying treatment input.3 values
treatment.loc[treatment['input.3'] > 4e4]
Out[8]:
In [9]:
# Define problem probes:
problem_probes = list(treatment.loc[treatment['input.3'] > 4e4].index)
We replace the three clear outlying values for the three problematic probes in input.3
of the treatment
array with interpolated values. We assume that input.1
and input.2
are typical of the input intensities for these three probes, and take the average of their values to substitute for input.3
for each.
In [10]:
# Interpolate values
treatment.set_value(index=problem_probes, col='input.3',
value=treatment.loc[problem_probes][['input.1', 'input.2']].mean(1))
treatment.loc[problem_probes]
Out[10]:
We can visualise the change in correlation for the treatment
dataframe that results:
In [11]:
# Plot correlations for treatment data
tools.plot_correlation(treatment);
We expect the array intensity distribution to vary according to whether the sample was from the input (strong) or output (weak) set, and whether the sample came from the control or treatment pools. We therefore divide the dataset into four independently-normalised components:
control_input
control_output
treatment_input
treatment_output
In [12]:
input_cols = ['input.1', 'input.2', 'input.3'] # input columns
output_cols = ['output.1', 'output.2', 'output.3'] # output columns
# Normalise inputs and outputs for control and treatment separately
control_input = tools.quantile_norm(control, columns=input_cols)
control_output = tools.quantile_norm(control, columns=output_cols)
treatment_input = tools.quantile_norm(treatment, columns=input_cols)
treatment_output = tools.quantile_norm(treatment, columns=output_cols)
We visualise the resulting distributions, in violin plots:
In [13]:
# Make violinplots of normalised data
tools.plot_normalised(control_input, control_output,
treatment_input, treatment_output)
We have four dataframes containing normalised data:
control_input
control_output
treatment_input
treatment_output
Each dataframe is indexed by the array probe systematic name, with three columns that correspond to replicates 1, 2, and 3 for either a control or a treatment run. For downstream analysis we want to organise this data as the following columns:
index
: unique IDprobe
: probe name (these apply across treatment/control and input/output)input
: normalised input intensity value (for a particular probe and replicate)output
: normalised input intensity value (for a particular probe and replicate)treatment
: 0/1 indicating whether the measurement was made for the control or treatment samplereplicate
: 1, 2, 3 indicating which replicate the measurement was made from
In [14]:
# Convert data from wide to long form
data = tools.wide_to_long(control_input, control_output,
treatment_input, treatment_output)
data.head()
Out[14]:
Long form data has some advantages for melting into new arrangments for visualisation, analysis, and incorporation of new data. For instance, we can visualise the distributions of input and output log intensities against each other, as below:
In [15]:
# Visualise input v output distributions
tools.plot_input_output_violin(data)
$ blastn -query Array/probe_seqlist.fas -subject Sakai/GCF_000008865.1_ASM886v1_cds_from_genomic.fna -outfmt 6 -out probes_blastn_sakai.tab -perc_identity 100
$ blastn -query Array/probe_seqlist.fas -subject DH10B/GCF_000019425.1_ASM1942v1_cds_from_genomic.fna -outfmt 6 -out probes_blastn_dh10b.tab -perc_identity 100
We first identify the probes that match uniquely at 100% identity to a single E. coli gene product from either Sakai or DH10B
In [16]:
# BLASTN results files
sakai_blastfile = os.path.join('..', 'data', 'probes_blastn_sakai.tab')
dh10b_blastfile = os.path.join('..', 'data', 'probes_blastn_dh10b.tab')
# Obtain a dataframe of unique probes and their BLASTN matches
unique_probe_hits = tools.unique_probe_matches((sakai_blastfile, dh10b_blastfile))
We then add parent gene annotations to the unique probes:
In [17]:
# Sequence data files
sakai_seqfile = os.path.join('..', 'data', 'Sakai', 'GCF_000008865.1_ASM886v1_cds_from_genomic.fna')
dh10b_seqfile = os.path.join('..', 'data', 'DH10B', 'GCF_000019425.1_ASM1942v1_cds_from_genomic.fna')
# Add locus tag information to each unique probe
unique_probe_hits = tools.annotate_seqdata(unique_probe_hits, (sakai_seqfile, dh10b_seqfile))
We exclude non-unique matching probes by performing an inner join between the data
and unique_probe_hits
dataframes.
In [18]:
censored_data = pd.merge(data, unique_probe_hits[['probe', 'match', 'locus_tag']],
how='inner', on='probe')
censored_data.head()
Out[18]:
As can be seen in the violin plot below, censoring the data in this way removes a large number of low-intensity probes from all datasets.
In [19]:
# Visually inspect the effect of censoring on distribution
tools.plot_input_output_violin(censored_data)
In [20]:
# Create output directory
outdir = 'datasets'
os.makedirs(outdir, exist_ok=True)
# Output files
full_dataset = os.path.join(outdir, "normalised_array_data.tab") # all censored data
reduced_probe_dataset = os.path.join(outdir, "reduced_probe_data.tab") # subset of data grouped by probe
reduced_locus_dataset = os.path.join(outdir, "reduced_locus_data.tab") # subset of data grouped by locus tag
For modelling with Stan, we assign indexes for common probe ID, locus tag, and array (combination of replicate and treatment) to each probe, before writing out the complete dataset.
In [21]:
# Index on probes
indexed_data = tools.index_column(censored_data, 'probe')
# Index on locus tags
indexed_data = tools.index_column(indexed_data, 'locus_tag')
# Index on array (replicate X treatment)
indexed_data = tools.index_column(indexed_data, 'repXtrt')
# Uncomment the line below to inspect the data
#indexed_data.head(20)
In [22]:
# Write the full dataset to file
indexed_data.to_csv(full_dataset, sep="\t", index=False)
For testing, we want to create two data subsets, one containing a reduced number of probes, and one with a reduced number of genes/locus tags.
In [23]:
# Reduced probe set
reduced_probes = tools.reduce_dataset(indexed_data, 'probe')
reduced_probes.to_csv(reduced_probe_dataset, sep="\t", index=False)
In [24]:
# Reduced locus tag set
reduced_lts = tools.reduce_dataset(indexed_data, 'locus_tag')
reduced_lts.to_csv(reduced_locus_dataset, sep="\t", index=False)