Supplementary Information: Holmes et al. 2019

1. Data cleaning, normalisation and quality assurance

This notebook describes raw data import, cleaning, and QA, then writing out of processed data to the data/ subdirectory, for use in model fitting.

Microarray data
1. Import array data
Data QA
1. Problematic probes
2. Interpolation for problematic probes
Normalisation
Wide to long form
Probe matches to Sakai and DH10B
Write data

Python imports



In [1]:

    
%pylab inline

import os
import random
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import scipy
import seaborn as sns

from Bio import SeqIO

import tools









    



Populating the interactive namespace from numpy and matplotlib

Microarray data

Raw array data was previously converted to plain text comma-separated variable format from two `Excel` files:

The file `AH alldata 12082013.xlsx` was converted to `data/treatment_unix_endings.csv`
The file `AH alldata expt1 flagged 05092013.xlsx` was converted to `data/control_unix_endings_flags.csv`

These describe microarray results for samples that underwent two treatments:

in vitro growth only - i.e. control: data/control_unix_endings_flags.csv
in vitro growth and plant passage - i.e. treatment: data/treatment_unix_endings.csv



In [2]:

    
# Input array data filepaths
controlarrayfile = os.path.join('..', 'data', 'control_unix_endings_flags.csv')  # control experiment array data (preprocessed)
treatmentarrayfile = os.path.join('..', 'data', 'treatment_unix_endings.csv')     # treatment experiment array data (preprocessed)

Import array data



In [3]:

    
control = pd.read_csv(controlarrayfile, sep=',', skiprows=4, index_col=0)
treatment = pd.read_csv(treatmentarrayfile, sep=',', skiprows=4, index_col=0)

# Uncomment the lines below to inspect the first few rows of each dataframe
#control.head()
#treatment.head()



In [4]:

    
len(control)









    Out[4]:





10751

We reduce the full dataset to only the raw intensity values. We also rename the columns in each of the `control` and `treatment` dataframes.

In both control and treatment datasets, the mapping of experimental samples (input and output) across the three replicates is:

replicate 1 input: Raw $\rightarrow$ input.1
replicate 1 output: Raw.1 $\rightarrow$ output.1
replicate 2 input: Raw.2 $\rightarrow$ input.2
replicate 2 output: Raw.3 $\rightarrow$ output.2
replicate 3 input: Raw.4 $\rightarrow$ input.3
replicate 3 output: Raw.5 $\rightarrow$ output.3



In [5]:

    
colnames_in = ['Raw', 'Raw.1', 'Raw.2', 'Raw.3', 'Raw.4', 'Raw.5']  # raw data columns
colnames_out = ['input.1', 'output.1', 'input.2', 'output.2', 'input.3', 'output.3']  # renamed raw data columns

# Reduce control and treatment arrays to raw data columns only
control = control[colnames_in]
control.columns = colnames_out
treatment = treatment[colnames_in]
treatment.columns = colnames_out

Data QA

We expect that there is good agreement between input and output raw intensities for each replicate control or treatment experiment. We also expect that there should be good agreement across replicates within the controls, and within the treatment. We inspect this agreement visually with a matrix of scatterplots, below.

The plot_correlation() function can be found in the accompanying tools.py module.



In [6]:

    
# Plot correlations for control data
tools.plot_correlation(control);

There is good visual correlation between the intensities for the control arrays, and the Spearman's R values also indicate good correlation.



In [7]:

    
# Plot correlations for treatment data
tools.plot_correlation(treatment);

There is - mostly - good visual correlation between the intensities for the control arrays, and the Spearman's R values also indicate good correlation. There appear to be three problematic probes in replicate 3 that we may need to deal with in the data cleanup.

Taken together, these plots indicate:

the intensities of the control arrays are systematically larger than intensities for the treatment arrays, suggesting that the effects of noise may be proportionally greater for the treatment arrays. This might be a concern for reliably inferring enrichment or depletion in the treatment.
the control arrays are good candidates for quantile normalisation (QN; $r > 0.95$, with similar density distributions)
the treatment array `input.3` dataset is potentially problematic, due to three treatment probe datapoints with intensities greater than 40,000 units having large leverage.

Problematic probes

We can readily identify problematic probes in treatment replicate 3, as they are the only probes with intensity greater than 40,000. The problematic probes are:

A_07_P000070
A_07_P061472
A_07_P052489



In [8]:

    
# Select outlying treatment input.3 values
treatment.loc[treatment['input.3'] > 4e4]









    Out[8]:







  
    
      
      input.1
      output.1
      input.2
      output.2
      input.3
      output.3
    
    
      Systematic
      
      
      
      
      
      
    
  
  
    
      A_07_P000070
      4251.70600
      3054.75400
      3997.707000
      3013.079000
      44857.48
      3283.400000
    
    
      A_07_P061472
      10.65405
      4.49928
      1.687277
      3.855374
      145625.60
      8.290274
    
    
      A_07_P052489
      929.70990
      1056.72400
      987.524200
      957.422400
      110306.40
      1327.634000



In [9]:

    
# Define problem probes:
problem_probes = list(treatment.loc[treatment['input.3'] > 4e4].index)

Interpolating values for problem probes

We replace the three clear outlying values for the three problematic probes in input.3 of the treatment array with interpolated values. We assume that input.1 and input.2 are typical of the input intensities for these three probes, and take the average of their values to substitute for input.3 for each.



In [10]:

    
# Interpolate values 
treatment.set_value(index=problem_probes, col='input.3',
                    value=treatment.loc[problem_probes][['input.1', 'input.2']].mean(1))
treatment.loc[problem_probes]









    Out[10]:







  
    
      
      input.1
      output.1
      input.2
      output.2
      input.3
      output.3
    
    
      Systematic
      
      
      
      
      
      
    
  
  
    
      A_07_P000070
      4251.70600
      3054.75400
      3997.707000
      3013.079000
      4124.706500
      3283.400000
    
    
      A_07_P061472
      10.65405
      4.49928
      1.687277
      3.855374
      6.170663
      8.290274
    
    
      A_07_P052489
      929.70990
      1056.72400
      987.524200
      957.422400
      958.617050
      1327.634000

We can visualise the change in correlation for the treatment dataframe that results:



In [11]:

    
# Plot correlations for treatment data
tools.plot_correlation(treatment);

Normalisation

We expect the array intensity distribution to vary according to whether the sample was from the input (strong) or output (weak) set, and whether the sample came from the control or treatment pools. We therefore divide the dataset into four independently-normalised components:

control_input
control_output
treatment_input
treatment_output

We have established that because the input and output arrays in both control and treatment conditions have strong correlation across all intensities, and have similar intensity distributions, we are justified in using quantile (mean) normalisation.



In [12]:

    
input_cols = ['input.1', 'input.2', 'input.3']      # input columns
output_cols = ['output.1', 'output.2', 'output.3']  # output columns

# Normalise inputs and outputs for control and treatment separately
control_input = tools.quantile_norm(control, columns=input_cols)
control_output = tools.quantile_norm(control, columns=output_cols)
treatment_input = tools.quantile_norm(treatment, columns=input_cols)
treatment_output = tools.quantile_norm(treatment, columns=output_cols)

We visualise the resulting distributions, in violin plots:



In [13]:

    
# Make violinplots of normalised data  
tools.plot_normalised(control_input, control_output,
                      treatment_input, treatment_output)

These plots illustrate that there is relative reduction in measured array intensity between control and treatment arrays for both the input and output arrays.

Wide to long form

We have four dataframes containing normalised data:

control_input
control_output
treatment_input
treatment_output

Each dataframe is indexed by the array probe systematic name, with three columns that correspond to replicates 1, 2, and 3 for either a control or a treatment run. For downstream analysis we want to organise this data as the following columns:

index: unique ID
probe: probe name (these apply across treatment/control and input/output)
input: normalised input intensity value (for a particular probe and replicate)
output: normalised input intensity value (for a particular probe and replicate)
treatment: 0/1 indicating whether the measurement was made for the control or treatment sample
replicate: 1, 2, 3 indicating which replicate the measurement was made from

We will add other columns with relevant data later, and to enable this, we convert the `control` and `treatment` data frames from wide (e.g. `input.1`, `input.2`, `input.3` columns) to long (e.g. `probe`, `input`, `output`, `replicate`) form - once for the control data, and once for the treatment data. We match on a multi-index of probe and replicate.



In [14]:

    
# Convert data from wide to long form
data = tools.wide_to_long(control_input, control_output,
                          treatment_input, treatment_output)
data.head()









    Out[14]:







  
    
      
      probe
      replicate
      treatment
      repXtrt
      input
      output
      log_input
      log_output
    
  
  
    
      0
      A_07_P010355
      1
      0
      rep1trt0
      10.049401
      5.969886
      2.307513
      1.786728
    
    
      1
      A_07_P039196
      1
      0
      rep1trt0
      7.638637
      7.230899
      2.033219
      1.978363
    
    
      2
      A_07_P039195
      1
      0
      rep1trt0
      14.394261
      15.832703
      2.666830
      2.762078
    
    
      3
      A_07_P039186
      1
      0
      rep1trt0
      7.872936
      18.637283
      2.063431
      2.925164
    
    
      4
      A_07_P039181
      1
      0
      rep1trt0
      10.100042
      7.566543
      2.312540
      2.023736

Long form data has some advantages for melting into new arrangments for visualisation, analysis, and incorporation of new data. For instance, we can visualise the distributions of input and output log intensities against each other, as below:



In [15]:

    
# Visualise input v output distributions
tools.plot_input_output_violin(data)

This visualisation again shows that treatment intensities are generally lower than control intensities, but also suggests that the bulk of output intensities are lower than input intensities.

There is a population of low-intensity values for each set of arrays, however. These appear to have a slight increase in intensity in the output, compared to input arrays.

Probe matches to Sakai and DH10B

Evidence for potential hybridisation of probes to DH10B or Sakai isolates was determined by default `BLASTN` query of each probe sequence against chromosome and plasmid feature nucleotide sequences from the NCBI records:

`GCF_000019425.1_ASM1942v1_cds_from_genomic.fna`
`GCF_000008865.1_ASM886v1_cds_from_genomic.fna`

$ blastn -query Array/probe_seqlist.fas -subject Sakai/GCF_000008865.1_ASM886v1_cds_from_genomic.fna -outfmt 6 -out probes_blastn_sakai.tab -perc_identity 100
$ blastn -query Array/probe_seqlist.fas -subject DH10B/GCF_000019425.1_ASM1942v1_cds_from_genomic.fna -outfmt 6 -out probes_blastn_dh10b.tab -perc_identity 100

We first identify the probes that match uniquely at 100% identity to a single E. coli gene product from either Sakai or DH10B



In [16]:

    
# BLASTN results files
sakai_blastfile = os.path.join('..', 'data', 'probes_blastn_sakai.tab')
dh10b_blastfile = os.path.join('..', 'data', 'probes_blastn_dh10b.tab')

# Obtain a dataframe of unique probes and their BLASTN matches
unique_probe_hits = tools.unique_probe_matches((sakai_blastfile, dh10b_blastfile))

We then add parent gene annotations to the unique probes:



In [17]:

    
# Sequence data files
sakai_seqfile = os.path.join('..', 'data', 'Sakai', 'GCF_000008865.1_ASM886v1_cds_from_genomic.fna')
dh10b_seqfile = os.path.join('..', 'data', 'DH10B', 'GCF_000019425.1_ASM1942v1_cds_from_genomic.fna')

# Add locus tag information to each unique probe
unique_probe_hits = tools.annotate_seqdata(unique_probe_hits, (sakai_seqfile, dh10b_seqfile))

We will certainly be interested in probes that hybridise unambiguously to Sakai or to DH10B. The [array was however designed to report on several *E. coli* isolates](http://www.ebi.ac.uk/arrayexpress/arrays/A-GEOD-13359/?ref=E-GEOD-46455), and not all probes should be expected to hybridise, so we could consider the non-uniquely matching probes not to be of interest, and censor them.

A strong reason to censor probes is that we will be estimating locus tag/gene-level treatment effects, on the basis of probe-level intensity measurements. Probes that may be reporting on multiple genes may mislead our model fit, and so are better excluded.

We exclude non-unique matching probes by performing an inner join between the data and unique_probe_hits dataframes.



In [18]:

    
censored_data = pd.merge(data, unique_probe_hits[['probe', 'match', 'locus_tag']],
                         how='inner', on='probe')
censored_data.head()









    Out[18]:







  
    
      
      probe
      replicate
      treatment
      repXtrt
      input
      output
      log_input
      log_output
      match
      locus_tag
    
  
  
    
      0
      A_07_P052986
      1
      0
      rep1trt0
      13.354233
      10.507812
      2.591833
      2.352119
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176
    
    
      1
      A_07_P052986
      2
      0
      rep2trt0
      6.815142
      11.374824
      1.919147
      2.431403
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176
    
    
      2
      A_07_P052986
      3
      0
      rep3trt0
      146.893973
      8.379672
      4.989711
      2.125809
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176
    
    
      3
      A_07_P052986
      1
      1
      rep1trt1
      0.629964
      2.408282
      -0.462092
      0.878914
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176
    
    
      4
      A_07_P052986
      2
      1
      rep2trt1
      0.651612
      6.384148
      -0.428306
      1.853818
      lcl|NC_002695.1_cds_NP_309203.1_1134
      ECs1176

This leaves us with a dataset comprising:

49872 datapoints (rows)
8312 unique probes
6084 unique locus tags

As can be seen in the violin plot below, censoring the data in this way removes a large number of low-intensity probes from all datasets.



In [19]:

    
# Visually inspect the effect of censoring on distribution
tools.plot_input_output_violin(censored_data)

Write data

We write the censored, normalised, long-format data to the `datasets/` subdirectory.



In [20]:

    
# Create output directory
outdir = 'datasets'
os.makedirs(outdir, exist_ok=True)

# Output files
full_dataset = os.path.join(outdir, "normalised_array_data.tab")  # all censored data
reduced_probe_dataset = os.path.join(outdir, "reduced_probe_data.tab")  # subset of data grouped by probe
reduced_locus_dataset = os.path.join(outdir, "reduced_locus_data.tab")  # subset of data grouped by locus tag

For modelling with Stan, we assign indexes for common probe ID, locus tag, and array (combination of replicate and treatment) to each probe, before writing out the complete dataset.



In [21]:

    
# Index on probes
indexed_data = tools.index_column(censored_data, 'probe')

# Index on locus tags
indexed_data = tools.index_column(indexed_data, 'locus_tag')

# Index on array (replicate X treatment)
indexed_data = tools.index_column(indexed_data, 'repXtrt')

# Uncomment the line below to inspect the data
#indexed_data.head(20)



In [22]:

    
# Write the full dataset to file
indexed_data.to_csv(full_dataset, sep="\t", index=False)

For testing, we want to create two data subsets, one containing a reduced number of probes, and one with a reduced number of genes/locus tags.



In [23]:

    
# Reduced probe set
reduced_probes = tools.reduce_dataset(indexed_data, 'probe')
reduced_probes.to_csv(reduced_probe_dataset, sep="\t", index=False)



In [24]:

    
# Reduced locus tag set
reduced_lts = tools.reduce_dataset(indexed_data, 'locus_tag')
reduced_lts.to_csv(reduced_locus_dataset, sep="\t", index=False)

	input.1	output.1	input.2	output.2	input.3	output.3
Systematic
A_07_P000070	4251.70600	3054.75400	3997.707000	3013.079000	44857.48	3283.400000
A_07_P061472	10.65405	4.49928	1.687277	3.855374	145625.60	8.290274
A_07_P052489	929.70990	1056.72400	987.524200	957.422400	110306.40	1327.634000

	probe	replicate	repXtrt	input	output	log_input	log_output
0	A_07_P010355	1	rep1trt0	10.049401	5.969886	2.307513	1.786728
1	A_07_P039196	1	rep1trt0	7.638637	7.230899	2.033219	1.978363
2	A_07_P039195	1	rep1trt0	14.394261	15.832703	2.666830	2.762078
3	A_07_P039186	1	rep1trt0	7.872936	18.637283	2.063431	2.925164
4	A_07_P039181	1	rep1trt0	10.100042	7.566543	2.312540	2.023736

	probe	replicate	treatment	repXtrt	input	output	log_input	log_output	match	locus_tag
0	A_07_P052986	1	0	rep1trt0	13.354233	10.507812	2.591833	2.352119	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176
1	A_07_P052986	2	0	rep2trt0	6.815142	11.374824	1.919147	2.431403	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176
2	A_07_P052986	3	0	rep3trt0	146.893973	8.379672	4.989711	2.125809	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176
3	A_07_P052986	1	1	rep1trt1	0.629964	2.408282	-0.462092	0.878914	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176
4	A_07_P052986	2	1	rep2trt1	0.651612	6.384148	-0.428306	1.853818	lcl\|NC_002695.1_cds_NP_309203.1_1134	ECs1176