Sequencing data have already been re-formatted following exploratory analysis and are stored in a number of formats based on analysis needs. Two main dataset representations are:
The aim of this part is to perform a manual curation of the allele data as processed until this point before embarking on the remainder of the analysis. First start with a reality check to make sure that the pre-processing worked as expected.
The re-formating so far was done using
2_process_patient_data.py
3_allele_dataframe.py
The outputs of this scripts are:
Start by setting up the environment.
In [1]:
#Python core packages
from collections import Counter
import string
import pickle
import datetime
import warnings
warnings.filterwarnings("ignore")
#Additional Python packages
import tqdm
#Scientific packages
from scipy import stats as ss
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import utils as sku
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
mpl.style.use('classic')
mpl.rcParams['font.family']='Arial'
mpl.rcParams['legend.numpoints'] = 1
%matplotlib inline
In [2]:
#FIGURE DETAILS
tick_size = 8
label_size = 10
title_size = 10
text_size = 8
single_column_figure = (3.35,3.35)
double_column_figure_r = (6.69,5)
double_column_figure_s = (6.69,6.69)
In [3]:
#LOAD FUNCTIONS
%cd '../src/data/'
import serial_functions as serf
%cd '../../notebooks'
#LOAD DATA
INTERIM_PATH = '../data/interim/'
OLD_PROJECT_PATH = '../../150306_Qingyun_serial/git/data/'
PATIENT_DATA = pickle.load(open('{}2_patient_data.pkl'.format(INTERIM_PATH),'rb'))
ANNOTATION_DICT = pickle.load(open('{}2_unfixed_annotation.pkl'.format(INTERIM_PATH),'rb'))
FIXED_ANNOTATION_DICT = pickle.load(open('{}2_fixed_annotation.pkl'.format(INTERIM_PATH),'rb'))
DATA_DICT = pickle.load(open('{}2_unfixed_mutations.pkl'.format(INTERIM_PATH),'rb'))
FIXED_ARRAY = pickle.load(open('{}2_fixed_mutations.pkl'.format(INTERIM_PATH),'rb'))
ALL = pd.read_csv('{}3_ALLELE_data.csv'.format(INTERIM_PATH), index_col=0)
ALL_old = pd.read_csv('{}supplementary_allele_data.csv'.format(OLD_PROJECT_PATH), index_col=0)
The main difference between the DataFrame used for initial analysis and the one generated by 3_allele_dataframe.py is the introduction of an additional filtering step removing all alleles that never occur at a frequency of more than 1.5%. This was done on the back of the analysis presented in notebooks 1.1 and 1.2.
To make sure that the code worked well I will compare the entries in the new DataFrame to the old one.
Generate allele identifiers as follows: PATIENT_TIME_LOCUS
In [3]:
ALL['ALLELE_ID'] = ['{}_{}_{}'.format(x,y,z) for (x,y,z) in zip(ALL.PATIENT_ID, ALL.TIME, ALL.LOCUS)]
ALL_old['ALLELE_ID'] = ['{}_{}_{}'.format(x,y,z) for (x,y,z) in zip(ALL_old.ID, ALL_old.TIME, ALL_old.LOCUS)]
In [4]:
set(ALL.ALLELE_ID)-set(ALL_old.ALLELE_ID)
Out[4]:
Ignore the later timepoint data which have been excluded in the earlier iteration of the analysis - perform this in the data polishing step.
There is a single allele that is only present in the new DataFrame. This arose because of my decision to take 1200_S1 as the representive sample at T=0 for Patient 12. Previously, I used the merged sample. Recall that T=0 in Patient 12 has 3 samples. The merged sample is exactly that, a re-analysis of the merged fastq files. The issue with this is the extraordinarily high coverage of this sample that may cause it to behave differently from other samples. As a result I chose S1 instead.
On a similar note, Patient 04 had two samples at time 4. I only used 0404_S2. The results were basically equivalent.
Next, let's plot the frequencies to make sure they fit.
In [9]:
plt.figure(figsize=(8,8))
for x in set(ALL.ALLELE_ID[(ALL.TIME<12)]):
try:
plt.plot(ALL.FREQUENCY[(ALL.ALLELE_ID==x)],
ALL_old.FREQ[(ALL_old.ALLELE_ID==x)],
'o', ms=10, mec='black', mfc='none')
except:
print(x)
plt.xlabel('New DataFrame', size=18)
plt.ylabel('Old DataFrame', size=18)
Out[9]:
Most are fine, some are off... print out the discrepancies
In [8]:
for x in set(ALL.ALLELE_ID[ALL.TIME<12]):
try:
#Check the frequency is consistent
_delta = float(ALL.FREQUENCY[ALL.ALLELE_ID==x])\
-float(ALL_old.FREQ[ALL_old.ALLELE_ID==x])
#Get rid of any floating point differences
_delta = np.round(_delta, decimals=4)
#Find problematic alleles
if _delta!=0:
print(x,_delta)
except:
print('Problematic allele:{}'.format(x))
OK, this is basically fine. The issues come from an old coding error in Patient 4 (I added up the frequencies of the two samples by mistake) and then variation between the merge and S1 samples of Patient 12. All the discrepancies therefore come from the way I chose to treat the samples differently.
In [4]:
plt.figure('SFS', figsize=(12,8))
plt.subplot(122)
plt.hist(np.array(ALL.FREQUENCY[(ALL.TIME<12)]), bins=np.arange(0,1.01,0.01),
label=r'$M.tuberculosis$ - TB patients',
color='black')
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.yticks(size=12)
plt.xlabel('Allele frequency', size=18)
plt.vlines(0.05,0,140,colors='red', linestyles='dotted')
plt.vlines(0.1,0,140,colors='red', lw=2)
plt.title('FREQUENCY FILTERED', loc='left', size=18)
plt.subplot(121)
plt.hist(np.array(ALL_old.FREQ), bins=np.arange(0,1.01,0.01),
label=r'$M.tuberculosis$ - TB patients',
color='black')
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.yticks(size=12)
plt.ylabel('Number of alleles', size=18)
plt.xlabel('Allele frequency', size=18)
plt.vlines(0.05,0,900,colors='red', linestyles='dotted')
plt.vlines(0.1,0,900,colors='red', lw=2)
plt.title('NON FREQUENCY FILTERED', loc='left', size=18)
#plt.savefig('../Figures/160113_Folded_SFS_TBonly.pdf')
Out[4]:
Clearly less extreme than before, however still very similar in outcome. Let's have a look at the distribution of alleles across the genome.
In [11]:
plt.figure(figsize=(12,8))
plt.subplot(211)
plt.title('NON FREQUENCY FILTERED', loc='left', size=18)
plt.hist(list(ALL_old.LOCUS), bins=np.arange(0,4.5e6,4000), color='black')
plt.ylabel('Coverage', size=18)
plt.ylim(0,20)
plt.subplot(212)
plt.title('FREQUENCY FILTERED', loc='left', size=18)
plt.hist(list(ALL.LOCUS[(ALL.TIME<12)]), bins=np.arange(0,4.5e6,4000), color='black')
plt.xlabel('Locus', size=18)
plt.ylabel('Coverage', size=18)
plt.ylim(0,20)
plt.hlines(5,0,4.5e6,colors='red',linestyles='dotted')
plt.subplots_adjust(hspace=0.3)
There are a couple of regions that still have several mutations. Let's identify those with 5+ loci being affected.
In [12]:
c,b = np.histogram(list(ALL.LOCUS[(ALL.TIME<12)]), bins=np.arange(0,4.5e6,4000))
b[c>5]
Out[12]:
In [58]:
ALL[(ALL.LOCUS[(ALL.TIME<12)]>776000)&(ALL.LOCUS[(ALL.TIME<12)]<780000)]
Out[58]:
Almost all mutations in Patient 12, all NSY in Rv0678. This is MmpR, which controls the expression of MmpS/L4 both involved in BDQ resistance as well as CFZ cross-resistance. This patient received CFZ treatment, and clearly had small subsets of CFZ resistant strains. Although none swept.
Large number of mutations comes from longitudinal data.
In [60]:
ALL[(ALL.LOCUS[(ALL.TIME<12)]>1540000)&(ALL.LOCUS[(ALL.TIME<12)]<1544000)]
Out[60]:
Stable minor variant in Rv1371 Patient 12.
In [67]:
ALL[(ALL.LOCUS[(ALL.TIME<12)]>2344000)&(ALL.LOCUS[(ALL.TIME<12)]<2348000)]
Out[67]:
This looks like a noisy locus, notice that it is always the same two loci affected in tandem in all patients. This is probably a technical issue. Manually remove.
In [62]:
ALL[(ALL.LOCUS[(ALL.TIME<12)]>2548000)&(ALL.LOCUS[(ALL.TIME<12)]<2552000)]
Out[62]:
Stable minor variant in Patient07 and a semi-stable minor variant in Patient 10. Not known what this protien does.
In [63]:
ALL[(ALL.LOCUS[(ALL.TIME<12)]>2824000)&(ALL.LOCUS[(ALL.TIME<12)]<2828000)]
Out[63]:
Stable minor variant in one patient.
In [64]:
ALL[(ALL.LOCUS[(ALL.TIME<12)]>3688000)&(ALL.LOCUS[(ALL.TIME<12)]<3692000)]
Out[64]:
Two stable minor variant in lpdA an upstream IGR variant that may have arisen from the expansion of an unrelated clone. These are upstream of glpD2 and corroborate a large number of glpK mutations. May represent XDR adaptation.
Final scrouple, let me re-do this with a narrower window (500bp) with a slightly lower cutoff.
In [71]:
c,b = np.histogram(list(ALL.LOCUS[(ALL.TIME<12)]), bins=np.arange(0,4.5e6,500))
b[c>4]
Out[71]:
In [81]:
ALL[(ALL.LOCUS[(ALL.TIME<12)]>3945000)&(ALL.LOCUS[(ALL.TIME<12)]<3945500)]
Out[81]:
Summary of the above loci:
Integrated all of this information into:
4_allele_data_polishing.py
The dataframe is then used to update the nd.array as well. Outputs are in data/processed:
In [31]:
PROCESSED_PATH = '../data/processed/'
ALL = pd.read_csv('{}ALLELE_data.csv'.format(PROCESSED_PATH), index_col=0)
PATIENT_DATA = pickle.load(open('{}ALLELE_info.pkl'.format(PROCESSED_PATH),'rb'))
In [4]:
plt.figure('SFS', figsize=(8,8))
plt.hist(np.array(ALL.FREQUENCY[(ALL.TIME<12)]), bins=np.arange(0,1.01,0.01),
label=r'$M.tuberculosis$ - TB patients',
histtype='step',
color='black', lw=2)
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.yticks(np.arange(0,150,25),size=14)
plt.ylim(0,131.25)
plt.xlabel('Allele frequency', size=18)
plt.title('A', loc='left', size=24)
plt.ylabel('Number of alleles', size=18)
plt.twinx()
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)])),
'-',
label='Cumulative distribution',
color='red',lw=2)
plt.ylabel('Proportion of alleles', size=18)
plt.yticks(np.arange(0.2,1.2,0.2),size=14)
plt.legend(frameon=False, fontsize=14, loc=9)
plt.vlines(0.2,0,0.9,colors='black', linestyles='dotted')
plt.hlines(0.8,0.1,1,colors='black', linestyles='dotted')
plt.ylim(0,1.05)
Out[4]:
In [5]:
plt.figure('SFS', figsize=(8,8))
plt.hist(np.array(ALL.FREQUENCY[(ALL.TIME<12)]), bins=np.arange(0,1.01,0.01),
label=r'$M.tuberculosis$ - TB patients',
color='black')
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.yticks(size=14)
plt.xlabel('Allele frequency', size=18)
plt.ylabel('Number of alleles', size=18)
plt.title('A', loc='left', size=24)
plt.ylim(0,125)
#Add inset
a = plt.axes([.4, .4, .4, .4], aspect='equal')
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)])),
'-', label='Cumulative\ndistribution',
color='red',lw=2)
plt.ylabel('Proportion of alleles', size=14)
plt.xlabel('Allele frequency', size=14)
plt.yticks(size=12)
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.legend(frameon=False, fontsize=14, loc=4)
plt.vlines(0.2,0,1,colors='black', linestyles='dotted')
plt.hlines(0.8,0,1,colors='black', linestyles='dotted')
# Hide the right and top spines
a.spines['right'].set_visible(False)
a.spines['top'].set_visible(False)
# Only show ticks on the left and bottom spines
a.yaxis.set_ticks_position('left')
a.xaxis.set_ticks_position('bottom')
plt.tight_layout()
#plt.savefig('../reports/figures/3_SFS.pdf')
#plt.savefig('../reports/figures/3_SFS.png',dpi=300)
In [12]:
plt.figure('SFS', figsize=(8,8))
plt.hist(np.array(ALL.FREQUENCY[(ALL.TIME<12)]), bins=np.arange(0,1.01,0.01),
label=r'$M.tuberculosis$ - TB patients',
color='black')
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.yticks(size=14)
plt.xlabel('Allele frequency', size=18)
plt.ylabel('Number of alleles', size=18)
plt.title('A', loc='left', size=24)
plt.ylim(0,125)
#Add inset
a = plt.axes([.4, .4, .4, .4], aspect='equal')
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)])),
'-', label='All v-SNPs',
color='black',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==0)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==0)])),
'-', label='Unstable v-SNPs',
color='gold',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==1)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==1)])),
'-', label='Recurrent v-SNPs',
color='dodgerblue',lw=3)
plt.ylabel('Proportion of alleles', size=14)
plt.xlabel('Allele frequency', size=14)
plt.yticks(size=12)
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.legend(frameon=False, fontsize=14, loc=4)
plt.vlines(0.2,0,1,colors='black', linestyles='dotted')
plt.hlines(0.8,0,1,colors='black', linestyles='dotted')
# Hide the right and top spines
a.spines['right'].set_visible(False)
a.spines['top'].set_visible(False)
# Only show ticks on the left and bottom spines
a.yaxis.set_ticks_position('left')
a.xaxis.set_ticks_position('bottom')
plt.tight_layout()
#plt.savefig('../reports/figures/3_SFS_Alt.pdf')
#plt.savefig('../reports/figures/3_SFS_Alt.png',dpi=300)
In [36]:
plt.figure('SFS', figsize=(5,5))
plt.hist(np.array(ALL.FREQUENCY[(ALL.TIME<12)]), bins=np.arange(0,1.01,0.01),
label=r'$M.tuberculosis$ - TB patients',
color='black')
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=tick_size)
plt.yticks(size=tick_size)
plt.xlabel('Allele frequency', size=label_size)
plt.ylabel('Number of alleles', size=label_size)
plt.title('A', loc='left', size=14)
plt.ylim(0,125)
plt.xlim(0,0.99)
#Add inset
a = plt.axes([.4, .4, .4, .4], aspect='equal')
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)])),
'-', label='All v-SNPs',
color='black',lw=2)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==0)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==0)])),
'-', label='Unstable v-SNPs',
color='gold',lw=2)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==1)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==1)])),
'-', label='Recurrent v-SNPs',
color='dodgerblue',lw=2)
plt.ylabel('Proportion of alleles', size=label_size)
plt.xlabel('Allele frequency', size=label_size)
plt.yticks(size=tick_size)
plt.xticks(np.arange(.2,1,.2),['20%','40%','60%','80%'], size=tick_size)
plt.xlim(0,0.99)
plt.legend(frameon=False, fontsize=text_size, loc=4)
plt.vlines(0.2,0,1,colors='black', linestyles='dotted')
plt.hlines(0.8,0,1,colors='black', linestyles='dotted')
# Hide the right and top spines
a.spines['right'].set_visible(False)
a.spines['top'].set_visible(False)
# Only show ticks on the left and bottom spines
a.yaxis.set_ticks_position('left')
a.xaxis.set_ticks_position('bottom')
plt.tight_layout()
plt.savefig('../reports/figures/3_SFS_Alt.pdf')
#plt.savefig('../reports/figures/3_SFS_Alt.png',dpi=600)
In [64]:
plt.figure('SFS', figsize=(8,8))
plt.subplot(221)
plt.title('A', size=18, loc='left')
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)]), endpoint=False),
'-', label='All v-SNPs',
color='black',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==0)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==0)]), endpoint=False),
'-', label='Unstable v-SNPs',
color='gold',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==1)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==1)]), endpoint=False),
'-', label='Recurrent v-SNPs',
color='dodgerblue',lw=3)
plt.ylabel('Proportion of alleles', size=18)
#plt.xlabel('Allele frequency', size=14)
plt.yticks(size=12)
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.legend(frameon=False, fontsize=12, loc=4)
plt.vlines(0.2,0,1,colors='black', linestyles='dotted')
plt.hlines(0.8,0,1,colors='black', linestyles='dotted')
plt.subplot(222)
plt.title('B', size=18, loc='left')
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)])),
'-', label='All v-SNPs',
color='black',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.SNP_TYPE=='NSY')])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.SNP_TYPE=='NSY')])),
'-', label='NSY v-SNPs',
color='gold',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.SNP_TYPE=='SYN')])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.SNP_TYPE=='SYN')])),
'-', label='SYN v-SNPs',
color='dodgerblue',lw=3)
#plt.ylabel('Proportion of alleles', size=14)
#plt.xlabel('Allele frequency', size=14)
plt.yticks(size=12)
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.legend(frameon=False, fontsize=12, loc=4)
plt.vlines(0.2,0,1,colors='black', linestyles='dotted')
plt.hlines(0.8,0,1,colors='black', linestyles='dotted')
plt.subplot(223)
plt.title('C', size=18, loc='left')
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)])),
'-', label='All v-SNPs',
color='black',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.NON_EFFICACIOUS==0)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.NON_EFFICACIOUS==0)])),
'-', label='4+ drugs',
color='gold',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.NON_EFFICACIOUS==1)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.NON_EFFICACIOUS==1)])),
'-', label='<4 drugs',
color='dodgerblue',lw=3)
plt.ylabel('Proportion of alleles', size=18)
plt.xlabel('Allele frequency', size=18)
plt.yticks(size=12)
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.legend(frameon=False, fontsize=12, loc=4)
plt.vlines(0.2,0,1,colors='black', linestyles='dotted')
plt.hlines(0.8,0,1,colors='black', linestyles='dotted')
plt.subplot(224)
plt.title('D', size=18, loc='left')
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)])),
'-', label='All v-SNPs',
color='black',lw=3)
#plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==0)])),
# np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==0)])),
# '-', label='Pre-treatment',
# color='gold',lw=3)
#plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.TIME>0)])),
# np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.TIME>0)])),
# '-', label='Post-treatment',
# color='dodgerblue',lw=3)
#plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
# np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)]), endpoint=False),
# '-', label='All v-SNPs',
# color='black',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==0)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==0)]), endpoint=False),
'-', label='Pre-treatment',
color='gold',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==2)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==2)]), endpoint=False),
'-', label='2 weeks',
color='dodgerblue',lw=3, alpha=1)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==4)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==4)]), endpoint=False),
'-', label='4 weeks',
color='dodgerblue',lw=3, alpha=0.75)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==6)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==6)]), endpoint=False),
'-', label='6 weeks',
color='dodgerblue',lw=3, alpha=0.5)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==8)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==8)]), endpoint=False),
'-', label='8 weeks',
color='dodgerblue',lw=3, alpha=0.25)
plt.arrow(0.18, 0.82, .07, -.1, head_width=0.02, head_length=0.05, fc='k', ec='k')
#plt.ylabel('Proportion of alleles', size=14)
plt.xlabel('Allele frequency', size=18)
plt.yticks(size=12)
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.legend(frameon=False, fontsize=12, loc=4)
plt.vlines(0.2,0,1,colors='black', linestyles='dotted')
plt.hlines(0.8,0,1,colors='black', linestyles='dotted')
plt.tight_layout()
#plt.savefig('../reports/figures/3_Supplementary_SFS.pdf')
#plt.savefig('../reports/figures/3_Supplementray_SFS.png',dpi=300)
In [7]:
plt.figure('SFS', figsize=(8,8))
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME<12)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME<12)]), endpoint=False),
'-', label='All v-SNPs',
color='black',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==0)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==0)]), endpoint=False),
'-', label='t=0',
color='gold',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==2)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==2)]), endpoint=False),
'-', label='t=2',
color='dodgerblue',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==4)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==4)]), endpoint=False),
'-', label='t=4',
color='salmon',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==6)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==6)]), endpoint=False),
'-', label='t=6',
color='lavender',lw=3)
plt.plot(sorted(list(ALL.FREQUENCY[(ALL.TIME==8)])),
np.linspace(0,1,num=len(ALL.FREQUENCY[(ALL.TIME==8)]), endpoint=False),
'-', label='t=8',
color='limegreen',lw=3)
plt.ylabel('Proportion of alleles', size=14)
#plt.xlabel('Allele frequency', size=14)
plt.yticks(size=12)
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.legend(frameon=False, fontsize=12, loc=4)
plt.vlines(0.2,0,1,colors='black', linestyles='dotted')
plt.hlines(0.8,0,1,colors='black', linestyles='dotted')
Out[7]:
In [11]:
sum(ALL.FREQUENCY[(ALL.TIME<12)&(ALL.RECURRENT==0)]<0.2)
Out[11]:
Main takehome stays the same. I think the cumulative distribution visually corroborates the notion that most of the genetic heterogeneity is driven by minor clones as it shows that 80% of the vSNPs have a frequency of less than 20%.
Some of the detected vSNPs are stable across time points while other are not. Run a logistic regression to see how good a predictor of stability is allele frequency.
In [32]:
RECURRENT_vSNP = {}
for patient in PATIENT_DATA['PATIENTS']:
RECURRENT_vSNP[patient] = []
for k,v in Counter(ALL.LOCUS[(ALL.PATIENT_ID==patient)]).items():
if v>1: RECURRENT_vSNP[patient].append(k)
ALL['RECURRENT'] = [int(x in RECURRENT_vSNP[y]) for (x,y) in zip(ALL.LOCUS,ALL.PATIENT_ID)]
In [110]:
model = sm.GLM.from_formula('RECURRENT ~ FREQUENCY', ALL[ALL.TIME==0], family=sm.families.Binomial())
result = model.fit()
print(result.summary())
Looks like it does. Let's have a look at what this actually looks like.
In [127]:
I , B1 = result.params
#Calculate the log-odds of being RECURRENT at each FREQUENCY
LOGODDS = np.arange(-1,2,0.01)*B1+I
#Calculate ODDS
ODDS = np.exp(LOGODDS)
#Calculate Probability
P = ODDS/(1+ODDS)
plt.figure(figsize=(8,8))
plt.plot(np.arange(-1,2,0.01),P,'k-')
plt.plot(ALL.FREQUENCY[ALL.TIME==0], ALL.RECURRENT[ALL.TIME==0], 'wo')
plt.xlabel('Frequency', size=18)
plt.ylabel('Pr(RECURRENT)', size=18)
plt.legend(['Logisitic regression line', 'SNP Data'], loc=5, fontsize=14)
#plt.vlines(800,-0.05,1.05,colors='red',linestyles='dashed')
#plt.hlines(0.05,0,1800,colors='blue',linestyles='dashed')
plt.ylim(-0.05,1.05)
plt.title('FREQUENCY-dependence of RECURRENCE', size=18, loc='left')
plt.vlines(0,-0.05,1.05,linestyles='dotted',colors='red')
plt.vlines(0.25,-0.05,1.05,linestyles='dotted',colors='red')
plt.hlines(0.5,-1,2,linestyles='dotted',colors='red')
Out[127]:
I have performed the analysis in R using 'lme4' and 'glmer' to do a mixed effect logistic regression:
t0 <-ALL$TIME==0
result <- glmer(RECURRENT ~ FREQUENCY + (1|PATIENT_ID), ALL[t0,], family=binomial)
The calculated parameters were:
Intercept -2.608
FREQUENCY 10.136
In [126]:
I , B1 = -2.608, 10.136
#Calculate the log-odds of being RECURRENT at each FREQUENCY
LOGODDS = np.arange(-1,2,0.01)*B1+I
#Calculate ODDS
ODDS = np.exp(LOGODDS)
#Calculate Probability
P = ODDS/(1+ODDS)
plt.figure(figsize=(8,8))
plt.plot(np.arange(-1,2,0.01),P,'k-')
plt.plot(ALL.FREQUENCY[ALL.TIME==0], ALL.RECURRENT[ALL.TIME==0], 'wo')
plt.xlabel('Frequency', size=18)
plt.ylabel('Pr(RECURRENT)', size=18)
plt.legend(['Logisitic regression line', 'SNP Data'], loc=5, fontsize=14)
#plt.vlines(800,-0.05,1.05,colors='red',linestyles='dashed')
#plt.hlines(0.05,0,1800,colors='blue',linestyles='dashed')
plt.ylim(-0.05,1.05)
plt.title('FREQUENCY-dependence of RECURRENCE', size=18, loc='left')
plt.vlines(0,-0.05,1.05,linestyles='dotted',colors='red')
plt.vlines(0.25,-0.05,1.05,linestyles='dotted',colors='red')
plt.hlines(0.5,-1,2,linestyles='dotted',colors='red')
Out[126]:
Unsurprisingly the higher frequency alleles at the time of start of treatment are more likely to be stable in the population. On the other hand for frequencies between 0-25% the alleles are more likely to be transient than stable. This is also reflected in the overall distribution of alleles: out of 140 vSNPs detected at T=0, 31 were recurrent and 109 were not.
On a tangent (this is probably wrong, but I'm curious), can I use survival analysis statistics to look at this a bit better?
In [22]:
import lifelines as lf
a = ALL[['FREQUENCY','RECURRENT', 'TIME', 'NON_EFFICACIOUS', 'SNP_TYPE']][ALL.TIME<12].copy()
a = a.reset_index(drop=True)
a['E'] = [1 for x in a.index]
E = a['E']
T = a['FREQUENCY']
t = np.linspace(0,1,len(a))
recurrent = a['RECURRENT']==1
treatment = a['TIME']>0
efficacious = a['NON_EFFICACIOUS']==0
nsysnp = a['SNP_TYPE']=='NSY'
synsnp = a['SNP_TYPE']=='SYN'
kmf = lf.KaplanMeierFitter()
kmf.fit(T[recurrent], timeline=t, event_observed=E[recurrent], label='Stable SNP')
ax = kmf.plot(figsize=(8,8),fontsize=12, c='dodgerblue')
kmf2 = lf.KaplanMeierFitter()
kmf2.fit(T[~recurrent], timeline=t, event_observed=E[~recurrent], label='Unstable SNPs')
kmf2.plot(ax=ax, c="r")
plt.xlabel('Allele frequency',size=14)
lf.plotting.add_at_risk_counts(kmf, kmf2, ax=ax)
plt.ylim(0,1.05)
Out[22]:
In [13]:
from lifelines.statistics import logrank_test
results = logrank_test(T[recurrent], T[~recurrent], event_observed_A=E[recurrent], event_observed_B=E[~recurrent])
results.print_summary()
In [14]:
kmf = lf.KaplanMeierFitter()
kmf.fit(T[treatment], timeline=t, event_observed=E[treatment], label='During treatment')
ax = kmf.plot(figsize=(8,8),fontsize=12, c='dodgerblue')
kmf2 = lf.KaplanMeierFitter()
kmf2.fit(T[~treatment], timeline=t, event_observed=E[~treatment], label='Pre-treatment')
kmf2.plot(ax=ax, c="r")
plt.xlabel('Allele frequency',size=14)
lf.plotting.add_at_risk_counts(kmf, kmf2, ax=ax)
plt.ylim(0,1.05)
Out[14]:
In [15]:
results = logrank_test(T[treatment], T[~treatment], event_observed_A=E[treatment], event_observed_B=E[~treatment])
results.print_summary()
In [21]:
kmf = lf.KaplanMeierFitter()
kmf.fit(T[efficacious], timeline=t, event_observed=E[efficacious], label='4+ drugs')
ax = kmf.plot(figsize=(8,8),fontsize=12, c='dodgerblue')
kmf2 = lf.KaplanMeierFitter()
kmf2.fit(T[~efficacious], timeline=t, event_observed=E[~efficacious], label='<4 drugs')
kmf2.plot(ax=ax, c="r")
plt.xlabel('Allele frequency',size=14)
lf.plotting.add_at_risk_counts(kmf, kmf2, ax=ax)
plt.ylim(0,1.05)
Out[21]:
In [18]:
results = logrank_test(T[efficacious], T[~efficacious],
event_observed_A=E[efficacious], event_observed_B=E[~efficacious])
results.print_summary()
In [27]:
kmf = lf.KaplanMeierFitter()
kmf.fit(T[nsysnp], timeline=t, event_observed=E[nsysnp], label='NSY v-SNP')
ax = kmf.plot(figsize=(8,8),fontsize=12, c='dodgerblue')
kmf2 = lf.KaplanMeierFitter()
kmf2.fit(T[synsnp], timeline=t, event_observed=E[synsnp], label='SYN v-SNP')
kmf2.plot(ax=ax, c="r")
plt.xlabel('Allele frequency',size=14)
lf.plotting.add_at_risk_counts(kmf, kmf2, ax=ax)
plt.ylim(0,1.05)
Out[27]:
In [25]:
results = logrank_test(T[nsysnp], T[synsnp],
event_observed_A=E[nsysnp], event_observed_B=E[synsnp])
results.print_summary()
In [23]:
for time in [0,2,4,6,8]:
a = ALL[['FREQUENCY','RECURRENT']][ALL.TIME==time].copy()
a = a.reset_index(drop=True)
a['E'] = [1 for x in a.index]
E = a['E']
T = a['FREQUENCY']
t = np.linspace(0,1,len(a))
recurrent = a['RECURRENT']==1
results = logrank_test(T[recurrent], T[~recurrent], event_observed_A=E[recurrent], event_observed_B=E[~recurrent])
results.print_summary()
Assuming I can use survival analysis in this case, the inference is the following: stable SNPs have a higher overall frequency and this difference is significant. Re-running the above analysis for each timepoint, I see that this relationship is true for all but one tested timepoint. So what does this mean? Never mind, these analyses are probably not applicable to frequency data...
Taking a more naive approach again and looking specifically at the denisty of these allele frequency distributions:
In [24]:
plt.figure('Violin', figsize=(8,8))
plt.violinplot([ALL.LOG_FREQ[(ALL.RECURRENT==0)&(ALL.TIME==0)],
ALL.LOG_FREQ[(ALL.RECURRENT==1)&(ALL.TIME==0)]])
plt.xticks([1,2],['Transient', 'Stable'],size=14)
plt.yticks(size=14)
plt.ylabel('Log allele frequency', size=18)
plt.xlabel('Allele stability', size=18)
Out[24]:
The shape is clearly different, so chances are that the correlation is driven by these high frequency mutations, and for the rest of there is a stochastic mechanism determining their expansion.
Exploring this a bit furtner, patient 12 has 3 samples for t=0. Let's quantify the overlap.
In [4]:
P12_T0 = {'S1': [],
'S2': [],
'S3': []}
P12_T0_FREQ = {}
for ind,sample in enumerate(['S1', 'S2', 'S3']):
for line in open('../data/raw/ANNOTATED_vSNP/1200_{}.annotated'.format(sample)):
split = line.split('\t')
if int(split[0]) in list(ALL.LOCUS[(ALL.PATIENT_ID=='Patient12')&(ALL.TIME!=0)])\
or float(split[3][:-1])>1.5:
P12_T0[sample].append(int(split[0]))
if int(split[0]) in P12_T0_FREQ:
P12_T0_FREQ[int(split[0])][ind]+=float(split[3][:-1])
if int(split[0]) not in P12_T0_FREQ:
P12_T0_FREQ[int(split[0])]=np.zeros(3)
P12_T0_FREQ[int(split[0])][ind]+=float(split[3][:-1])
P12_T0_FREQ = pd.DataFrame(P12_T0_FREQ).T
P12_T0_FREQ['RECURRENT'] = [int(x in list(ALL.LOCUS[(ALL.PATIENT_ID=='Patient12')&
(ALL.TIME!=0)])) for x in P12_T0_FREQ.index]
P12_T0_FREQ[P12_T0_FREQ.RECURRENT==1]
Out[4]:
In [8]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8], polar=True)
theta = [0.25*np.pi, 0.25*np.pi, 1.7*np.pi]
radii = [8,10,8]
width = [21/20*np.pi, 8/20*np.pi, 15/20*np.pi]
bars = ax.bar(theta, radii, width=width, bottom=0.0)
colors = ['dodgerblue', 'gold', 'limegreen']
labels = ['S1','S2','S3']
for label,bar,color in zip(labels, bars, colors):
bar.set_facecolor(color)
bar.set_label(label)
bar.set_alpha(0.7)
#plt.grid(None)
ax.spines['polar'].set_visible(False)
plt.xticks(np.linspace(0,2*np.pi,9),[])
plt.yticks(np.arange(0,12,2),[])
plt.text(np.pi*.45,8.5,'4', fontsize=16)
plt.text(np.pi*.6,5,'5', fontsize=16)
plt.text(np.pi*.35,5,'4', fontsize=16)
plt.text(np.pi*0.05,5,'11', fontsize=16)
plt.text(np.pi*.9,5,'12', fontsize=16)
plt.legend(ncol=3, loc=8, title='Subsample', fontsize=14, frameon=False)
ax.get_legend().get_title().set_fontsize('16')
In [10]:
for subsample in [0,1,2]:
_a = [x>0 for x in P12_T0_FREQ[subsample]]
_b = list(P12_T0_FREQ.RECURRENT)
print(ss.spearmanr(_a,_b))
In [5]:
present = np.array(P12_T0_FREQ[[0,1,2]]>0,dtype=int)
recurrent = np.array(P12_T0_FREQ.RECURRENT)
plt.figure(figsize=(12,4))
plt.subplot(211)
plt.imshow( (present+(present.T*recurrent).T).T, interpolation='none', cmap='viridis')
plt.yticks([0,1,2],[1,2,3])
plt.ylabel('Subsample', size=14)
Out[5]:
All of the alleles that are present in all samples are also stable over time. Similarly, almost all shared alleles are also stable with the exception of 3347797 (marker 20 above). Clearly though the populations represented in each of the subsamples are re-sampled over the course of treatment. I wonder how many times each gets picked up over the course of treatment.
In [51]:
plt.figure(figsize=(8,8))
c=plt.subplot2grid((5,5), (0,0), rowspan=2, colspan=5)
plt.vlines(24.5,0,50,lw=2,colors='red')
plt.vlines(34.5,0,50,lw=2,colors='red')
plt.vlines(3.5,0,50,lw=2,colors='red')
plt.vlines(7.5,0,50,lw=2,colors='red')
plt.bar(np.arange(-0.5,35,1),
np.array(P12_T0_FREQ[[0,1,2]].sum(axis=1))/np.array(P12_T0_FREQ[[0,1,2]]>0).sum(axis=1),
bottom=-.7, width=1, color='white',lw=1.5)
plt.vlines(np.arange(0,35.5,1),
np.array(P12_T0_FREQ[[0,1,2]].sum(axis=1))/np.array(P12_T0_FREQ[[0,1,2]]>0).sum(axis=1)\
+np.std(P12_T0_FREQ[[0,1,2]], axis=1),
np.array(P12_T0_FREQ[[0,1,2]].sum(axis=1))/np.array(P12_T0_FREQ[[0,1,2]]>0).sum(axis=1)\
-np.std(P12_T0_FREQ[[0,1,2]], axis=1),
)
plt.ylabel('Mean allele\nfrequency', size=16)
plt.xlim(-.5,35.5)
plt.ylim(0,50)
plt.yticks(np.arange(0,60,10),
['0%', '10%', '20%', '30%', '40%', '50%'],
size=12)
plt.xticks(np.arange(0,36,5),[])
plt.text(5.5,40,'mmpR',horizontalalignment='center',
size=14,color='red')
plt.text(30,40,'glpK',horizontalalignment='center',
size=14,color='red')
c.spines['right'].set_visible(False)
c.spines['top'].set_visible(False)
# Only show ticks on the left and bottom spines
c.yaxis.set_ticks_position('left')
c.xaxis.set_ticks_position('bottom')
plt.xticks([0,1,2,5.5,9,10,12,13,21,24,30],
rotation=60,size=12)
plt.subplot2grid((5,5), (2,0), rowspan=1, colspan=5)
plt.imshow( (present+(present.T*recurrent).T).T, aspect='equal', interpolation='none', cmap='viridis')
plt.yticks([0,1,2],['S_1','S_2','S_3'],size=12)
plt.ylabel('Sputum\nSample', size=16)
plt.vlines(24.5,-.50,2.5,lw=3,colors='red')
plt.vlines(34.5,-.50,2.5,lw=3,colors='red')
plt.vlines(3.5,-.50,2.5,lw=3,colors='red')
plt.vlines(7.5,-.50,2.5,lw=3,colors='red')
plt.hlines(-.50,24.5,34.5,lw=3,colors='red')
plt.hlines(2.5,24.5,34.5,lw=3,colors='red')
plt.hlines(-.50,3.5,7.5,lw=3,colors='red')
plt.hlines(2.5,3.5,7.5,lw=3,colors='red')
plt.xlim(-0.5,35.5)
plt.xticks([0,1,2,5.5,9,10,12,13,21,24,30],
['dnaA','lprO','zmp1','mmpR','Rv1097c','Rv1371','ephB','mshC','Rv3195','lpdA','glpK'],
rotation=60,size=12)
d = plt.subplot2grid((5,5), (3,0), rowspan=2, colspan=5)
#count how many times each locus is present in the rest of the timepoints
COUNTINGS = Counter(list(ALL.LOCUS[(ALL.PATIENT_ID=='Patient12')&(ALL.TIME!=0)]))
PROPORTIONS = [COUNTINGS.get(x,0)*20 for x in P12_T0_FREQ.index]
plt.vlines(24.5,0,100,lw=2,colors='red',linestyles='dotted')
plt.vlines(34.5,0,100,lw=2,colors='red',linestyles='dotted')
plt.vlines(3.5,0,100,lw=2,colors='red',linestyles='dotted')
plt.vlines(7.5,0,100,lw=2,colors='red',linestyles='dotted')
plt.bar(np.arange(-0.5,35,1),
PROPORTIONS,
bottom=-.7, width=1, color='gold',lw=1.5)
plt.ylabel('Number of timepoints\nwhere allele detected', size=16)
plt.xlim(-.5,35.5)
plt.ylim(0,100)
plt.yticks(np.arange(20,100,20),
['2', '3', '4', '5'],
size=12)
plt.xticks(np.arange(0,36,5),[])
plt.text(5.5,90,'mmpR',horizontalalignment='center',
size=14,color='red')
plt.text(30,90,'glpK',horizontalalignment='center',
size=14,color='red')
d.spines['right'].set_visible(False)
d.spines['top'].set_visible(False)
# Only show ticks on the left and bottom spines
d.yaxis.set_ticks_position('left')
d.xaxis.set_ticks_position('bottom')
plt.xticks([0,1,2,5.5,9,10,12,13,21,24,30],
rotation=60,size=12)
plt.xlabel('v-SNPs', size=16)
plt.tight_layout()
#plt.savefig('../reports/figures/3_P12_resample.pdf')
#plt.savefig('../reports/figures/3_P12_resample',dpi=300)
In [7]:
#Does allele frequency determine weather an allele is shared across samples?
shared = present.sum(1)>1
shared_frequencies = [x for x in np.array(P12_T0_FREQ[[0,1,2]], dtype=float)[shared].flatten() if x>0]
unique_frequencies = [x for x in np.array(P12_T0_FREQ[[0,1,2]], dtype=float)[~shared].flatten() if x>0]
ss.mannwhitneyu(shared_frequencies, unique_frequencies, alternative='two-sided')
Out[7]:
In [21]:
#Does allele frequency determine weather an allele is stable over time?
ss.mannwhitneyu([x for x in np.array(P12_T0_FREQ[[0,1,2]][P12_T0_FREQ.RECURRENT==1],
dtype=float).flatten() if x>0],
[x for x in np.array(P12_T0_FREQ[[0,1,2]][P12_T0_FREQ.RECURRENT==0],
dtype=float).flatten() if x>0]
)
Out[21]:
Overall, we are probably consistently undersampling the true complexity of the bacterial population in the lung. The sputum seems to contain a proportion of stable alleles and a host of transient ones. This may reflect the dynamic nautre of the population based on biology or may also result from the fact that expectoration is effectively a stochastic spatial sampling. Nonetheless, the variation in the frequency of stable alleles suggests that much of the bacterial population that is regularly caughed up is not homogeneous. The question of where these bacteria come from remains open, do they actively compete, or are they simply the result of leeching from spatialy segregated clones?
This then leads to the question of: is there a biological signal in the population that would favour one scenario over another?
But before we go there... I'm just curious to see what the trajectory is of the stable alleles observed above.
In [137]:
ALL[(ALL.PATIENT_ID=='Patient12')&(ALL.LOCUS>4138202)&(ALL.LOCUS<4139755)]
Out[137]:
In [155]:
plt.figure(figsize=(8,8))
locus_id = ['mmpR', 'mmpR', 'mmpR', 'Rv1097c', 'Rv1371', 'Rv3195', 'lpdA']
for ind,locus in enumerate([779193,779215,779296,1225354,1543917,3564850,3689763]):
plt.subplot(3,3,ind+1)
plt.plot(ALL.TIME[(ALL.PATIENT_ID=='Patient12')&(ALL.LOCUS==locus)],
ALL.FREQUENCY[(ALL.PATIENT_ID=='Patient12')&(ALL.LOCUS==locus)],
'r-', lw=2)
plt.title(locus_id[ind],loc='left',size=14)
plt.ylim(0,.5)
plt.yticks(np.arange(0,.5,.1),size=12)
plt.xticks([],size=12)
if ind>3:
plt.xticks([0,2,4,6,8,16],size=12)
plt.xlabel('Time (weeks)', size=14)
if ind in [0,3,6]:
plt.ylabel('Allele frequency', size=14)
plt.tight_layout()
These are cleary minor but stable variants. Why are they not increasing in frequency? Does this argue that the bacteria in the sputum come from the constant leeching of bacteria in the different foci in the lung - as suggest by JoAnne Flynn? Or is this a minor but stable popultion within the microcosm of lung microbiota?
One of the dynamics that I should plot is that of gyrA mutations in Patient 10.
In [38]:
plt.figure('Patient10_gyrA', figsize=(3.35,3))
plt.plot([0]+list(ALL.TIME[(ALL.PATIENT_ID=='Patient10')&(ALL.LOCUS==7570)])+[16],
[0]+list(ALL.FREQUENCY[(ALL.PATIENT_ID=='Patient10')&(ALL.LOCUS==7570)])+[0.958],
label=r'$\mathregular{GyrA^{Ala90Val}}$',color='black',
marker='o',mfc='dodgerblue',mew=1.5, ms=8, mec='black',lw=1.5)
plt.plot([0,4]+list(ALL.TIME[(ALL.PATIENT_ID=='Patient10')&(ALL.LOCUS==7582)])+[16],
[0,0]+list(ALL.FREQUENCY[(ALL.PATIENT_ID=='Patient10')&(ALL.LOCUS==7582)])+[0.0491],
label=r'$\mathregular{GyrA^{Asp94Gly}}$',color='black',
marker='o',mfc='gold',mew=1.5, ms=8, mec='black',lw=1.5)
plt.xticks([0,4,6,8,16],[0,4,6,8,16],size=tick_size)
plt.yticks(size=tick_size)
plt.vlines([0,4,6,8,16],0,1,linestyles='dotted',lw=0.5)
plt.legend(loc=5, fontsize=text_size)
plt.xlabel('Time (weeks)', size=label_size)
plt.ylabel('Proportion of population', size=label_size)
#plt.title('Patient 10', size=18, loc='left')
plt.savefig('../reports/figures/3_P10_gyrA.pdf')
#plt.savefig('../reports/figures/3_P10_gyrA',dpi=300)
Might be worth to see if we can classify the trajectories of the v-SNPs as a way to describe population dynamics.
Let's first have a look at all the 58 recurrent v-SNPs:
In [34]:
#RECURRENTS = np.unique(ALL.LOCUS[ALL.RECURRENT==1])
RECURRENTS = []
for x in sorted(RECURRENT_vSNP.keys()):
RECURRENTS+=sorted(RECURRENT_vSNP[x])
#Order recurrent v-SNPs first by PATIENT_ID and then by LOCUS
plt.figure('v-SNP trajectories', figsize=(8,8))
for ind,snp in enumerate(RECURRENTS):
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.subplot(8,8,ind+1)
plt.plot(time, log_freqs, 'r-', lw=2)
plt.xticks(np.arange(0,10,8),[])
plt.yticks(np.arange(-5.5,2,2),[])
plt.title('{}'.format(patient))
plt.tight_layout()
This is consistent with my expectations... Let's manually code the trajectory of these SNPs (0-ascending, 1-descending, 2-constant, 3-sporadic).
In [36]:
SNP_PATTERN = np.array([1,1,1,1,3,1,1,3,
3,1,0,0,1,0,1,2,
3,3,3,3,0,0,3,2,
0,1,3,2,3,1,1,1,
3,3,1,3,1,1,1,1,
1,1,3,1,1,2,2,2,
0,2,2,3,2,0,0,0,
2,0])
In [32]:
plt.figure(figsize=(10,4))
patient_of_interest = 'Patient10'
c=plt.subplot2grid((4,11), (0,0), rowspan=2, colspan=3)
pattern = 0
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='limegreen', lw=2, alpha=0.15)
if snp==7570:
plt.plot(time, log_freqs, '-', color='limegreen', lw=2)
plt.xticks(np.arange(0,10,2),[])
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'])
plt.title('Ascending', size=14, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=14)
c=plt.subplot2grid((4,11), (2,0), rowspan=2, colspan=3)
pattern = 1
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #e^-5.5 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='orangered', lw=2, alpha=0.15)
if snp== 2892974:
plt.plot(time, log_freqs, '-', color='orangered', lw=2)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8])
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'])
plt.title('Descending', size=14, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=14)
plt.xlabel('Time(weeks)',size=14)
c=plt.subplot2grid((4,11), (0,3), rowspan=2, colspan=3)
pattern = 2
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='dodgerblue', lw=2, alpha=0.15)
if snp== 1543917:
plt.plot(time, log_freqs, '-', color='dodgerblue', lw=2)
plt.xticks(np.arange(0,10,2),[])
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'])
plt.title('Constant', size=14, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
c=plt.subplot2grid((4,11), (2,3), rowspan=2, colspan=3)
pattern = 3
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='orange', lw=2, alpha=0.15)
if snp== 4232644:
plt.plot(time, log_freqs, '-', color='orange', lw=2)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8])
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'])
plt.title('Sporadic', size=14, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.xlabel('Time(weeks)',size=14)
c=plt.subplot2grid((4,11), (0,6), rowspan=4, colspan=5)
for (snp,pattern) in zip(np.array(RECURRENTS),SNP_PATTERN):
patcol = {0:'limegreen', 1:'orangered', 2:'dodgerblue', 3:'orange'}
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
_lf = log_freqs+[1]
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
if patient == patient_of_interest:
plt.plot(time, log_freqs, '-', color=patcol[pattern], lw=2, alpha=0.8)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8])
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'])
plt.title('{0}nt {1}'.format(patient_of_interest.split('nt')[0], patient_of_interest.split('nt')[1]),
size=14, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.xlabel('Time(weeks)',size=14)
plt.tight_layout()
In [39]:
plt.figure(figsize=(8,6))
patient_of_interest = 'Patient10'
c=plt.subplot(2,2,1)
pattern = 0
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='limegreen', lw=2, alpha=0.15)
if snp==7570:
plt.plot(time, log_freqs, '-', color='limegreen', lw=2)
plt.xticks(np.arange(0,10,2),[], size=12)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=12)
plt.title('Ascending', size=16, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=16)
c=plt.subplot(2,2,3)
pattern = 1
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #e^-5.5 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='orangered', lw=2, alpha=0.15)
if snp== 2892974:
plt.plot(time, log_freqs, '-', color='orangered', lw=2)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8], size=12)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=12)
plt.title('Descending', size=16, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=16)
plt.xlabel('Time(weeks)',size=14)
c=plt.subplot(2,2,2)
pattern = 2
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='dodgerblue', lw=2, alpha=0.15)
if snp== 1543917:
plt.plot(time, log_freqs, '-', color='dodgerblue', lw=2)
plt.xticks(np.arange(0,10,2),[], size=12)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=12)
plt.title('Constant', size=16, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
c=plt.subplot(2,2,4)
pattern = 3
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='orange', lw=2, alpha=0.15)
if snp== 4232644:
plt.plot(time, log_freqs, '-', color='orange', lw=2)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8], size=12)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=12)
plt.title('Sporadic', size=16, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.xlabel('Time(weeks)',size=16)
plt.tight_layout()
plt.savefig('../reports/figures/3_recurrence_pattern.pdf')
plt.savefig('../reports/figures/3_recurrence_pattern',dpi=300)
In [52]:
plt.figure(figsize=(6.7,5))
c=plt.subplot2grid((20,28), (0,0), rowspan=8, colspan=20)
plt.vlines(24.5,0,50,lw=1.5,colors='red')
plt.vlines(34.5,0,50,lw=1.5,colors='red')
plt.vlines(3.5,0,50,lw=1.5,colors='red')
plt.vlines(7.5,0,50,lw=1.5,colors='red')
plt.bar(np.arange(-0.5,35,1),
np.array(P12_T0_FREQ[[0,1,2]].sum(axis=1))/np.array(P12_T0_FREQ[[0,1,2]]>0).sum(axis=1),
bottom=-.7, width=1, color='white',lw=1.5)
#plt.vlines(np.arange(0,35.5,1),
# np.array(P12_T0_FREQ[[0,1,2]].sum(axis=1))/np.array(P12_T0_FREQ[[0,1,2]]>0).sum(axis=1)\
# +np.std(P12_T0_FREQ[[0,1,2]], axis=1),
# np.array(P12_T0_FREQ[[0,1,2]].sum(axis=1))/np.array(P12_T0_FREQ[[0,1,2]]>0).sum(axis=1)\
# -np.std(P12_T0_FREQ[[0,1,2]], axis=1),
# )
plt.ylabel('Mean allele\nfrequency', size=label_size)
plt.xlim(-.5,35.5)
plt.ylim(0,50)
plt.yticks(np.arange(0,60,10),
['0%', '10%', '20%', '30%', '40%', '50%'],
size=tick_size)
plt.xticks(np.arange(0,36,5),[])
plt.text(5.5,40,'mmpR',horizontalalignment='center',
size=text_size,color='red')
plt.text(30,40,'glpK',horizontalalignment='center',
size=text_size,color='red')
c.spines['right'].set_visible(False)
c.spines['top'].set_visible(False)
# Only show ticks on the left and bottom spines
c.yaxis.set_ticks_position('left')
c.xaxis.set_ticks_position('bottom')
plt.xticks([0,1,2,5.5,9,10,12,13,21,24,30],
rotation=60,size=tick_size)
plt.subplot2grid((20,28), (8,0), rowspan=4, colspan=20)
plt.imshow( (present+(present.T*recurrent).T).T, interpolation='none', aspect='equal', cmap='viridis')
plt.yticks([0,1,2],['1','2','3'],size=tick_size)
plt.ylabel('Sputum\nSample', size=label_size)
plt.vlines(24.5,-.50,2.5,lw=1.5,colors='red')
plt.vlines(34.5,-.50,2.5,lw=1.5,colors='red')
plt.vlines(3.5,-.50,2.5,lw=1.5,colors='red')
plt.vlines(7.5,-.50,2.5,lw=1.5,colors='red')
plt.hlines(-.50,24.5,34.5,lw=1.5,colors='red')
plt.hlines(2.5,24.5,34.5,lw=1.5,colors='red')
plt.hlines(-.50,3.5,7.5,lw=1.5,colors='red')
plt.hlines(2.5,3.5,7.5,lw=1.5,colors='red')
plt.xlim(-.5,35.5)
plt.xticks([0,1,2,5.5,9,10,12,13,21,24,30],
['dnaA','lprO','zmp1','mmpR','Rv1097c','Rv1371','ephB','mshC','Rv3195','lpdA','glpK'],
rotation=60,size=tick_size)
d = plt.subplot2grid((20,28), (12,0), rowspan=8, colspan=20)
#count how many times each locus is present in the rest of the timepoints
COUNTINGS = Counter(list(ALL.LOCUS[(ALL.PATIENT_ID=='Patient12')&(ALL.TIME!=0)]))
PROPORTIONS = [COUNTINGS.get(x,0)*20 for x in P12_T0_FREQ.index]
plt.vlines(24.5,0,100,lw=1.5,colors='red',linestyles='dotted')
plt.vlines(34.5,0,100,lw=1.5,colors='red',linestyles='dotted')
plt.vlines(3.5,0,100,lw=1.5,colors='red',linestyles='dotted')
plt.vlines(7.5,0,100,lw=1.5,colors='red',linestyles='dotted')
plt.bar(np.arange(-0.5,35,1),
PROPORTIONS,
bottom=-.7, width=1, color='gold',lw=1.5)
plt.ylabel('Number of timepoints\nwhere allele detected', size=label_size)
plt.xlim(-.5,35.5)
plt.ylim(0,100)
plt.yticks(np.arange(20,100,20),
['2', '3', '4', '5'],
size=tick_size)
plt.xticks(np.arange(0,36,5),[])
plt.text(5.5,90,'mmpR',horizontalalignment='center',
size=text_size,color='red')
plt.text(30,90,'glpK',horizontalalignment='center',
size=text_size,color='red')
d.spines['right'].set_visible(False)
d.spines['top'].set_visible(False)
# Only show ticks on the left and bottom spines
d.yaxis.set_ticks_position('left')
d.xaxis.set_ticks_position('bottom')
plt.xticks([0,1,2,5.5,9,10,12,13,21,24,30],
rotation=60,size=tick_size)
plt.xlabel('v-SNPs', size=label_size)
c=plt.subplot2grid((20,28), (0,20), rowspan=5, colspan=8)
pattern = 0
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='limegreen', lw=1.5, alpha=0.15)
if snp==7570:
plt.plot(time, log_freqs, '-', color='limegreen', lw=1.5)
plt.xticks(np.arange(0,10,2),[], size=tick_size)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=tick_size)
plt.title('Ascending', size=title_size, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=label_size)
c=plt.subplot2grid((20,28), (5,20), rowspan=5, colspan=8)
pattern = 1
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #e^-5.5 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='orangered', lw=1.5, alpha=0.15)
if snp== 2892974:
plt.plot(time, log_freqs, '-', color='orangered', lw=1.5)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8], size=tick_size)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=tick_size)
plt.title('Descending', size=title_size, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=label_size)
c=plt.subplot2grid((20,28), (10,20), rowspan=5, colspan=8)
pattern = 2
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='dodgerblue', lw=1.5, alpha=0.15)
if snp== 1543917:
plt.plot(time, log_freqs, '-', color='dodgerblue', lw=1.5)
plt.xticks(np.arange(0,10,2),[], size=tick_size)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=tick_size)
plt.title('Constant', size=title_size, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=label_size)
c=plt.subplot2grid((20,28), (15,20), rowspan=5, colspan=8)
pattern = 3
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='orange', lw=1.5, alpha=0.15)
if snp== 4232644:
plt.plot(time, log_freqs, '-', color='orange', lw=1.5)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8], size=tick_size)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=tick_size)
plt.title('Sporadic', size=title_size, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.xlabel('Time(weeks)',size=label_size)
plt.ylabel('Allele frequency',size=label_size)
plt.tight_layout()
#plt.savefig('../reports/figures/3_Combined.pdf')
#plt.savefig('../reports/figures/3_Combined',dpi=1200)
During the review process it came up that the above figure may not be sufficiently clear and that the lower left panel may be best represented as a presence/absence picture, like the middle panel. To do this I would have to first extract the relevant infomation from the PATIENT_DATA dictionary.
In [100]:
#Get patient index
PATIENT_INDEX = PATIENT_DATA['PATIENTS'].index('Patient12')
#Generate recipient array
P12_OTHER_TIMEPOINTS = np.zeros(shape=(36,4))
#Extract frequencies for a given allele at other times
for ind,x in enumerate(np.array(P12_T0_FREQ.index)):
place = np.where(PATIENT_DATA['UNFIXED_ARRAYS'][PATIENT_INDEX][:,0]==x)[0]
if place: P12_OTHER_TIMEPOINTS[ind]+=PATIENT_DATA['UNFIXED_ARRAYS'][PATIENT_INDEX][place[0],2:]
#Generate a presence absence matrix
GRID = np.array(P12_OTHER_TIMEPOINTS>0, dtype=int)
#Plot
plt.imshow(GRID.T+(GRID.T*recurrent), interpolation='none', cmap='viridis')
plt.yticks(range(4), ['2', '4', '6', '8'])
Out[100]:
In [115]:
plt.figure(figsize=(6.7,5))
c=plt.subplot2grid((20,28), (0,0), rowspan=8, colspan=20)
plt.vlines(24.5,0,50,lw=1.5,colors='red')
plt.vlines(34.5,0,50,lw=1.5,colors='red')
plt.vlines(3.5,0,50,lw=1.5,colors='red')
plt.vlines(7.5,0,50,lw=1.5,colors='red')
plt.bar(np.arange(-0.5,35,1),
np.array(P12_T0_FREQ[[0,1,2]].sum(axis=1))/np.array(P12_T0_FREQ[[0,1,2]]>0).sum(axis=1),
bottom=-.7, width=1, color='white',lw=1.5)
#plt.vlines(np.arange(0,35.5,1),
# np.array(P12_T0_FREQ[[0,1,2]].sum(axis=1))/np.array(P12_T0_FREQ[[0,1,2]]>0).sum(axis=1)\
# +np.std(P12_T0_FREQ[[0,1,2]], axis=1),
# np.array(P12_T0_FREQ[[0,1,2]].sum(axis=1))/np.array(P12_T0_FREQ[[0,1,2]]>0).sum(axis=1)\
# -np.std(P12_T0_FREQ[[0,1,2]], axis=1),
# )
plt.ylabel('Mean detected allele\n frequency at enrollment', size=label_size)
plt.xlim(-.5,35.5)
plt.ylim(0,50)
plt.yticks(np.arange(0,60,10),
['0%', '10%', '20%', '30%', '40%', '50%'],
size=tick_size)
plt.xticks(np.arange(0,36,5),[])
plt.text(5.5,40,'mmpR',horizontalalignment='center',
size=text_size,color='red')
plt.text(30,40,'glpK',horizontalalignment='center',
size=text_size,color='red')
c.spines['right'].set_visible(False)
c.spines['top'].set_visible(False)
# Only show ticks on the left and bottom spines
c.yaxis.set_ticks_position('left')
c.xaxis.set_ticks_position('bottom')
plt.xticks([0,1,2,5.5,9,10,12,13,21,24,30],
rotation=60,size=tick_size)
plt.subplot2grid((20,28), (8,0), rowspan=4, colspan=20)
plt.imshow( (present+(present.T*recurrent).T).T, interpolation='none', aspect='equal', cmap='viridis')
plt.yticks([0,1,2],['1','2','3'],size=tick_size)
plt.ylabel('Sputum\nSample', size=label_size)
plt.vlines(24.5,-.50,2.5,lw=1.5,colors='red')
plt.vlines(34.5,-.50,2.5,lw=1.5,colors='red')
plt.vlines(3.5,-.50,2.5,lw=1.5,colors='red')
plt.vlines(7.5,-.50,2.5,lw=1.5,colors='red')
plt.hlines(-.50,24.5,34.5,lw=3,colors='red')
plt.hlines(2.5,24.5,34.5,lw=3,colors='red')
plt.hlines(-.50,3.5,7.5,lw=3,colors='red')
plt.hlines(2.5,3.5,7.5,lw=3,colors='red')
plt.xlim(-.5,35.5)
plt.xticks([0,1,2,5.5,9,10,12,13,21,24,30],
['dnaA','lprO','zmp1','mmpR','Rv1097c','Rv1371','ephB','mshC','Rv3195','lpdA','glpK'],
rotation=60,size=tick_size)
d = plt.subplot2grid((20,28), (12,0), rowspan=8, colspan=20)
#count how many times each locus is present in the rest of the timepoints
plt.imshow(GRID.T+(GRID.T*recurrent), interpolation='none', aspect='auto', cmap='viridis')
plt.vlines(24.5,-0.5,3.5,lw=1.5, colors='red')
plt.vlines(34.5,-0.5,3.5,lw=1.5, colors='red')
plt.vlines(3.5,-0.5,3.5,lw=1.5, colors='red')
plt.vlines(7.5,-0.5,3.5,lw=1.5, colors='red')
plt.hlines(.50,-.5,35.5,lw=1,colors='red', linestyles='dotted')
plt.hlines(1.50,-.5,35.5,lw=1,colors='red', linestyles='dotted')
plt.hlines(2.5,-.5,35.5,lw=1,colors='red', linestyles='dotted')
plt.ylabel('Sputum Samples\n(Weeks post-enrollment)', size=label_size)
plt.xlim(-.5,35.5)
plt.yticks(range(4), ['2', '4', '6', '8'], size=tick_size)
plt.xticks(np.arange(0,36,5),[])
plt.text(5.5,90,'mmpR',horizontalalignment='center',
size=text_size,color='red')
plt.text(30,90,'glpK',horizontalalignment='center',
size=text_size,color='red')
d.spines['right'].set_visible(False)
d.spines['top'].set_visible(False)
# Only show ticks on the left and bottom spines
d.yaxis.set_ticks_position('left')
d.xaxis.set_ticks_position('bottom')
plt.xticks([0,1,2,5.5,9,10,12,13,21,24,30],
rotation=60,size=tick_size)
plt.xlabel('v-SNPs', size=label_size)
c=plt.subplot2grid((20,28), (0,20), rowspan=5, colspan=8)
pattern = 0
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='limegreen', lw=1.5, alpha=0.15)
if snp==7570:
plt.plot(time, log_freqs, '-', color='limegreen', lw=1.5)
plt.xticks(np.arange(0,10,2),[], size=tick_size)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=tick_size)
plt.title('Ascending', size=title_size, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=label_size)
c=plt.subplot2grid((20,28), (5,20), rowspan=5, colspan=8)
pattern = 1
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #e^-5.5 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='orangered', lw=1.5, alpha=0.15)
if snp== 2892974:
plt.plot(time, log_freqs, '-', color='orangered', lw=1.5)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8], size=tick_size)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=tick_size)
plt.title('Descending', size=title_size, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=label_size)
c=plt.subplot2grid((20,28), (10,20), rowspan=5, colspan=8)
pattern = 2
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='dodgerblue', lw=1.5, alpha=0.15)
if snp== 1543917:
plt.plot(time, log_freqs, '-', color='dodgerblue', lw=1.5)
plt.xticks(np.arange(0,10,2),[], size=tick_size)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=tick_size)
plt.title('Constant', size=title_size, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.ylabel('Allele frequency',size=label_size)
c=plt.subplot2grid((20,28), (15,20), rowspan=5, colspan=8)
pattern = 3
for snp in np.array(RECURRENTS)[SNP_PATTERN==pattern]:
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
plt.plot(time, log_freqs, '-', color='orange', lw=1.5, alpha=0.15)
if snp== 4232644:
plt.plot(time, log_freqs, '-', color='orange', lw=1.5)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8], size=tick_size)
plt.yticks(np.log([0.01,0.1,0.5]),['1%', '10%', '50%'], size=tick_size)
plt.title('Sporadic', size=title_size, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.xlabel('Time(weeks)',size=label_size)
plt.ylabel('Allele frequency',size=label_size)
plt.tight_layout()
plt.savefig('../reports/figures/3a_Combined.pdf')
plt.savefig('../reports/figures/3a_Combined',dpi=1200)
In [33]:
plt.figure(figsize=(8,8))
for patient_of_interest in PATIENT_DATA['PATIENTS']:
ind = int(patient_of_interest[-2:])
plt.subplot(4, 3, ind)
for (snp,pattern) in zip(np.array(RECURRENTS),SNP_PATTERN):
patcol = {0:'limegreen', 1:'orangered', 2:'dodgerblue', 3:'orange'}
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
patient = np.unique(ALL.PATIENT_ID[ALL.LOCUS==snp])[0] #get PATIENT_ID
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
if patient == patient_of_interest and patient != 'Patient02':
plt.plot(time, log_freqs, '-', color=patcol[pattern], lw=2, alpha=0.9)
if patient == patient_of_interest and patient == 'Patient02':
plt.plot(time[:-1], log_freqs[:-1], '-', color=patcol[pattern], lw=2, alpha=0.9)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8])
plt.yticks(np.log([0.01,0.1,0.5,1.]),['1%', '10%', '50%',''])
plt.ylim(np.log(0.004),np.log(1.004))
plt.title('{0}nt {1}'.format(patient_of_interest.split('nt')[0], patient_of_interest.split('nt')[1]),
size=14, loc='left')
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
#plt.xlabel('Time(weeks)',size=14)
plt.tight_layout()
#plt.savefig('../reports/figures/3_supplement_trajectories.pdf')
#plt.savefig('../reports/figures/3_supplement_trajectories',dpi=300)
In [32]:
DR_set = ['Rv0006', 'Rv0026', 'Rv0402c', 'Rv0404', 'Rv0425c',
'Rv0658c', 'Rv1129c', 'Rv1181', 'Rv1319c', 'Rv1486c',
'Rv1895', 'Rv2764c', 'Rv2931', 'Rv2946c', 'Rv3795']
plt.figure(figsize=(8,8))
for patient_of_interest in PATIENT_DATA['PATIENTS']:
ind = int(patient_of_interest[-2:])
plt.subplot(4, 3, ind)
for snp, snp_type, snp_gene in zip(ALL.LOCUS[(ALL.PATIENT_ID==patient_of_interest)&(ALL.RECURRENT==1)],
ALL.SNP_TYPE[(ALL.PATIENT_ID==patient_of_interest)&(ALL.RECURRENT==1)],
ALL.GENE[(ALL.PATIENT_ID==patient_of_interest)&(ALL.RECURRENT==1)]):
pat = {'NSY':'-', 'SYN':'--', 'IGR':':', '-':'-.'}
col = ['grey','red'][int(snp_gene in DR_set)]
time = [0,2,4,6,8]
log_freqs = [-5.5,]*5 #2^-8 roughly our limit of detection
for t,f in zip(list(ALL.TIME[ALL.LOCUS==snp]), list(ALL.LOG_FREQ[ALL.LOCUS==snp])):
place = time.index(t)
log_freqs[place] = f
missing = PATIENT_DATA['MISSING'][PATIENT_DATA['PATIENTS'].index(patient_of_interest)] #remove missing timepoints
if missing:
time.pop(missing-1)
log_freqs.pop(missing-1)
if patient_of_interest != 'Patient02':
plt.plot(time, log_freqs,
'{}'.format(pat[snp_type]),
color=col, lw=2, alpha=0.9)
if patient_of_interest == 'Patient02':
plt.plot(time[:-1], log_freqs[:-1],
'{}'.format(pat[snp_type]),
color=col, lw=2, alpha=0.9)
plt.title('{0}nt {1}'.format(patient_of_interest.split('nt')[0], patient_of_interest.split('nt')[1]),
size=14, loc='left')
if patient_of_interest!='Patient08':
plt.hlines(np.log(0.5), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.1), 0, 8, linestyles='dotted')
plt.hlines(np.log(0.01), 0, 8, linestyles='dotted')
plt.xticks(np.arange(0,10,2),[0,2,4,6,8])
plt.yticks(np.log([0.01,0.1,0.5,1.]),['1%', '10%', '50%',''])
plt.ylim(np.log(0.004),np.log(1.004))
if patient_of_interest=='Patient08':
plt.ylim(0,1)
plt.yticks([0,1],[])
plt.xlim(0,8)
plt.xticks(np.arange(0,10,2),[0,2,4,6,8])
plt.hlines(0.85,0.5,1.8,'red', lw=2)
plt.text(2, 0.85, 'DR-associated', size=12, verticalalignment='center', horizontalalignment='left')
plt.hlines(0.7,0.5,1.8,'grey', lw=2)
plt.text(2, 0.7, 'Not DR-associated', size=12, verticalalignment='center', horizontalalignment='left')
plt.hlines(0.4,0.5,1.8, lw=2)
plt.text(2, 0.4, 'Nonsynonymous', size=12, verticalalignment='center', horizontalalignment='left')
plt.hlines(0.25,0.5,1.8, linestyles='dashed', lw=2)
plt.text(2, 0.25, 'Synonymous', size=12, verticalalignment='center', horizontalalignment='left')
plt.hlines(0.10,0.5,1.8, linestyles='dotted', lw=2)
plt.text(2, 0.10, 'IGR', size=12, verticalalignment='center', horizontalalignment='left')
#plt.xlabel('Time(weeks)',size=14)
plt.tight_layout()
#plt.savefig('../reports/figures/3_supplement_trajectories_DR.pdf')
#plt.savefig('../reports/figures/3_supplement_trajectories_DR',dpi=300)
In [13]:
nohs = {}
click = 0
for patient in np.unique(ALL.PATIENT_ID):
_c = Counter(ALL.TIME[(ALL.PATIENT_ID==patient)])
_r = Counter(ALL.TIME[(ALL.PATIENT_ID==patient)&(ALL.RECURRENT==1)])
for k,v in _c.items():
nohs[click] = {'PATIENT_ID': patient,
'TIME': k, 'NOH': v,
'NON_EFFICACIOUS': int(int(patient[-2:])>8),
'RECURRENT': _r[k]}
click+=1
click+=1;nohs[click] = {'PATIENT_ID': 'Patient01', 'TIME': 8, 'NOH': 0, 'NON_EFFICACIOUS': 0, 'RECURRENT': 0}
click+=1;nohs[click] = {'PATIENT_ID': 'Patient03', 'TIME': 0, 'NOH': 0, 'NON_EFFICACIOUS': 0, 'RECURRENT': 0}
click+=1;nohs[click] = {'PATIENT_ID': 'Patient03', 'TIME': 8, 'NOH': 0, 'NON_EFFICACIOUS': 0, 'RECURRENT': 0}
click+=1;nohs[click] = {'PATIENT_ID': 'Patient08', 'TIME': 0, 'NOH': 0, 'NON_EFFICACIOUS': 0, 'RECURRENT': 0}
click+=1;nohs[click] = {'PATIENT_ID': 'Patient08', 'TIME': 2, 'NOH': 0, 'NON_EFFICACIOUS': 0, 'RECURRENT': 0}
COUNTED_vSNPS = pd.DataFrame(nohs).T
COUNTED_vSNPS.sort('TIME', inplace=True)
COUNTED_vSNPS['NON_EFFICACIOUS'] = pd.to_numeric(COUNTED_vSNPS['NON_EFFICACIOUS'])
COUNTED_vSNPS['NOH'] = pd.to_numeric(COUNTED_vSNPS['NOH'])
In [65]:
plt.figure(figsize=(10,8))
for ind,patient in enumerate(np.unique(ALL.PATIENT_ID)):
plt.subplot(3,4,ind+1)
plt.plot(COUNTED_vSNPS.TIME[COUNTED_vSNPS.PATIENT_ID==patient],
COUNTED_vSNPS.NOH[COUNTED_vSNPS.PATIENT_ID==patient],
'-', color='dodgerblue', marker='o', ms=12, mew=1.5,
mfc='none', mec='dodgerblue', lw=2)
plt.ylim(0,60)
plt.xticks([0,2,4,6,8],size=12)
plt.yticks(np.arange(0,60,25),size=12)
plt.title(patient, size=14, loc='left')
if ind==4:
plt.ylabel('Number of vSNPs', size=18)
plt.tight_layout()
plt.subplot(3,4,10)
plt.xlabel('Time (weeks post treatment)', size=18, horizontalalignment='left')
Out[65]:
In [57]:
model = sm.Logit.from_formula('NON_EFFICACIOUS ~ NOH', data=COUNTED_vSNPS)
result = model.fit()
print(result.summary())
In [60]:
plt.figure(figsize=(8,8))
plt.boxplot([COUNTED_vSNPS.NOH[COUNTED_vSNPS.NON_EFFICACIOUS==0],
COUNTED_vSNPS.NOH[COUNTED_vSNPS.NON_EFFICACIOUS==1]])
plt.xticks([1,2],['4+ DRUGS', '<4 DRUGS'], size=18)
plt.yticks(size=12)
plt.ylim(-1,30)
plt.ylabel('Number of vSNPs', size=18)
Out[60]:
In [62]:
print(ss.levene(COUNTED_vSNPS.NOH[COUNTED_vSNPS.NON_EFFICACIOUS==0],
COUNTED_vSNPS.NOH[COUNTED_vSNPS.NON_EFFICACIOUS==1]))
print(ss.ttest_ind(COUNTED_vSNPS.NOH[COUNTED_vSNPS.NON_EFFICACIOUS==0],
COUNTED_vSNPS.NOH[COUNTED_vSNPS.NON_EFFICACIOUS==1],
equal_var=False))
In [15]:
plt.figure(figsize=(12,8))
for (patient,loc,col) in zip(np.unique(ALL.PATIENT_ID),
[(0,0),(0,4),(0,8),(0,12),(2,0),(2,4),(2,8),(2,12),(4,0),(4,4),(4,8),(4,12)],
['red',]*8+['grey',]*4):
plt.subplot2grid((6,23),loc,colspan=4,rowspan=2)
plt.plot(COUNTED_vSNPS.TIME[COUNTED_vSNPS.PATIENT_ID==patient],
COUNTED_vSNPS.NOH[COUNTED_vSNPS.PATIENT_ID==patient],
'-', color=col, marker='o', ms=10, mew=1.5,
mfc='none', mec=col, lw=2)
plt.ylim(0,60)
plt.xticks([0,2,4,6,8],size=12)
plt.yticks(np.arange(0,60,25),size=12)
plt.title(patient, size=14, loc='left')
if patient=='Patient05':
plt.ylabel('Number of v-SNPs', size=18)
if loc[0]==4:
plt.xlabel('Time (weeks)', size=18)
plt.subplot2grid((6,23),(0,17),colspan=6,rowspan=6)
bp = plt.boxplot([COUNTED_vSNPS.NOH[COUNTED_vSNPS.NON_EFFICACIOUS==0],
COUNTED_vSNPS.NOH[COUNTED_vSNPS.NON_EFFICACIOUS==1]],
patch_artist=True, widths=.5)
for box in bp['boxes']:
box.set(color='black',lw=2)
for whisker in bp['whiskers']:
whisker.set(color='black', linestyle='solid',lw=2)
for median in bp['medians']:
median.set(color='black', lw=2)
for cap in bp['caps']:
cap.set(color='black',lw=2)
for flier in bp['fliers']:
flier.set(marker='+', color='black', mew=1.5, mfc='white', ms=10)
bp['boxes'][0].set(facecolor = 'red')
bp['boxes'][1].set(facecolor = 'grey')
plt.xticks([1,2],['4+ drugs', '<4 drugs'], size=14)
plt.ylabel('Number of v-SNPs', size=18)
plt.xlabel('Patient group', size=18)
plt.yticks(size=12)
plt.ylim(-1,55)
plt.hlines(26,1,2,lw=2)
plt.text(1.5, 27,'p = 0.0012', size=14,horizontalalignment='center')
plt.tight_layout()
#plt.savefig('../reports/figures/3_supplement_noh.pdf')
#plt.savefig('../reports/figures/3_supplement_noh',dpi=300)
In [117]:
COUNTED_vSNPS.NOH[(COUNTED_vSNPS.NON_EFFICACIOUS==0)].describe()
Out[117]:
In [118]:
COUNTED_vSNPS.NOH[(COUNTED_vSNPS.NON_EFFICACIOUS==1)].describe()
Out[118]:
In [113]:
plt.figure()
COUNTED_vSNPS['RECRATIO'] = [np.divide(x,y) for (x,y) in zip(COUNTED_vSNPS.RECURRENT,COUNTED_vSNPS.NOH)]
plt.boxplot([list(COUNTED_vSNPS.RECRATIO[(COUNTED_vSNPS.NON_EFFICACIOUS==0)&(COUNTED_vSNPS.RECRATIO.notnull())]),
list(COUNTED_vSNPS.RECRATIO[(COUNTED_vSNPS.NON_EFFICACIOUS==1)&(COUNTED_vSNPS.RECRATIO.notnull())])])
print(ss.levene(list(COUNTED_vSNPS.RECRATIO[(COUNTED_vSNPS.NON_EFFICACIOUS==0)&(COUNTED_vSNPS.RECRATIO.notnull())]),
list(COUNTED_vSNPS.RECRATIO[(COUNTED_vSNPS.NON_EFFICACIOUS==1)&(COUNTED_vSNPS.RECRATIO.notnull())])))
print(ss.mannwhitneyu(list(COUNTED_vSNPS.RECRATIO[(COUNTED_vSNPS.NON_EFFICACIOUS==0)&(COUNTED_vSNPS.RECRATIO.notnull())]),
list(COUNTED_vSNPS.RECRATIO[(COUNTED_vSNPS.NON_EFFICACIOUS==1)&(COUNTED_vSNPS.RECRATIO.notnull())])))
In [115]:
COUNTED_vSNPS.RECRATIO[(COUNTED_vSNPS.NON_EFFICACIOUS==0)&(COUNTED_vSNPS.RECRATIO.notnull())].describe()
Out[115]:
In [116]:
COUNTED_vSNPS.RECRATIO[(COUNTED_vSNPS.NON_EFFICACIOUS==1)&(COUNTED_vSNPS.RECRATIO.notnull())].describe()
Out[116]:
In [7]:
CCDC5079_genome = 4414325
TIMEPOINT_DATA = {'PATIENT_ID': [],
'TIME': [],
'MEAN_FREQUENCY': [],
'LOG_MEAN_FREQUENCY': [],
'MEDIAN_FREQUENCY': [],
'VARIANCE_FREQUENCY': [],
'vSNP_COUNT': [],
'HETEROZYGOSITY': [],
'DEPTH': [],
'MEAN_SYN_FREQUENCY': [],
'MEAN_NSY_FREQUENCY': [],
'pNS_fSNP': [],
'NEUTRAL_pNS_fSNP': [],
'pNS_vSNP': [],
'NEUTRAL_pNS_vSNP': [],
'NON_EFFICACIOUS': []
}
for _patient in PATIENT_DATA['PATIENTS']:
_timepoints = list(set(ALL.TIME[ALL.PATIENT_ID==_patient]))
for _timepoint in _timepoints:
_frequencies = list(ALL.FREQUENCY[(ALL.TIME==_timepoint)&(ALL.PATIENT_ID==_patient)])
_mean_frequency = np.mean(_frequencies)
_median_frequency = np.median(_frequencies)
_variance = np.var(_frequencies)
_heterozygosity = serf.heterozygosity(np.array(_frequencies),
CCDC5079_genome)
_effective = int(_patient in ['Patient09', 'Patient10', 'Patient11', 'Patient12'])
_NSY_mean = np.mean(ALL.FREQUENCY[(ALL.TIME==_timepoint)&(ALL.PATIENT_ID==_patient)&(ALL.SNP_TYPE=='NSY')])
_SYN_mean = np.mean(ALL.FREQUENCY[(ALL.TIME==_timepoint)&(ALL.PATIENT_ID==_patient)&(ALL.SNP_TYPE=='SYN')])
TIMEPOINT_DATA['PATIENT_ID'].append(_patient)
TIMEPOINT_DATA['TIME'].append(_timepoint)
TIMEPOINT_DATA['MEAN_FREQUENCY'].append(_mean_frequency)
TIMEPOINT_DATA['LOG_MEAN_FREQUENCY'].append(np.log(_mean_frequency))
TIMEPOINT_DATA['MEDIAN_FREQUENCY'].append(_median_frequency)
TIMEPOINT_DATA['VARIANCE_FREQUENCY'].append(_variance)
TIMEPOINT_DATA['MEAN_SYN_FREQUENCY'].append(_SYN_mean)
TIMEPOINT_DATA['MEAN_NSY_FREQUENCY'].append(_NSY_mean)
TIMEPOINT_DATA['vSNP_COUNT'].append(len(_frequencies))
TIMEPOINT_DATA['HETEROZYGOSITY'].append(_heterozygosity)
TIMEPOINT_DATA['DEPTH'].append(PATIENT_DATA['DEPTH'][_patient][_timepoint])
TIMEPOINT_DATA['NON_EFFICACIOUS'].append(_effective)
TIMEPOINT_DATA['pNS_fSNP'].append(_effective)
TIMEPOINT_DATA['NEUTRAL_pNS_fSNP'].append(_effective)
TIMEPOINT_DATA['pNS_vSNP'].append(_effective)
TIMEPOINT_DATA['NEUTRAL_pNS_vSNP'].append(_effective)
TIMEPOINT_DF = pd.DataFrame(TIMEPOINT_DATA)
In [4]:
TIMEPOINT_DF = pd.read_csv('../data/processed/CLINICAL_info.csv', index_col=0)
In [5]:
model = sm.MixedLM.from_formula('TIME_TO_POSITIVITY_MGIT ~ TIME * NON_EFFICACIOUS',
data=TIMEPOINT_DF,
groups=TIMEPOINT_DF.PATIENT_ID)
result = model.fit()
print(result.summary())
In [7]:
i = result.params[0]
ni = result.params[0]+result.params[2]
s = result.params[1]
sn = result.params[1]+result.params[3]
plt.figure('timetopos',figsize=(10,6))
plt.subplot(121)
plt.plot(TIMEPOINT_DF.TIME[TIMEPOINT_DF.NON_EFFICACIOUS==0],
TIMEPOINT_DF.TIME_TO_POSITIVITY_MGIT[TIMEPOINT_DF.NON_EFFICACIOUS==0],
'o', mec='black', mfc='white', mew=1., ms=8)
plt.xlim(-1,9)
plt.ylim(0,29)
plt.yticks(size=12)
plt.xticks(size=12)
plt.plot(np.arange(0,9), np.arange(0,9)*s+i, 'k-')
plt.title('4+ drugs', size=18, loc='left')
plt.xlabel('Time post-treatment (weeks)', size=18)
plt.ylabel('Time to culture positivity (days)', size=18)
plt.subplot(122)
plt.plot(TIMEPOINT_DF.TIME[TIMEPOINT_DF.NON_EFFICACIOUS==1],
TIMEPOINT_DF.TIME_TO_POSITIVITY_MGIT[TIMEPOINT_DF.NON_EFFICACIOUS==1],
'o', mec='red', mfc='white', mew=1., ms=8)
plt.xlim(-1,9)
plt.ylim(0,29)
plt.yticks(size=12)
plt.xticks(size=12)
plt.plot(np.arange(0,9), np.arange(0,9)*sn+ni, 'r-')
plt.title('<4 drugs', size=18, loc='left')
plt.xlabel('Time post-treatment (weeks)', size=18)
#plt.savefig('../reports/figures/3_supplement_ttp.pdf')
#plt.savefig('../reports/figures/3_supplement_ttp',dpi=300)
In [14]:
TIMEPOINT_DF['OUTLIER'] = [int(x==50) for x in TIMEPOINT_DF.vSNP_COUNT]
In [18]:
plt.figure(figsize=(8,8))
plt.plot(TIMEPOINT_DF.DEPTH[TIMEPOINT_DF.OUTLIER==0],
TIMEPOINT_DF.vSNP_COUNT[TIMEPOINT_DF.OUTLIER==0],
'o', mec='black', mfc='none', ms=10)
plt.plot(TIMEPOINT_DF.DEPTH[TIMEPOINT_DF.OUTLIER==1],
TIMEPOINT_DF.vSNP_COUNT[TIMEPOINT_DF.OUTLIER==1],
'o', mec='red', mfc='none', ms=10)
plt.ylim(-2,58)
plt.xlim(0,2500)
Out[18]:
In [19]:
model = sm.MixedLM.from_formula('vSNP_COUNT ~ DEPTH',
data=TIMEPOINT_DF,
groups=TIMEPOINT_DF['PATIENT_ID'])
result = model.fit()
print(result.summary())
model = sm.MixedLM.from_formula('vSNP_COUNT ~ DEPTH',
data=TIMEPOINT_DF[TIMEPOINT_DF.OUTLIER!=1],
groups=TIMEPOINT_DF['PATIENT_ID'][TIMEPOINT_DF.OUTLIER!=1])
result_noutlier = model.fit()
print(result_noutlier.summary())
In [40]:
model = sm.MixedLM.from_formula('vSNP_COUNT ~ TIME_TO_POSITIVITY_MGIT',
data=TIMEPOINT_DF,
groups=TIMEPOINT_DF['PATIENT_ID'])
ttp_result = model.fit()
print(ttp_result.summary())
model = sm.MixedLM.from_formula('vSNP_COUNT ~ TIME_TO_POSITIVITY_MGIT',
data=TIMEPOINT_DF[TIMEPOINT_DF.OUTLIER!=1],
groups=TIMEPOINT_DF['PATIENT_ID'][TIMEPOINT_DF.OUTLIER!=1])
ttp_result_noutlier = model.fit()
print(ttp_result_noutlier.summary())
In [46]:
plt.figure(figsize=(12,6))
plt.subplot(121)
plt.title('A', size=18, loc='left')
plt.plot(np.arange(2500),
np.arange(2500)*result.params[1]+result.params[0],
'r:',
label='All data, slope = {0:0.3f}, p = {1:0.3f}'.format(result.params[1],
result.pvalues[1]))
plt.plot(np.arange(2500),
np.arange(2500)*result_noutlier.params[1]+result_noutlier.params[0],
'k:',
label='No outlier, slope = {0:0.3f}, p = {1:0.3f}'.format(result_noutlier.params[1],
result_noutlier.pvalues[1]))
plt.legend(fontsize=14, loc=2)
plt.plot(TIMEPOINT_DF.DEPTH[TIMEPOINT_DF.OUTLIER==0],
TIMEPOINT_DF.vSNP_COUNT[TIMEPOINT_DF.OUTLIER==0],
'o', mec='black', mfc='none', ms=10)
plt.plot(TIMEPOINT_DF.DEPTH[TIMEPOINT_DF.OUTLIER==1],
TIMEPOINT_DF.vSNP_COUNT[TIMEPOINT_DF.OUTLIER==1],
'o', mec='red', mfc='none', ms=10, mew=1.)
plt.ylim(-2,68)
plt.yticks(size=14)
plt.ylabel('Number of v-SNP', size=18)
plt.xlim(0,2500)
plt.xticks(size=14)
plt.xlabel('Coverage depth', size=18)
plt.subplot(122)
plt.title('B', size=18, loc='left')
plt.plot(np.arange(29),
np.arange(29)*ttp_result.params[1]+ttp_result.params[0],
'r:',
label='All data, slope = {0:0.3f}, p = {1:0.3f}'.format(ttp_result.params[1],
ttp_result.pvalues[1]))
plt.plot(np.arange(29),
np.arange(29)*ttp_result_noutlier.params[1]+ttp_result_noutlier.params[0],
'k:',
label='No outlier, slope = {0:0.3f}, p = {1:0.3f}'.format(ttp_result_noutlier.params[1],
ttp_result_noutlier.pvalues[1]))
plt.legend(fontsize=14, loc=2)
plt.plot(TIMEPOINT_DF.TIME_TO_POSITIVITY_MGIT[TIMEPOINT_DF.OUTLIER==0],
TIMEPOINT_DF.vSNP_COUNT[TIMEPOINT_DF.OUTLIER==0],
'o', mec='black', mfc='none', ms=10)
plt.plot(TIMEPOINT_DF.TIME_TO_POSITIVITY_MGIT[TIMEPOINT_DF.OUTLIER==1],
TIMEPOINT_DF.vSNP_COUNT[TIMEPOINT_DF.OUTLIER==1],
'o', mec='red', mfc='none', ms=10, mew=1.)
plt.ylim(-2,68)
plt.yticks(size=14)
plt.xlim(0,29)
plt.xticks(size=14)
plt.xlabel('Time to MGIT positivity (days)', size=18)
#plt.savefig('../reports/figures/3_supplement_depthnoh.pdf')
#plt.savefig('../reports/figures/3_supplement_depthnoh',dpi=300)
Looks like there is a slight correlation between depth and number of vSNPs detected. The slope suggests that on average we observe 1 additional SNP per 100-fold coverage increase. Nonetheless, coverage explains only 15% of the variance in the sample. Therefore, coverage does slightly increase the number of detected vSNPs but this effect is minor.
In [17]:
plt.figure(figsize=(8,8))
TIMEPOINT_DF.boxplot('LOG_MEAN_FREQUENCY', by=['NON_EFFICACIOUS', 'TIME'], figsize=(8,8))
Out[17]:
In [16]:
TIMEPOINT_DF.boxplot('HETEROZYGOSITY', by=['NON_EFFICACIOUS', 'TIME'], figsize=(8,8))
Out[16]:
In [5]:
plt.figure('SFS', figsize=(8,8))
plt.hist(np.array(ALL.FREQUENCY[(ALL.TIME<12)]), bins=np.arange(0,1.01,0.01),
label=r'$M.tuberculosis$ - TB patients',
color='black')
plt.xticks(np.arange(.2,1.2,.2),['20%','40%','60%','80%','100%'], size=12)
plt.yticks(size=14)
plt.xlabel('Allele frequency', size=18)
plt.ylabel('Number of alleles', size=18)
#plt.title('A', loc='left', size=24)
plt.ylim(0,125)
#plt.savefig('../reports/figures/JT_SFS_simple.pdf')
#plt.savefig('../reports/figures/JT_SFS_simple.png',dpi=300)
In [7]:
TL_unfixed = [] #Calculated frequency between 3 and 97% exclusive.
TL_fixed = [] #Calculated frequency greater than 97%.
DONE = False
_tl = []
for line in open('../data/external/150414_Lieberman_SFS.txt'):
split = line.strip().split()
if line[0]=='>':
if DONE:
_tl = np.array(_tl,dtype=float)
TL_unfixed.append(_tl[_tl<0.97])
TL_fixed.append(_tl[_tl>=0.97])
_tl = []
DONE = True
else: _tl.append(split[0])
In [9]:
plt.figure('SFS', figsize=(8,8))
plt.hist(np.array(ALL.FREQUENCY[(ALL.TIME<12)]), bins=np.arange(0,1.01,0.01),
label=r'$M.tuberculosis$ - TB patients',normed=True,
color='black')
plt.hist(np.concatenate(TL_unfixed[:-1]), bins=np.arange(0,1.01,0.01),
label=r'$B.dolosa$ - CF Patients',color='black', normed=True, histtype='step')
plt.legend(fontsize=14)
plt.xticks(size=12)
plt.yticks(size=12)
plt.xlabel('Allele frequency', size=18)
plt.ylabel('Normalised allele weight', size=18)
#plt.savefig('../reports/figures/JT_SFS_TBCF.pdf')
#plt.savefig('../reports/figures/JT_SFS_TBCF.png',dpi=300)
In [ ]: