Introduction

In this notebook, we use PaSDqc to explore the properties of individual chromosomes within a sample and use those properties to classify chromosomes as either copy neutral and well amplified, possible copy gain, possible copy loss, or poorly amplified. We use micronucleated single-cells sequenced by Zhang et al, 2015.

We also demonstrate the convenience functions that PaSDqc includes to perform amplicon distribution estimation. These functions abstract away the technicalities demonstrated in 02_example-basic_PSD.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors
# import matplotlib.gridspec as gridspec
import seaborn as sns
import pathlib

import PaSDqc

%matplotlib inline

In [2]:
sns.set_context('poster')
sns.set_style("ticks", {'ytick.minor.size': 0.0, 'xtick.minor.size': 0.0})

Load the data

(same data used in 01_example_report)


In [3]:
p = pathlib.Path("../01_example_report/psd/")
f_psd_list = sorted(p.glob("*.spec"))

psd_list = [PaSDqc.PSDTools.SamplePSD.load_from_file(str(f), name=f.stem) for f in f_psd_list]
med_list = [PaSDqc.PSDTools.normalize_psd(psd.avg_PSD()) for psd in psd_list]

In [4]:
freq = psd_list[0].freq
period = 1 / freq

Calculate chromosome properties and classify

1. fit erf curves to chrom and sample PSDs


In [5]:
chroms = PaSDqc.extra_tools.chroms_from_build('grch37')
[psd.fit_chrom_curves(chroms) for psd in psd_list]
[psd.fit_sample_curves() for psd in psd_list];

2. Calculate chromosomal properties


In [6]:
[psd.calc_chrom_props(chroms) for psd in psd_list];

In [7]:
psd_list[0].chrom_props


Out[7]:
KL sub supra min_pos median mean lower upper classif
1 0.008277 1.158352 12.567109 1395 9866 15760 2008 48548 Pass
2 0.006165 1.396898 12.350987 1502 9809 15641 2019 47845 Pass
3 0.009098 1.524423 12.977493 1367 10553 16959 2119 52678 Pass
4 0.013150 0.944059 12.037820 0 11147 18855 2070 60529 Pass
5 0.009743 1.198319 11.983481 0 10725 17047 2183 52678 Pass
6 0.009160 1.349496 12.910527 959 10142 15892 2129 48410 Pass
7 0.008828 1.344177 12.859760 0 10103 16623 1944 52086 Pass
8 0.010840 1.388128 13.579427 1111 11871 20570 2113 65782 Pass
9 0.013699 1.019437 11.846513 1288 9602 15394 1966 47338 Pass
10 0.013026 1.155810 13.097158 932 9468 15411 1878 48068 Pass
11 0.016160 1.606248 13.383810 0 10307 17106 1966 53729 Pass
12 0.064690 2.472840 15.123587 0 8497 13672 1711 41826 Possible gain
13 0.020349 0.965291 11.817984 0 11695 20167 2093 65971 Pass
14 0.015783 1.211322 12.232754 0 10182 15660 2203 46533 Pass
15 0.017805 1.240945 13.061316 381 9781 17462 1673 57687 Pass
16 0.018946 1.375675 12.537084 1553 9577 15586 1909 47948 Pass
17 0.021480 1.099367 11.681915 0 9832 14782 2229 43253 Pass
18 0.020370 1.272434 12.683880 1598 12262 19835 2434 61375 Pass
19 0.036121 1.208922 12.435185 509 6245 8561 1689 23030 Fail
20 0.024819 1.304876 12.503999 511 10853 18246 2023 58780 Pass
21 0.088736 1.898078 14.892119 767 9969 15053 2239 44582 Fail
22 0.040483 1.201764 12.406457 1128 9844 16318 1900 51601 Fail

3. Summarize chromosome classifications by sample


In [8]:
idx = [psd.name.split('.')[0] for psd in psd_list]
df_stat = PaSDqc.extra_tools.summarize_chrom_classif_by_sample(psd_list, idx)

In [9]:
df_stat.iloc[0:5, 0:12]


Out[9]:
1 2 3 4 5 6 7 8 9 10 11 12
MN1_a Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Possible gain
MN1_b Possible loss Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Possible gain
MN2_a Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Pass Possible gain
MN2_b Pass Possible loss Pass Pass Pass Pass Pass Pass Pass Pass Pass Possible gain
MN3_a Pass Possible gain Pass Pass Pass Pass Pass Pass Pass Pass Pass Possible gain

4. Calculate chromosome amplicon distribution curves


In [10]:
[psd.infer_chrom_amplicon_dist(chroms) for psd in psd_list];

In [11]:
psd_list[0].chrom_dist.keys()


Out[11]:
dict_keys(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22'])

Calculate Sample properties and distributions


In [12]:
[psd.calc_sample_props() for psd in psd_list];
[psd.infer_sample_amplicon_dist() for psd in psd_list];

In [13]:
PaSDqc.extra_tools.summarize_sample_props(psd_list, idx)


Out[13]:
amplicon median mean size lower size upper size
MN1_a 10203 16593 2012 51998
MN1_b 10549 16356 2270 48877
MN2_a 9663 16336 1799 52539
MN2_b 8186 15376 1308 52245
MN3_a 10523 17585 1996 55422
MN3_b 9923 16380 1903 51797
MN4_a 8598 14168 1643 44887
MN4_b 9207 15463 1722 49254
MN5_a 9900 16293 1910 51125
MN5_b 10154 15931 2169 48395
MN6_a 9952 17400 1758 56463
MN6_b 9612 17392 1592 57970
MN8_a 9122 16141 1569 53261
MN8_b 9669 16558 1773 53155
MN9_a 9967 17057 1824 55031
MN9_b 9448 17036 1562 56171
MN9_c 10222 17453 1883 55764
MN9_d 9395 16909 1599 56220
MN9_e 9928 17519 1722 56914

Figure 5

1. Load CNV data


In [14]:
df_cnvs = pd.read_table("CNVs_Zhang_2015_MN.txt", index_col=0)
columns = [c for c in df_cnvs.columns if c.startswith('MN')]
cols2 = [c.split('.')[0] for c in columns]
df_cnv2 = df_cnvs.loc[:, columns].iloc[:-2, :] - df_cnvs[columns].iloc[:-2, :].median() # removing sex chromosomes
df_cnv2.index = [c.replace('chr', '') for c in df_cnv2.index]
df_cnv2.columns = cols2

# Mask MN9_d since CNV caller failed
mask = (df_cnv2 == 0)
mask.loc[:, 'MN9_d'] = True

2. Function for heatmap plotting


In [15]:
def plot_cnv_heatmap(df_cnv, ax=None, add_cbar=True, cbar_ax=None):

    if not ax:
        f = plt.figure()
        ax = f.add_subplot(111)

    nd = df_cnv.T.as_matrix()
    np_mask = np.ma.array(nd, mask=mask.T)

    dcp = sns.diverging_palette(255, 133, l=60, n=7)
    dcp_list = [dcp[0], dcp[1], dcp[2], dcp[3], dcp[3], dcp[3], dcp[3], dcp[4], dcp[5], dcp[6]]
    cmap = matplotlib.colors.LinearSegmentedColormap.from_list('mycmap', dcp_list)
    
    cax = ax.imshow(np_mask, aspect='equal', cmap=cmap, interpolation=None, vmax=1, vmin=-1)
    ax.set_yticks(np.arange(0, df_cnv2.shape[1]))
    ax.set_xticks(np.arange(0, df_cnv2.shape[0]))
    ax.set_xticks(np.arange(0.5, df_cnv2.shape[0]+0.5), minor=True)
    ax.set_xlabel('chromosome')

    for y in np.arange(0.5, df_cnv2.shape[1], 1):
        ax.axhline(y, linestyle='--', color='black', linewidth=1)

    ax.set_yticklabels([])
    ax.set_yticklabels([s.replace('_', '') for s in df_cnv2.columns])

    ax.set_xticklabels(df_cnv2.index);
    ax.grid(which='minor', color='black', linewidth=1)

    #colorbar
    if add_cbar:
        cbar = ax.figure.colorbar(cax, ticks=[-1, 0, 1], cax=cbar_ax, orientation='horizontal')
        cbar.ax.set_xticklabels(['Loss', 'Neutral', 'Gain'])
        cbar.ax.xaxis.tick_top()
        cbar.ax.tick_params(axis='x', which='major', pad=0)

3. Extract MN2b data


In [16]:
MN2b = psd_list[3]
MN2b_kl = MN2b.KL_div_by_chrom()
freq = MN2b.ampl.freq['erf']

# Sample curves
MN2b_fit = MN2b.sample_curves['avg']
MN2b_fit_lower = MN2b.sample_curves['lower']
MN2b_fit_upper = MN2b.sample_curves['upper']
MN2b_dist = MN2b.sample_dist

# Chrom curves
MN2b_chr2_fit = MN2b.chrom_curves['2']
MN2b_chr2_dist = MN2b.chrom_dist['2']

MN2b_chr12_fit = MN2b.chrom_curves['12']
MN2b_chr12_dist = MN2b.chrom_dist['12']

MN2b_chr21_fit = MN2b.chrom_curves['21']
MN2b_chr21_dist = MN2b.chrom_dist['21']

In [17]:
# MAKE A GIANT FIGURE
f = plt.figure(figsize=(12, 14))
ax0 = plt.subplot2grid((7, 6), (0, 0), colspan=6, rowspan=2)

ax1 = plt.subplot2grid((7, 6), (2, 0), colspan=3, rowspan=2)
ax2 = plt.subplot2grid((7, 6), (2, 3), colspan=3, rowspan=2)

ax3 = plt.subplot2grid((7, 6), (4, 0), colspan=3, rowspan=3)
ax3_cbar = f.add_axes([0.11, 0.39, 0.375, 0.01])
ax4 = plt.subplot2grid((7, 6), (4, 3), colspan=3, rowspan=3)
ax4_cbar = f.add_axes([0.11+0.48, 0.39, 0.375, 0.01])
cp = sns.color_palette()

## KL divergence plot
PaSDqc.extra_tools.plot_KL_div_by_chrom(MN2b_kl, ax=ax0)

##  Smooth spectrograms
ax1.plot(freq, MN2b_fit, label='Sample\n  Avg')
ax1.fill_between(freq, MN2b_fit_lower, MN2b_fit_upper, color=cp[0], alpha=0.25)
ax1.plot(freq, MN2b_chr2_fit, label='MN2b Chr2 \n   (Loss)', color=cp[5])
ax1.plot(freq, MN2b_chr12_fit, label='MN2b Chr12 \n   (Gain)', color=cp[1])
ax1.plot(freq, MN2b_chr21_fit, label='MN2b Chr21\n (Aberrant)', color=cp[2])
ax1.set_xscale('log')
ax1.set_xticklabels(["0", "100 bp", "1 kb", "10 kb", "100 kb", "1 mb"])
ax1.set_xlabel('Genomic scale')
ax1.set_ylabel('PSD (dB)')
ax1.legend(bbox_to_anchor=(0, 1, 2.25, .102), loc=(0, 0), ncol=5, mode="expand", borderaxespad=0.)

## Distribution Plots
ax2.plot(MN2b_dist['freq'], MN2b_dist['dist'], label='MN3 Sample Avg', color=cp[0])
ax2.plot(MN2b_chr2_dist['freq'], MN2b_chr2_dist['dist'], label='MN3 chr2 (loss)', color=cp[5])
ax2.plot(MN2b_chr12_dist['freq'], MN2b_chr12_dist['dist'], label='MN3 chr12 (gain)', color=cp[1])
ax2.plot(MN2b_chr21_dist['freq'], MN2b_chr21_dist['dist'], label='MN3 chr21 (aberrant)', color=cp[2])

ax2.set_xscale('log')
ax2.set_xticklabels(["0", "100 bp", "1 kb", "10 kb", "100 kb", "1 mb"])
ax2.set_xlabel('Genomic scale')
ax2.set_ylabel('Density')


PaSDqc.extra_tools.plot_chrom_classification(df_stat, ax=ax3, cbar_ax=ax3_cbar)
ax3.tick_params(axis='x', which='major', labelsize=14, pad=5)

plot_cnv_heatmap(df_cnv2, ax=ax4, cbar_ax=ax4_cbar)
ax4.tick_params(axis='x', which='major', labelsize=14, pad=5)

plt.tight_layout(h_pad=2.5)
sns.despine(ax=ax0)
sns.despine(ax=ax1)
sns.despine(ax=ax2)

f.text(0.02, 0.97, "A", weight="bold", horizontalalignment='left', verticalalignment='center', fontsize=20)
f.text(0.02, 0.70, "B", weight="bold", horizontalalignment='left', verticalalignment='center', fontsize=20)
#f.text(0.49, 0.66, "C", weight="bold", horizontalalignment='left', verticalalignment='center', fontsize=20)
f.text(0.02, 0.43, "C", weight="bold", horizontalalignment='left', verticalalignment='center', fontsize=20)
f.text(0.51, 0.43, "D", weight="bold", horizontalalignment='left', verticalalignment='center', fontsize=20)


/home/mas138/anaconda3/envs/test1/lib/python3.6/site-packages/matplotlib/figure.py:1743: UserWarning:

This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.

Out[17]:
<matplotlib.text.Text at 0x2aae46b48630>

In [ ]: