TEST pyBioPlot package

This notebook contains tests for all function contained in pyBioPlot package



In [1]:

    
# For ipython inline ploting ploting
%matplotlib inline

# Larger display 
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

# Import of required packages
import pandas as pd

# import all function from pyBioPlot
from pyBioPlot import *

RNASeq plots

volcano_plot



In [2]:

    
help(volcano_plot)









    



Help on function volcano_plot in module pyBioPlot:

volcano_plot(df, X, Y, FDR=1, X_cutoff=1, sig_pos_color='0.5', sig_neg_color='0.7', non_sig_color='0.9', highlight_list=[], highlight_palette='Set1', highlight_FDR=None, highlight_min_targets=0, **kwargs)
    Plot a volcano plot with matplot lib from a dataframe in “stacked” or “record” format
    *  df
        Panda dataframe containing the results. Each line corresponds to a single gene/transcript value. Gene/transcript are
        identified by a target_id column. The other covariate columns need to contain the values for X and Y plotting  
    *  X
        Name of the column for X plotting (usually log2FC)
    *  Y
        Name of the column for Y plotting (usually pvalue)
    *  FDR
        false discovery rate cut-off for the Y axis (on the raw value before log transformation for plotting [DEFAULT: 1]
    *  X_cutoff
        value for significance cut-off for the X axis [DEFAULT: 1]
    *  sig_pos_color
        Color of the significant points on the positive side of the X axis [DEFAULT: "0.5"]
    *  sig_neg_color
        Color of the significant points on the negative side of the Y axis [DEFAULT: "0.7"]
    *  non_sig_color
        Color of the non-significant points [DEFAULT: "0.9"]
    *  highlight_list
        List of dictionaries for values to highlight. Each entry contains:
                [mandatory]     "target_id": List or pandas series of target_id matching target_id in the main df
            OR  [mandatory]     "df": A dataframe containing values with X and Y columns can be provided instead of the target_id list
                [facultative]   "color": A valid matplotlib color, else a random color will be attributed
                [facultative]   "label": A str label [DEFAULT: # of the series]
                [facultative]   "alpha": Alpha parameter [DEFAULT: 1]
                [facultative]   "marker": A matplotlib symbol [DEFAULT: "o"]
                example: highlight_list= [
                    {"target_id":["id1","id3"], "color":"red", "label":"s1"}
                    {"target_id":["id4","id7","id9"], "color":"green", "label":"s2", "marker":">", "alpha":0.5}]
    *  highlight_palette
        palette to be used to automatically assign colors to each element of the highlight_list [DEFAULT: "Set1"]
    *  highlight_FDR
        if a value if given the highlight list will be thresholded by the given FDR [DEFAULT: None]
    *  highlight_min_targets
        Mininal number of targets in a series of the highlight list. If below the series will not be plotted  
        [DEFAULT: 0]
    *  kwargs
        Additional parameters for plot appearance derived from pylab basic plot arguments such as:
        figsize, xlim, ylim, title, xlabel, ylabel, bg_color, grid_color, marker, alpha...



In [3]:

    
df = pd.read_table("./test_data/localizationnucleus_wald_test.tsv")
volcano_plot(df, "b", "qval", FDR=0.01, X_cutoff=2, figsize=[10,5], xlim=[-10,10], ylim=[-1, 30], sig_pos_color="red", sig_neg_color="blue",
             non_sig_color="green", xlabel="Beta factor Nucleus localization")



In [4]:

    
df = pd.read_table("./test_data/localizationnucleus_wald_test.tsv")
highlight_list = [
    {"label":"mock", "alpha":0.5},
    {"target_id":df.target_id[(df.tech_var >= 0.2)], "label":"variable samples", "alpha":0.5},
    {"target_id":df.target_id[(df.tech_var >= 0.4)]},
    {"target_id":df.target_id[(df.tech_var >= 0.7)], "color":"black", "label":"very highly variable samples", "marker":"<"}]

volcano_plot(df, "b", "qval", figsize=[10,5], xlim=[-10,10], ylim=[-1, 30], highlight_list= highlight_list,
             alpha=0.25, highlight_FDR=0.05, FDR=0.05, fontsize=5)









    



Target_id list of dataframe required for series #0. Skipping to the next one

MA_plot



In [3]:

    
help(MA_plot)









    



Help on function MA_plot in module pyBioPlot:

MA_plot(df, X, Y, FDR=1, FDR_col='pval', sig_pos_color='0.5', sig_neg_color='0.7', non_sig_color='0.9', highlight_list=[], highlight_palette='Set1', highlight_FDR=None, highlight_min_targets=0, **kwargs)
    Plot a MA plot with matplotlib from a dataframe in “stacked” or “record” format
    *  df
        Panda dataframe containing the results. Each line corresponds to a single gene/transcript value. Gene/transcript are
        identified by a target_id column. The other covariate columns need to contain the values for X and Y plotting  
    *  X
        Name of the column for X plotting (usually Mean expression)
    *  Y
        Name of the column for Y plotting (usually log2FC)
    *  FDR
        False discovery rate cut-off for the Y axis (on the raw value before log transformation for plotting [DEFAULT: 1]
    *  FDR_col
        Name of the column to use to determine the significance cut-off (usually pvalue)
    *  sig_pos_color
        Color of the significant points on the positive side of the X axis [DEFAULT: "0.5"]
    *  sig_neg_color
        Color of the significant points on the negative side of the Y axis [DEFAULT: "0.7"]
    *  non_sig_color
        Color of the non-significant points [DEFAULT: "0.9"]
    *  highlight_list
        List of dictionaries for values to highlight. Each entry contains:
                [mandatory]     "target_id": List or pandas series of target_id matching target_id in the main df
            OR  [mandatory]     "df": A dataframe containing values with X and Y columns can be provided instead of the target_id list
                [facultative]   "color": A valid matplotlib color, else a random color will be attributed
                [facultative]   "label": A str label [DEFAULT: # of the series]
                [facultative]   "alpha": Alpha parameter [DEFAULT: 1]
                [facultative]   "marker": A matplotlib symbol [DEFAULT: "o"]
                example: highlight_list= [
                    {"target_id":["id1","id3"], "color":"red", "label":"s1"}
                    {"target_id":["id4","id7","id9"], "color":"green", "label":"s2", "marker":">", "alpha":0.5}]
    *  highlight_palette
        palette to be used to automatically assign colors to each element of the highlight_list [DEFAULT: "Set1"]
    *  highlight_FDR
        if a value if given the highlight list will be thresholded by the given FDR [DEFAULT: None]
    *  highlight_min_targets
        Mininal number of targets in a series of the highlight list. If below the series will not be plotted  
        [DEFAULT: 0]
    *  kwargs
        Additional parameters for plot appearance derived from pylab basic plot arguments such as:
        figsize, xlim, ylim, title, xlabel, ylabel, bg_color, grid_color...



In [6]:

    
df = pd.read_table("./test_data/localizationnucleus_wald_test.tsv")
MA_plot(df, "mean_obs", "b", FDR=0.01, FDR_col="qval", figsize=[10,5], sig_pos_color="red", sig_neg_color="green", non_sig_color="0.8",
        xlabel="Mean expression",
        ylabel="Beta factor Nucleus localization")



In [7]:

    
df = pd.read_table("./test_data/localizationnucleus_wald_test.tsv")

hl = [
    {"df":df[(df.b>2)], "label":"High nuclear", "color":"red","alpha":0.2},
    {"label":"mock", "alpha":0.5},
    {"target_id":df.target_id[(df.tech_var >= 0.2)]},
    {"target_id":df.target_id[(df.tech_var >= 1)], "color":"black", "label":"very highly variable samples", "alpha":1, "marker":"<"}]

MA_plot(
        df = df,
        X= "mean_obs",
        Y= "b",
        FDR=0.01,
        FDR_col="qval",
        highlight_list=hl,
        figsize=[10,5],
        xlabel="Mean expression",
        ylabel="Beta factor Nucleus localization",
        alpha=0.5,
        highlight_FDR=0.05)









    



Target_id list of dataframe required for series #1. Skipping to the next one

density_plot



In [4]:

    
help(density_plot)









    



Help on function density_plot in module pyBioPlot:

density_plot(df, X, FDR=None, FDR_col='pval', cumulative=False, cut=3, highlight_list=[], highlight_palette='Set1', highlight_FDR=None, highlight_min_targets=0, **kwargs)
    Plot a density with matplotlib from a dataframe in “stacked” or “record” format
    *  df
        Panda dataframe containing the results. Each line corresponds to a single gene/transcript value. Gene/transcript are
        identified by a target_id column. The other covariate columns need to contain the values for X and Y plotting  
    *  X
        Name of the column to calculate density (usually Mean expression)
    *  FDR
        false discovery rate cut-off for the Y axis (on the raw value before log transformation for plotting [DEFAULT: None]
    *  FDR_col
        Name of the column to use to determine the significance cut-off (usually pvalue)
    *  cumulative
        If true, will plot a cumulative distribution [DEFAULT: 1]
    *  highlight_list
        List of dictionaries for values to highlight. Each entry contains:
                [mandatory]     "target_id": List or pandas series of target_id matching target_id in the main df
            OR  [mandatory]     "df": A dataframe containing values with X and Y columns can be provided instead of the target_id list
                [facultative]   "color": A valid matplotlib color, else a random color will be attributed
                [facultative]   "label": A str label [DEFAULT: # of the series]
                [facultative]   "alpha": Alpha parameter [DEFAULT: 1]
                [facultative]   "linestyle":A matplotlib linestyle [DEFAULT: "-"]
                [facultative]   "linestyle": A matplotlib linestyle [DEFAULT: "-"]
                [facultative]   "linewidth": Width of the line [DEFAULT: 2]
                example: highlight_list= [
                    {"target_id":["id1","id3"], "color":"red", "label":"s1"}
                    {"target_id":["id4","id7","id9"], "color":"green", "label":"s2", "marker":">", "alpha":0.5}]
    *  highlight_palette
        palette to be used to automatically assign colors to each element of the highlight_list [DEFAULT: "Set1"]
    *  highlight_FDR
        if a value if given the highlight list will be thresholded by the given FDR [DEFAULT: None]
    *  highlight_min_targets
        Mininal number of targets in a series of the highlight list. If below the series will not be plotted  
        [DEFAULT: 0]
    *  kwargs
        Additional parameters for plot appearance derived from pylab basic plot arguments such as:
        figsize, xlim, ylim, title, xlabel, ylabel, bg_color, grid_color...



In [9]:

    
df = pd.read_table("./test_data/localizationnucleus_wald_test.tsv")
density_plot(df, "b", figsize=[10,5], ylabel="Cumulative Beta value")









    



/usr/local/lib/python3.5/dist-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j



In [10]:

    
df = pd.read_table("./test_data/localizationnucleus_wald_test.tsv")

df2 = df[(df.mean_obs>10)]

hl = [
    {"df":df2, "label":"High nuclear", "color":"blue","linestyle":':', "linewidth":4},
    {"label":"mock", "alpha":0.5},
    {"target_id":df.target_id[(df.tech_var >= 1.3)], "linestyle":'--'},
    {"target_id":df.target_id[(df.tech_var >= 1.5)], "color":"red", "label":"very highly variable samples", "alpha":0.5, "marker":"<"}]


density_plot(df, "b", figsize=[10,5], ylabel="Cumulative Beta value", highlight_list= hl, cumulative = True, cut=5)









    



/usr/local/lib/python3.5/dist-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    



Target_id list of dataframe required for series #1. Skipping to the next one

PCA_var_plot



In [5]:

    
help(PCA_var_plot)









    



Help on function PCA_var_plot in module pyBioPlot:

PCA_var_plot(df, variable_col, sample_col, value_col, plot_style='ggplot', **kwargs)
    Plot a the frequence of contribution to the variance of the principal componants from a dataframe in “stacked” or “record” format
    *  df
        A pandas dataframe with a least the 3 following columns:
        - variable_col = column containing variable identifiers (such as gene identifiers)
        - sample_col = column containing sample identifiers (such as experimental condition)
        - value_col = column containing quantitative numeric values
    *  variable_col
        Name or index of the column containing variable identifiers
    *  sample_col
        Name or index of the column containing sample identifiers
    *  value_col
        Name or index of the column containing values
    *  plot_style
        Default plot style for pyplot ('grayscale'|'bmh'|'ggplot'|'dark_background'|'classic'|'fivethirtyeight'...)[ DEFAULT: "ggplot" ]
    *  kwargs
        Additional parameters for plot appearance derived from pylab basic plot arguments such as: title, xlabel, ylabel, color, alpha, fontsize, fontname



In [10]:

    
df = pd.read_table("./test_data/obs_normalised_filtered.tsv")
display(df.head())
PCA_var_plot(df,  variable_col='target_id', sample_col="sample", value_col="tpm", plot_style="ggplot", color="cadetblue", alpha=0.8, fontsize=15)









    






  
    
      
      target_id
      sample
      est_counts
      tpm
      eff_len
      len
    
  
  
    
      1
      ENST00000222396.9
      HeLaS3_cytosol_1
      2.919048
      0.701134
      349.145653
      623
    
    
      2
      ENST00000222396.9
      HeLaS3_cytosol_2
      0.000000
      0.000000
      313.104587
      623
    
    
      3
      ENST00000222396.9
      HeLaS3_nucleus_1
      24.388913
      6.409712
      481.440779
      623
    
    
      4
      ENST00000222396.9
      HeLaS3_nucleus_2
      16.205747
      3.694313
      466.236991
      623
    
    
      5
      ENST00000229465.10
      HeLaS3_cytosol_1
      1440.607599
      85.971743
      1405.258868
      1655
    
  








    Out[10]:






  
    
      
      comp
      X
      Y
    
  
  
    
      0
      PC1
      1
      78.331
    
    
      1
      PC2
      2
      21.1269
    
    
      2
      PC3
      3
      0.385066
    
    
      3
      PC4
      4
      0.157017

PCA



In [6]:

    
help(PCA)









    



Help on function PCA in module pyBioPlot:

PCA(df, variable_col, sample_col, value_col, pcx=1, pcy=2, point_label=False, plot_style='ggplot', **kwargs)
    Plot a the frequence of contribution to the variance of the principal components from a dataframe in “stacked” or “record” format.
    Use matplotlib PCA implementation
    *  df
        A pandas dataframe with a least the 3 following columns:
        - variable_col = column containing variable identifiers (such as gene identifiers)
        - sample_col = column containing sample identifiers (such as experimental condition)
        - value_col = column containing quantitative numeric values
    *  variable_col
        Name or index of the column containing variable identifiers
    *  sample_col
        Name or index of the column containing sample identifiers
    *  value_col
        Name or index of the column containing values
    *  pcx
        Number of the component to plot on the x axis [ DEFAULT: 1 ]
    *  pcy
        Number of the component to plot on the y axis [ DEFAULT: 2 ]
    * point_label
        If True the points will be labelled with their names
    *  plot_style
        Default plot style for pyplot ('grayscale'|'bmh'|'ggplot'|'dark_background'|'classic'|'fivethirtyeight'...)[ DEFAULT: "ggplot" ]
    *  kwargs
        Additional parameters for plot appearance derived from pylab basic plot arguments such as: title, color, alpha, fontsize, fontname, linewidths



In [13]:

    
df = pd.read_table("./test_data/obs_normalised_filtered.tsv")
a= PCA(df, variable_col='target_id', sample_col="sample", value_col="tpm", pcx=1, pcy=2, point_label=True, plot_style="ggplot", fontsize=15, color= "cadetblue", alpha=0.8, linewidths=5)
a









    Out[13]:






  
    
      
      1
      2
      3
      4
    
    
      sample
      
      
      
      
    
  
  
    
      HeLaS3_cytosol_1
      0.507107
      0.501701
      0.458975
      0.529605
    
    
      HeLaS3_cytosol_2
      0.471982
      0.491550
      -0.631010
      -0.370726
    
    
      HeLaS3_nucleus_1
      -0.548392
      0.660616
      0.333303
      -0.389567
    
    
      HeLaS3_nucleus_2
      -0.468340
      0.265068
      -0.529223
      0.655987

PCA2



In [10]:

    
help(PCA)









    



Help on function PCA in module pyBioPlot:

PCA(df, variable_col, sample_col, value_col, pcx=1, pcy=2, point_label=False, plot_style='ggplot', **kwargs)
    Plot a the frequence of contribution to the variance of the principal components from a dataframe in “stacked” or “record” format.
    Use matplotlib PCA implementation
    *  df
        A pandas dataframe with a least the 3 following columns:
        - variable_col = column containing variable identifiers (such as gene identifiers)
        - sample_col = column containing sample identifiers (such as experimental condition)
        - value_col = column containing quantitative numeric values
    *  variable_col
        Name or index of the column containing variable identifiers
    *  sample_col
        Name or index of the column containing sample identifiers
    *  value_col
        Name or index of the column containing values
    *  pcx
        Number of the component to plot on the x axis [ DEFAULT: 1 ]
    *  pcy
        Number of the component to plot on the y axis [ DEFAULT: 2 ]
    * point_label
        If True the points will be labelled with their names
    *  plot_style
        Default plot style for pyplot ('grayscale'|'bmh'|'ggplot'|'dark_background'|'classic'|'fivethirtyeight'...)[ DEFAULT: "ggplot" ]
    *  kwargs
        Additional parameters for plot appearance derived from pylab basic plot arguments such as: title, color, alpha, fontsize, fontname, linewidths



In [14]:

    
df = pd.read_table("./test_data/obs_normalised_filtered2.tsv")
a= PCA2(df, variable_col='target_id', sample_col="sample", value_col="tpm", pcx=1, pcy=2, point_label=True, plot_style="ggplot", fontsize=15, color= "cadetblue", alpha=0.8, linewidths=5)
a









    



6






    Out[14]:






  
    
      
      1
      2
      3
      4
      5
      6
    
    
      sample
      
      
      
      
      
      
    
  
  
    
      3T3_PUS1_KD_T0_1
      149677.221161
      -19408.854816
      -6327.215333
      -8028.444481
      8025.114974
      6.903200e-12
    
    
      3T3_PUS1_KD_T0_2
      -241127.416041
      -6339.392362
      22260.500173
      -3894.063162
      -211.793906
      1.311460e-10
    
    
      3T3_PUS1_KD_T0_3
      -227510.147119
      -21102.752705
      -18784.762410
      5295.513714
      -1250.862741
      -1.608302e-11
    
    
      3T3_scramble_T0_1
      250013.390790
      -21068.735975
      6637.841144
      3205.349768
      -7918.310202
      2.696737e-11
    
    
      3T3_scramble_T0_2
      68587.240043
      26652.092582
      5609.842060
      9745.066537
      6483.683090
      -1.151832e-10
    
    
      3T3_scramble_T0_3
      359.711165
      41267.643275
      -9396.205633
      -6323.422375
      -5127.831215
      -1.455314e-11

Generic utilities

get_color_list



In [7]:

    
help(get_color_list)









    



Help on function get_color_list in module pyBioPlot:

get_color_list(n, gamma=1, colormap='brg')
    Return a list of l length with gradient colors from a given matplot lib colormap palette
    Before usage the palette can be tried with the *try_color_list* function
    *  n
        Number of color scalar in the list
    *  gamma
        Move the distribution toward the left (gamma<1) or the right (gamma>1)
    *  colormap
        colormap color palette from matplotlib package see http://matplotlib.org/examples/color/colormaps_reference.html
                        example : inferno magma hot blues cool spring winter brg ocean hsv jet ... [DEFAULT: brg]



In [17]:

    
a = get_color_list(n=3, gamma=2, colormap="brg")

for i in range(6):
    print(next(a))









    



(3.0757400999615529e-05, 0.0, 0.99996924259900033, 1.0)
(0.50392925797770083, 0.0, 0.49607074202229917, 1.0)
(0.0, 1.0, 0.0, 1.0)
(3.0757400999615529e-05, 0.0, 0.99996924259900033, 1.0)
(0.50392925797770083, 0.0, 0.49607074202229917, 1.0)
(0.0, 1.0, 0.0, 1.0)

try_color_list



In [8]:

    
help(try_color_list)









    



Help on function try_color_list in module pyBioPlot:

try_color_list(n_color, n_values, gamma=1, colormap='brg')
    Test a palette generated by get_color_list
    * n_color
        Number of colour in the palette
    * n_values
        Number of values required (could be more that the number of colour but will cycle from the start)
    * gamma
        Move the distribution toward the left (gamma<1) or the right (gamma>1)
    * colormap
        Color palette from matplotlib package see http://matplotlib.org/examples/color/colormaps_reference.html
        Examples : inferno magma hot blues cool spring winter brg ocean hsv jet ... [DEFAULT: brg]



In [19]:

    
try_color_list(n_color=3, n_values=6, colormap="brg")



In [20]:

    
try_color_list(n_color=10, n_values=20, gamma=0.1, colormap="jet")



In [21]:

    
try_color_list(n_color=10, n_values=20, gamma=1, colormap="jet")



In [22]:

    
try_color_list(n_color=10, n_values=20, gamma=4, colormap="jet")

plot_text



In [9]:

    
help(plot_text)









    



Help on function plot_text in module pyBioPlot:

plot_text(text, plot_len=20, align='center', **kwargs)
    Plot a text alone as graph. Useful to separate series of data plots in interactive session.
    * text
        Test message to plot
    * len_plot
        Length of the plotting area [DEFAULT: 20]
    * align
        Alignment of the text ['left' | 'right' | 'center' ] [DEFAULT: 'center']
    * kwargs
        Additional parameters from matplotlib.text.Text class see http://matplotlib.org/users/text_intro.html
        Examples = color, family, fontname, position...



In [24]:

    
plot_text("This is my awesome text")



In [25]:

    
plot_text("This is my awesome text", align="left", color="red", fontsize=20, fontweight="bold")

	target_id	sample	est_counts	tpm	eff_len	len
1	ENST00000222396.9	HeLaS3_cytosol_1	2.919048	0.701134	349.145653	623
2	ENST00000222396.9	HeLaS3_cytosol_2	0.000000	0.000000	313.104587	623
3	ENST00000222396.9	HeLaS3_nucleus_1	24.388913	6.409712	481.440779	623
4	ENST00000222396.9	HeLaS3_nucleus_2	16.205747	3.694313	466.236991	623
5	ENST00000229465.10	HeLaS3_cytosol_1	1440.607599	85.971743	1405.258868	1655

	1	2	3	4
sample
HeLaS3_cytosol_1	0.507107	0.501701	0.458975	0.529605
HeLaS3_cytosol_2	0.471982	0.491550	-0.631010	-0.370726
HeLaS3_nucleus_1	-0.548392	0.660616	0.333303	-0.389567
HeLaS3_nucleus_2	-0.468340	0.265068	-0.529223	0.655987

	1	2	3	4	5	6
sample
3T3_PUS1_KD_T0_1	149677.221161	-19408.854816	-6327.215333	-8028.444481	8025.114974	6.903200e-12
3T3_PUS1_KD_T0_2	-241127.416041	-6339.392362	22260.500173	-3894.063162	-211.793906	1.311460e-10
3T3_PUS1_KD_T0_3	-227510.147119	-21102.752705	-18784.762410	5295.513714	-1250.862741	-1.608302e-11
3T3_scramble_T0_1	250013.390790	-21068.735975	6637.841144	3205.349768	-7918.310202	2.696737e-11
3T3_scramble_T0_2	68587.240043	26652.092582	5609.842060	9745.066537	6483.683090	-1.151832e-10
3T3_scramble_T0_3	359.711165	41267.643275	-9396.205633	-6323.422375	-5127.831215	-1.455314e-11

TEST pyBioPlot package

RNASeq plots

volcano_plot

MA_plot

density_plot

PCA_var_plot

PCA

PCA2

Generic utilities

get_color_list

try_color_list

plot_text

TESTS AND DEV