author: lukethompson@gmail.com
date: 28 Nov 2016
language: Python 3.5
conda environment: emp-py3
license: unlicensed

nestedness_otu_subsets.ipynb

Generate lists of samples that are found in X% or less of samples.

Parent BIOM table: emp_deblur_90bp.subset_2k.rare_5000.biom

The most prevalent OTUs found in this table are found in only ~1/3 of samples, so there are no ubiquitous 'contaminants'. Still, if we remove the most prevalent OTUs, here are the number of OTUs we are left with, where '% samples' means OTU is found in this percent of samples or less:

    % samp  No. OTUs
    100     155002
    10.0    154876
    5.0     154361
    2.5     152689
    1.0     147182



In [1]:

    
import pandas as pd
import numpy as np
import biom



In [2]:

    
path_otu_summary = '/Users/luke.thompson/emp/analyses-otus/otu_summary.emp_deblur_90bp.subset_2k.rare_5000.tsv'
path_output = '/Users/luke.thompson/emp/analyses-nestedness/otu_subset.emp_deblur_90bp.subset_2k.rare_5000'



In [3]:

    
df_otus = pd.read_csv(path_otu_summary, sep='\t', index_col=0)



In [4]:

    
tot_samples = 2000



In [5]:

    
# OTUs found in less than or equal to X fraction of samples
print('% samp\tNo. OTUs')
for i in [1, 0.1, 0.05, 0.025, 0.01]:
    df_sub = df_otus[df_otus['num_samples'] <= tot_samples * i]
    print('%s\t%s' % (i*100, df_sub.shape[0]))
    df_sub['sequence'].to_csv('%s.lt_%s_pc_samp.txt' % (path_output, i * 100), index=False)









    



% samp	No. OTUs
100	155002
10.0	154876
5.0	154361
2.5	152689
1.0	147182



In [ ]: