author: lukethompson@gmail.com
date: 28 Nov 2016
language: Python 3.5
conda environment: emp-py3
license: unlicensed

nestedness_otu_subsets.ipynb

Generate lists of samples that are found in X% or less of samples.

Parent BIOM table: emp_deblur_90bp.subset_2k.rare_5000.biom

The most prevalent OTUs found in this table are found in only ~1/3 of samples, so there are no ubiquitous 'contaminants'. Still, if we remove the most prevalent OTUs, here are the number of OTUs we are left with, where '% samples' means OTU is found in this percent of samples or less:

    % samp  No. OTUs
    100     155002
    10.0    154876
    5.0     154361
    2.5     152689
    1.0     147182

In [1]:
import pandas as pd
import numpy as np
import biom

In [2]:
path_otu_summary = '/Users/luke.thompson/emp/analyses-otus/otu_summary.emp_deblur_90bp.subset_2k.rare_5000.tsv'
path_output = '/Users/luke.thompson/emp/analyses-nestedness/otu_subset.emp_deblur_90bp.subset_2k.rare_5000'

In [3]:
df_otus = pd.read_csv(path_otu_summary, sep='\t', index_col=0)

In [4]:
tot_samples = 2000

In [5]:
# OTUs found in less than or equal to X fraction of samples
print('% samp\tNo. OTUs')
for i in [1, 0.1, 0.05, 0.025, 0.01]:
    df_sub = df_otus[df_otus['num_samples'] <= tot_samples * i]
    print('%s\t%s' % (i*100, df_sub.shape[0]))
    df_sub['sequence'].to_csv('%s.lt_%s_pc_samp.txt' % (path_output, i * 100), index=False)


% samp	No. OTUs
100	155002
10.0	154876
5.0	154361
2.5	152689
1.0	147182

In [ ]: