Generate lists of samples that are found in X% or less of samples.
Parent BIOM table: emp_deblur_90bp.subset_2k.rare_5000.biom
The most prevalent OTUs found in this table are found in only ~1/3 of samples, so there are no ubiquitous 'contaminants'. Still, if we remove the most prevalent OTUs, here are the number of OTUs we are left with, where '% samples' means OTU is found in this percent of samples or less:
% samp No. OTUs
100 155002
10.0 154876
5.0 154361
2.5 152689
1.0 147182
In [1]:
import pandas as pd
import numpy as np
import biom
In [2]:
path_otu_summary = '/Users/luke.thompson/emp/analyses-otus/otu_summary.emp_deblur_90bp.subset_2k.rare_5000.tsv'
path_output = '/Users/luke.thompson/emp/analyses-nestedness/otu_subset.emp_deblur_90bp.subset_2k.rare_5000'
In [3]:
df_otus = pd.read_csv(path_otu_summary, sep='\t', index_col=0)
In [4]:
tot_samples = 2000
In [5]:
# OTUs found in less than or equal to X fraction of samples
print('% samp\tNo. OTUs')
for i in [1, 0.1, 0.05, 0.025, 0.01]:
df_sub = df_otus[df_otus['num_samples'] <= tot_samples * i]
print('%s\t%s' % (i*100, df_sub.shape[0]))
df_sub['sequence'].to_csv('%s.lt_%s_pc_samp.txt' % (path_output, i * 100), index=False)
In [ ]: