author: lukethompson@gmail.com
date: 9 Oct 2017
language: Python 3.5
license: BSD3

otu_entropy.ipynb

For each sample type, find the lowest-entropy OTUs that have a minimum abundance of X and a minimum prevalence of X.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



In [2]:

    
path_otus = '../../data/sequence-lookup/otu_summary.emp_deblur_90bp.subset_2k.rare_5000.tsv' # gunzip first
df_otus = pd.read_csv(path_otus, sep='\t', index_col=0)



In [3]:

    
path_entropy = '../../data/entropy/otu_entropy_empo.csv' # output of 09-specificity-entropy/entropy_environment_by_taxon.ipynb
df_otu_entropy = pd.read_csv(path_entropy, index_col=0)



In [4]:

    
empo3 = df_otu_entropy.columns[1:]



In [5]:

    
df_merged = pd.merge(df_otus, df_otu_entropy, left_on='sequence', right_index=True)



In [6]:

    
# OTUs by empo_3 with rel abund empo_3 >= 25%, entropy < 1, and total obs >= 1000 (15 have at least one)
for empo in empo3:
    print(empo, '\t',
          df_merged[(df_merged[empo] >= 0.25) & 
                    (df_merged['entropy'] < 1) &
                    (df_merged['total_obs'] >= 1000)].shape)









    



Animal surface 	 (6, 30)
Animal corpus 	 (18, 30)
Animal secretion 	 (1, 30)
Animal proximal gut 	 (51, 30)
Animal distal gut 	 (6, 30)
Plant surface 	 (22, 30)
Plant corpus 	 (0, 30)
Plant rhizosphere 	 (19, 30)
Soil (non-saline) 	 (1, 30)
Sediment (non-saline) 	 (18, 30)
Sediment (saline) 	 (16, 30)
Surface (non-saline) 	 (27, 30)
Surface (saline) 	 (22, 30)
Aerosol (non-saline) 	 (7, 30)
Water (non-saline) 	 (7, 30)
Water (saline) 	 (13, 30)
Intertidal (saline) 	 (0, 30)
Hypersaline (saline) 	 (0, 30)
Sterile water blank 	 (0, 30)
Mock community 	 (0, 30)



In [7]:

    
# now get the most abundant OTU that meets those criteria above
df_top_entropy = pd.DataFrame()
list_empo = []
for empo in empo3:
    df_empo = pd.DataFrame()
    df_empo = df_merged[(df_merged[empo] >= 0.25) & 
                        (df_merged['entropy'] < 1) &
                        (df_merged['total_obs'] >= 1000)]
    df_empo.sort_values('total_obs', ascending=False, inplace=True)
    if df_empo.shape[0] > 0:
        df_top_entropy = df_top_entropy.append(df_empo.iloc[0,:])
        list_empo.append(empo)
df_top_entropy.index = list_empo









    



/Users/luke.thompson/.local/lib/python3.5/site-packages/ipykernel/__main__.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [8]:

    
# write out the "most abundant, sample type-specific (>=25%), low-entropy OTU (<1)" for each sample type
df_top_entropy.to_csv('../../data/sequence-lookup/top_specialized_otu_per_empo.csv')



In [ ]: