Goal:

Write functions to subdivide an m/z : rt space into rt bins. See how this affects classification performance



In [1]:

    
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import numpy as np
%matplotlib inline

Start with MTBLS315, the malaria vs fever dataset. Could get ~0.85 AUC for whole dataset.



In [2]:

    
# Get the data
### Subdivide the data into a feature table
local_path = '/home/irockafe/Dropbox (MIT)/Alm_Lab/projects/'
data_path = local_path + '/revo_healthcare/data/processed/MTBLS315/'\
'uhplc_pos/xcms_camera_results.csv'

## Import the data and remove extraneous columns
df = pd.read_csv(data_path, index_col=0)
df.shape
df.head()

not_samples=['']

Almost everything is below 30sec rt-window



In [3]:

    
# Show me a scatterplot of m/z rt dots
# distribution along mass-axis and rt axist

plt.scatter(df['mz'], df['rt'], s=1,)
plt.xlabel('mz')
plt.ylabel('rt')
plt.title('mz vs. rt')
plt.show()



In [4]:

    
# Check out the distribution of rt-windows to choose your rt bin
rt_dist = df['rtmax'] - df['rtmin']
rt_dist.hist(bins=100)
# Choose to use 40 second bins...?









    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f421fff28d0>



In [ ]:



In [ ]:

    
# Divide m/z and rt into windows of certain width

def subdivide_mz_rt(df, rt_bins, mz_bins, samples, not_samples):
    '''
    GOAL - Subdivide mz/rt space into a grid of defined widths. Sum entrie sof the grid
    IPUT - df - dataframe from xcms (rt, mz, patient names as columns)
    OUTPUT - Matrix containing summed intensities
    NOTE - Do adducts play a large roll here? Uncertain. Could choose your rt-widths based on
           rt-distribution
    '''
    # Define the bins based on separation criteria
    # todo this is hacky and will always round down
    rt_window = df['rt'].max() / rt_bins
    rt_bounds = [(i*rt_window, i*rt_window+rt_window) 
                 for i in range(0, rt_bins)]

    mz_window = df['mz'].max() / mz_bins
    mz_bounds = [(i*mz_window, i*mz_window+mz_window) 
                for i in range(0, mz_bins)]
        
    # Tidy up by converting patient labels to column using melt
    tidy = pd.melt(df, id_vars=not_samples,
           value_vars=samples, var_name='Samples', 
           value_name='Intensity')
    
    # Go through each sample and sum intensities onto grid
    grid_dict = {}
    for sample in samples:
        # Initiate empty dict with shape mz_windo x rt_window
        grid_vals = np.zeros([mz_bins, rt_bins])
        
        df_sample = tidy[tidy['Samples'] == sample]
        # use floor, b/c zero is the start of indexes, not 1
        # TODO Deal with 
        y_vals = np.floor(df_sample['rt'].div(
                    df_sample['rt'].max()+1e-9)*rt_bins
                        ).astype(int)
        x_vals = np.floor(df_sample['mz'].div(
                        df_sample['mz'].max()+1e-9)*mz_bins
                        ).astype(int)
        df_sample['rt_bin'] = x_vals
        df_sample['mz_bin'] = y_vals
        # Go through each row and add intensity to the correct 
        # grid-entry
        for idx, row in df_sample.iterrows():
            grid_vals[row['rt_bin'], row['mz_bin']] += row['Intensity']
                    
        # Add to grid dict
        grid_dict[sample] = grid_vals
        ax = sns.heatmap(grid_vals)
        ax.set_xlabel('mz')
        ax.set_ylabel('rt')
        ax.invert_yaxis()
        plt.title(sample)
        plt.show()
    
    # Create matrix to add intensities to
    
    # Get the number of bins by finding hte max values of m/z and 
    return grid_dict

not_samples = ['mz', 'mzmin', 'mzmax', 'rt', 'rtmin', 'rtmax',
              'adduct', 'isotopes', 'npeaks', 'pcgroup', 'uhplc_pos']
samples = df.columns.difference(not_samples)
print 'max:', df['mz'].max()
rt_divisor = 50 # base this on distribution of rt-widths..?
mz_bins = 5
rt_bins = 5
print mz_bins
print rt_bins
grid_dict = subdivide_mz_rt(df, mz_bins, rt_bins, samples, not_samples)









    



max: 994.676397959
5
5






    



/home/irockafe/miniconda2/envs/isaac_revo_healthcare/lib/python2.7/site-packages/ipykernel/__main__.py:41: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/home/irockafe/miniconda2/envs/isaac_revo_healthcare/lib/python2.7/site-packages/ipykernel/__main__.py:42: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [92]:

    
# Make shit tidy
test_df = pd.DataFrame({'mz': [10,20,30,100], 'rt':[100,200,300,1000], 
                        'A': [1,2,3,4], 'B': [10,20,30,40],
                       'C': [5,15,25,35]})
print 'Original: \n', test_df
tidy_df = pd.melt(test_df, id_vars=['mz', 'rt'], value_vars=['A', 'B', 'C'], 
       var_name='Subject', value_name='Intensity')
print '\n\n Tidy:\n', tidy_df

feature_table_df = pd.pivot_table(tidy_df, index=['mz', 'rt'], values='Intensity',
                           columns='Subject')

print '\n\n Unpivoted, step 1:\n',feature_table_df

feature_table_df.reset_index(inplace=True)
print feature_table_df


abc = tidy_df[tidy_df['Subject'] == 'A']









    



Original: 
   A   B   C   mz    rt
0  1  10   5   10   100
1  2  20  15   20   200
2  3  30  25   30   300
3  4  40  35  100  1000


 Tidy:
     mz    rt Subject  Intensity
0    10   100       A          1
1    20   200       A          2
2    30   300       A          3
3   100  1000       A          4
4    10   100       B         10
5    20   200       B         20
6    30   300       B         30
7   100  1000       B         40
8    10   100       C          5
9    20   200       C         15
10   30   300       C         25
11  100  1000       C         35


 Unpivoted, step 1:
Subject   A   B   C
mz  rt             
10  100   1  10   5
20  200   2  20  15
30  300   3  30  25
100 1000  4  40  35
Subject   mz    rt  A   B   C
0         10   100  1  10   5
1         20   200  2  20  15
2         30   300  3  30  25
3        100  1000  4  40  35



In [120]:

    
a = np.zeros([2,6])
a[0.0,2.0] += 3
a[0,2]+=7
a









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-120-07d12ca3195f> in <module>()
      1 a = np.zeros([2,6])
----> 2 a[0.0,2.0] += 3
      3 a[0,2]+=7
      4 a

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices



In [84]:

    
# divide feature table into slices of retention time
def get_rt_slice(df, rt_bounds):
    '''
    PURPOSE: 
        Given a tidy feature table with 'mz' and 'rt' column headers, 
        retain only the features whose rt is between rt_left 
        and rt_right
    INPUT: 
        df - a tidy pandas dataframe with 'mz' and 'rt' column 
            headers
        rt_left, rt_right: the boundaries of your rt_slice, in seconds
    '''
    out_df = df.loc[ (df['rt'] > rt_bounds[0]) & 
                    (df['rt'] < rt_bounds[1])]
    return out_df

def sliding_window_rt(df, rt_width, step=rt_width*0.25):
    pass
    # get range of values [(0, rt_width)
    #get_rt_slice(df, )
    rt_min = np.min(df['rt'])
    rt_max = np.max(df['rt'])
    # define the ranges 
    left_bound = np.arange(rt_min, rt_max, step)
    right_bound = left_bound + rt_width
    rt_bounds = zip(left_bound, right_bound)
    for rt_slice in rt_bounds:
        rt_window = get_rt_slice(df, rt_slice)
        #print rt_window.head()
        print 'shape', rt_window.shape
        raise hee
        # TODO Send to ml pipeline here? Or separate function?

print type(np.float64(3.5137499999999999))
a = get_rt_slice(df, (750, 1050))
print 'Original dataframe shape: ', df.shape
print '\n Shape:', a.shape, '\n\n\n\n'

sliding_window_rt(df, 100)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-84-6b138b998900> in <module>()
     15     return out_df
     16 
---> 17 def sliding_window_rt(df, rt_width, step=rt_width*0.25):
     18     pass
     19     # get range of values [(0, rt_width)

NameError: name 'rt_width' is not defined



In [16]:

    
# Convert selected slice to feature table, X, get labels y









    Out[16]:







  
    
      
      mz
      mzmin
      mzmax
      rt
      rtmin
      rtmax
      npeaks
      uhplc_pos
      X1001_P
      X1002_P
      ...
      X1061_P
      X1062_P
      X1064_P
      X1065_P
      X1066_P
      X1067_P
      X1068_P
      isotopes
      adduct
      pcgroup
    
  
  
    
      37
      104.106489
      104.106064
      104.106715
      769.177526
      768.010462
      778.193106
      67
      55
      329070.128055
      396227.894115
      ...
      473247.278829
      588871.216421
      426270.960560
      572473.584864
      569940.613228
      315394.476526
      944484.900497
      NaN
      NaN
      442
    
    
      38
      104.106481
      104.105924
      104.106715
      797.843875
      796.921036
      799.168518
      53
      51
      65338.750793
      79133.729755
      ...
      185880.697595
      128276.011237
      205495.778470
      231927.971052
      210661.425303
      164378.532642
      274995.481829
      NaN
      NaN
      33
    
    
      44
      112.627395
      112.626903
      112.627604
      768.882724
      768.164720
      769.627506
      34
      34
      0.000000
      2512.603634
      ...
      17939.896362
      19085.291535
      24771.295474
      34058.985162
      59346.107589
      43265.770580
      90047.112008
      NaN
      NaN
      11
    
    
      87
      117.988387
      117.987919
      117.988599
      912.162783
      903.569648
      919.857546
      58
      34
      316999.738260
      557646.282622
      ...
      413286.737731
      392384.888350
      969722.728147
      241978.478925
      401799.805418
      402536.825821
      387827.492260
      NaN
      NaN
      968
    
    
      92
      117.988391
      117.987949
      117.988588
      895.832264
      889.220909
      902.494494
      49
      32
      443674.518761
      360580.907258
      ...
      278985.527029
      296287.274749
      274575.084114
      255087.820263
      248857.829020
      274409.550916
      243281.723691
      NaN
      NaN
      57
    
  

5 rows × 72 columns

Show me the distribution of features from alzheimers dataset



In [88]:

    
### Subdivide the data into a feature table
local_path = '/home/irockafe/Dropbox (MIT)/Alm_Lab/'\
'projects'
data_path = local_path + '/revo_healthcare/data/processed/MTBLS72/positive_mode/'\
'mtbls_no_retcor_bw2.csv'
## Import the data and remove extraneous columns
df = pd.read_csv(data_path, index_col=0)
df.shape
df.head()
# Make a new index of mz:rt
mz = df.loc[:,"mz"].astype('str')
rt = df.loc[:,"rt"].astype('str')
idx = mz+':'+rt
df.index = idx
df.head()

a = get_rt_slice(df, (0,100))
print df.shape
print a.shape









    



(1417, 262)
(517, 262)

	mz	mzmin	mzmax	rt	rtmin	rtmax	npeaks	uhplc_pos	X1001_P	X1002_P	...	X1061_P	X1062_P	X1064_P	X1065_P	X1066_P	X1067_P	X1068_P	isotopes	adduct	pcgroup
37	104.106489	104.106064	104.106715	769.177526	768.010462	778.193106	67	55	329070.128055	396227.894115	...	473247.278829	588871.216421	426270.960560	572473.584864	569940.613228	315394.476526	944484.900497	NaN	NaN	442
38	104.106481	104.105924	104.106715	797.843875	796.921036	799.168518	53	51	65338.750793	79133.729755	...	185880.697595	128276.011237	205495.778470	231927.971052	210661.425303	164378.532642	274995.481829	NaN	NaN	33
44	112.627395	112.626903	112.627604	768.882724	768.164720	769.627506	34	34	0.000000	2512.603634	...	17939.896362	19085.291535	24771.295474	34058.985162	59346.107589	43265.770580	90047.112008	NaN	NaN	11
87	117.988387	117.987919	117.988599	912.162783	903.569648	919.857546	58	34	316999.738260	557646.282622	...	413286.737731	392384.888350	969722.728147	241978.478925	401799.805418	402536.825821	387827.492260	NaN	NaN	968
92	117.988391	117.987949	117.988588	895.832264	889.220909	902.494494	49	32	443674.518761	360580.907258	...	278985.527029	296287.274749	274575.084114	255087.820263	248857.829020	274409.550916	243281.723691	NaN	NaN	57