In [1]:
#instantiate our environment
import os
import sys
%matplotlib inline
import pandas as pd
import statsmodels.api as sm
In [2]:
# read the data into a pandas dataframe
df = pd.read_csv("read_depth.strains.tsv", header=0, delimiter="\t")
print("Shape: {}".format(df.shape))
df.head()
Out[2]:
In [3]:
dfa = df[(df["A_read_depth"] > 0) & (df["A_strains"] > 0)]
dfb = df[(df["B_read_depth"] > 0) & (df["B_strains"] > 0)]
dfc = df[(df["C_read_depth"] > 0) & (df["C_strains"] > 0)]
print("Shape: {}".format(dfa.shape))
dfa.head()
Out[3]:
Note that we have reduced our matrix from having 11,054 entries with all the zeros to only having 1,397 entries now!
In [4]:
ax = dfa.plot('A_read_depth', 'A_strains', kind='scatter')
ax.set(ylabel="# strains", xlabel="read depth")
Out[4]:
Note that this plot is skewed by a few outliers. Lets limit it to anything where read_depth < 1000 and redraw the plot
In [5]:
dfas = dfa[dfa['A_read_depth'] < 1000]
print("Shape: {}".format(dfas.shape))
ax = dfas.plot('A_read_depth', 'A_strains', kind='scatter')
ax.set(ylabel="# strains", xlabel="read depth")
Out[5]:
When we zoom in, this doesn't look like a strong correlation. Note that there are a lot of data points here compared to the whole data set. In the data set excluding (0,0) we had 1,397 entries, and now we have 1,386 entries, so we only removed 9 values!
In [6]:
model = sm.OLS(dfa['A_strains'], dfa['A_read_depth']).fit()
predictions = model.predict(dfa['A_read_depth'])
model.summary()
Out[6]:
In [7]:
model = sm.OLS(dfas['A_strains'], dfas['A_read_depth']).fit()
predictions = model.predict(dfas['A_read_depth'])
model.summary()
Out[7]:
Notice here that the r2 becomes 0.30. Those few outliers are strongly influencing the correlation of the data!