In [ ]:
# Working with data 2017. Class 5
## Contact
Javier Garcia-Bernardo
garcia@uva.nl

## 0. Structure
## 1 Assumptions of t-test and regression
- Normality 
- Independent and identically distributed (i.i.d.)
- Equal variance
- (for linear regression) Uncorrelated residuals


## 2.  groups
- Compare one group vs one value
- Compare two groups 
- Compare two paired groups
    
## 3. Multiple groups
- ANOVA
- Multiple comparison (Tukey correction)
    
## 4.  Regressions
- Linear regression
- Logistic regression
- Machine learning

In [1]:
#Install something we'll need
!pip install --user scikits.bootstrap

#Requires restarting the kernel


Requirement already satisfied: scikits.bootstrap in /root/.local/lib/python3.5/site-packages
Requirement already satisfied: scipy in /opt/anaconda/anaconda3/lib/python3.5/site-packages (from scikits.bootstrap)
Requirement already satisfied: numpy in /opt/anaconda/anaconda3/lib/python3.5/site-packages (from scikits.bootstrap)

In [91]:
##Some code to run at the beginning of the file, to be able to show images in the notebook
##Don't worry about this cell

#Print the plots in this screen
%matplotlib inline 

#Be able to plot images saved in the hard drive
from IPython.display import Image 

#Make the notebook wider
from IPython.core.display import display, HTML 
display(HTML("<style>.container { width:90% !important; }</style>"))

import seaborn as sns
import pylab as plt
import pandas as pd
import numpy as np
import scipy
import scipy.stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import scikits.bootstrap as bootstrap


1. Assumptions of most tests

1.1 Normality


In [4]:
Image(url="https://kanbanize.com/blog/wp-content/uploads/2014/07/Standard_deviation_diagram.png", width=500)


Out[4]:

How to test it

  • QQ plot. Measure our quantiles vs hypothetical quantiles.
  • Most tests are resistant to small deviations.

In [ ]:


In [14]:
def qq_plot(x):
    import scipy.stats
    (osm, osr),(slope, intercept, r)  = scipy.stats.probplot(x, dist='norm', plot=None)
    plt.plot(osm, osr, '.', osm, slope*osm + intercept)
    plt.xlabel('Quantiles',fontsize=14)
    plt.ylabel('Quantiles Obs',fontsize=14)

In [55]:
x = np.random.randn(100)
plt.subplot(2,2,1)
qq_plot(x)

x = np.random.exponential(2,100)
plt.subplot(2,2,2)
qq_plot(x)


plt.tight_layout()
plt.show()


1.2 Equal variance (if two or more groups)


In [33]:
Image(url="http://goodsciencebadscience.nl/wp-content/uploads/2012/09/variances1.gif",width=500)


Out[33]:

How to test it

  • Levene test (it doesn't assume normality)

In [12]:
#random normal data (mean 0 std 1)
x = np.random.randn(100)
#random normal data (mean 0 std 2)
y = np.random.randn(100)*2

#Test
print(scipy.stats.levene(x,y))


LeveneResult(statistic=40.098137331112042, pvalue=1.5876312204292832e-09)

1.3 i.i.d. (independent and identically distributed)

This comes from the data collection.

  • Each observation is independent
  • There are no subpopulations in the population

1.4 Uncorrelated residuals (for linear regression)


In [34]:
print("These are the residuals")
Image(url="https://upload.wikimedia.org/wikipedia/commons/e/ed/Residuals_for_Linear_Regression_Fit.png",width=500)


These are the residuals
Out[34]:

In [8]:
Image(url="https://i.stack.imgur.com/RU17l.png")


Out[8]:

1.4 No or little multicolinearity (for linear regression)

  • Your dependent variables are not measuring the same thing

2. t-test

2.1 Sample against value

  • Imagine you measure increase in productivity, and you want to see if the increase is significant

In [15]:
from scipy.stats import norm
#Our sample, normally distributed with mean 0.1 and std 1
x = np.random.randn(100)+2

#histogram
plt.subplot(2,2,1)
sns.distplot(x,kde=False,fit=norm)


plt.subplot(2,2,2)
qq_plot(x)
plt.show()


print("-"*20)
print("Test for mean of group different than zero")
scipy.stats.ttest_1samp(x,popmean=0)


--------------------
Test for mean of group different than zero
Out[15]:
Ttest_1sampResult(statistic=19.561758393832346, pvalue=8.7108452542495534e-36)

2.2 Two independent samples

  • Imagine you measure productivity for a control group and a experimental group
  • Imagine it's normally distributed
  • You want to see if the productivity of the groups is different

In [16]:
from scipy.stats import norm
#Our sample, normally distributed with mean 0.1 and std 1
x = np.random.randn(100)+0.1
#Our other sample, normally distributed with mean 0.5 and std 1
y = np.random.randn(100)+0.5


#Equal variance?
#We are testing if we can do the test
#If significant it means that the variances are different
print("Test for equal variance")
print(scipy.stats.levene(x,y))


Test for equal variance
LeveneResult(statistic=0.18372231239245079, pvalue=0.66865960819279313)

In [17]:
#histogram
plt.subplot(2,2,1)
sns.distplot(x,kde=True)
sns.distplot(y,kde=True)


plt.subplot(2,2,2)
qq_plot(x)
qq_plot(y)

plt.tight_layout()
plt.show()

print("-"*20)
print("Test for different means")
scipy.stats.ttest_ind(x,y)


/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
--------------------
Test for different means
Out[17]:
Ttest_indResult(statistic=-2.9602185045213134, pvalue=0.0034495814845479505)

2.3 Two paired samples

  • Imagine you measure productivity for a group before starting a new program and after the program (each individual has two observations)
  • Imagine it's normally distributed
  • You can test if the productivity has increased

In [18]:
from scipy.stats import norm
#Our sample, normally distributed with mean 0.1 and std 1
x = np.random.randn(100)+0.1
#Our other sample, similar to x but adding 0.05 and some random noise
y = x+np.random.randn(100)/10 + 0.05


#Equal variance?
print("Test for equal variance")
print(scipy.stats.levene(x,y))


Test for equal variance
LeveneResult(statistic=1.2464290960049443e-07, pvalue=0.9997186638758232)

In [19]:
#histogram
plt.subplot(2,2,1)
sns.distplot(x,kde=True)
sns.distplot(y,kde=True)


plt.subplot(2,2,2)
qq_plot(x)
qq_plot(y)

plt.tight_layout()
plt.show()


/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

In [20]:
print("WRONG: What would happen if we don't use paired t-test")
print(scipy.stats.ttest_ind(x,y))

print("-"*20)
print("Test for different means")
scipy.stats.ttest_rel(x,y)


WRONG: What would happen if we don't use paired t-test
Ttest_indResult(statistic=-0.36128847200154984, pvalue=0.71826902336070653)
--------------------
Test for different means
Out[20]:
Ttest_relResult(statistic=-5.1739506148333767, pvalue=1.2001340429502704e-06)

2.4 What happens if they don't have equal variance? (But normality)


In [94]:
scipy.stats.ttest_rel?

In [23]:
#Our sample, normally distributed with mean 0.1 and std 1
x = np.random.randn(100)+0.1
#Our other sample, normally distributed with mean 0.2 and std 2
y = np.random.randn(50)*2+0.2


#Equal variance?
print("Test for equal variance")
print(scipy.stats.levene(x,y))


Test for equal variance
LeveneResult(statistic=33.947129051276335, pvalue=3.405985034562977e-08)

In [31]:
#histogram
plt.subplot(2,2,1)
sns.distplot(x,kde=True)
sns.distplot(y,kde=True)


plt.subplot(2,2,2)
qq_plot(x)
qq_plot(y)

plt.tight_layout()
plt.show()

print("WRONG: What happens if we try the test assumming equal variance")
print(scipy.stats.ttest_ind(x,y))

print("-"*20)
print("Test for different means with different variance")
scipy.stats.ttest_ind(x,y,equal_var=False)


/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
WRONG: What happens if we try the test assumming equal variance
Ttest_indResult(statistic=-1.6473476178397546, pvalue=0.10160817641110562)
--------------------
Test for different means with different variance
Out[31]:
Ttest_indResult(statistic=-1.3503948008453872, pvalue=0.18178335336590215)

2.5 What happens if they are not normally distributed (but equal variance)

  • Use only when the number of observation in each sample is > 20 and you have 2 independent samples of ranks
  • Mann-Whitney-U

In [39]:
#Our sample, uniformly distributed 
x = np.random.exponential(2,100)
#Our other sample, niformly distributed  + 0.1
y = np.random.exponential(2,100)+0.1


#Equal variance?
print("Test for equal variance")
print(scipy.stats.levene(x,y))


Test for equal variance
LeveneResult(statistic=0.31029262175336686, pvalue=0.5781297914261414)

In [40]:
#histogram
plt.subplot(2,2,1)
sns.distplot(x,kde=True)
sns.distplot(y,kde=True)


plt.subplot(2,2,2)
qq_plot(x)
qq_plot(y)

plt.show()


/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

In [42]:
print("WRONG: What happens if we try t-test")
print(scipy.stats.ttest_ind(x,y))

print("-"*20)
print("Test for different means with no normal distributions, equal variance")
scipy.stats.mannwhitneyu(x,y)


WRONG: What happens if we try t-test
Ttest_indResult(statistic=-0.47338561343706537, pvalue=0.63645958719886819)
--------------------
Test for different means with no normal distributions, equal variance and paired samples
Out[42]:
MannwhitneyuResult(statistic=4916.0, pvalue=0.41916777366941338)

2.6 What happens if they are not normally distributed (but equal variance) and paired samples?

  • Use only when the number of observation in each sample is > 20 and you have 2 independent samples of ranks
  • Wilcoxon signed ranksum

In [43]:
#Our sample, normally distributed with mean 0.1 and std 1
x = np.random.random(100)
#Our other sample, normally distributed with mean 0.05 and std 1 and some random noise
y = x+np.random.randn(100)/10 + 0.05

#Equal variance?
print("Test for equal variance")
print(scipy.stats.levene(x,y))


Test for equal variance
LeveneResult(statistic=0.23002853591535549, pvalue=0.63203201208956483)

In [44]:
#histogram
plt.subplot(2,2,1)
sns.distplot(x,kde=True)
sns.distplot(y,kde=True)


plt.subplot(2,2,2)
qq_plot(x)
qq_plot(y)

plt.show()


/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

In [45]:
print("WRONG: What happens if we try t-test")
print(scipy.stats.ttest_rel(x,y))

print("-"*20)
print("Test for different means with no normal distributions and equal variance")
print(scipy.stats.wilcoxon(x,y))


WRONG: What happens if we try t-test
Ttest_relResult(statistic=-7.0915815375897777, pvalue=1.9977883955834856e-10)
--------------------
Test for different means with no normal distributions and equal variance
WilcoxonResult(statistic=824.0, pvalue=4.9575571539880461e-09)

2.6 What happens if they are not normally distributed and not equal variance?

  • You can only test if the distributions are different (booo)
  • scipy.stats.mannwhitneyu(x,y)

3. More than two samples

ANOVA (Parametric) or Kruskal-Wallis (Non parametric)

Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences between group means and their associated procedures (such as "variation" among and between groups).

  • ~ t-test
  • If normality

The Kruskal–Wallis one-way analysis of variance by ranks (named after William Kruskal and W. Allen Wallis) is a non-parametric method for testing whether samples originate from the same distribution.

  • ~ Mann-Whitney-U
  • If no-normality but equal variance

3.1 Compare if the groups are different


In [46]:
#Our sample, normally distributed with mean 0.1 and std 1
x = np.random.randn(100)+0.1
#Our other sample, normally distributed with mean 0.2 and std 1
y = np.random.randn(100)+0.2
#Our other sample, normally distributed with mean 0.3 and std 1
z = np.random.randn(100)+0.3


#Equal variance?
print("Test for equal variance")
print(scipy.stats.levene(x,y,z))


Test for equal variance
LeveneResult(statistic=2.0480092714597573, pvalue=0.13080921080653679)

In [47]:
#histogram
plt.subplot(2,2,1)
sns.distplot(x,kde=True)
sns.distplot(y,kde=True)
sns.distplot(z,kde=True)


plt.subplot(2,2,2)
qq_plot(x)
qq_plot(y)
qq_plot(z)

plt.show()

print("-"*20)
print("ONE-WAY ANOVA")
print(scipy.stats.f_oneway(x, y, z)    )
print("-"*20)
print("ONE-WAY KRUSKAL WALLIS") #use if not-normal and equal variance
print(scipy.stats.mstats.kruskalwallis(x,y,z))


/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
--------------------
ONE-WAY ANOVA
F_onewayResult(statistic=1.9446821552695626, pvalue=0.1448496537094488)
--------------------
ONE-WAY KRUSKAL WALLIS
KruskalResult(statistic=3.3152000000000044, pvalue=0.19059586169236964)

3.2 Problem with many groups (look-elsewhere bias)


In [64]:
#Let's compare two identical samples, see how often we get false results
for i in range(100):
    x = np.random.randn(100)
    y = np.random.randn(100)
    stat, pvalue = scipy.stats.ttest_ind(x,y)
    if pvalue < 0.05: 
        print(i,pvalue)

3.3 Compare which groups are different

3.3.1 Bonferroni correction

  • Corrects for the number of comparisons that you WILL make (Set p-value before the experiment. Don't cheat please)
  • New threhold to consider something significant is $\frac{0.05}{\#\ of\ comparisons}$

In [65]:
#Our sample, normally distributed with mean 0.1 and std 1
x = np.random.randn(100)+0.1
#Our other sample, normally distributed with mean 0.2 and std 1
y = np.random.randn(100)+0.2
#Our other sample, normally distributed with mean 0.3 and std 1
z = np.random.randn(100)+0.3
#Our other sample, normally distributed with mean 0.3 and std 1
w = np.random.randn(100)+0.5

print("New threshold (instead of 0.05) = ",0.05/6)

#Compare x and y
print("x and y: ", scipy.stats.ttest_ind(x,y))
#Compare x and z
print("x and z: ", scipy.stats.ttest_ind(x,z))
#Compare x and w
print("x and w: ", scipy.stats.ttest_ind(x,w))
#Compare y and z
print("y and z: ", scipy.stats.ttest_ind(y,z))
#Compare y and w
print("y and w: ", scipy.stats.ttest_ind(y,w))
#Compare z and w
print("z and w: ", scipy.stats.ttest_ind(z,w))


New threshold (instead of 0.05) =  0.008333333333333333
x and y:  Ttest_indResult(statistic=-1.5176204931996204, pvalue=0.13070511158476028)
x and z:  Ttest_indResult(statistic=-1.2996294462939917, pvalue=0.19523907106939523)
x and w:  Ttest_indResult(statistic=-1.8995627011104539, pvalue=0.058944076616144633)
y and z:  Ttest_indResult(statistic=0.1720575490687388, pvalue=0.86356801701764385)
y and w:  Ttest_indResult(statistic=-0.43104499058947698, pvalue=0.66690455874746379)
z and w:  Ttest_indResult(statistic=-0.58344029803134156, pvalue=0.56026092381780823)

3.3.2 Tukey's HSD (honest significant difference) test. Best

  • Assumes normality
  • Much more power (less false negatives) than Bonferroni (bonferroni with many samples is hard)

In [144]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [68]:
df = pd.read_csv("data/tukey_example.csv")
df.sample(10)


Out[68]:
productivity group
178 -0.913089 b
64 -0.014540 a
395 2.065359 d
53 1.224755 a
31 0.239282 a
122 0.982057 b
60 0.081560 a
174 0.972888 b
308 1.297314 d
132 0.643592 b

In [69]:
res2 = pairwise_tukeyhsd(df["productivity"],df["group"])
print(res2)


Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff  lower  upper  reject
--------------------------------------------
  a      b     0.0542  -0.2883 0.3967 False 
  a      c     0.1966  -0.1459 0.5392 False 
  a      d     0.5083   0.1658 0.8509  True 
  b      c     0.1424  -0.2001 0.485  False 
  b      d     0.4541   0.1116 0.7967  True 
  c      d     0.3117  -0.0308 0.6542 False 
--------------------------------------------

In [70]:
res2.plot_simultaneous(comparison_name=None,xlabel='diffs',ylabel='Group')
plt.show()


4. Fitting distributions

Why do we care about distributions?

  • They tell us a lot about the mechanism generating our values

  • Normal distribution:

    • Nothing weird going on (just normal)
  • Lognormal distribution:
    • Multiplicative process:
      • The value depends on the multiplication of many normal variables.
      • For example, each unit of time you can grow depending on some random factors.
    • Many financial indicators are lognormally distributed
    • Looks like a normal distribution if you take the logarithm of your values (np.log(x))
  • Exponential distribution:
    • Time between events (generated by poisson events)
    • If the time between events follows an exponential distribution the events are independent
    • It's a straight line if the y axis is in log scale
  • Power-law distribution:

    • Rich-get-richer mechanism: Your growth depends on your value -> Inequality is unavoidable
    • The tails of some human and financial phenomena follow power-laws.
    • The tail is a straight line in log-log scale (both y and x axis)
  • Binomial distribution: Number of successes out of a number of trials. Each event independent

  • Poisson distribution: Number of successes in some time. Each event independent

In [73]:
from scipy.stats import lognorm,norm,expon,powerlaw
x = np.random.lognormal(1,0.8,3000) 
plt.subplot(2,2,1)
sns.distplot(x,fit=norm,kde=False)
plt.title("norm")

plt.subplot(2,2,2)
sns.distplot(x,fit=lognorm,kde=False)
plt.title("lognorm")

plt.subplot(2,2,3)
sns.distplot(x,fit=expon,kde=False)
plt.title("expon")

plt.subplot(2,2,4)
sns.distplot(x,fit=powerlaw,kde=False)
plt.title("powerlaw")

plt.tight_layout()
plt.show()


5. Calculate Confidence Intervals with bootstrap methods

  • How does it work?
  • You have a sample (imagine 100 values)
  • Do 10000 times:
    • Take 100 values from the sample (with replacement)
    • Calculate the mean (or std)
  • Take the percentiles 2.5 and 97.5 of those 10000 values -> Those are your confidence intervals (kind-of)

In [71]:
import scikits.bootstrap as bootstrap

In [73]:
#Bootstrap of mean and std
x = np.random.randn(100)*3 #this could be one of our columns

CIs = bootstrap.ci(x, statfunction=np.mean,n_samples=100000)  
print('CI for mean with bootstrapping = ', CIs)

CIs = bootstrap.ci(x, statfunction=np.std,n_samples=100000)  
print('CI for std with bootstrapping = ', CIs)


CI for mean with bootstrapping =  [-1.13814925  0.09261081]
CI for std with bootstrapping =  [ 2.77119667  3.62362368]

In [75]:
x = np.random.randn(100)*3
np.mean(x),np.std(x)


Out[75]:
(0.33822769269829528, 3.1644168130907633)

In [76]:
CIs = bootstrap.ci(x, statfunction=np.mean,n_samples=100000)  
print('CI for mean with bootstrapping = ', CIs)


CI for mean with bootstrapping =  [-0.26813271  0.9691153 ]

In [78]:
scipy.stats.ttest_1samp(x,popmean=0)


Out[78]:
Ttest_1sampResult(statistic=1.0634891829804902, pvalue=0.29014734833454026)

In [79]:
#Bootsrap of p-value
def return_pvalue(x):
    stat,pvalue = scipy.stats.ttest_1samp(x,0)
    return pvalue

CIs = bootstrap.ci(x, statfunction=return_pvalue, n_samples=10000)  
print('CI for p-value with bootstrapping = ', CIs)


CI for p-value with bootstrapping =  [ 0.00407042  0.96082875]

In class exercise


In [109]:
df = pd.read_csv("data/big3_position.csv",sep="\t")
df.head()


Out[109]:
Company_name Company_ID Big3Share Position Revenue Assets Employees MarketCap Exchange TypeEnt
0 INVESCO LTD BM40671R 17.85 1 7500000.0 NaN 7500.0 13123024.0 New York Stock Exchange (NYSE) Bank
1 ROYAL CARIBBEAN CRUISES LTD LR30002MX 14.32 3 7500000.0 NaN 7500.0 16739323.0 New York Stock Exchange (NYSE) Industrial company
2 GENCO SHIPPING & TRADING LTD MH30004AQ 0.14 31 350000.0 NaN 1500.0 43392.0 New York Stock Exchange (NYSE) Industrial company
3 EAGLE BULK SHIPPING INC MH40003AQ 2.85 9 350000.0 NaN 750.0 26674.0 NASDAQ National Market Industrial company
4 POWERSHARES DB US DOLLAR INDEX BEARISH US*S00100679 0.00 101 NaN NaN NaN 44120.0 NYSE ARCA Industrial company

PartA. Compare the number of employees of industrial vs financial companies


In [81]:
industrial = df.loc[df["TypeEnt"]=="Industrial company"]
financial = df.loc[df["TypeEnt"]=="Financial company"]

In [84]:
#Employees (like most is lognormally distributed -> we can convert it to a normal distribution and run our tests happily)
i = np.log(industrial["Employees"].dropna())
f = np.log(financial["Employees"].dropna())

from scipy.stats import lognorm,norm
plt.figure(figsize=(10,3))
plt.subplot(1,2,1)
sns.distplot(df["Employees"].dropna(),fit=lognorm,kde=False)

plt.subplot(1,2,2)
sns.distplot(np.log(df["Employees"].dropna()),fit=norm,kde=False)


Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f60542e7400>

In [89]:
i.head()


Out[89]:
1     8.922658
2     7.313220
3     6.620073
23    3.761200
24    4.672829
Name: Employees, dtype: float64

In [90]:
f.head()


Out[90]:
31    7.725330
54    4.189655
56    3.218876
87    6.236370
91    3.891820
Name: Employees, dtype: float64

In [92]:
#Check if our distributions have the same variance
scipy.stats.levene(i,f)


Out[92]:
LeveneResult(statistic=3.3037088094994504, pvalue=0.069226865802670651)

In [97]:
sns.distplot(i)
sns.distplot(f)


/opt/anaconda/anaconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[97]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6053f657b8>

In [95]:
#Plot histograms and qq_plots for i and f
qq_plot(i)
qq_plot(f)



In [98]:
#Run the correct statistical test to know if the 
#number of employees of financial and industrial companies is different
scipy.stats.man(i,f)


Out[98]:
MannwhitneyuResult(statistic=233817.5, pvalue=3.205266258082931e-22)

PartB. Compare the market capitalization for different types of companies


In [123]:
df = pd.read_csv("data/big3_position.csv",sep="\t")
df.head()


Out[123]:
Company_name Company_ID Big3Share Position Revenue Assets Employees MarketCap Exchange TypeEnt
0 INVESCO LTD BM40671R 17.85 1 7500000.0 NaN 7500.0 13123024.0 New York Stock Exchange (NYSE) Bank
1 ROYAL CARIBBEAN CRUISES LTD LR30002MX 14.32 3 7500000.0 NaN 7500.0 16739323.0 New York Stock Exchange (NYSE) Industrial company
2 GENCO SHIPPING & TRADING LTD MH30004AQ 0.14 31 350000.0 NaN 1500.0 43392.0 New York Stock Exchange (NYSE) Industrial company
3 EAGLE BULK SHIPPING INC MH40003AQ 2.85 9 350000.0 NaN 750.0 26674.0 NASDAQ National Market Industrial company
4 POWERSHARES DB US DOLLAR INDEX BEARISH US*S00100679 0.00 101 NaN NaN NaN 44120.0 NYSE ARCA Industrial company

In [124]:
#Create a new column that is the logarithm of "MarketCap", call it "logMarketCap"
df["logMarketCap"] = np.log(df["MarketCap"])
df.head()


/opt/anaconda/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: RuntimeWarning: divide by zero encountered in log
  from ipykernel import kernelapp as app
Out[124]:
Company_name Company_ID Big3Share Position Revenue Assets Employees MarketCap Exchange TypeEnt logMarketCap
0 INVESCO LTD BM40671R 17.85 1 7500000.0 NaN 7500.0 13123024.0 New York Stock Exchange (NYSE) Bank 16.389879
1 ROYAL CARIBBEAN CRUISES LTD LR30002MX 14.32 3 7500000.0 NaN 7500.0 16739323.0 New York Stock Exchange (NYSE) Industrial company 16.633271
2 GENCO SHIPPING & TRADING LTD MH30004AQ 0.14 31 350000.0 NaN 1500.0 43392.0 New York Stock Exchange (NYSE) Industrial company 10.678030
3 EAGLE BULK SHIPPING INC MH40003AQ 2.85 9 350000.0 NaN 750.0 26674.0 NASDAQ National Market Industrial company 10.191445
4 POWERSHARES DB US DOLLAR INDEX BEARISH US*S00100679 0.00 101 NaN NaN NaN 44120.0 NYSE ARCA Industrial company 10.694668

In [112]:
df.describe()


Out[112]:
Big3Share Position Revenue Assets Employees MarketCap logMarketCap
count 3882.000000 3882.000000 3.837000e+03 3.695000e+03 3.300000e+03 3.882000e+03 3882.000000
mean 10.323029 16.122875 3.467816e+06 9.155279e+06 1.066365e+04 5.706551e+06 -inf
std 7.951140 34.876491 1.490323e+07 6.451795e+07 5.129945e+04 2.346929e+07 NaN
min 0.000000 1.000000 -3.757200e+04 2.190000e+02 1.000000e+00 0.000000e+00 -inf
25% 2.420000 1.000000 6.544700e+04 1.588285e+05 2.640000e+02 1.372448e+05 11.829521
50% 10.365000 2.000000 3.867370e+05 8.707830e+05 1.327500e+03 6.494275e+05 13.383845
75% 17.197500 8.000000 1.830800e+06 3.674560e+06 6.243750e+03 2.733080e+06 14.820940
max 48.190000 165.000000 4.821300e+08 2.144316e+09 2.300000e+06 5.496596e+08 20.124810

In [125]:
#Replace infinite values (log(0) = infinite) with nan
df = df.replace([np.inf, -np.inf], np.nan)

#Drop na values if "logMarketCap" is na 
df = df.dropna(subset=["logMarketCap"])

In [114]:
df.describe()


Out[114]:
Big3Share Position Revenue Assets Employees MarketCap logMarketCap
count 3878.000000 3878.000000 3.833000e+03 3.692000e+03 3.296000e+03 3.878000e+03 3878.000000
mean 10.333677 16.036101 3.471048e+06 9.162528e+06 1.067561e+04 5.712437e+06 13.347133
std 7.948320 34.789581 1.491066e+07 6.454366e+07 5.132941e+04 2.348068e+07 2.181133
min 0.000000 1.000000 -3.757200e+04 2.190000e+02 1.000000e+00 1.000000e+00 0.000000
25% 2.420000 1.000000 6.612000e+04 1.590505e+05 2.647500e+02 1.376925e+05 11.832778
50% 10.375000 2.000000 3.867670e+05 8.743075e+05 1.329500e+03 6.529460e+05 13.389249
75% 17.200000 8.000000 1.837287e+06 3.679478e+06 6.279750e+03 2.737552e+06 14.822574
max 48.190000 165.000000 4.821300e+08 2.144316e+09 2.300000e+06 5.496596e+08 20.124810

In [115]:
#Find the distribution of "MarketCap" and "logMarketCap", fitting the rigth distribution
from scipy.stats import norm,lognorm,expon
sns.distplot(df["MarketCap"],fit=lognorm,kde=False)


Out[115]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6053f80860>

In [116]:
#Find the distribution of "MarketCap" and "logMarketCap", fitting the rigth distribution
from scipy.stats import norm,lognorm,expon
sns.distplot(df["logMarketCap"],fit=norm,kde=False)


Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6053de8e10>

In [126]:
#Kepe only three types of entities
df = df.loc[df["TypeEnt"].isin(["Financial company","Industrial company","Bank"])]

In [118]:
#Plot distributions for each type
plt.figure(figsize=(6,4))
for typeent,group in df.groupby("TypeEnt"):
    sns.distplot(group["logMarketCap"],kde=False,norm_hist=True,label=typeent)

    plt.legend()



In [127]:
#Run ANOVA
bank = df.loc[df["TypeEnt"] == "Bank"]
bank_values = bank["logMarketCap"]

ind = df.loc[df["TypeEnt"] == "Industrial company"]
ind_values = ind["logMarketCap"]

fin = df.loc[df["TypeEnt"] == "Financial company"]
fin_values = fin["logMarketCap"]

scipy.stats.f_oneway(bank_values,ind_values,fin_values)


Out[127]:
F_onewayResult(statistic=9.6548583490128426, pvalue=6.5712383714430791e-05)

In [128]:
#Run Tukey test
res2 = pairwise_tukeyhsd(df["logMarketCap"],df["TypeEnt"])
print(res2)


         Multiple Comparison of Means - Tukey HSD,FWER=0.05         
====================================================================
      group1            group2       meandiff  lower   upper  reject
--------------------------------------------------------------------
       Bank       Financial company   0.6829   0.3149  1.051   True 
       Bank       Industrial company  0.238   -0.0201  0.4962 False 
Financial company Industrial company -0.4449  -0.7385 -0.1513  True 
--------------------------------------------------------------------

In [129]:
#Plot tukey test
res2.plot_simultaneous(comparison_name=None,xlabel='diffs',ylabel='Group')
plt.show()