Statistical analysis

In this notebook we use pandas and the stats module from scipy for some basic statistical analysis.



In [10]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats 

import pandas as pd

We use pandas to load the 'adult' data set from the UC Irvine Machine Learning Repository in our dataframe.



In [11]:

    
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', names=["Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
        "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
        "Hours per week", "Country", "Target"])
df.head()









    Out[11]:






  
    
      
      Age
      Workclass
      fnlwgt
      Education
      Education-Num
      Martial Status
      Occupation
      Relationship
      Race
      Sex
      Capital Gain
      Capital Loss
      Hours per week
      Country
      Target
    
  
  
    
      0
      39
      State-gov
      77516
      Bachelors
      13
      Never-married
      Adm-clerical
      Not-in-family
      White
      Male
      2174
      0
      40
      United-States
      <=50K
    
    
      1
      50
      Self-emp-not-inc
      83311
      Bachelors
      13
      Married-civ-spouse
      Exec-managerial
      Husband
      White
      Male
      0
      0
      13
      United-States
      <=50K
    
    
      2
      38
      Private
      215646
      HS-grad
      9
      Divorced
      Handlers-cleaners
      Not-in-family
      White
      Male
      0
      0
      40
      United-States
      <=50K
    
    
      3
      53
      Private
      234721
      11th
      7
      Married-civ-spouse
      Handlers-cleaners
      Husband
      Black
      Male
      0
      0
      40
      United-States
      <=50K
    
    
      4
      28
      Private
      338409
      Bachelors
      13
      Married-civ-spouse
      Prof-specialty
      Wife
      Black
      Female
      0
      0
      40
      Cuba
      <=50K

Descriptive statistics

Let's have a first look at the shape of our dataframe.



In [12]:

    
df.shape









    Out[12]:





(32561, 15)

We can calculate the mean, median, standard error of the mean (sem), variance, standard deviation (std) and the quantiles for every column in the dataframe



In [13]:

    
df.mean()









    Out[13]:





Age                   38.581647
fnlwgt            189778.366512
Education-Num         10.080679
Capital Gain        1077.648844
Capital Loss          87.303830
Hours per week        40.437456
dtype: float64



In [14]:

    
df.median()









    Out[14]:





Age                   37.0
fnlwgt            178356.0
Education-Num         10.0
Capital Gain           0.0
Capital Loss           0.0
Hours per week        40.0
dtype: float64



In [15]:

    
df.sem()









    Out[15]:





Age                 0.075593
fnlwgt            584.937250
Education-Num       0.014258
Capital Gain       40.927838
Capital Loss        2.233126
Hours per week      0.068427
dtype: float64



In [16]:

    
df.var()









    Out[16]:





Age               1.860614e+02
fnlwgt            1.114080e+10
Education-Num     6.618890e+00
Capital Gain      5.454254e+07
Capital Loss      1.623769e+05
Hours per week    1.524590e+02
dtype: float64



In [17]:

    
df.std()









    Out[17]:





Age                   13.640433
fnlwgt            105549.977697
Education-Num          2.572720
Capital Gain        7385.292085
Capital Loss         402.960219
Hours per week        12.347429
dtype: float64



In [18]:

    
df.quantile(q=0.5)









    Out[18]:





Age                   37.0
fnlwgt            178356.0
Education-Num         10.0
Capital Gain           0.0
Capital Loss           0.0
Hours per week        40.0
Name: 0.5, dtype: float64



In [19]:

    
df.quantile(q=[0.05, 0.95])









    Out[19]:






  
    
      
      Age
      fnlwgt
      Education-Num
      Capital Gain
      Capital Loss
      Hours per week
    
  
  
    
      0.05
      19.0
      39460.0
      5.0
      0.0
      0.0
      18.0
    
    
      0.95
      63.0
      379682.0
      14.0
      5013.0
      0.0
      60.0



In [20]:

    
df.Age.std()









    Out[20]:





13.640432553581341



In [21]:

    
df['Age'].count()









    Out[21]:





32561



In [22]:

    
df['Age'].mean()









    Out[22]:





38.581646755320783

In the next sample we replace a value with None so that we can show how to hanlde missing values in a dataframe.



In [224]:

    
df_copy = df.copy()
df_copy.set_value(0, 'Age', None)
df_copy.ix[0:0]









    Out[224]:






  
    
      
      Age
      Workclass
      fnlwgt
      Education
      Education-Num
      Martial Status
      Occupation
      Relationship
      Race
      Sex
      Capital Gain
      Capital Loss
      Hours per week
      Country
      Target
    
  
  
    
      0
      NaN
      State-gov
      77516
      Bachelors
      13
      Never-married
      Adm-clerical
      Not-in-family
      White
      Male
      2174
      0
      40
      United-States
      <=50K



In [225]:

    
df.isnull().values.any()









    Out[225]:





False



In [226]:

    
df_copy.isnull().values.any()









    Out[226]:





True



In [227]:

    
df_copy.isnull().sum()









    Out[227]:





Age               1
Workclass         0
fnlwgt            0
Education         0
Education-Num     0
Martial Status    0
Occupation        0
Relationship      0
Race              0
Sex               0
Capital Gain      0
Capital Loss      0
Hours per week    0
Country           0
Target            0
dtype: int64



In [228]:

    
df_copy = df_copy.dropna()



In [229]:

    
df_copy.isnull().values.any()









    Out[229]:





False



In [230]:

    
df_copy = df.copy()

df_copy.set_value(0, 'Age', None)

df_copy['Age'] = df_copy['Age'].fillna(df_copy['Age'].median())

df_copy.ix[0:0]









    Out[230]:






  
    
      
      Age
      Workclass
      fnlwgt
      Education
      Education-Num
      Martial Status
      Occupation
      Relationship
      Race
      Sex
      Capital Gain
      Capital Loss
      Hours per week
      Country
      Target
    
  
  
    
      0
      37.0
      State-gov
      77516
      Bachelors
      13
      Never-married
      Adm-clerical
      Not-in-family
      White
      Male
      2174
      0
      40
      United-States
      <=50K

Inferential statistics



In [253]:

    
male = df[df['Sex'].str.contains('Male')]
female = df[df['Sex'].str.contains('Female')]

print(female.shape)
print(male.shape)









    



(10771, 15)
(21790, 15)



In [255]:

    
t, p = stats.ttest_ind(female['Age'], male['Age'])
print(t)
print(p)









    



-16.0925170119
4.82399306878e-58



In [259]:

    
sns.distplot(female.Age)









    Out[259]:





<matplotlib.axes._subplots.AxesSubplot at 0x111a09c18>



In [260]:

    
sns.distplot(male.Age)









    Out[260]:





<matplotlib.axes._subplots.AxesSubplot at 0x112015a20>



In [ ]:

	Age	Workclass	fnlwgt	Education	Education-Num	Martial Status	Occupation	Relationship	Race	Sex	Capital Gain	Hours per week	Country	Target
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

	Age	fnlwgt	Education-Num	Capital Gain	Capital Loss	Hours per week
0.05	19.0	39460.0	5.0	0.0	0.0	18.0
0.95	63.0	379682.0	14.0	5013.0	0.0	60.0