In this notebook we use pandas and the stats module from scipy for some basic statistical analysis.
In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import pandas as pd
We use pandas to load the 'adult' data set from the UC Irvine Machine Learning Repository in our dataframe.
In [11]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', names=["Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
"Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"Hours per week", "Country", "Target"])
df.head()
Out[11]:
Let's have a first look at the shape of our dataframe.
In [12]:
df.shape
Out[12]:
We can calculate the mean, median, standard error of the mean (sem), variance, standard deviation (std) and the quantiles for every column in the dataframe
In [13]:
df.mean()
Out[13]:
In [14]:
df.median()
Out[14]:
In [15]:
df.sem()
Out[15]:
In [16]:
df.var()
Out[16]:
In [17]:
df.std()
Out[17]:
In [18]:
df.quantile(q=0.5)
Out[18]:
In [19]:
df.quantile(q=[0.05, 0.95])
Out[19]:
In [20]:
df.Age.std()
Out[20]:
In [21]:
df['Age'].count()
Out[21]:
In [22]:
df['Age'].mean()
Out[22]:
In the next sample we replace a value with None so that we can show how to hanlde missing values in a dataframe.
In [224]:
df_copy = df.copy()
df_copy.set_value(0, 'Age', None)
df_copy.ix[0:0]
Out[224]:
In [225]:
df.isnull().values.any()
Out[225]:
In [226]:
df_copy.isnull().values.any()
Out[226]:
In [227]:
df_copy.isnull().sum()
Out[227]:
In [228]:
df_copy = df_copy.dropna()
In [229]:
df_copy.isnull().values.any()
Out[229]:
In [230]:
df_copy = df.copy()
df_copy.set_value(0, 'Age', None)
df_copy['Age'] = df_copy['Age'].fillna(df_copy['Age'].median())
df_copy.ix[0:0]
Out[230]:
In [253]:
male = df[df['Sex'].str.contains('Male')]
female = df[df['Sex'].str.contains('Female')]
print(female.shape)
print(male.shape)
In [255]:
t, p = stats.ttest_ind(female['Age'], male['Age'])
print(t)
print(p)
In [259]:
sns.distplot(female.Age)
Out[259]:
In [260]:
sns.distplot(male.Age)
Out[260]:
In [ ]: