Example of notebook to perform analysis of data

This notebook describes step by step the analysis of the grades

Set some options for the notebook. We can ignore these options for now.


In [1]:
%matplotlib inline

Import the tools needed

Import tool called "pandas". "pandas" will read grades into notebook and perform the analysis on the grades


In [2]:
import pandas

Set the maximum amount of rows of a "pandas" object to be printed to screen


In [3]:
pandas.set_option( 'display.max_rows' , 10 )

Set the maximum amount of decimals to print out.


In [12]:
pandas.set_option( 'precision' , 2 )

Import tool called "numpy" that we use to draw from normal distribution.


In [5]:
import numpy

Import tool called "matplotlib" that we use to plot data.


In [6]:
import matplotlib
import matplotlib.pyplot as plt

Set style of plots in notebook.


In [7]:
matplotlib.style.use('ggplot')

Read the grades into notebook

To do this we use "pandas" tool that we imported in the previous section.
pandas.read_table is a function within the "pandas" tool that reads the grades from the grades.txt file into the notebook.
You can read the documentation of the pandas.read_table function at this web page: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html#pandas.read_table


In [8]:
grades = pandas.read_table(
                           filepath_or_buffer='grades.txt',
                           delim_whitespace=True,
                           header=None,
                           index_col=False,
                           names=['group 1','group 2']
                          )

Now the "grades" object in the notebook contains the grades.
We can now print the grades to the screen in the notebook as we do below.


In [9]:
grades


Out[9]:
group 1 group 2
0 1 12
1 1 12
2 1 14
3 2 14
4 3 14
... ... ...
21 16 19
22 16 19
23 16 20
24 20 20
25 20 20

26 rows × 2 columns

In the printout above of the grades we see three columns of data.

  • The first on the left is the index of the grades.
  • The second from the left contains the actual grades for the first group.
  • The third from the left contains the actual grades for the second group.

The two dots we see in the middle of the printout stand for skipped rows.

Analysis of grades

We can now do analysis of the data just read into the notebook.

Mean of grades

We store into mu object the mean of the data.
We then print the mean of the data rounded to two decimals.


In [13]:
mu = grades.mean()
mu


Out[13]:
group 1     9.85
group 2    16.77
dtype: float64

Standard deviation of grades

We store into sd object the standard deviation of the data.
We then print the standard deviationof the data rounded to two decimals.


In [14]:
sd = grades.std()
sd


Out[14]:
group 1    5.66
group 2    2.32
dtype: float64

Histogram of grades

We plot the histogram of the grades.


In [17]:
grades.hist( bins = 5 )
plt.xlabel('Grade')
plt.ylabel('Frequency');


We can see from the histogram above, that the grades do not appear to be normally distributed.

Boxplot of grades

We plot the boxplot of the grades.


In [19]:
grades.plot.box()
plt.ylabel('Grade')
plt.axis([None,None,0,21]);


From the boxplot above, we see that there is asymmetry in the distribution.