Lab: Basic Statistics with Numpy


Objectives:

  • Experimenting with some func. Numpy offers out of the box.
  • Performing summary statistics to have a first look about our data.

  • Average Vs Median -- 100xp, status : earned
  • Explore the Baseball data -- 100xp, status: earned
  • Blend it all together -- 100xp, status : earned

1. Average Vs Median

Preface: The baseball data is available as a 2D Numpy array with 3 columns (height, weight, age) and 1015 rows. The name of this Numpy array is np_baseball.

After restructuring the data, however, you notice that some height values are abnormally high.

Instructions:

  • Create Numpy array np_height, that is equal to first column of np_baseball.
  • Print out the mean of np_height.
  • Print out the median of np_height.

In [35]:
"""Since the data is unavailabe from data camp, let's create
some of our own"""

# Import numpy
import numpy as np
from numpy import random
from numpy import column_stack

# np_baseball is un-available, so let's generate some random distribution!

height = np.round( np.random.normal( 5.50, 5.0, 1015 ), 2 )

weight = np.round( np.random.normal( 70.50, 5.0, 1015 ), 2 )

age = np.round( np.random.normal( 31, 2, 1015 ), 2 )

# let's assign these values to np_baseball
np_baseball = np.column_stack( ( height, weight, age) )

# Create np_height from np_baseball

"""
Since height coloumn is at index "0", and all the row
are to be included, hence ":"
"""
   
np_height = np.array( np_baseball[ :, 0 ] )

# Print out the mean of np_height
print( np.mean( np_height ) )

# Print out the median of np_height
print( np.median( np_height ) )


5.36614778325
5.54

In [ ]:
"""
Inference: 

+ An average length of 5.36 inch, it sound right.

+ Further, by discriptive statistics, 
  the "median" is very less affected by the outlier in general.
  However, there's no outlier it seems.
  
+ Hence 5.54 inch makes sense.

As a side note: Always check both the median and the mean.

Why? It gives an insight for the overall distributionn of the
entire dataset.

"""

2. Explore the Baseball data:

Preface:

Because the mean and median are so far apart, you decide to complain to the MLB. They find the error and send the corrected data over to you. It's again available as a 2D Numpy array np_baseball, with three columns.

Instructions:

  • The code to print out the mean height is already included. Complete the code for the median height. Replace None with the correct code.
  • Use np.std() on the first column of np_baseball to calculate stddev.

    • Replace None with the correct code.
  • Do big players tend to be heavier?

    • Use np.corrcoef() to store the correlation between the first and second column of np_baseball in corr.

    • Replace the None with teh correct code.


In [38]:
# np_baseball is available

# Import numpy
import numpy as np

# Print mean height (first column)
avg = np.mean(np_baseball[:,0])
print("Average: " + str(avg))

# Print median height. Replace 'None'
med = np.median( np_baseball[:, 0])
print("\nMedian: " + str(med))

# Print out the standard deviation on height. Replace 'None'
stddev = np.std( np_baseball[:, 0])
print("\nStandard Deviation: " + str(stddev))

# Print out correlation between first and second column. Replace 'None'
corr = np.corrcoef( np_baseball[:,0], np_baseball[:,1])
print("\nCorrelation: " + str(corr))


Average: 5.36614778325

Median: 5.54

Standard Deviation: 4.95899575065

Correlation: [[ 1.         -0.04160593]
 [-0.04160593  1.        ]]

3. Blend it all together:


Preface:

You've contacted the FIFA for some data and they handed you two lists. The lists are the following:

    position = [ 'GK', 'M', 'A', 'D', ... ]
    height = [ 191, 184, 185, 180, ... ]

Each element in the lists corresponds to a player.

  • The first list, positions, contains strings representing each player's position.

  • The possible positions are: 'GK' (goalkeeper), 'M' (midfield), 'A' (attack) and 'D' (defense).

  • The second list, heights, contains integers representing the height of the player in cm.

  • The first player in the lists is a goalkeeper and is pretty tall (191 cm).


Presumption:

You're fairly confident that the median height of goalkeepers is higher than that of other players on the soccer field.

Some of your friends don't believe you, so you are determined to show them using the data you received from FIFA and your newly acquired Python skills.


Instructions:

  • Convert heights and positions, which are regular lists, to numpy arrays. Call them np_heights and np_positions.

  • Extract all the heights of the goalkeepers. You can use a little trick here:

    • np_positions == 'GK' as an index for np_heights. Assign the result to gk_heights.
  • Extract all the heights of the all the other players.

    • This time use np_positions != 'GK' as an index for np_heights.

    • Assign the result to other_heights.

  • Print out the median height of the goalkeepers using np.median().

    • Replace None with the correct code.
  • Do the same for the other players. Print out their median height. Replace None with the correct code.


In [47]:
# heights and positions are un-available from data camp
# So let's create some of our own.
import numpy as np
from numpy import random
from numpy import column_stack

# let's represent numbers with "positions"
# 1:'GK', 2:'D', 3:'M', 4:'A'

positions = np.round( np.random.normal( 3, 10, 1000), 2) 
heights = np.round( np.random.normal( 5.8, 4.0, 1000), 2) #in feets

# let's concatenate these into two coloumns
np_football = np.column_stack( (positions, heights) )
# Convert positions and heights to numpy arrays: np_positions, np_heights
np_positions = np.array( np_football[:, 0])
np_heights = np.array( np_football[:, 1])

# Heights of the goalkeepers: gk_heights
gk_heights = np_heights[ np_positions == 1 ]

# Heights of the other players: other_heights
other_heights = np_heights[ np_positions != 1 ]

# Print out the mean position of the np_football
print("\nMean positions at which players play: " + str( np.mean( np_positions ) ) )

# Print out the median positon of the np_football
print("\nMedian positions at which player play: " + str( np.median( np_positions ) ) )

# Print out the median height of goalkeepers. Replace 'None'
print("\nMedian height of goalkeepers: " + str( np.median( gk_heights ) ) )

# Print out the median height of other players. Replace 'None'
print("\nMedian height of other players: " + str( np.median( other_heights ) ) )


Mean positions at which players play: 2.66724

Median positions at which player play: 3.005
Median height of goalkeepers: 6.18
Median height of other players: 5.62