This runs some statistical calculations as a cheatsheet for students

Using Numpy arrays

  • In Python when one hears arrays it typically is referring to numpy
    • The other thing it can refer to are array (array.array) types which are thin wrappers over C arrays
  • Lists, Sets, and Tuples are the typical "pure python" implementations of position based data collections

Example

The example will use the idea of figuring statistics on a persons scores in a class - as this is a pretty standard and understandable example.


In [29]:
import matplotlib.pyplot as plt
%matplotlib inline

In [13]:
import numpy as np

### Start by creating a 50 element numpy array for calculations
# np.random.random(50) if floats wanted or np.random.random((50,3)) if a 50 x 3 matrix wanted
# We'll assume the person was passing and use a 60-100 spread over 50 and then 51 elements (to show median)

scores = np.random.random_integers(60,100, 51)
print(scores)
print(type(scores)) # This will show this is not a List but an ndarray


[ 63  81  73  63  89  90  80  90  64  81  64  65  91  79  70  96  76  64
  89  96  88  82  68  92  67  88  81  75  80  61  66  92  63  81  85  89
  85  87  71  90  61 100  84  96  75  95  95 100  62  82]
<class 'numpy.ndarray'>

In [14]:
### Now a few simple calculations: mean, highest score, and lowest score
# We'll need these values again so instead of having to recalculate them all the time we can use a dict to store
s_attribs = {"mean": np.mean(scores), "max": np.max(scores), "min": np.min(scores)}

print("Your average grade was {}, highest score was {}, lowest was {}".format(s_attribs['mean'], s_attribs['max'], s_attribs['min']))


Your average grade was 80.1, highest score was 100, lowest was 61

In [18]:
### Now to find the median value - first the normal way then the numpy way
# Note one can use np.shape to find the shape of the array which will be a tuple of the (rows, columns) or len()
# This will use shape just to show that method as len() is used in many other examples

elements = scores.shape[0]
scores.sort() #This will sort the array in-place - use sorted() if you do not want to keep array sorted

if elements % 2:
    # The half way point is not exact
    print("The median was {}".format((scores[(elements // 2) - 1] + scores[elements // 2]) / 2))
else:
    # The half way point is exact use that
    print("The median was {}".format(scores[elements // 2]))
    
#or we could just do this without even needing to sort
# and add this to the dictionary
s_attribs["median"] = np.median(scores)
print("The median was {}".format(s_attribs["median"]))


The median was 81
The median was 81.0

In [17]:
from collections import Counter # for the next section

In [28]:
### Now to get some counts
# We'll be building a counter object so first lets try the quick way to find the mode (or most frequent value)

## s_attribs["mode"] = np.mode(scores) <- causes an Attribute error
## print("Most frequent grade was {}".format(s_attribs["mode"]))

### Well that didn't work, apparently numpy doesn't have a mode (scipy.stats does fyi)
## Then again we'll need some frequecy counts for other reasons anyway so its time to learn to count

score_count = Counter(scores)

print(score_count)
print("The mode is {}".format(score_count.most_common(1)[0][0])) 
# change as we explain need to get call the first index twice and inherent problem with this method (ie same count)


Counter({81: 4, 64: 3, 89: 3, 90: 3, 96: 3, 63: 3, 75: 2, 80: 2, 82: 2, 85: 2, 88: 2, 92: 2, 95: 2, 100: 2, 61: 2, 65: 1, 66: 1, 67: 1, 68: 1, 70: 1, 71: 1, 73: 1, 76: 1, 79: 1, 84: 1, 87: 1, 91: 1, 62: 1})
The mode is 81

In [27]:
### Okay just because it annoys me - I really just want the counts - a better way to get the mode is
from scipy.stats import mode

print(mode(scores)) #but we'll get to that in a later lesson


ModeResult(mode=array([81]), count=array([4]))

In [41]:
### Well we have a numbers and frequencies so maybe we should do a quick histogram

# First create lists of our labels and indexes
lbls, counts = zip(*score_count.items())
idx = np.arange(len(lbls))

### This will take a bit of explaining: using splat command in Python 3 to `expand` the values returned by a method
##  And the need to get the length of labels (or score_count) as this will provide the indexes (indices?)

plt.figure(figsize=(12,12)) # Chane the graph size (play with this as you want)
plt.bar(idx, counts, 1) # 1 = width
plt.xticks(idx + 0.5, lbls) # Think labels (hence why we called it labels) - if width changes do width * .5


Out[41]:
([<matplotlib.axis.XTick at 0x7f06f41092b0>,
  <matplotlib.axis.XTick at 0x7f06f4109c50>,
  <matplotlib.axis.XTick at 0x7f06f40fa5f8>,
  <matplotlib.axis.XTick at 0x7f06f3fe4908>,
  <matplotlib.axis.XTick at 0x7f06f3fe8358>,
  <matplotlib.axis.XTick at 0x7f06f3fe8d68>,
  <matplotlib.axis.XTick at 0x7f06f3feb7b8>,
  <matplotlib.axis.XTick at 0x7f06f3fef208>,
  <matplotlib.axis.XTick at 0x7f06f40ce0b8>,
  <matplotlib.axis.XTick at 0x7f06f4100208>,
  <matplotlib.axis.XTick at 0x7f06f3fef668>,
  <matplotlib.axis.XTick at 0x7f06f3ff4048>,
  <matplotlib.axis.XTick at 0x7f06f3ff4a58>,
  <matplotlib.axis.XTick at 0x7f06f3ff64a8>,
  <matplotlib.axis.XTick at 0x7f06f3ff6eb8>,
  <matplotlib.axis.XTick at 0x7f06f3ffa908>,
  <matplotlib.axis.XTick at 0x7f06f3ffe358>,
  <matplotlib.axis.XTick at 0x7f06f3ffed68>,
  <matplotlib.axis.XTick at 0x7f06f40027b8>,
  <matplotlib.axis.XTick at 0x7f06f4008208>,
  <matplotlib.axis.XTick at 0x7f06f4008c18>,
  <matplotlib.axis.XTick at 0x7f06f400a668>,
  <matplotlib.axis.XTick at 0x7f06f400e0b8>,
  <matplotlib.axis.XTick at 0x7f06f400eac8>,
  <matplotlib.axis.XTick at 0x7f06f4012518>,
  <matplotlib.axis.XTick at 0x7f06f4012f28>,
  <matplotlib.axis.XTick at 0x7f06f4016978>,
  <matplotlib.axis.XTick at 0x7f06f3f9b3c8>],
 <a list of 28 Text xticklabel objects>)