Summary Statistics


A typical first step in analyzing your data is as follows:

  1. Getting to know your data.

    • For the numpy arrays, it's pretty lucid.
  2. However data analyst or data scientist crunch millions and billions of numbers.

    • If we have less data, we can simply, look at it, interpret it and can infer insights form it.

    • For Big data, it quickly become tedious.

  3. We can generate summary statistics.

  4. Numpy being an efficient data structure, it happens to be good at number crunching.


Basic Numpy Statistical Functions

  • Numpy offers various statistical functions, such as:

    • numpy mean() function, and we can invoke it using, np.mean( < numpy_array > ). It calculates the "average" of a population, or a sample of a population.

    • numpy median() function, we can invoke it using, np.median( < numpy_array[] > ). It calculates the "middle value" of a population, or a sample of a population.

    • numpy corrcoeff() func, we can invoke it using, np.corcoeff( <numpy_array1[]> , <numpy_ array2[]> ). It calculates the "correlation" b/w two numpy arrays.

    • numpy std() func for standard deviation, we can invoke it using, np.std(<numpy_arrat[]>). It calcualates the spread of the data from either population or sample mean.


Purpose of Summary Statistics

The purpose of summarizing statistics is to perform a "sanity check" of the data.

  • e.g. If we found the average weight of people to be 1000 Kg, we can infer that our measurements are most likely incorrect.

And more basic functions

Just like Basic Python distribution, the function mentioned below are the same for "Numpy arrays" as well, these are:

  • The sum() function.

  • The sort() function.

However, there's a key difference, since Numpy enforces a single data type, it can significantly speed up the calculations.

In short, np.sum() and np.sort()


Simulating data with numpy functions

For generating random numbers and distributions, we can use:

  • round( < np.random.normal( < distribution mean >, < distribution standard deviation >, < number of samples > ), )

    • This creates a single distribution.

    • We can do this to create two distributions or more and store them in variables.

Then Concatenate them to create a number of "coloumns" by using:

  • np.column_stack( < distribution 1 >, < distribution 2 >, < distribution n > )

Exercise: Basic Statistics with Numpy


RQ1: Which of the following statement about basic statistics with Numpy is correct?

Assuming the Numpy package is imported as np.

Ans: Numpy offers many functionns to calculate basic statistics such as np.mean(), np.median() and np.std().


RQ2: You are writing code to measure your travel time and weather conditions to work each day.

  • The data is recorded in a Numpy array where each row specifies the measurements for a single day.
  • The first column specifies the temperature in Fahrenheit. The second column specifies the amount of travel time in minutes.

Following is a sample code:

import numpy as np x = np.array([[28, 18], [34, 14], [32, 16], ... [26, 23], [23, 17]])

Which Python command do you use to calculate the average travel time?

Ans: Steps to solve this problem:

  • Since the first coloumn specifies the "temperature", it's index is 0.
  • Consequently the second coloumns index is 1.
  • Now, questions doesn't specifies to take all the row consisting in the 2nd coloumn, but gives a hint, i.e. we are taking an average which simply means we are to calculate the mean over the entire 2nd coloumn.

Hence the code to perform that will in psudo form:

np.mean( x[ < for all the rows >, < in 2nd coloumn > ]

i.e. np.mean( x[:, 1] )


RQ3: As a wrap up, have a look at the statements below about Numpy in general. Select the three statements that hold.

Ans:

  • Numpy is a great alternative to the regular Python list if you want to do Data Science in Python.

  • Numpy arrays can only hold elements of the same basic type.

  • Next to an efficient data structure, Numpy also offers tools to calculate summary statistics and to simulate statistical distributions.