A typical first step in analyzing your data is as follows:
Getting to know your data.
numpy
arrays, it's pretty lucid. However data analyst or data scientist crunch millions and billions of numbers.
If we have less data, we can simply, look at it, interpret it and can infer insights form it.
For Big data, it quickly become tedious.
We can generate summary statistics.
Numpy being an efficient data structure, it happens to be good at number crunching.
Basic Numpy Statistical Functions
Numpy offers various statistical functions, such as:
numpy mean()
function, and we can invoke it using, np.mean( < numpy_array > )
. It calculates the "average" of a population, or a sample of a population.
numpy median()
function, we can invoke it using, np.median( < numpy_array[] > )
. It calculates the "middle value" of a population, or a sample of a population.
numpy corrcoeff()
func, we can invoke it using, np.corcoeff( <numpy_array1[]> , <numpy_ array2[]> )
. It calculates the "correlation" b/w two numpy arrays.
numpy std()
func for standard deviation, we can invoke it using, np.std(<numpy_arrat[]>)
. It calcualates the spread of the data from either population or sample mean.
Purpose of Summary Statistics
The purpose of summarizing statistics is to perform a "sanity check" of the data.
And more basic functions
Just like Basic Python distribution, the function mentioned below are the same for "Numpy arrays" as well, these are:
The sum()
function.
The sort()
function.
However, there's a key difference, since Numpy enforces a single data type, it can significantly speed up the calculations.
In short, np.sum()
and np.sort()
Simulating data with numpy functions
For generating random numbers and distributions, we can use:
round( < np.random.normal( < distribution mean >, < distribution standard deviation >, < number of samples > ), )
This creates a single distribution.
We can do this to create two distributions or more and store them in variables.
Then Concatenate them to create a number of "coloumns" by using:
np.column_stack( < distribution 1 >, < distribution 2 >, < distribution n > )
RQ1: Which of the following statement about basic statistics with Numpy is correct?
Assuming the Numpy package is imported as np
.
Ans: Numpy offers many functionns to calculate basic statistics such as np.mean()
, np.median()
and np.std()
.
RQ2: You are writing code to measure your travel time and weather conditions to work each day.
Following is a sample code:
import numpy as np
x = np.array([[28, 18],
[34, 14],
[32, 16],
...
[26, 23],
[23, 17]])
Which Python command do you use to calculate the average travel time?
Ans: Steps to solve this problem:
0
.1
.Hence the code to perform that will in psudo form:
np.mean( x[ < for all the rows >, < in 2nd coloumn > ]
i.e. np.mean( x[:, 1] )
RQ3: As a wrap up, have a look at the statements below about Numpy in general. Select the three statements that hold.
Ans:
Numpy is a great alternative to the regular Python list if you want to do Data Science in Python.
Numpy arrays can only hold elements of the same basic type.
Next to an efficient data structure, Numpy also offers tools to calculate summary statistics and to simulate statistical distributions.