Vectors

Create a vector:



In [13]:

    
import numpy as np

vector = np.array([1, 2, 3, 4, 5])
vector









    Out[13]:





array([1, 2, 3, 4, 5])

Convert it into a row vector:



In [14]:

    
vector.reshape((5, 1))









    Out[14]:





array([[1],
       [2],
       [3],
       [4],
       [5]])

Convert it into a column vector:



In [15]:

    
vector.reshape((1, 5))









    Out[15]:





array([[1, 2, 3, 4, 5]])

Vectors can be considered as a single feature matrix:



In [16]:

    
vector.reshape((1, 5))









    Out[16]:





array([[1, 2, 3, 4, 5]])

Multiple Feature Matrices

Create a multiple feature matrix:



In [17]:

    
multiple_feature_matrix = np.array([[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15]])
multiple_feature_matrix









    Out[17]:





array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

Load datasets for demonstration



In [18]:

    
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_boston

boston = load_boston()
california = fetch_california_housing()

Load modules for demonstration



In [19]:

    
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib as mpl
%matplotlib inline



In [24]:

    
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target

The Probability Density Function (PDF)



In [21]:

    
x = np.linspace(-4, 4, 100)
for mean, variance in [(0, 0.7), (0, 1), (1, 1.5), (-2, 0.5)]:
    plt.plot(x, mlab.normpdf(x, mean, variance))

The Mean



In [25]:

    
dataset.head()



In [27]:

    
mean_expected_value = dataset['target'].mean()
mean_expected_value









    Out[27]:





22.532806324110698

The mean can also be calculated using a NumPy function:



In [28]:

    
np.mean(dataset['target'])









    Out[28]:





22.532806324110698

Squared Sum of Errors



In [29]:

    
Squared_errors = pd.Series(mean_expected_value - dataset['target'])**2
SSE = np.sum(Squared_errors)
print('Sum of Squared Errors (SSE): {0}'.format(SSE))









    



Sum of Squared Errors (SSE): 42716.29541501979



In [30]:

    
density_plot = Squared_errors.plot('hist')

Most errors in the dataset are close to zero.

Correlation

Correlation is a measure of how two variables relate to each other (how much and in what direction?).

Z-score standardization involves subtracting each score from its mean and then dividing by the standard deviation. This measure has a mean of 0 and a standard deviation of 1.

The standardization formula is:



In [37]:

    
def standardize(x):
    return (x - np.mean(x)) / np.std(x)



In [44]:

    
standardize(dataset['target']).head()









    Out[44]:





0    0.159686
1   -0.101524
2    1.324247
3    1.182758
4    1.487503
Name: target, dtype: float64

Calculate the correlation by summing all of the squared differences and divide by the number of observations. This will yield a number between -1 and 1 with positive indicating proportionality (direct/positive--both grow together; indirect/negative--one grows, other shrinks).

Covariance:

Pearson's correlation:



In [47]:

    
def covariance(var1, var2, bias=0):
    observations = float(len(var1))
    return np.sum((var1 - np.mean(var1)) * (var2 - np.mean(var2))) / (observations - min(bias, 1))

def standardize(var):
    return (var - np.mean(var)) / np.std(var)

def correlation(var1, var2, bias=0):
    return covariance(standardize(var1), standardize(var2), bias)

print('Our correlation estimate: {0}'.format(correlation(dataset['RM'], dataset['target'])))









    



Our correlation estimate: 0.6953599470715393



In [48]:

    
from scipy.stats.stats import pearsonr

print('Correlation from Scipy pearsonr: {0}'.format(pearsonr(dataset['RM'], dataset['target'])[0]))









    



Correlation from Scipy pearsonr: 0.6953599470715393

Visually, a correlation approaches a straight line when plotted as a scatterplot. Correlation is a measure of linear association, i.e., how close to a straight line your data is.



In [49]:

    
x_range = [dataset['RM'].min(), dataset['RM'].max()]
y_range = [dataset['target'].min(), dataset['target'].max()]
scatterplot = dataset.plot(kind='scatter', x='RM', y='target', xlim=x_range, ylim=y_range)
mean_y = scatterplot.plot(x_range, [dataset['target'].mean(), dataset['target'].mean()], '--', color='red', linewidth=1)
mean_x = scatterplot.plot([dataset['RM'].mean(), dataset['RM'].mean()], y_range, '--', color='red', linewidth=1)



In [ ]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	target
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2