Vectors

Create a vector:


In [13]:
import numpy as np

vector = np.array([1, 2, 3, 4, 5])
vector


Out[13]:
array([1, 2, 3, 4, 5])

Convert it into a row vector:


In [14]:
vector.reshape((5, 1))


Out[14]:
array([[1],
       [2],
       [3],
       [4],
       [5]])

Convert it into a column vector:


In [15]:
vector.reshape((1, 5))


Out[15]:
array([[1, 2, 3, 4, 5]])

Vectors can be considered as a single feature matrix:


In [16]:
vector.reshape((1, 5))


Out[16]:
array([[1, 2, 3, 4, 5]])

Multiple Feature Matrices

Create a multiple feature matrix:


In [17]:
multiple_feature_matrix = np.array([[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15]])
multiple_feature_matrix


Out[17]:
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15]])

Load datasets for demonstration


In [18]:
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_boston

boston = load_boston()
california = fetch_california_housing()

Load modules for demonstration


In [19]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib as mpl
%matplotlib inline

In [24]:
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target

The Probability Density Function (PDF)


In [21]:
x = np.linspace(-4, 4, 100)
for mean, variance in [(0, 0.7), (0, 1), (1, 1.5), (-2, 0.5)]:
    plt.plot(x, mlab.normpdf(x, mean, variance))


The Mean


In [25]:
dataset.head()


Out[25]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT target
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

In [27]:
mean_expected_value = dataset['target'].mean()
mean_expected_value


Out[27]:
22.532806324110698

The mean can also be calculated using a NumPy function:


In [28]:
np.mean(dataset['target'])


Out[28]:
22.532806324110698

Squared Sum of Errors


In [29]:
Squared_errors = pd.Series(mean_expected_value - dataset['target'])**2
SSE = np.sum(Squared_errors)
print('Sum of Squared Errors (SSE): {0}'.format(SSE))


Sum of Squared Errors (SSE): 42716.29541501979

In [30]:
density_plot = Squared_errors.plot('hist')


Most errors in the dataset are close to zero.

Correlation

Correlation is a measure of how two variables relate to each other (how much and in what direction?).

Z-score standardization involves subtracting each score from its mean and then dividing by the standard deviation. This measure has a mean of 0 and a standard deviation of 1.

The standardization formula is:


In [37]:
def standardize(x):
    return (x - np.mean(x)) / np.std(x)

In [44]:
standardize(dataset['target']).head()


Out[44]:
0    0.159686
1   -0.101524
2    1.324247
3    1.182758
4    1.487503
Name: target, dtype: float64

Calculate the correlation by summing all of the squared differences and divide by the number of observations. This will yield a number between -1 and 1 with positive indicating proportionality (direct/positive--both grow together; indirect/negative--one grows, other shrinks).

Covariance:

Pearson's correlation:


In [47]:
def covariance(var1, var2, bias=0):
    observations = float(len(var1))
    return np.sum((var1 - np.mean(var1)) * (var2 - np.mean(var2))) / (observations - min(bias, 1))

def standardize(var):
    return (var - np.mean(var)) / np.std(var)

def correlation(var1, var2, bias=0):
    return covariance(standardize(var1), standardize(var2), bias)

print('Our correlation estimate: {0}'.format(correlation(dataset['RM'], dataset['target'])))


Our correlation estimate: 0.6953599470715393

In [48]:
from scipy.stats.stats import pearsonr

print('Correlation from Scipy pearsonr: {0}'.format(pearsonr(dataset['RM'], dataset['target'])[0]))


Correlation from Scipy pearsonr: 0.6953599470715393

Visually, a correlation approaches a straight line when plotted as a scatterplot. Correlation is a measure of linear association, i.e., how close to a straight line your data is.


In [49]:
x_range = [dataset['RM'].min(), dataset['RM'].max()]
y_range = [dataset['target'].min(), dataset['target'].max()]
scatterplot = dataset.plot(kind='scatter', x='RM', y='target', xlim=x_range, ylim=y_range)
mean_y = scatterplot.plot(x_range, [dataset['target'].mean(), dataset['target'].mean()], '--', color='red', linewidth=1)
mean_x = scatterplot.plot([dataset['RM'].mean(), dataset['RM'].mean()], y_range, '--', color='red', linewidth=1)



In [ ]: