Create a vector:
In [13]:
import numpy as np
vector = np.array([1, 2, 3, 4, 5])
vector
Out[13]:
Convert it into a row vector:
In [14]:
vector.reshape((5, 1))
Out[14]:
Convert it into a column vector:
In [15]:
vector.reshape((1, 5))
Out[15]:
Vectors can be considered as a single feature matrix:
In [16]:
vector.reshape((1, 5))
Out[16]:
Create a multiple feature matrix:
In [17]:
multiple_feature_matrix = np.array([[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15]])
multiple_feature_matrix
Out[17]:
In [18]:
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import load_boston
boston = load_boston()
california = fetch_california_housing()
In [19]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib as mpl
%matplotlib inline
In [24]:
dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
dataset['target'] = boston.target
In [21]:
x = np.linspace(-4, 4, 100)
for mean, variance in [(0, 0.7), (0, 1), (1, 1.5), (-2, 0.5)]:
plt.plot(x, mlab.normpdf(x, mean, variance))
In [25]:
dataset.head()
Out[25]:
In [27]:
mean_expected_value = dataset['target'].mean()
mean_expected_value
Out[27]:
The mean can also be calculated using a NumPy function:
In [28]:
np.mean(dataset['target'])
Out[28]:
In [29]:
Squared_errors = pd.Series(mean_expected_value - dataset['target'])**2
SSE = np.sum(Squared_errors)
print('Sum of Squared Errors (SSE): {0}'.format(SSE))
In [30]:
density_plot = Squared_errors.plot('hist')
Most errors in the dataset are close to zero.
Correlation is a measure of how two variables relate to each other (how much and in what direction?).
Z-score standardization involves subtracting each score from its mean and then dividing by the standard deviation. This measure has a mean of 0 and a standard deviation of 1.
The standardization formula is:
In [37]:
def standardize(x):
return (x - np.mean(x)) / np.std(x)
In [44]:
standardize(dataset['target']).head()
Out[44]:
Calculate the correlation by summing all of the squared differences and divide by the number of observations. This will yield a number between -1 and 1 with positive indicating proportionality (direct/positive--both grow together; indirect/negative--one grows, other shrinks).
Covariance:
Pearson's correlation:
In [47]:
def covariance(var1, var2, bias=0):
observations = float(len(var1))
return np.sum((var1 - np.mean(var1)) * (var2 - np.mean(var2))) / (observations - min(bias, 1))
def standardize(var):
return (var - np.mean(var)) / np.std(var)
def correlation(var1, var2, bias=0):
return covariance(standardize(var1), standardize(var2), bias)
print('Our correlation estimate: {0}'.format(correlation(dataset['RM'], dataset['target'])))
In [48]:
from scipy.stats.stats import pearsonr
print('Correlation from Scipy pearsonr: {0}'.format(pearsonr(dataset['RM'], dataset['target'])[0]))
Visually, a correlation approaches a straight line when plotted as a scatterplot. Correlation is a measure of linear association, i.e., how close to a straight line your data is.
In [49]:
x_range = [dataset['RM'].min(), dataset['RM'].max()]
y_range = [dataset['target'].min(), dataset['target'].max()]
scatterplot = dataset.plot(kind='scatter', x='RM', y='target', xlim=x_range, ylim=y_range)
mean_y = scatterplot.plot(x_range, [dataset['target'].mean(), dataset['target'].mean()], '--', color='red', linewidth=1)
mean_x = scatterplot.plot([dataset['RM'].mean(), dataset['RM'].mean()], y_range, '--', color='red', linewidth=1)
In [ ]: