That course gives a basic level coverage of most components used by general Data Scientists that are approachable without prior courses. Following this, you could review:
I highly recommed the Python for Data Structures, Algorithms, and Interviews! course by Jose Portilla, on Udemy. Great for preparing for CS based interviews and an intense review of practical parts of 1st and 2nd year CS in general.
r-squared: 0 = bad, 1 = perfect (all of variance is captured by model)
In [6]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(10)
dossageEffectiveness = abs(np.random.normal(5.0, 1.5, 1000))
repurchaseRate = (dossageEffectiveness + np.random.normal(0, 0.1, 1000)) * 3
repurchaseRate/=np.max(repurchaseRate)
plt.scatter(dossageEffectiveness, repurchaseRate)
plt.show()
In [7]:
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(dossageEffectiveness, repurchaseRate)
r_value ** 2
Out[7]:
In [8]:
def predict(x):
return slope * x + intercept
fitLine = predict(dossageEffectiveness)
plt.scatter(dossageEffectiveness, repurchaseRate)
plt.plot(dossageEffectiveness, fitLine, c='r')
plt.show()
In [9]:
repurchaseRate = np.random.normal(1, 0.1, 1000)*dossageEffectiveness**2
poly = np.poly1d(np.polyfit(dossageEffectiveness, repurchaseRate, 4))
xPoly = np.linspace(0, 7, 100)
plt.scatter(dossageEffectiveness, repurchaseRate)
plt.plot(xPoly, poly(xPoly), c='r')
plt.show()
Make distribution more complicated to see if scikit-learn can fit it
In [10]:
dossageEffectiveness = np.sort(dossageEffectiveness)
repurchaseRate = (dossageEffectiveness + np.random.normal(0, 1, 1000)) * 3
repurchaseRate/=np.max(repurchaseRate)
angles = np.sort(np.random.uniform(0,np.pi,1000))
cs = np.sin(angles)
repurchaseRateComplicated = repurchaseRate+(cs*100)
repurchaseRateComplicated/=np.max(repurchaseRateComplicated)
poly = np.poly1d(np.polyfit(dossageEffectiveness, repurchaseRateComplicated, 9))
xPoly = np.linspace(0, 7, 100)
plt.scatter(dossageEffectiveness, repurchaseRateComplicated)
plt.plot(xPoly, poly(xPoly), c='r')
plt.show()
With a high-N polynomial, it is unlikely to hold up to future testing and only fits the test data well.
Just regression above with more than one variable being fit.
Multi-Level Models attempt to model these interdepndencies
Commonly applied in healthcare.
Not covered in more detailed beyond general discusion in lecture 22, and instead recommends a book for further reading.
One good real-world application is in a spam filter. Naive Bayes' can be used to develop a model that can discriminate normal (Ham) emails from garbage (Spam). Lots of ways to improve it, but works fairly well in a basic sense.
Supervised learning.
In [ ]:
## for more code and details, see NaiveBayes.ipynb
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
## *** Read in the emails and their classification ***
def readFiles(path):
# NO CODE HERE, JUST READ IN FILES FROM A DIR
# AND RETURN: FULL PATH AND MESSAGES BODY
def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)
return DataFrame(rows, index=index)
data = DataFrame({'message': [], 'class': []})
data = data.append(dataFrameFromDirectory('spamdir', 'spam')) #not real file/dir, just ex
data = data.append(dataFrameFromDirectory('hamdir','ham'))#not real file/dir, just ex
## *** Done reading in data ***
# vectorize email contents to numbers
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
# make multinomial Naive Bayes object/func
classifier = MultinomialNB()
targets = data['class'].values
# fit vectorized emails
classifier.fit(counts, targets)
# Check it worked with obviouse test casese
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions
Attempts to split data into K groups that are closest to K centroids.
(1)Centroids are adjusted to the center of the points that were closest to it.
(2)Points are then used to find which centroids they are closest to again.
repeat 1 & 2 until error or distance centroids move converges.
choosing K
use different randomly chosen initial centroids to avoid local minima
Still need to determine labels for clusters found.
In [ ]:
!pip install --upgrade graphviz
In [ ]:
import numpy as np
import pandas as pd
from sklearn import tree
input_file = "PastHires.csv"
df = pd.read_csv(input_file, header = 0)
d = {'Y': 1, 'N': 0}
df['Hired'] = df['Hired'].map(d)
d = {'BS': 0, 'MS': 1, 'PhD': 2}
df['Level of Education'] = df['Level of Education'].map(d)
In [ ]:
features = list(df.columns[:6])
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features,decisions)
In [ ]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(features,decisions)
#Predict employment of an employed 10-year veteran
print clf.predict([[10, 1, 4, 0, 0, 0]])
#...and an unemployed 10-year veteran
print clf.predict([[10, 0, 4, 0, 0, 0]])
Multiple models work together to make a prediction
Ex. Identify types of iris flower by length and width of sepal.
With a simple linear kernal.
In [ ]:
from sklearn import svm, datasets
C = 1.0 #error penalty. 1 is default.
svc = svm.SVC(kernel='linear', C=C).fit(features, classifications)
#Check for another set of features
svc.predict([[200000, 40]]) #output will be classification for those features
Resolves some of the problems that arise from using people's actions to make recommendations mentioned above.
In [ ]:
import pandas as pd
## see 'SimilarMovies.ipynb' for basics of finding similar movies
## see 'ItemBasedCF.ipynb' for improved filtering and results
ratings = pd.read_csv('ratingsData') # not a real file, just ex
movies = movies = pd.read_csv('items')# not a real file, just ex
userRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
In [ ]:
# Calculate item correlations
corrMatrix = userRatings.corr()
# ex of simple cleaning
corrMatrix = userRatings.corr(method='pearson', min_periods=100)
In [ ]:
# group results and return top matches
simCandidates = simCandidates.groupby(simCandidates.index).sum()
simCandidates.sort_values(inplace = True, ascending = False)
simCandidates.head(10)
# filter out those current user has seen or bought
filteredSims = simCandidates.drop(myRatings.index)
filteredSims.head(10)
## further filtering ideas for this example at bottom of ItemBasedCF.ipynb
Supervised learning
In [ ]:
import pandas as pd
### for more real code, see KNN.ipynb
# bring in the data
ratings = pd.read_csv("data")# not a real file, just ex
# group by features of interst
movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
# normalize features of interest for classification
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
In [ ]:
from scipy import spatial
def ComputeDistance(a, b):
""" Function to comput distance between two items."""
genresA = a[1]
genresB = b[1]
genreDistance = spatial.distance.cosine(genresA, genresB)
popularityA = a[2]
popularityB = b[2]
popularityDistance = abs(popularityA - popularityB)
return genreDistance + popularityDistance
In [ ]:
import operator
def getNeighbors(movieID, K):
""" Get KNN and return sorted neighbors."""
distances = []
for movie in movieDict:
if (movie != movieID):
dist = ComputeDistance(movieDict[movieID], movieDict[movie])
distances.append((movie, dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(K):
neighbors.append(distances[x][0])
return neighbors
## again, see KNN.ipynb to see how the results can be
## displayed to see how it went
When data has too many dimensions, extract sets of basis data that can be combined to re-produce the high-D data sufficiently. In another way: find a way to represent the data with minimal dimensions that sufficiently preserves its variance.
Ex. Identify types of iris flower by length and width of sepal. Data comes with scikit-learn.
In [ ]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pylab as pl
from itertools import cycle
# load data
iris = load_iris()
#numSamples, numFeatures = iris.data.shape
# apply PCA
X = iris.data
pca = PCA(n_components=2, whiten=True).fit(X)
X_pca = pca.transform(X)
print pca.components_
#check remaining variance
print pca.explained_variance_ratio_
print sum(pca.explained_variance_ratio_) #1.0 would implies 100% variance kept
## see rest of PCA.ipynb to see how to plot resuls
The more 'traditional' approach.
transformed data is loaded into warehouse
BUT, step 2, transform can be a big problem with "big data"
Push intensive transformation step to the end where it can be better optimized. This approach is now much more scalable than ETL.
One example is Pac-Man.
Implementation of reinforcement learning.
have:
Start all Q's at 0
Use Bayes theorem to include intelligent randomness into exploration to increase the learning efficiency. Thus, a Markov Decision Process (MDP)
Use This in tandem with Q-learning to build up a list of all possible states and the reward values (Q values) for every available action in that state. Can be considered to implement Dynamic Programming or Memoization in some cases or terms.
Lectures 48-53 based on issues of applying course fundamentals to real world data.
Using MLLib to esentially do things like K-Means Clustering, Decision Trees... reviewed in pure Python before, but in a way that could be ran locally OR on a Hadoop cluster with Amazon Web Services (AWS).
In [ ]: