In [5]:
from __future__ import division
# plotting
%matplotlib inline
from matplotlib import pyplot as plt;
import seaborn as sns
import pylab as pl
from matplotlib.pylab import cm
import pandas as pd
# scientific
import numpy as np;
# ipython
from IPython.display import Image
This lecture some basics of information theory and will be introduced. These will provide some important background for Probabilistic Graphical Models, which is a big topic that we will cover for several following lectures. For information theory, some definitions like information, entropy, cross entropy, relative entropy, etc. are to be introduced. We could see how entropy is related to compression theory. As for applications, we will show how information theory can help us select features and find most frequent collocations in a novel.
Information theory is concerned with
In machine learning, information-theoretic quantities are useful for
The information content of an event $E$ with probability $p$ defined as $$ I(E) = I(p) = - \log_2 p = \log_2 \frac{1}{p} \geq 0 $$
On the contrary, if we observe tail $$ I(\text{Tail}) = - \log_2 0 = + \infty $$
Information is a measure of how surprised we are by an outcome.
Entropy is highest when $X$ is close to uniform.
The farther from uniform $X$ is, the smaller the entropy.
We should choose $p$ to have maximum entropy $H[p]$ among all distributions satisfying our constraints.
Constraints | Maximum Entropy Distribution |
---|---|
Min $a$, Max $b$ | Uniform $U[a,b]$ |
Mean $\mu$, Support $(0,+\infty)$ | Exponential $Exp(\mu)$ |
Mean $\mu$, Variance $\sigma^2$ | Gaussian $\mathcal{N}(\mu, \sigma^2)$ |
Suppose we draw messages from a distribution $p$.
An efficient encoding minimizes the average code length,
Example: Morse Code
It is impossible to encode messages drawn from a distribution $p$ with fewer than $H[p]$ bits, on average.
$H[p]$ measures the optimal code length, in bits, for messages drawn from $p$
Consider different distributions $p$ and $q$
For example, suppose our encoding scheme is optimal for German text.
Relative entropy is the difference $H(p,q) - H(p)$.
Relative entropy, aka Kullback-Leibler divergence, of $q$ from $p$ is $$ \begin{align} D_{KL}(p \| q) &= H(p,q) - H(p) \\ &= \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)} \\ \end{align} $$
Measures the number of extra bits needed to encode messages from $p$ if we use a code optimal for $q$.
Mutual information between discrete variables $X$ and $Y$ is $$ \begin{align} I(X; Y) &= \sum_{y\in Y} \sum_{x \in X} p(x,y) \log\frac{p(x,y)}{p(x)p(y)} \\ &= D_{KL}( p(x,y) \| p(x)p(y) ) \end{align} $$
See [MLAPP] §3.5.4 for more information
Substituting a synonym sounds unnatural:
How can we find collocations in a corpus of text?
The pointwise mutual information (PMI) between words $x$ and $y$ is $$ \mathrm{pmi}(x;y) = \log \frac{p(x,y)}{p(x)p(y)} $$
Idea: Rank word pairs by $\mathrm{pmi}(x,y)$ to find collocations!
Example: Let's try it on the novel Crime and Punishment!
collocations/data
folder
In [1]:
### Requirements: PyDotPlus, Matplotlib, Scikit-Learn, Pandas, Numpy, IPython (and possibly GraphViz)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
import sklearn
import sklearn.metrics as skm
from scipy import misc
from sklearn.externals.six import StringIO
import pydotplus
from IPython.display import Image, YouTubeVideo
def visualize_tree(tree, feature_names, class_names):
dot_data = StringIO()
sklearn.tree.export_graphviz(tree, out_file=dot_data,
filled=True, rounded=True,
feature_names=feature_names,
class_names=class_names,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
return graph.create_png()
In this lecture we will:
(Example: http://en.akinator.com)
In [2]:
# Decision Tree Example: Poisonous/Edible Mushroom Classification
# (from the UCI Repository)
data = pd.read_csv('agaricus-lepiota.data', header=None)
# Preprocessing (Note: Label Encoder is a very useful tool to convert categorical data!)
le = preprocessing.LabelEncoder()
# Change columns from labels to integer categories
#(See agaricus-lepiota.data for the initial labels and
# agaricus-lepiota.names for a full description of the columns)
data = data.apply(le.fit_transform)
# Use a Decision Tree with maximum depth = 5
dt_classifier = DecisionTreeClassifier(max_depth=5)
dt = dt_classifier.fit(data.ix[:,1:], data.ix[:,0])
In [3]:
# Visualize the decision tree with class names and feature names (See the first cell for the function)
Image(visualize_tree(dt, feature_names =
open('agaricus-lepiota.feature-names').readlines(),
class_names = ['edible', 'poisonous']))
Out[3]:
Recall Information Gain = Entropy(Parent) - Weighted Sum of Entropy(Children)
$$IG(P,a) = H(P) - H(P|a)$$Remember, the entropy is $\text{H(p)} = -\sum \limits_{i = 1}^{k_m} p_i \log_2 p_i$ where the query is concerned with some feature $x_k$ and the query has $k_m$ responses.
Entropy of the dataset, say $D$: $H(D) = -\left(\frac{29}{29 + 35} \log_2 \frac{29}{29 + 35} + \frac{35}{29 + 35} \log_2 \frac{35}{29 + 35}\right) = 0.99365071169$. So:
$I(N) = 1 - \max \limits_{i} P(i \mid N)$ for a query node $N$
In [4]:
%matplotlib inline
def gini(p):
return (p) * (1 - p) + (1 - p) * (1 - (1 - p))
def entropy(p):
return -(p * np.log2(p) + (1 - p) * np.log2(1 - p))
def error(p):
return 1 - np.max([p, 1 - p])
def plot_impurity_metrics():
x = np.arange(0.0, 1.0, 0.01)
entropy_t = [entropy(p) if p != 0 else None for p in x]
entropy_scaled_t = [0.5 * i if i != None else None for i in entropy_t]
gini_t = [gini(p) for p in x]
error_t = [error(p) for p in x]
fig = plt.figure()
ax = plt.subplot(111)
for i, lab, ls, c in zip([entropy_t, entropy_scaled_t, gini_t, error_t],
['Entropy', 'Entropy (Scaled)', 'Gini', 'Misclassification Error'],
['-', '-', '--', '-.'], ['black', 'lightgray',
'red', 'lightgreen', 'cyan']):
line = ax.plot(x, i, label = lab, linestyle = ls, color = c, lw = 2)
ax.legend(loc = 'upper center', ncol = 3, bbox_to_anchor = (0.5, 1.15), fancybox = True, shadow = False)
ax.axhline(y = 1, linewidth = 1, color = 'k', linestyle = '--')
ax.axhline(y = 0.5, linewidth = 1, color = 'k', linestyle = '--')
plt.ylim([0, 1.1])
plt.xlabel('p(Attribute = True)')
plt.ylabel('Impurity Index')
In [5]:
# The Metrics Illustrated in a Graph For a 2-Class Problem
plot_impurity_metrics()
# Note: The Gini Impurity Index is in between the misclassifcation
# error curve and the scaled Entropy Curve!
All three measures are similar, and often construct identical trees. However:
In [6]:
### Decision Tree Active Learning Example: Animal Game
import string
class node :
"Node objects have a question, and left and right pointer to other nodes"
def __init__ (self, question, left=None, right=None) :
self.question = question
self.left = left
self.right = right
def yes (ques) :
"Force the user to answer 'yes' or 'no' or something similar. Yes returns true"
while 1 :
ans = raw_input (ques)
ans = string.lower(ans[0:1])
if ans == 'y' : return 1
elif ans == 'n' : return 0
knowledge = node("bird")
def active_learning_example (suppress=True) :
"Guess the animal. Add a new node for a wrong guess."
first = True
while not suppress and (first or yes("Continue? (y/n)")):
if first:
first = False
print
if not yes("Are you thinking of an animal? (y/n)") : break
p = knowledge
while p.left != None :
if yes(p.question+"? ") : p = p.right
else : p = p.left
if yes("Is it a " + p.question + "? ") : continue
animal = raw_input ("What is the animals name? ")
question = raw_input ("What question would distinguish a %s from a %s? "
% (animal, p.question))
p.left = node(p.question)
p.right = node(animal)
p.question = question
if not yes ("If the animal were %s the answer would be? (y/n)" % animal) :
(p.right, p.left) = (p.left, p.right)
In [7]:
# Interactive Active Learning Example
# Change suppress to False and run this cell to see demonstration
active_learning_example(suppress=True)
What is the policy by which a particular decision tree algorithm generalizes from observed training examples to classify unseen instances?
Definition: The set of assumptions that, together with the training data, deductively justify the classifications assigned by the learner to future instances.
We can also think of this bias as an algorithms preference.
Decision Trees in general perform well with lots of data, are robust to violations of assumptions, and probably most strikingly are easy to understand and interpret. However: