Primero pimpea tu libreta!


In [1]:
from IPython.core.display import HTML
import os
def css_styling():
    """Load default custom.css file from ipython profile"""
    base = os.getcwd()
    styles = "<style>\n%s\n</style>" % (open(os.path.join(base,'files/custom.css'),'r').read())
    return HTML(styles)
css_styling()


Out[1]:

Introduccion a Machine Learning

En Machine learning, computadoras aplican tecnicas de aprendizaje estadistico para automaticamente reconocer patrones en los datos.

Estas tecnicas se pueden utilizar para predecir, clasificar, ajustar modelos, descubrir patrones y reducir dimencionalidad.

Para ello utilizaremos la libreria Scikit-learn:


In [1]:
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import sklearn.datasets as datasets
%matplotlib inline

El panorama!

Primero, a usar la intuicion!

Como clasificaremos a estos dos grupos?

Que se te ocurre?

El problema de Clasificacion

En ML, categorizar puntos de datos es un problema de classificacion.

Normalmente X representan los datos y Y los valores a predecir.

(Ojo):

X es un arreglo multidimencional, tiene informacion del eje x y y!


In [2]:
X, Y = datasets.make_blobs(centers=2)

print("Informacion sobre X:")
print(X.shape)
print(X)
print("Informacion sobre Y:")
print(Y.shape)
print(Y)


Informacion sobre X:
(100, 2)
[[ 11.0131191   -5.61002922]
 [ -0.33893747  -4.35353469]
 [  8.26522797  -5.85492695]
 [  9.1160325   -5.15855943]
 [  7.4667365   -6.16363337]
 [  9.07750361  -5.45071011]
 [ 10.56384932  -4.95669391]
 [ -0.39406653  -2.32689489]
 [  7.86043386  -7.36423613]
 [ 10.21178357  -7.00846787]
 [ -1.04426752  -2.45779391]
 [ -1.55328979  -3.00645918]
 [ -0.97176232  -2.56222965]
 [  9.95035643  -3.94069201]
 [ -0.37666148  -3.7200991 ]
 [  0.03231878  -2.74427004]
 [  0.52814202  -3.23742079]
 [  8.12588044  -6.88727744]
 [ -1.22482541  -2.28495727]
 [  8.71869597  -4.48262332]
 [ -2.07003784  -2.73694538]
 [ -2.27444234  -1.63099814]
 [ -3.43261875  -3.00424835]
 [ 10.23927873  -7.93305778]
 [ -1.01653752  -2.84035997]
 [  8.95400785  -5.34803633]
 [ -1.79780676  -3.84366647]
 [  8.4154423   -5.33571675]
 [  9.40202973  -6.38521149]
 [ 10.44905735  -5.76582077]
 [ -1.82838298  -1.96597303]
 [ 10.53901668  -4.50216866]
 [ -0.89942041  -1.97928224]
 [  0.1986879   -3.4145989 ]
 [ 11.51915472  -6.17398152]
 [ -0.45504246  -2.50809594]
 [  8.64368982  -6.71588714]
 [  9.43780969  -6.20146531]
 [  1.32227504  -3.62403596]
 [ -1.41214849  -2.12336905]
 [ 10.38910624  -6.70895902]
 [ -2.33291237  -2.9449379 ]
 [ 10.16070411  -5.29847287]
 [  9.84611731  -6.59040669]
 [  9.82411338  -5.37384634]
 [  9.16382409  -4.92027851]
 [ -4.24690827  -2.42923943]
 [ -1.70628469  -3.2448953 ]
 [  9.17497363  -4.81731412]
 [ -3.32377231  -1.85984006]
 [ -2.48191244  -2.67221345]
 [ -1.33255441  -3.15358412]
 [ 10.62897783  -7.2137103 ]
 [ -0.70317292  -2.43922378]
 [ -1.32586992  -4.34824   ]
 [  9.62574809  -6.02454197]
 [ -1.74507844  -1.5966618 ]
 [  8.71463148  -5.63994158]
 [  8.39159285  -6.41170733]
 [  9.32055606  -5.10033535]
 [ -2.2523957   -2.21489318]
 [  8.91877751  -6.45570208]
 [ -1.96593938  -1.68195309]
 [ -0.31749703  -1.96598964]
 [ -0.11292885  -2.84463745]
 [ -2.22898768  -1.59764796]
 [ -3.24014728  -2.0036655 ]
 [ -0.88613367  -3.73435946]
 [ 11.2272562   -3.20011433]
 [ -1.06423132  -2.00055846]
 [ -0.87674853  -2.81641684]
 [ 11.5345212   -5.3668263 ]
 [ -0.48909938  -2.24566387]
 [ -2.92076901  -3.14436839]
 [  9.28198222  -5.77427771]
 [ -2.08280532  -2.84641197]
 [  8.60400769  -7.95228577]
 [  8.90895219  -5.35534092]
 [ -1.98267006  -2.40776383]
 [ -0.90714066  -4.27335127]
 [  9.59174061  -5.28328949]
 [  9.62663898  -4.96736077]
 [  8.05530467  -7.749004  ]
 [ 10.07216146  -4.95124112]
 [ -3.97074432  -3.13445013]
 [  8.98732132  -6.13264199]
 [  0.58306562  -3.51130114]
 [ -1.17547072  -2.50836496]
 [  8.50767818  -6.22904467]
 [ -1.32610963  -1.78423408]
 [  9.71310246  -5.78485231]
 [  8.66384501  -6.01426699]
 [ -1.56430061  -1.48602832]
 [  9.30075045  -7.02199097]
 [  9.86065698  -6.86990172]
 [ -1.55299106  -2.5860757 ]
 [ -1.26723995  -1.89825377]
 [  9.57377599  -7.18060005]
 [ 10.39853134  -5.77189482]
 [ 10.54268663  -6.24049211]]
Informacion sobre Y:
(100,)
[0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 0 1 0
 0 1 1 0 1 0 0 0 0 1 1 0 1 1 1 0 1 1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 1 0 1 1
 0 1 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0]

A visualizar!

Con Y, el valor que queremos predecir, podemos asignar colores


In [3]:
plt.scatter(X[:,0], X[:,1], c=Y);


k-means al rescate!

K-mean es un algoritmo sencillo que podemos utilizar, busca los centros mas "probables" y entonces un punto se encuentra en la categoria si se encuentra mas cerca a estos centros.


In [6]:
from sklearn.cluster import KMeans

kmeans = KMeans(4)
Y_pred = kmeans.fit(X).labels_
print(Y_pred)


[3 1 0 3 0 3 3 1 0 0 1 1 1 3 1 1 1 0 1 3 2 2 2 0 1 3 1 0 0 3 2 3 1 1 3 1 0
 0 1 2 0 2 3 0 3 3 2 1 3 2 2 1 0 1 1 3 2 0 0 3 2 0 2 1 1 2 2 1 3 1 1 3 1 2
 3 2 0 3 2 1 3 3 0 3 2 0 1 1 0 2 3 0 2 0 0 2 2 0 3 3]

mas bonito


In [7]:
plt.scatter(X[:,0], X[:,1], c=Y_pred);


podemos cuantificar el error para ver que tal


In [8]:
error=kmeans.score(X,Y)
print("El error es : %f "%error)


El error es : -105.797277 

y visualizar los centros:


In [9]:
plt.scatter(X[:,0], X[:,1], c=Y_pred, alpha=0.4)
mu = kmeans.cluster_centers_
plt.scatter(mu[:,0], mu[:,1], s=100, c=np.unique(Y_pred))
print mu


[[ 9.02128567 -6.66750181]
 [-0.6537593  -2.99965305]
 [-2.34105974 -2.26137148]
 [ 9.92570331 -5.21614594]]

Finalemente vamos a visualizar el error conforme el numero de K's


In [14]:
ks =[ 2,5,8,10,20,40,60,80,100]
error=[]
for k in ks:
    kmeans = KMeans(k)
    kmeans.fit(X)
    error.append(kmeans.score(X,Y))

In [15]:
plt.plot(ks,error,'-o')
plt.show()


Le atinamos?

Que podemos cambiar?

Actividad!

  • Crear un numero alteatorio, guardalo en una variable, recuerda (np.random.randint())
  • NO LO VEAS!
  • Crea gaussianas con esa variable
  • Usa K-means para tratar de encontrar el numero de classificaciones.
  • Que pasa cuando los clusters se encuentran cerca? Es decir cuando hay sobrelape

In [ ]:


In [ ]:


In [ ]:


In [ ]:

Mas dimensiones extra, mas info, sutilezas

  • Dimensiones en un conjunto de datos se llaman features o variables.

  • La esencia del aprendizaje estadistico es identificar los limites en los datos usando matematicas.

  • Crea modelo tambien se dice que es "entrenar un modelo".


In [ ]:
from sklearn.cluster import KMeans
from sklearn.utils import shuffle
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
Y = digits.target
#X_digits, _,_, Y_digits = load_digits() # datos MNIST
#X_digits, Y_digits = shuffle(X_digits,Y_digits) # movemos los datos aleatoriamente
#X_digits = X_digits[-5000:]       # take only the last instances, to shorten runtime of KMeans

In [ ]:

vamos a ver que acabamos de lodear:


In [ ]:
plt.rc("image", cmap="binary") # use black/white palette for plotting
for i in xrange(10):
    plt.subplot(2,5,i+1)
    plt.imshow(X[i].reshape(28,28))
    plt.xticks(())
    plt.yticks(())
plt.tight_layout()

corremos k-means


In [ ]:
kmeans = KMeans(20)
mu_digits = kmeans.fit(X).cluster_centers_

y visualizamos


In [ ]:
plt.figure(figsize=(16,6))
for i in xrange(2*(mu_digits.shape[0]/2)): # loop over all means
    plt.subplot(2,mu_digits.shape[0]/2,i+1)
    plt.imshow(mu_digits[i].reshape(32,32))
    plt.xticks(())
    plt.yticks(())
plt.tight_layout()

In [ ]:


In [ ]: