notes

Now I know I should think in column vector, and Tensorflow is very picky about the shape of data. But in numpy, the normal 1D ndarray is represented as column vector already. If I reshape $\mathbb{R}^n$ as $\mathbb{R}^{n\times1}$, It's not the same as column vector anymore. It's Matrix with 1 column. And I got troubles with scipy optimizer.
So I should just treat tensorflow's data as special case. Keep using the convention of numpy world.



In [1]:

    
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

import sys
sys.path.append('..')

from helper import logistic_regression as lr  # my own module
from helper import general as general

from sklearn.metrics import classification_report



In [2]:

    
# prepare data
data = pd.read_csv('ex2data1.txt', names=['exam1', 'exam2', 'admitted'])
data.head()



In [3]:

    
X = general.get_X(data)
print(X.shape)

y = general.get_y(data)
print(y.shape)









    



(100, 3)
(100,)

sigmoid function



In [4]:

    
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(np.arange(-10, 10, step=0.01),
        lr.sigmoid(np.arange(-10, 10, step=0.01)))
ax.set_ylim((-0.1,1.1))
ax.set_xlabel('z', fontsize=18)
ax.set_ylabel('g(z)', fontsize=18)
ax.set_title('sigmoid function', fontsize=18)









    Out[4]:





<matplotlib.text.Text at 0x112c23278>

cost function

$max(\ell(\theta)) = min(-\ell(\theta))$

choose $-\ell(\theta)$ as the cost function



In [5]:

    
theta = theta=np.zeros(3) # X(m*n) so theta is n*1
theta









    Out[5]:





array([ 0.,  0.,  0.])



In [6]:

    
lr.cost(theta, X, y)









    Out[6]:





0.69314718055994529

looking good, be careful of the data shape

gradient

this is batch gradient
translate this into vector computation $\frac{1}{m} X^T( Sigmoid(X\theta) - y )$



In [7]:

    
lr.gradient(theta, X, y)









    Out[7]:





array([ -0.1       , -12.00921659, -11.26284221])

fit the parameter

here I'm using scipy.optimize.minimize to find the parameters

and I use this model without understanding.... what is Jacobian ...



In [8]:

    
import scipy.optimize as opt



In [9]:

    
res = opt.minimize(fun=lr.cost, x0=theta, args=(X, y), method='Newton-CG', jac=lr.gradient)



In [10]:

    
print(res)









    



     fun: 0.20349771447597362
     jac: array([  9.71219458e-06,   6.96602898e-04,   8.53722702e-04])
 message: 'Optimization terminated successfully.'
    nfev: 73
    nhev: 0
     nit: 29
    njev: 253
  status: 0
 success: True
       x: array([-25.17064404,   0.20630619,   0.20154693])

predict and validate from training set

now we are using training set to evaluate the model, this is not the best practice, but the course just begin, I guess Andrew will cover how to do model validation properlly later



In [11]:

    
final_theta = res.x
y_pred = lr.predict(X, final_theta)

print(classification_report(y, y_pred))









    



             precision    recall  f1-score   support

          0       0.87      0.85      0.86        40
          1       0.90      0.92      0.91        60

avg / total       0.89      0.89      0.89       100

find the decision boundary

http://stats.stackexchange.com/questions/93569/why-is-logistic-regression-a-linear-classifier

$X \times \theta = 0$ (this is the line)



In [12]:

    
print(res.x) # this is final theta









    



[-25.17064404   0.20630619   0.20154693]



In [13]:

    
coef = -(res.x / res.x[2])  # find the equation
print(coef)

x = np.arange(130, step=0.1)
y = coef[0] + coef[1]*x









    



[ 124.88725955   -1.02361365   -1.        ]



In [14]:

    
data.describe()  # find the range of x and y

you know the intercept would be around 125 for both x and y



In [15]:

    
sns.set(context="notebook", style="ticks", font_scale=1.5)

sns.lmplot('exam1', 'exam2', hue='admitted', data=data, 
           size=6, 
           fit_reg=False, 
           scatter_kws={"s": 25}
          )

plt.plot(x, y, 'grey')
plt.xlim(0, 130)
plt.ylim(0, 130)
plt.title('Decision Boundary')









    Out[15]:





<matplotlib.text.Text at 0x1150bbb70>



In [ ]:

	exam1	exam2	admitted
0	34.623660	78.024693	0
1	30.286711	43.894998	0
2	35.847409	72.902198	0
3	60.182599	86.308552	1
4	79.032736	75.344376	1

	exam1	exam2	admitted
count	100.000000	100.000000	100.000000
mean	65.644274	66.221998	0.600000
std	19.458222	18.582783	0.492366
min	30.058822	30.603263	0.000000
25%	50.919511	48.179205	0.000000
50%	67.032988	67.682381	1.000000
75%	80.212529	79.360605	1.000000
max	99.827858	98.869436	1.000000