Linear Regression using Tensorfow data flow graph

NOTE: this has nothing to do with neural networks.

This notebook teaches an usage of tensorflow's dataflow graph + gradient descent optimizer for linear regression

Dataset and the Task overview:

Courtesy : Airial Intelligence.

The datset used is from this datascience challenge: https://github.com/aerialintel/data-science-challenge

The task here is to predict the yield of wheat.


In [1]:
%%bash
# Download the dataset
[ ! -d dataset ] && mkdir dataset
[ ! -f dataset/wheat-2013-supervised.csv ] && wget https://aerialintel.blob.core.windows.net/recruiting/datasets/wheat-2013-supervised.csv -O dataset/wheat-2013-supervised.csv  
# [ ! -f dataset/wheat-2014-supervised.csv ] && wget https://aerialintel.blob.core.windows.net/recruiting/datasets/wheat-2014-supervised.csv -O dataset/wheat-2014-supervised.csv

In [2]:
%matplotlib inline
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import sys
import os
import time
current_milli_time = lambda: int(round(time.time() * 1000))


/home/tg/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [11]:
path = "dataset/wheat-2013-supervised.csv"
df=pd.read_csv(path, sep=',')
df=df[:10000]     # for a smaller dataset for quick experimentation
display(df[0:5])

train_X = df.iloc[:, range(5,15)] # basic features
train_Y = df.iloc[:, [25]]        # Labels


means = train_X.mean()
stds = train_X.std()
train_X = (train_X - means) / stds


CountyName State Latitude Longitude Date apparentTemperatureMax apparentTemperatureMin cloudCover dewPoint humidity ... precipTypeIsOther pressure temperatureMax temperatureMin visibility windBearing windSpeed NDVI DayInSeason Yield
0 Adams Washington 46.811686 -118.695237 11/30/2013 0:00 35.70 20.85 0.00 29.53 0.91 ... 0 1027.13 35.70 27.48 2.46 214 1.18 134.110657 0 35.7
1 Adams Washington 46.929839 -118.352109 11/30/2013 0:00 35.10 26.92 0.00 29.77 0.93 ... 0 1026.87 35.10 26.92 2.83 166 1.01 131.506592 0 35.7
2 Adams Washington 47.006888 -118.510160 11/30/2013 0:00 33.38 26.95 0.00 29.36 0.94 ... 0 1026.88 33.38 26.95 2.95 158 1.03 131.472946 0 35.7
3 Adams Washington 47.162342 -118.699677 11/30/2013 0:00 28.05 25.93 0.91 29.47 0.94 ... 0 1026.37 33.19 27.17 2.89 153 1.84 131.288300 0 35.7
4 Adams Washington 47.157512 -118.434056 11/30/2013 0:00 28.83 25.98 0.91 29.86 0.94 ... 0 1026.19 33.85 27.07 2.97 156 1.85 131.288300 0 35.7

5 rows × 26 columns


In [12]:
#train_X[0:3]
#Y[0:3]

train_X = train_X.as_matrix()
train_Y = train_Y.as_matrix()

In [13]:
train_X.shape, train_Y.shape


Out[13]:
((10000, 10), (10000, 1))

In [16]:
n_recs, n_attrs = train_X.shape

## Tensorflow for linear regression
X = tf.placeholder("float", [None, n_attrs])
Y = tf.placeholder("float", [None, 1])

W = tf.Variable(tf.zeros([n_attrs, 1]))
b = tf.Variable(tf.zeros([1]))

Y_ = tf.matmul(X, W) + b

learning_rate = 0.001

# Root Mean squared error
cost = tf.sqrt(tf.reduce_sum(tf.pow(Y_ - Y, 2)))

# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

num_epochs = 20

progress_delay = 10 * 1000 
last_time = current_milli_time()
conv_tol = 0.1

training_cost = float('inf') 
with tf.Session() as sess:
    sess.run(tf.initialize_all_variables())
    for e in range(num_epochs):
        for x, y in zip(train_X, train_Y):
            sess.run(optimizer, feed_dict={X: np.array([x]), Y: np.array([y])})
        if (current_milli_time() - last_time > progress_delay):
            last_time = current_milli_time()
            new_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
            print "Epoch =", e," Training cost=", new_cost, " W=", sess.run(W).flatten(), " b=", sess.run(b)
            if abs(new_cost - training_cost) < conv_tol:
                print("Converged...")
                break
            training_cost = new_cost
            
    print ("Optimization finished.")
    training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
    print "Training cost=", training_cost, " W=", sess.run(W).flatten(), " b=", sess.run(b)
    
    # predictions
    preds = sess.run(Y_, feed_dict={X: train_X, Y: train_Y})
    
    # Graphic display
    plt.plot(train_X[0:1000], train_Y[0:1000], 'ro', label='Original data')
    plt.plot(train_X[0:1000], preds[0:1000], label='Fitted line')
    plt.legend()
    plt.show()


Epoch = 3  Training cost= 1662.13  W= [-1.73535001 -1.56158304  0.64364094 -1.29775858  1.20920122  0.9089275
 -0.01618723  0.64039057 -0.28188893  2.25604653]  b= [ 24.02242661]
Epoch = 7  Training cost= 1471.98  W= [-1.62460577 -1.13254797 -0.87408513 -1.01908696  2.39189243  1.72627711
 -0.36256787  0.77363473 -0.4994058   2.78240705]  b= [ 29.18358994]
Epoch = 11  Training cost= 1450.04  W= [-1.71049213 -0.57524407 -1.26913095 -0.89953929  2.89991808  2.20356965
 -0.66515845  0.73559988 -0.8414036   2.69298577]  b= [ 30.40691757]
Epoch = 15  Training cost= 1446.82  W= [-1.91022635 -0.19468179 -1.3106643  -1.00933588  3.00771952  2.44481564
 -0.83346182  0.69910026 -0.97608459  2.52923226]  b= [ 30.65278244]
Epoch = 19  Training cost= 1445.74  W= [-2.08038926  0.11862925 -1.34683442 -1.13047743  3.06448889  2.67980576
 -0.93724048  0.7088002  -1.07999873  2.44138741]  b= [ 30.73073959]
Optimization finished.
Training cost= 1445.74  W= [-2.08038926  0.11862925 -1.34683442 -1.13047743  3.06448889  2.67980576
 -0.93724048  0.7088002  -1.07999873  2.44138741]  b= [ 30.73073959]

In [74]:



Out[74]:
(1, 3)

In [ ]: