Establishing a Baseline for the Problem

Using variety of regression algorithms (non linear)



In [1]:

    
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

import pprint
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import cross_val_score
from sklearn import metrics

from sklearn.svm import SVR

%matplotlib inline



In [2]:

    
# importing the dataset we prepared and saved using Baseline 1 Notebook
ricep = pd.read_csv("/Users/macbook/Documents/BTP/Notebook/BTP/ricep.csv")
ricep.head()









    Out[2]:







  
    
      
      Unnamed: 0
      State_Name
      ind_district
      Crop_Year
      Season
      Crop
      Area
      Production
      phosphorus
      X1
      X2
      X3
      X4
    
  
  
    
      0
      15
      Andhra Pradesh
      anantapur
      1999
      kharif
      Rice
      37991.0
      105082.0
      0.0
      96800.0
      75400.0
      643.720
      881.473
    
    
      1
      16
      Andhra Pradesh
      anantapur
      2000
      kharif
      Rice
      39905.0
      117680.0
      0.0
      105082.0
      96800.0
      767.351
      643.720
    
    
      2
      17
      Andhra Pradesh
      anantapur
      2001
      kharif
      Rice
      32878.0
      95609.0
      0.0
      117680.0
      105082.0
      579.338
      767.351
    
    
      3
      18
      Andhra Pradesh
      anantapur
      2002
      kharif
      Rice
      29066.0
      66329.0
      0.0
      95609.0
      117680.0
      540.070
      579.338
    
    
      4
      21
      Andhra Pradesh
      anantapur
      2005
      kharif
      Rice
      25008.0
      69972.0
      0.0
      85051.0
      44891.0
      819.700
      564.500



In [3]:

    
ricep = ricep.drop(["Unnamed: 0"],axis=1)
ricep["phosphorus"] = ricep["phosphorus"]*10



In [4]:

    
ricep["value"] = ricep["Production"]/ricep["Area"]



In [5]:

    
X = ricep[["X1","X2","X3","X4","phosphorus"]]
y = ricep[["value"]]*1000



In [6]:

    
# Z-Score Normalization OR try using the sklearn internal normalizing by setting mormalize flag = true !!!

cols = list(X.columns)
for col in cols:
    col_zscore = col + '_zscore'
    X[col_zscore] = (X[col] - X[col].mean())/X[col].std(ddof=0)









    



/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [7]:

    
X_ = X[["X1_zscore", "X2_zscore", "X3_zscore", "X4_zscore", "phosphorus_zscore"]]
X_.head()









    Out[7]:







  
    
      
      X1_zscore
      X2_zscore
      X3_zscore
      X4_zscore
      phosphorus_zscore
    
  
  
    
      0
      -0.285176
      -0.374714
      -0.457800
      0.021735
      -0.837691
    
    
      1
      -0.247120
      -0.276111
      -0.198113
      -0.496827
      -0.837691
    
    
      2
      -0.189232
      -0.237950
      -0.593035
      -0.227176
      -0.837691
    
    
      3
      -0.290648
      -0.179903
      -0.675518
      -0.637250
      -0.837691
    
    
      4
      -0.339162
      -0.515288
      -0.088153
      -0.669613
      -0.837691



In [8]:

    
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size=0.2, random_state=1)

First checking the avg RMSE for Linear Regression



In [9]:

    
clf = LinearRegression()
scores = cross_val_score(clf, X_, y, cv=5, scoring='neg_mean_squared_error')
for i in range(0,5):
    scores[i] = sqrt(-1*scores[i])
    
print(scores)
avg_rmse = scores.mean()
print("\n\nAvg RMSE is ",scores.mean())









    



[ 1030.92314374  1109.37929379   972.36266895  1487.52744177   491.48595541]


Avg RMSE is  1018.33570073






    



/usr/local/lib/python3.6/site-packages/scipy/linalg/basic.py:1018: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)



In [ ]:

Epsilon-Support Vector Regression (SVR)

RBF Kernel



In [10]:

    
# 5 Fold CV, to calculate avg RMSE
clf = SVR(C=500000.0, epsilon=0.1, kernel='rbf', gamma=0.0008)
scores = cross_val_score(clf, X_, y.values.ravel(), cv=5, scoring='neg_mean_squared_error')
for i in range(0,5):
    scores[i] = sqrt(-1*scores[i])



In [11]:

    
print(scores)
avg_rmse = scores.mean()
print("\n\nAvg RMSE is ",scores.mean())









    



[  904.09013921   940.99998887   981.97853142  1616.00179024   568.93419484]


Avg RMSE is  1002.40092892



In [12]:

    
# Just the 4 original features (no soil data)
X_old = X[["X1_zscore", "X2_zscore", "X3_zscore", "X4_zscore"]]



In [13]:

    
# 5 Fold CV, to calculate avg RMSE
clf = SVR(C=1000.0, epsilon=0.1, kernel='rbf', gamma=0.027)
scores = cross_val_score(clf, X_old, y.values.ravel(), cv=5, scoring='neg_mean_squared_error')
for i in range(0,5):
    scores[i] = sqrt(-1*scores[i])

print(scores)
avg_rmse = scores.mean()
print("\n\nAvg RMSE is ",scores.mean())









    



[  903.93008696   753.88394413   765.69751566  1574.251674     636.95214188]


Avg RMSE is  926.943072526

SVR : 927

LR : 1018

SVR (RBF kernel) works better than Linear Regression.

Also, the soil feature, for now, does more harm than good (Phosphorous content)

Lets check the importance of Rain Data



In [14]:

    
# Just 2 features (no rain data)
X_nr = X[["X1_zscore", "X2_zscore"]]



In [15]:

    
# 5 Fold CV, to calculate avg RMSE
clf = SVR(C=1000.0, epsilon=0.1, kernel='rbf', gamma=0.027)
scores = cross_val_score(clf, X_nr, y.values.ravel(), cv=5, scoring='neg_mean_squared_error')
for i in range(0,5):
    scores[i] = sqrt(-1*scores[i])

print(scores)
avg_rmse = scores.mean()
print("\n\nAvg RMSE is ",scores.mean())









    



[ 1039.57563055   863.77364865   944.40471     1492.31174906   672.96822263]


Avg RMSE is  1002.60679218

The Rain data does helps us

Lets try for SVR with other kernels ...

Degree 3 Polynomial



In [16]:

    
# 5 Fold CV, to calculate avg RMSE
clf = SVR(kernel='poly', gamma='auto', degree=3, coef0=2)
scores = cross_val_score(clf, X_old, y.values.ravel(), cv=5, scoring='neg_mean_squared_error')
for i in range(0,5):
    scores[i] = sqrt(-1*scores[i])
    
print(scores)
avg_rmse = scores.mean()
print("\n\nAvg RMSE is ",scores.mean())









    



[  906.20976415   837.77643762  1049.76326739  1568.88777167   504.49443066]


Avg RMSE is  973.426334297

Polynomial Kernel also does better than Linear Regression

Degree 4 Polynomial



In [17]:

    
# 5 Fold CV, to calculate avg RMSE
clf = SVR(kernel='poly', gamma='auto', degree=4, coef0=2)
scores = cross_val_score(clf, X_old, y.values.ravel(), cv=5, scoring='neg_mean_squared_error')
for i in range(0,5):
    scores[i] = sqrt(-1*scores[i])
    
print(scores)
avg_rmse = scores.mean()
print("\n\nAvg RMSE is ",scores.mean())









    



[  907.10874357   787.20784909   848.64917648  1570.06140194   557.83575489]


Avg RMSE is  934.172585194



In [ ]:

	Unnamed: 0	State_Name	ind_district	Crop_Year	Season	Crop	Area	Production	X1	X2	X3	X4
0	15	Andhra Pradesh	anantapur	1999	kharif	Rice	37991.0	105082.0	96800.0	75400.0	643.720	881.473
1	16	Andhra Pradesh	anantapur	2000	kharif	Rice	39905.0	117680.0	105082.0	96800.0	767.351	643.720
2	17	Andhra Pradesh	anantapur	2001	kharif	Rice	32878.0	95609.0	117680.0	105082.0	579.338	767.351
3	18	Andhra Pradesh	anantapur	2002	kharif	Rice	29066.0	66329.0	95609.0	117680.0	540.070	579.338
4	21	Andhra Pradesh	anantapur	2005	kharif	Rice	25008.0	69972.0	85051.0	44891.0	819.700	564.500

	X1_zscore	X2_zscore	X3_zscore	X4_zscore	phosphorus_zscore
0	-0.285176	-0.374714	-0.457800	0.021735	-0.837691
1	-0.247120	-0.276111	-0.198113	-0.496827	-0.837691
2	-0.189232	-0.237950	-0.593035	-0.227176	-0.837691
3	-0.290648	-0.179903	-0.675518	-0.637250	-0.837691
4	-0.339162	-0.515288	-0.088153	-0.669613	-0.837691