TxDOT Project



In [ ]:

    
tree1 = tree.DecisionTreeClassifier( criterion ='entropy',random_state = 0) 
fittedtree=tree1.fit( X_train, y_train) 

#print metrics.confusion_matrix(y_train, fittedtree.predict(X_train))
#print metrics.accuracy_score(y_train, fittedtree.predict(X_train))
cross_validation.cross_val_score(tree1, X_train, y_train, cv = 10).mean()

#tree.export_graphviz(fittedtree,  out_file =' tree.dot',  feature_names =['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width'])



In [1]:

    
import os
import math
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

import csv
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn import feature_selection, linear_model

Load in the Data



In [2]:

    
df = pd.read_csv('C:/Users/Collin/Documents/Python_Projects/TxDOT/data/Bid_Data.csv')
len(df)









    



C:\Users\Collin\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py:2902: DtypeWarning: Columns (27,32) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)






    Out[2]:





32447

Formatting & Identifying the Data



In [3]:

    
df = df[df['Rank'] == 1]
df = df[df['Type'] == 'Construction']

df = df.drop(' From', 1)
df = df.drop('To', 1)
df = df.drop('Contract Description', 1)
df = df.drop('Contractor', 1)
df = df.drop('Contract Category', 1)
df = df.drop('LNG MON', 1)
df = df.drop('MONTH', 1)

df['Award Amount'] = df['Award Amount'].str.lstrip('$')
df['Engineers Estimate'] = df['Engineers Estimate'].str.lstrip('$')
df['Award Amount'] = df['Award Amount'].str.replace(',','').astype(float)
df['Engineers Estimate'] = df['Engineers Estimate'].str.replace(',','').astype(float)

#Renaming Variables
df['EngEst'] = df['Engineers Estimate']
df['NBidders'] = df['Number of Bidders']
df['Date'] = pd.to_datetime(df['Letting Date'])
df.set_index('Date' , inplace=True)
df['Year'] = df.index.year
df['Month'] = df.index.month
df['WinBid'] = df['Award Amount']

# Creating New Varialbes
df['Diff'] = df['EngEst'] - df['WinBid']
df['lnWinBid'] = np.log(df['WinBid'])
df['lnEngEst'] = np.log(df['EngEst'])
df['DiffLn'] = df['lnWinBid'] - df['lnEngEst']
df['Within10Percent'] = 1 
df['PercentOff'] = df['Diff'] / df['EngEst']
df['MoreOrLessThan10'] = 0
df['LessThan10'] = 0
df['MoreThan10'] = 0

df.loc[(df.PercentOff > .10) , 'Within10Percent'] = 0
df.loc[(df.PercentOff < -.10) , 'Within10Percent'] = 0
df.loc[(df.PercentOff > .10) , 'MoreOrLessThan10'] = 1
df.loc[(df.PercentOff < -.10) , 'MoreOrLessThan10'] = 2
df.loc[(df.PercentOff > .10) , 'MoreThan10'] = 1
df.loc[(df.PercentOff < -.10) , 'LessThan10'] = 1

print len(df)

Exploring the Data

We see that Bids are Log Normally Distributed



In [4]:

    
sns.jointplot(x="EngEst", y="WinBid", data=df, kind="reg"); sns.jointplot(x="lnEngEst", y="lnWinBid", data=df, kind="reg");

Because the TxDot Estimates are so highly correlated with the winning bid, we can use it as a baseline for the model predictions

My idea was to classify the TxDot Estimates according to their scoring method of estimating within 10% of the winning bid and then having 3 separate regressive models for the 3 classes (within 10%, more than 10%, less than 10%).

TxDot's goal for their estimates is to have 55% of their estimates within 10% of the winning bid.

TxDot considers bids over or under the estimate by more than 10% as bad

We can then use this score to judge our model predictions



In [6]:

    
#Using ALL the Data

Percent = float(df.Within10Percent.sum()) / len(df)
print  (Percent)*100 , '% of All the TxDOT estimates were within 10% of actual bid'

Percent_April_2016 = float(df[(df.Year == 2016) & (df.Month == 4)].Within10Percent.sum()) / len(df_test)
print  (Percent_April_2016)*100 , '% of the April 2016 TxDOT estimates were within 10% of actual bid'









    



50.3380336102 % of All the TxDOT estimates were within 10% of actual bid
46.2686567164 % of the April 2016 TxDOT estimates were within 10% of actual bid

We now need to build and train a model that can classify a bid as over or under the TxDOT estimate

Here We can see how this classification might look



In [7]:

    
cmap = {'0': 'g', '1': 'r', '2': 'b' }
df['cMoreOrLessThan10'] = df.MoreOrLessThan10.apply(lambda x: cmap[str(x)])
print df.plot('EngEst', 'WinBid', kind='scatter', c=df.cMoreOrLessThan10)









    



Axes(0.125,0.125;0.775x0.775)



In [8]:

    
cmap = {'0': 'g', '1': 'r', '2': 'b' }
df['cMoreOrLessThan10'] = df.MoreOrLessThan10.apply(lambda x: cmap[str(x)])
print df.plot('lnEngEst', 'lnWinBid', kind='scatter', c=df.cMoreOrLessThan10)









    



Axes(0.125,0.125;0.775x0.775)

1) Splitting the Data into Training and Testing Sets



In [5]:

    
df_test = df[(df.Year == 2016) & (df.Month == 4)]
print len(df_test) , 'projects in April 2016'

df_train = df[(df.Year != 2016) | (df.Month != 4)]
print len(df_train) ,'projects from Jan 2010 to April 2016'

#df_train[['Year','Month']].tail()









    



67 projects in April 2016
5110 projects from Jan 2010 to April 2016



In [242]:

    
#df_test.columns

This will split the training and testing sets into a more useful format for loading into models



In [9]:

    
#names_x = ['Length','Year','Month','lnEngEst','Time', 'Highway', 'District', 'County' ]
names_x = ['Length','Year','Month','lnEngEst','Time', 'NBidders']
names_y = ['MoreOrLessThan10']

df_Train_x = df_train[ names_x ]
df_Train_y = df_train[ names_y ]
df_Test_x = df_test[ names_x ]
df_Test_y = df_test[ names_y ]

df_Train_x.head()



In [260]:

    
test_X.head()

Classifying using Decision Tree



In [10]:

    
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(df_Train_x, df_Train_y)



In [12]:

    
clf.predict(df_Test_x)









    Out[12]:





array([0, 2, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 2, 2, 2,
       2, 1, 2, 1, 2, 0, 0, 0, 0, 2, 1, 2, 1, 0, 1, 1, 0, 1, 1, 2, 2, 1, 2,
       1, 0, 1, 1, 2, 2, 2, 1, 1, 0, 0, 2, 1, 1, 0, 0, 0, 1, 1, 0, 1], dtype=int64)



In [13]:

    
clf.score(df_Test_x, df_Test_y)









    Out[13]:





0.41791044776119401

Model given Wining Bid is within 10% of TxDot Estimate



In [274]:

    
#subsets the training data to just those who were within 10% of the TxDot Estimate
df_train_within = df[df.Within10Percent == 1]

model_3 = smf.ols(formula = 'lnWinBid ~ lnEngEst+Year+Month+Year*Month+NBidders+NBidders*Year+Time', data = df_train_within).fit()
model_3.summary()









    Out[274]:





OLS Regression Results

  Dep. Variable:         lnWinBid        R-squared:             0.999 


  Model:                    OLS          Adj. R-squared:        0.999 


  Method:              Least Squares     F-statistic:        2.592e+05


  Date:              Wed, 01 Jun 2016    Prob (F-statistic):     0.00  


  Time:                  17:52:08        Log-Likelihood:       3907.2 


  No. Observations:         2606         AIC:                  -7798. 


  Df Residuals:             2598         BIC:                  -7752. 


  Df Model:                    7                                      


  Covariance Type:       nonrobust                                    




                   coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept         -3.2112      3.597     -0.893   0.372    -10.265     3.843


  lnEngEst           0.9956      0.001    956.110   0.000      0.994     0.998


  Year               0.0016      0.002      0.909   0.363     -0.002     0.005


  Month              0.2336      0.364      0.641   0.521     -0.481     0.948


  Year:Month        -0.0001      0.000     -0.641   0.522     -0.000     0.000


  NBidders          -0.2604      0.545     -0.478   0.633     -1.328     0.808


  NBidders:Year      0.0001      0.000      0.471   0.638     -0.000     0.001


  Time            4.125e-05   8.94e-06      4.613   0.000   2.37e-05  5.88e-05




  Omnibus:        576.150    Durbin-Watson:         1.925


  Prob(Omnibus):   0.000     Jarque-Bera (JB):    121.613


  Skew:            0.183     Prob(JB):           3.91e-27


  Kurtosis:        2.007     Cond. No.           5.79e+07

Predicting



In [275]:

    
df_test.loc[:,'Lnprediction_within'] = model_3.predict(test_X)
df_test.Lnprediction_within.head()









    Out[275]:





Date
2016-04-05    14.972721
2016-04-06    14.706426
2016-04-06    16.814502
2016-04-06    14.883813
2016-04-06    13.832213
Name: Lnprediction_within, dtype: float64

Model given Wining Bid is More Than 10% of TxDot Estimate



In [276]:

    
#subsets the training data to just those who were more than 10% of the TxDot Estimate
df_train_more = df[df.MoreThan10 == 1]

model_4 = smf.ols(formula = 'lnWinBid ~ lnEngEst+Year+Month+Year*Month+NBidders+NBidders*Year+Time', data = df_train_more).fit()
model_4.summary()









    Out[276]:





OLS Regression Results

  Dep. Variable:         lnWinBid        R-squared:             0.992 


  Model:                    OLS          Adj. R-squared:        0.992 


  Method:              Least Squares     F-statistic:        2.733e+04


  Date:              Wed, 01 Jun 2016    Prob (F-statistic):     0.00  


  Time:                  17:52:22        Log-Likelihood:       1033.0 


  No. Observations:         1488         AIC:                  -2050. 


  Df Residuals:             1480         BIC:                  -2008. 


  Df Model:                    7                                      


  Covariance Type:       nonrobust                                    




                   coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept        -13.2256     10.209     -1.296   0.195    -33.250     6.799


  lnEngEst           1.0004      0.003    332.010   0.000      0.995     1.006


  Year               0.0065      0.005      1.273   0.203     -0.003     0.016


  Month             -0.5925      0.938     -0.632   0.528     -2.432     1.247


  Year:Month         0.0003      0.000      0.633   0.527     -0.001     0.001


  NBidders           3.0432      1.479      2.057   0.040      0.141     5.945


  NBidders:Year     -0.0015      0.001     -2.060   0.040     -0.003 -7.25e-05


  Time             8.62e-05   2.87e-05      3.000   0.003   2.98e-05     0.000




  Omnibus:        744.819    Durbin-Watson:         1.877


  Prob(Omnibus):   0.000     Jarque-Bera (JB):   5533.375


  Skew:           -2.224     Prob(JB):               0.00


  Kurtosis:       11.335     Cond. No.           5.77e+07

Predicting



In [277]:

    
df_test.loc[:,'Lnprediction_more'] = model_4.predict(test_X)

Model given Wining Bid is Less Than 10% of TxDot Estimate



In [278]:

    
#subsets the training data to just those who were less than 10% of the TxDot Estimate
df_train_less = df[df.LessThan10 == 1]

model_5 = smf.ols(formula = 'lnWinBid ~ lnEngEst+Year+Month+Year*Month+NBidders+NBidders*Year+Time', data = df_train_less).fit()
model_5.summary()









    Out[278]:





OLS Regression Results

  Dep. Variable:         lnWinBid        R-squared:             0.995 


  Model:                    OLS          Adj. R-squared:        0.995 


  Method:              Least Squares     F-statistic:        2.895e+04


  Date:              Wed, 01 Jun 2016    Prob (F-statistic):     0.00  


  Time:                  17:52:27        Log-Likelihood:       1000.8 


  No. Observations:         1083         AIC:                  -1986. 


  Df Residuals:             1075         BIC:                  -1946. 


  Df Model:                    7                                      


  Covariance Type:       nonrobust                                    




                   coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept        -33.7856     10.879     -3.106   0.002    -55.131   -12.440


  lnEngEst           0.9719      0.003    310.603   0.000      0.966     0.978


  Year               0.0171      0.005      3.161   0.002      0.006     0.028


  Month              0.3353      1.180      0.284   0.776     -1.981     2.651


  Year:Month        -0.0002      0.001     -0.283   0.777     -0.001     0.001


  NBidders           6.5230      1.825      3.574   0.000      2.942    10.104


  NBidders:Year     -0.0032      0.001     -3.581   0.000     -0.005    -0.001


  Time               0.0001      3e-05      3.943   0.000   5.95e-05     0.000




  Omnibus:        473.873    Durbin-Watson:         1.994


  Prob(Omnibus):   0.000     Jarque-Bera (JB):   3314.246


  Skew:            1.872     Prob(JB):               0.00


  Kurtosis:       10.709     Cond. No.           5.76e+07

Predicting



In [279]:

    
df_test.loc[:,'Lnprediction_less'] = model_5.predict(test_X)

Our Model Prediction of Winning Bid will be a Weighted average of Bid predictions with weights being the probability of being classified as such class found from Logistic Regression



In [280]:

    
#df_test.columns

df_test['Hyp_More'] = 1 df_test['Hyp_Less'] = 1 df_test['Hyp_Within'] = 1



In [283]:

    
df_test[['p_more', 'p_less', 'p_within', 'Lnprediction_within', 'Lnprediction_more', 'Lnprediction_less']].head()









    Out[283]:






  
    
      
      p_more
      p_less
      p_within
      Lnprediction_within
      Lnprediction_more
      Lnprediction_less
    
    
      Date
      
      
      
      
      
      
    
  
  
    
      2016-04-05
      0.260327
      0.244116
      0.495556
      14.972721
      14.779110
      15.183437
    
    
      2016-04-06
      0.384470
      0.113125
      0.502405
      14.706426
      14.487806
      14.866475
    
    
      2016-04-06
      0.411735
      0.085400
      0.502865
      16.814502
      16.613899
      16.943218
    
    
      2016-04-06
      0.296980
      0.199164
      0.503856
      14.883813
      14.681865
      15.077643
    
    
      2016-04-06
      0.261602
      0.274409
      0.463989
      13.832213
      13.625224
      14.051156

E[Bid] = P(within10)xE[within10] + P(above10)xE[above10] + P(below10)xE[below10]



In [284]:

    
df_test.loc[:,'lnpred'] = df_test.p_within*df_test.Lnprediction_within + df_test.p_more*df_test.Lnprediction_more + df_test.p_less*df_test.Lnprediction_less



In [285]:

    
df_test.loc[:,'BidPrediction'] = np.exp(df_test.loc[:,'lnpred'])



In [286]:

    
df_test.loc[:,'PredDiff'] = df_test.loc[:,'BidPrediction'] - df_test.loc[:,'WinBid']
df_test.loc[:,'PredPercentOff'] = df_test.loc[:,'PredDiff'] / df_test.loc[:,'BidPrediction']

df_test.loc[:,'PredWithin10Percent'] = 1
df_test.loc[(df_test.PredPercentOff > .10) , 'PredWithin10Percent'] = 0
df_test.loc[(df_test.PredPercentOff < -.10) , 'PredWithin10Percent'] = 0



In [287]:

    
ModelPercent = float(df_test.PredWithin10Percent.sum()) / len(df_test)
PercentIncrease = (ModelPercent)*100 - (Percent_April_2016)*100
NumberCorrectIncrease = (PercentIncrease/100)*len(df_test)
print  (Percent_April_2016)*100 , '% of the TxDOT estimates were within 10% of actual bid'
print  (ModelPercent)*100 , '% of the Model predictions were within 10% of actual bid'
print 
print 'this is a increase of :', PercentIncrease, '%'
print 'or', NumberCorrectIncrease, 'more estimates within the 10% threshhold'









    



46.2686567164 % of the TxDOT estimates were within 10% of actual bid
53.7313432836 % of the Model predictions were within 10% of actual bid

this is a increase of : 7.46268656716 %
or 5.0 more estimates within the 10% threshhold



In [288]:

    
print 'In April 2016 TxDOT under estimated bids by: ' , df_test.Diff.sum()
print
print 'In April 2016 the Model under estimated bids by: ' ,df_test.PredDiff.sum()
print
print 'In April 2016 the model was ' , df_test.Diff.sum() - df_test.PredDiff.sum() , 'closer to the winning bids than TxDOT' 
print
print 'The model predicted a sum of' ,df_test.BidPrediction.sum() ,'for all the projects in April 2016'
print
print 'TxDOT predicted a sum of' ,df_test.EngEst.sum() ,'for all the projects in April 2016'









    



In April 2016 TxDOT under estimated bids by:  -3410280.24

In April 2016 the Model under estimated bids by:  -19260901.731

In April 2016 the model was  15850621.491 closer to the winning bids than TxDOT

The model predicted a sum of 209008961.949 for all the projects in April 2016

TxDOT predicted a sum of 224859583.44 for all the projects in April 2016



In [289]:

    
df_test[['Diff','PredDiff']].std()









    Out[289]:





Diff        659308.564868
PredDiff    857344.883547
dtype: float64



In [290]:

    
df_test[['Diff','PredDiff']].describe()









    Out[290]:






  
    
      
      Diff
      PredDiff
    
  
  
    
      count
      6.700000e+01
      6.700000e+01
    
    
      mean
      -5.089971e+04
      -2.874761e+05
    
    
      std
      6.593086e+05
      8.573449e+05
    
    
      min
      -3.178934e+06
      -5.011753e+06
    
    
      25%
      -1.510351e+05
      -3.104509e+05
    
    
      50%
      1.591124e+04
      -7.131061e+04
    
    
      75%
      1.269111e+05
      4.398933e+04
    
    
      max
      1.390280e+06
      9.164722e+05

Bid Model Comparison by Graph

Red is Bad (beyond the 10% threshhold)

Green is Good (within 10% of actual)

TxDOT Estimates



In [291]:

    
cmap = {'0': 'r', '1': 'g' }
df_test.loc[:,'cWithin10Percent'] = df_test.Within10Percent.apply(lambda x: cmap[str(x)])
print df_test.plot('lnEngEst', 'lnWinBid', kind='scatter', c=df_test.cWithin10Percent)









    



Axes(0.125,0.125;0.775x0.775)

Model Predictions



In [292]:

    
predcmap = {'0': 'r', '1': 'g' }
df_test.loc[:,'cPredWithin10Percent'] = df_test.PredWithin10Percent.apply(lambda x: predcmap[str(x)])
print df_test.plot('lnpred', 'lnWinBid', kind='scatter', c=df_test.cPredWithin10Percent)









    



Axes(0.125,0.125;0.775x0.775)



In [ ]:

	Length	Year	Month	lnEngEst	Time	NBidders
Date
2010-10-21	0.200	2010	10	11.728165	40.0	1
2010-08-11	9.050	2010	8	15.161901	140.0	6
2010-04-07	1.250	2010	4	12.492707	28.0	1
2010-08-11	4.316	2010	8	14.277695	40.0	2
2010-05-12	0.100	2010	5	12.464106	40.0	4

	Length	NBidders	Year	Month	lnEngEst	Time
Date
2016-04-05	15.055	2	2016	4	14.979827	0.0
2016-04-06	0.016	5	2016	4	14.722602	0.0
2016-04-06	2.024	4	2016	4	16.836501	0.0
2016-04-06	6.294	3	2016	4	14.893941	0.0
2016-04-06	0.455	3	2016	4	13.837735	0.0

Dep. Variable:	lnWinBid	R-squared:	0.999
Model:	OLS	Adj. R-squared:	0.999
Method:	Least Squares	F-statistic:	2.592e+05
Date:	Wed, 01 Jun 2016	Prob (F-statistic):	0.00
Time:	17:52:08	Log-Likelihood:	3907.2
No. Observations:	2606	AIC:	-7798.
Df Residuals:	2598	BIC:	-7752.
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-3.2112	3.597	-0.893	0.372	-10.265 3.843
lnEngEst	0.9956	0.001	956.110	0.000	0.994 0.998
Year	0.0016	0.002	0.909	0.363	-0.002 0.005
Month	0.2336	0.364	0.641	0.521	-0.481 0.948
Year:Month	-0.0001	0.000	-0.641	0.522	-0.000 0.000
NBidders	-0.2604	0.545	-0.478	0.633	-1.328 0.808
NBidders:Year	0.0001	0.000	0.471	0.638	-0.000 0.001
Time	4.125e-05	8.94e-06	4.613	0.000	2.37e-05 5.88e-05

Omnibus:	576.150	Durbin-Watson:	1.925
Prob(Omnibus):	0.000	Jarque-Bera (JB):	121.613
Skew:	0.183	Prob(JB):	3.91e-27
Kurtosis:	2.007	Cond. No.	5.79e+07

Omnibus:	744.819	Durbin-Watson:	1.877
Prob(Omnibus):	0.000	Jarque-Bera (JB):	5533.375
Skew:	-2.224	Prob(JB):	0.00
Kurtosis:	11.335	Cond. No.	5.77e+07

Omnibus:	473.873	Durbin-Watson:	1.994
Prob(Omnibus):	0.000	Jarque-Bera (JB):	3314.246
Skew:	1.872	Prob(JB):	0.00
Kurtosis:	10.709	Cond. No.	5.76e+07

	p_more	p_less	p_within	Lnprediction_within	Lnprediction_more	Lnprediction_less
Date
2016-04-05	0.260327	0.244116	0.495556	14.972721	14.779110	15.183437
2016-04-06	0.384470	0.113125	0.502405	14.706426	14.487806	14.866475
2016-04-06	0.411735	0.085400	0.502865	16.814502	16.613899	16.943218
2016-04-06	0.296980	0.199164	0.503856	14.883813	14.681865	15.077643
2016-04-06	0.261602	0.274409	0.463989	13.832213	13.625224	14.051156

	Diff	PredDiff
count	6.700000e+01	6.700000e+01
mean	-5.089971e+04	-2.874761e+05
std	6.593086e+05	8.573449e+05
min	-3.178934e+06	-5.011753e+06
25%	-1.510351e+05	-3.104509e+05
50%	1.591124e+04	-7.131061e+04
75%	1.269111e+05	4.398933e+04
max	1.390280e+06	9.164722e+05