Predicting TCS Stock Performance

In this jupyter notebook, we will be training linear models for predicting TCS stock price based on 2016 trading data.

Lets Get Data

Get the historic data of TCS scrip from NSE

choose Price, Volume and Delivery Position data as it provides insight into investors pulse
Enter Symbol 'TCS' and select series equity 'EQ'
Get data for past 365 days
Click get data and then download full data



In [1]:

    
# load ammo
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score

Loading data into DataFrame

Assuming the downloaded data is in your current working directory, the dataset should have name like this 01-01-2016-TO-30-12-2016TCSEQN.csv



In [2]:

    
# loading data sec_bhavdata_full.csv
# column name is present in data 
df = pd.read_csv('01-01-2016-TO-30-12-2016TCSEQN.csv')
print(df.shape)
df.dtypes









    



(247, 15)






    Out[2]:





Symbol                     object
Series                     object
Date                       object
Prev Close                float64
Open Price                float64
High Price                float64
Low Price                 float64
Last Price                float64
Close Price               float64
Average Price             float64
Total Traded Quantity       int64
Turnover in Lacs          float64
No. of Trades               int64
Deliverable Qty             int64
% Dly Qt to Traded Qty    float64
dtype: object



In [3]:

    
# let see what we got here
df.describe()









    Out[3]:






  
    
      
      Prev Close
      Open Price
      High Price
      Low Price
      Last Price
      Close Price
      Average Price
      Total Traded Quantity
      Turnover in Lacs
      No. of Trades
      Deliverable Qty
      % Dly Qt to Traded Qty
    
  
  
    
      count
      247.000000
      247.000000
      247.000000
      247.000000
      247.000000
      247.000000
      247.000000
      2.470000e+02
      247.000000
      247.000000
      2.470000e+02
      247.000000
    
    
      mean
      2419.522065
      2418.992105
      2442.333198
      2392.198785
      2419.567611
      2419.233401
      2417.921984
      1.091715e+06
      26337.458381
      60466.882591
      7.133274e+05
      65.055506
    
    
      std
      139.226738
      137.641631
      136.705273
      139.659422
      138.700465
      139.264575
      137.672694
      5.897982e+05
      14181.076781
      26322.564829
      3.817986e+05
      7.950901
    
    
      min
      2101.150000
      2116.900000
      2140.000000
      2051.900000
      2120.000000
      2101.150000
      2110.240000
      4.267600e+04
      1021.550000
      3032.000000
      1.707300e+04
      40.010000
    
    
      25%
      2319.550000
      2315.250000
      2340.550000
      2291.250000
      2318.500000
      2319.550000
      2317.545000
      7.268340e+05
      17627.255000
      43728.500000
      4.482740e+05
      59.905000
    
    
      50%
      2420.050000
      2411.000000
      2434.900000
      2401.100000
      2421.400000
      2418.100000
      2419.990000
      9.394940e+05
      22598.580000
      54342.000000
      6.150090e+05
      64.730000
    
    
      75%
      2521.100000
      2521.275000
      2543.500000
      2495.175000
      2522.650000
      2521.100000
      2514.875000
      1.241949e+06
      30405.365000
      70264.000000
      8.483030e+05
      71.515000
    
    
      max
      2738.450000
      2738.000000
      2744.800000
      2702.050000
      2740.000000
      2738.450000
      2732.860000
      4.168183e+06
      102712.190000
      185897.000000
      2.506173e+06
      85.080000

TCS being largest company by market capitalization it maintains very high liquidity so we will have data for all the trading days of 2016.



In [4]:

    
df.head(2)









    Out[4]:






  
    
      
      Symbol
      Series
      Date
      Prev Close
      Open Price
      High Price
      Low Price
      Last Price
      Close Price
      Average Price
      Total Traded Quantity
      Turnover in Lacs
      No. of Trades
      Deliverable Qty
      % Dly Qt to Traded Qty
    
  
  
    
      0
      TCS
      EQ
      01-Jan-2016
      2436.85
      2439.00
      2439.0
      2412.25
      2414.1
      2416.4
      2423.73
      356131
      8631.65
      15545
      246115
      69.11
    
    
      1
      TCS
      EQ
      04-Jan-2016
      2416.40
      2410.15
      2414.0
      2366.05
      2373.0
      2369.6
      2381.76
      935092
      22271.61
      72341
      615009
      65.77



In [5]:

    
df.tail(2)









    Out[5]:






  
    
      
      Symbol
      Series
      Date
      Prev Close
      Open Price
      High Price
      Low Price
      Last Price
      Close Price
      Average Price
      Total Traded Quantity
      Turnover in Lacs
      No. of Trades
      Deliverable Qty
      % Dly Qt to Traded Qty
    
  
  
    
      245
      TCS
      EQ
      29-Dec-2016
      2313.4
      2303.0
      2359.9
      2303.00
      2355.00
      2351.70
      2346.25
      981758
      23034.45
      64945
      575262
      58.60
    
    
      246
      TCS
      EQ
      30-Dec-2016
      2351.7
      2354.9
      2378.6
      2345.65
      2366.55
      2365.55
      2366.64
      702874
      16634.48
      44979
      456203
      64.91

Exploring dataset

Lets see how the scrip has performed over the year



In [6]:

    
df['Average Price'].plot(color='black',linewidth=0.7)
df['High Price'].plot(color='green', linewidth=0.5)
df['Low Price'].plot(color='red', linewidth=0.5,)
plt.plot([df['Average Price'].mean()]*247, 'b--')
plt.show()

We see a lot of swing in the price, the stock started & ended the year in the same range.



In [7]:

    
df['Average Price'].plot(color='black',linewidth=0.7)
df['Open Price'].plot(color='green', linewidth=0.5)
df['Close Price'].plot(color='red', linewidth=0.5,)
plt.plot([df['Average Price'].mean()]*247, 'b--')
plt.show()

Feature Selection

Lets build a simple model based on open, high, low, close and average price features to pridect next day average price



In [8]:

    
# feature and label selection
features = ['Open Price','High Price','Low Price','Close Price','Average Price']
label = ['Average Price']

# prepare dataset with only required columns
# label is the avg. price shifted to next day
dataset = df[features].assign(label=df[label].shift(-1))



In [9]:

    
# lets check head of the dataset, now the label is previous days avg. price
dataset.head(5)









    Out[9]:






  
    
      
      Open Price
      High Price
      Low Price
      Close Price
      Average Price
      label
    
  
  
    
      0
      2439.00
      2439.00
      2412.25
      2416.40
      2423.73
      2381.76
    
    
      1
      2410.15
      2414.00
      2366.05
      2369.60
      2381.76
      2353.54
    
    
      2
      2385.00
      2386.60
      2341.00
      2348.95
      2353.54
      2374.18
    
    
      3
      2350.20
      2386.15
      2350.20
      2381.60
      2374.18
      2367.71
    
    
      4
      2370.00
      2382.90
      2360.00
      2371.25
      2367.71
      2396.08



In [10]:

    
# the last label is nan cause we shifted the label
dataset.tail(2)









    Out[10]:






  
    
      
      Open Price
      High Price
      Low Price
      Close Price
      Average Price
      label
    
  
  
    
      245
      2303.0
      2359.9
      2303.00
      2351.70
      2346.25
      2366.64
    
    
      246
      2354.9
      2378.6
      2345.65
      2365.55
      2366.64
      NaN



In [11]:

    
# drop last datapoint as it lacks label
print(dataset.shape)
dataset.dropna(inplace=True)
print(dataset.shape)









    



(247, 6)
(246, 6)

Preprocessing Data

we will scale the features for obvious performace upgrade



In [12]:

    
# separate data and target label
data = dataset[features].values
target = dataset['label'].values

# scale the dataset to ease model
data_scaled = preprocessing.scale(data)

Train and Test Data

Split data into train and test, with train data representing 70%



In [13]:

    
test_size = 0.3 # hold out 30% for prediction
random_state = 42 # answer to the univers could be better choice :P

X_train, X_test, y_train, y_test = train_test_split(data_scaled,target,test_size=test_size,random_state=random_state)

X_train.shape, y_train.shape









    Out[13]:





((172, 5), (172,))



In [14]:

    
X_test.shape, y_test.shape









    Out[14]:





((74, 5), (74,))

Model Selection

Let us run this data through few linear models to compare their relative preformace. We will choose top two models



In [15]:

    
# lets use multiple models and determine their merit with kfold cross validation
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression

models = []
models.append(('Linear Regression',LinearRegression()))
models.append(('SVR Linear',SVR(kernel='linear')))
models.append(('SVR Ploy',SVR(kernel='poly')))
models.append(('SVR RBF',SVR(kernel='rbf')))

Cross Validation

I tried cv with shuffling(by explicitly using KFold class) and without shuffling(by just passing cv=10), the models performed significantly worse when shuffled, mainly because this is a time series problem and we are not using the key feature Date which I'm saving it for later.

Model evaluation metric for out problem would be Mean Squard Error



In [16]:

    
# lets evaluate models 
results = []
names = []
scoring = 'neg_mean_squared_error'

# kfold with shuffling
from sklearn.model_selection import KFold
num_splits = 10
kf = KFold(n_splits=num_splits,shuffle=True,random_state=random_state)

# kfold without shuffling
cv_fold = 10

# if you wish to try cv will shuffling pass kf to cv instead of sv_fold

for name, model in models:
    result = cross_val_score(model, X_train, y_train, cv=cv_fold,scoring=scoring)
    names.append(name)
    results.append(result)
    msg = "{:20} : mean = {:.5f}  std = {:.5f}".format(name,result.mean(),result.std())
    print(msg)









    



Linear Regression    : mean = -804.93089  std = 291.70783
SVR Linear           : mean = -985.02259  std = 325.78073
SVR Ploy             : mean = -5832.65408  std = 2263.04877
SVR RBF              : mean = -12455.62543  std = 3957.33056

Comparing CV Score



In [17]:

    
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

As we can see in the comparison Linear Regressionand and SVR with Linear Kernel have higest accuracy and very low deviation. SVR Poly and SVR RBF are bad idea if you gonna put your money on this :P

Choosing Final Models

I have set MSE threshold to -1000 which shoudl get us LR and SVR-Linear



In [18]:

    
# lets predict
accuracy_threshold = -1000
cv_models = []
cv_results = []
cv_names = []

for model,result in zip(models,results):
    if result.mean() > accuracy_threshold:
        cv_models.append(model)
        cv_results.append(result)
        cv_names.append(model[0])
        print('adding model... {}'.format(model[0]))









    



adding model... Linear Regression
adding model... SVR Linear



In [19]:

    
# lets have a closer look at the 'relatively' good models
fig = plt.figure()
fig.suptitle('Final Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(cv_results)
ax.set_xticklabels(cv_names)
plt.show()

there are some outlies but the inter quartail range is short (compared to the one trained on shuffled features)

Predicting Test Set



In [20]:

    
# make prediction on test set
from sklearn.metrics import mean_squared_error
predictions = []
mseResult = []
for name,model in cv_models:
    model.fit(X_train,y_train)
    prediction = model.predict(X_test)
    predictions.append(prediction)
    mse = mean_squared_error(y_test,prediction)
    mseResult.append(mse)
    msg = "{:20} : mae = {:.5f}".format(name,mse)
    print(msg)









    



Linear Regression    : mae = 867.23876
SVR Linear           : mae = 1028.96210

The predicton MSE is in line with what we seen in cross validation



In [21]:

    
df_predictions = pd.DataFrame(np.transpose(predictions),columns=['Linear Reg','SVR Linear'])
df_predictions = df_predictions.assign(y_test=y_test)
df_predictions.head()









    Out[21]:






  
    
      
      Linear Reg
      SVR Linear
      y_test
    
  
  
    
      0
      2317.573720
      2303.110227
      2317.45
    
    
      1
      2364.197795
      2365.639535
      2325.68
    
    
      2
      2427.973945
      2422.513061
      2423.88
    
    
      3
      2514.182910
      2503.545917
      2554.51
    
    
      4
      2625.602555
      2629.699464
      2610.69



In [22]:

    
df_predictions['Linear Reg'].plot(color='green',linewidth=0.5)
df_predictions['SVR Linear'].plot(color='red', linewidth=0.5)
df_predictions['y_test'].plot(color='black', linewidth=0.5,)
plt.show()

We can see the Linear Regression is tracing the actual value more conservatively while SVR overshooting in most of the time

	Prev Close	Open Price	High Price	Low Price	Last Price	Close Price	Average Price	Total Traded Quantity	Turnover in Lacs	No. of Trades	Deliverable Qty	% Dly Qt to Traded Qty
count	247.000000	247.000000	247.000000	247.000000	247.000000	247.000000	247.000000	2.470000e+02	247.000000	247.000000	2.470000e+02	247.000000
mean	2419.522065	2418.992105	2442.333198	2392.198785	2419.567611	2419.233401	2417.921984	1.091715e+06	26337.458381	60466.882591	7.133274e+05	65.055506
std	139.226738	137.641631	136.705273	139.659422	138.700465	139.264575	137.672694	5.897982e+05	14181.076781	26322.564829	3.817986e+05	7.950901
min	2101.150000	2116.900000	2140.000000	2051.900000	2120.000000	2101.150000	2110.240000	4.267600e+04	1021.550000	3032.000000	1.707300e+04	40.010000
25%	2319.550000	2315.250000	2340.550000	2291.250000	2318.500000	2319.550000	2317.545000	7.268340e+05	17627.255000	43728.500000	4.482740e+05	59.905000
50%	2420.050000	2411.000000	2434.900000	2401.100000	2421.400000	2418.100000	2419.990000	9.394940e+05	22598.580000	54342.000000	6.150090e+05	64.730000
75%	2521.100000	2521.275000	2543.500000	2495.175000	2522.650000	2521.100000	2514.875000	1.241949e+06	30405.365000	70264.000000	8.483030e+05	71.515000
max	2738.450000	2738.000000	2744.800000	2702.050000	2740.000000	2738.450000	2732.860000	4.168183e+06	102712.190000	185897.000000	2.506173e+06	85.080000

	Symbol	Series	Date	Prev Close	Open Price	High Price	Low Price	Last Price	Close Price	Average Price	Total Traded Quantity	Turnover in Lacs	No. of Trades	Deliverable Qty	% Dly Qt to Traded Qty
0	TCS	EQ	01-Jan-2016	2436.85	2439.00	2439.0	2412.25	2414.1	2416.4	2423.73	356131	8631.65	15545	246115	69.11
1	TCS	EQ	04-Jan-2016	2416.40	2410.15	2414.0	2366.05	2373.0	2369.6	2381.76	935092	22271.61	72341	615009	65.77

	Symbol	Series	Date	Prev Close	Open Price	High Price	Low Price	Last Price	Close Price	Average Price	Total Traded Quantity	Turnover in Lacs	No. of Trades	Deliverable Qty	% Dly Qt to Traded Qty
245	TCS	EQ	29-Dec-2016	2313.4	2303.0	2359.9	2303.00	2355.00	2351.70	2346.25	981758	23034.45	64945	575262	58.60
246	TCS	EQ	30-Dec-2016	2351.7	2354.9	2378.6	2345.65	2366.55	2365.55	2366.64	702874	16634.48	44979	456203	64.91

	Linear Reg	SVR Linear	y_test
0	2317.573720	2303.110227	2317.45
1	2364.197795	2365.639535	2325.68
2	2427.973945	2422.513061	2423.88
3	2514.182910	2503.545917	2554.51
4	2625.602555	2629.699464	2610.69