Late night 1 hour hack of the freshly released dataset on train time tables by IRCTC. Source: https://data.gov.in/catalog/indian-railways-train-time-table-0#web_catalog_tabs_block_10



In [16]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns



In [20]:

    
# Load the data into a dataframe
df = pd.read_csv("data/isl_wise_train_detail_03082015_v1.csv")



In [21]:

    
sns.set_context("poster")
# Show some rows
df.head()









    Out[21]:






  
    
      
      Train No.
      train Name
      islno
      station Code
      Station Name
      Arrival time
      Departure time
      Distance
      Source Station Code
      source Station Name
      Destination station Code
      Destination Station Name
    
  
  
    
      0
      '00851'
      BNC SUVIDHA SPL
      1
      BBS
      BHUBANESWAR
      '00:00:00'
      '22:50:00'
      0
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT
    
    
      1
      '00851'
      BNC SUVIDHA SPL
      2
      BAM
      BRAHMAPUR
      '01:10:00'
      '01:12:00'
      166
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT
    
    
      2
      '00851'
      BNC SUVIDHA SPL
      3
      VSKP
      VISAKHAPATNAM
      '05:10:00'
      '05:30:00'
      443
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT
    
    
      3
      '00851'
      BNC SUVIDHA SPL
      4
      BZA
      VIJAYAWADA JN
      '11:10:00'
      '11:20:00'
      793
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT
    
    
      4
      '00851'
      BNC SUVIDHA SPL
      5
      RU
      RENIGUNTA JN
      '16:42:00'
      '16:52:00'
      1169
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT



In [4]:

    
df.columns









    Out[4]:





Index([u'Train No.', u'train Name', u'islno', u'station Code', u'Station Name',
       u'Arrival time', u'Departure time', u'Distance', u'Source Station Code',
       u'source Station Name', u'Destination station Code',
       u'Destination Station Name'],
      dtype='object')



In [22]:

    
# Convert time columns to datetime objects
df[u'Arrival time'] = pd.to_datetime(df[u'Arrival time'])
df[u'Departure time'] = pd.to_datetime(df[u'Departure time'])



In [23]:

    
df.head()









    Out[23]:






  
    
      
      Train No.
      train Name
      islno
      station Code
      Station Name
      Arrival time
      Departure time
      Distance
      Source Station Code
      source Station Name
      Destination station Code
      Destination Station Name
    
  
  
    
      0
      '00851'
      BNC SUVIDHA SPL
      1
      BBS
      BHUBANESWAR
      2015-08-10 00:00:00
      2015-08-10 22:50:00
      0
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT
    
    
      1
      '00851'
      BNC SUVIDHA SPL
      2
      BAM
      BRAHMAPUR
      2015-08-10 01:10:00
      2015-08-10 01:12:00
      166
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT
    
    
      2
      '00851'
      BNC SUVIDHA SPL
      3
      VSKP
      VISAKHAPATNAM
      2015-08-10 05:10:00
      2015-08-10 05:30:00
      443
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT
    
    
      3
      '00851'
      BNC SUVIDHA SPL
      4
      BZA
      VIJAYAWADA JN
      2015-08-10 11:10:00
      2015-08-10 11:20:00
      793
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT
    
    
      4
      '00851'
      BNC SUVIDHA SPL
      5
      RU
      RENIGUNTA JN
      2015-08-10 16:42:00
      2015-08-10 16:52:00
      1169
      BBS
      BHUBANESWAR
      BNC
      BANGALORE CANT

Distribution of Arrival and Departure Times

Lets analyze the arrival and departure time distributions. As we can see from the plots below, both the times follow as similar distribution. What is interesting is that a majority of the trains arrive during the night (which is good as Indians love to travel during night).



In [28]:

    
fig, ax = plt.subplots(1,2, sharey=True)
df[u'Arrival time'].map(lambda x: x.hour).hist(ax=ax[0], bins=24)
df[u'Departure time'].map(lambda x: x.hour).hist(ax=ax[1], bins=24)
ax[0].set_xlabel("Arrival Time")
ax[1].set_xlabel("Departure Time")









    Out[28]:





<matplotlib.text.Text at 0x21690c88>

It would also be interesting to find out the distribution of the stoppage time at a station. $Stoppage\_time = Departure\_time - Arrival\_time$



In [25]:

    
df["Stoppage"] = (df[u'Departure time'] - df[u'Arrival time']).astype('timedelta64[m]') # Find stoppage time in minutes
# Plot distribution of stoppage time
df["Stoppage"].hist()
plt.xlabel("Stoppage Time")









    Out[25]:





<matplotlib.text.Text at 0x1c2d4390>

This looks wierd. Stoppage time cannot be negative or more than 500 minutes (~8 hours). Let us remove these outlires and plot our distributions again.



In [26]:

    
df["Stoppage"][(df["Stoppage"]> 0) & (df["Stoppage"] < 61)].hist() # Let us take that max stoppage time can be an hour. 
plt.xlabel("Stoppage Time")









    Out[26]:





<matplotlib.text.Text at 0x2098f470>

This is better but still appears that most stoppage times are less than 30 minutes. So let us plot again in that range.



In [27]:

    
df["Stoppage"][(df["Stoppage"]> 0) & (df["Stoppage"] < 31)].hist(bins=30) # Let us take that max stoppage time can be an hour. 
plt.xlabel("Stoppage Time")









    Out[27]:





<matplotlib.text.Text at 0x1c6e9dd8>

This is more informative. We see that most stoppage times are either 1 or 2 minutes or a multiple of 5 minutes. Makes a lot of sense. Now let us look filter the data to make it consist of the stoppage time in this range.



In [29]:

    
df_stoppage_30 = df[(df["Stoppage"]> 0) & (df["Stoppage"] < 31)] # Filter data between nice stoppage times
# Plot data for this stoppage time range.
fig, ax = plt.subplots(1,2, sharey=True)
df_stoppage_30[u'Arrival time'].map(lambda x: x.hour).hist(ax=ax[0], bins=24)
df_stoppage_30[u'Departure time'].map(lambda x: x.hour).hist(ax=ax[1], bins=24)
ax[0].set_xlabel("Arrival Time")
ax[1].set_xlabel("Departure Time")









    Out[29]:





<matplotlib.text.Text at 0x208de2e8>

Aah, it looks like less trains arrive and depart during lunch hours around 1200-1500 Hours. Looks wierd but can also point to the fact that many trains run at night and travel short distances. This makes me think that we should look closely at the total distance per train.

Distance analysis

Lets now analyze the total distance travelled by a train. This can be easily found by using the last value for each train.



In [34]:

    
# Total Number of stations of the train, last arrival time, first departure time, last distance, first station and last station.

df_train_dist = df[[u'Train No.', u'station Code', u'Arrival time', u'Departure time',
                    u'Distance', u'Source Station Code', u'Destination station Code']]\
.groupby(u'Train No.').agg({u'station Code': "count", u'Arrival time': "last",
                                                               u'Departure time': "first", u'Distance': "last",
                                                               u'Source Station Code': "first", u'Destination station Code': "last"})



In [48]:

    
df_train_dist.head()









    Out[48]:






  
    
      
      Distance
      station Code
      Source Station Code
      Destination station Code
      Arrival time
      Departure time
    
    
      Train No.
      
      
      
      
      
      
    
  
  
    
      '00851'
      1511
      7
      BBS
      BNC
      2015-08-10 22:40:00
      2015-08-10 22:50:00
    
    
      '00852'
      1511
      7
      BNC
      BBS
      2015-08-10 01:45:00
      2015-08-10 01:00:00
    
    
      '01081'
      436
      8
      DR
      BSL
      2015-08-10 05:00:00
      2015-08-10 21:45:00
    
    
      '01082'
      436
      8
      BSL
      DR
      2015-08-10 16:20:00
      2015-08-10 08:35:00
    
    
      '01149'
      117
      4
      BSL
      CSN
      2015-08-10 21:30:00
      2015-08-10 18:35:00



In [40]:

    
# Let us plot the distribution of the distances as well as station codes, as well as arrival and departure times
fig, ax = plt.subplots(2,2)
df_train_dist[u'station Code'].hist(ax=ax[0][0], bins=range(df_train_dist[u'station Code'].max() + 1))
df_train_dist[u'Distance'].hist(ax=ax[0][1], bins=50)
ax[1][0].set_xlabel("Total Stations stopped")
ax[1][1].set_xlabel("Total Distance covered")

df_train_dist[u'Arrival time'].map(lambda x: x.hour).hist(ax=ax[1][0], bins=range(24))
df_train_dist[u'Departure time'].map(lambda x: x.hour).hist(ax=ax[1][1], bins=range(24))
ax[1][0].set_xlabel("Arrival Time")
ax[1][1].set_xlabel("Departure Time")









    Out[40]:





<matplotlib.text.Text at 0x26b26f98>

Train specific analysis

Ok this is insteresting.

We observe that majority of the trains cover 15-25 stations.
We also see that many trains are short distance trains travelling only 500-700 Kilometers.
Arrival time for many trains at their last stop is mostly during morning 0500 to afternoon 1300 hours and also a lot around midnight.
Departure time for a majority of the trains is actually mostly during night.

Now the question is: Do trains on average having more stops run longer distance or not ? Let us try to answer this question.



In [41]:

    
sns.lmplot(x=u'station Code', y=u'Distance', data=df_train_dist, x_estimator=np.mean)









    Out[41]:





<seaborn.axisgrid.FacetGrid at 0x23a7d860>

The regression plot shows that we cannot draw any conclusion regarding the relation between number of stopns and distance. We do see that low stops mean small distances but for larger distances we observe that this condition doesn't hold true. This can be attributed to the availability of both express as well as passenger trains for longer distances.



In [49]:

    
# Lets us see what are some general statistics of the distances and the number of stops. 
df_train_dist.describe()









    Out[49]:






  
    
      
      Distance
      station Code
    
  
  
    
      count
      2810.000000
      2810.000000
    
    
      mean
      1073.904270
      24.557295
    
    
      std
      770.358422
      16.903673
    
    
      min
      14.000000
      2.000000
    
    
      25%
      463.500000
      13.000000
    
    
      50%
      810.000000
      20.000000
    
    
      75%
      1585.000000
      31.000000
    
    
      max
      4273.000000
      128.000000

We observe that 50% of the trains travel less than 810 Km as well as have less than 20 stops. Maximum distance travelled by a train is 4273 Km and maximum stoppages are 128, both of which are very high numbers.

Analysis of Stations

Let us look at which stations are popular.



In [56]:

    
df[[u'Train No.', u'Station Name']].groupby(u'Station Name').count().sort(u'Train No.', ascending=False).head(20)









    Out[56]:






  
    
      
      Train No.
    
    
      Station Name
      
    
  
  
    
      VIJAYAWADA JN
      313
    
    
      VADODARA JN
      298
    
    
      KANPUR CENTRAL
      283
    
    
      SURAT
      267
    
    
      ITARSI JN
      262
    
    
      AHMEDABAD JN
      255
    
    
      KALYAN JN
      254
    
    
      BHUSAVAL JN
      248
    
    
      NAGPUR
      243
    
    
      LUCKNOW NR
      239
    
    
      NEW DELHI
      239
    
    
      MUGHAL SARAI JN
      233
    
    
      BHOPAL  JN
      232
    
    
      HOWRAH JN
      230
    
    
      JHANSI JN
      217
    
    
      BHUBANESWAR
      210
    
    
      PUNE JN
      208
    
    
      AMBALA CANT JN
      206
    
    
      VARANASI JN
      201
    
    
      VISAKHAPATNAM
      201

Looks like Vijaywada is the station where maximum trains have a stoppage. I am upset not to see my place Allahabad in the top 20 list. Neverthless, let us plot the distribution of these stoppages.



In [66]:

    
df[[u'Train No.', u'Station Name']].groupby(u'Station Name').count().hist(bins=range(1,320,2), log=True)
plt.xlabel("Number of trains stopping")
plt.ylabel("Number of stations")









    Out[66]:





<matplotlib.text.Text at 0x2c7f0a58>

Looks like very few stations have a high volume of trains stopping. Most stations see close to 5 trains. Let us now look at some train statistics like:

Trains with maximum stops, I would personally avoid these trains.
Trains which travel maximum distance, if they take less stops I would prefer these.



In [67]:

    
df_train_dist.sort(u'station Code', ascending=False).head(10) # Top 10 trains with maximum number of stops









    Out[67]:






  
    
      
      Distance
      station Code
      Source Station Code
      Destination station Code
      Arrival time
      Departure time
    
    
      Train No.
      
      
      
      
      
      
    
  
  
    
      '59386'
      1545
      128
      CWA
      INDB
      2015-08-10 08:10:00
      2015-08-10 01:20:00
    
    
      '13131'
      1532
      124
      KOAA
      ANVT
      2015-08-10 11:40:00
      2015-08-10 19:50:00
    
    
      '11039'
      1346
      122
      KOP
      G
      2015-08-10 20:15:00
      2015-08-10 13:35:00
    
    
      '13008'
      1978
      121
      SGNR
      HWH
      2015-08-10 19:30:00
      2015-08-10 23:50:00
    
    
      '13007'
      1978
      112
      HWH
      SGNR
      2015-08-10 07:00:00
      2015-08-10 09:35:00
    
    
      '13049'
      1922
      111
      HWH
      ASR
      2015-08-10 10:15:00
      2015-08-10 13:50:00
    
    
      '13352'
      2536
      108
      ALLP
      DHN
      2015-08-10 13:15:00
      2015-08-10 06:00:00
    
    
      '15018'
      1713
      106
      GKP
      LTT
      2015-08-10 18:05:00
      2015-08-10 17:50:00
    
    
      '58112'
      878
      106
      ITR
      TATA
      2015-08-10 00:05:00
      2015-08-10 21:30:00
    
    
      '13050'
      1922
      106
      ASR
      HWH
      2015-08-10 15:45:00
      2015-08-10 18:10:00



In [68]:

    
df_train_dist.sort(u'Distance', ascending=False).head(10) # Top 10 trains with maximum distance









    Out[68]:






  
    
      
      Distance
      station Code
      Source Station Code
      Destination station Code
      Arrival time
      Departure time
    
    
      Train No.
      
      
      
      
      
      
    
  
  
    
      '15906'
      4273
      62
      DBRG
      CAPE
      2015-08-10 09:50:00
      2015-08-10 23:45:00
    
    
      '15905'
      4273
      62
      CAPE
      DBRG
      2015-08-10 07:15:00
      2015-08-10 23:00:00
    
    
      '16318'
      3715
      84
      JAT
      CAPE
      2015-08-10 21:30:00
      2015-08-10 09:05:00
    
    
      '16317'
      3714
      72
      CAPE
      JAT
      2015-08-10 13:10:00
      2015-08-10 14:10:00
    
    
      '06336'
      3650
      58
      KCVL
      GHY
      2015-08-10 08:15:00
      2015-08-10 12:00:00
    
    
      '06335'
      3650
      58
      GHY
      KCVL
      2015-08-10 22:30:00
      2015-08-10 23:25:00
    
    
      '16688'
      3609
      79
      JAT
      MAQ
      2015-08-10 22:45:00
      2015-08-10 13:40:00
    
    
      '16687'
      3607
      63
      MAQ
      JAT
      2015-08-10 13:10:00
      2015-08-10 16:50:00
    
    
      '12483'
      3597
      26
      KCVL
      ASR
      2015-08-10 22:20:00
      2015-08-10 09:20:00
    
    
      '12484'
      3597
      26
      ASR
      KCVL
      2015-08-10 17:45:00
      2015-08-10 05:55:00



In [73]:

    
fig, ax = plt.subplots(1,2)
sns.regplot(x=df_train_dist[u'Arrival time'].map(lambda x: x.hour), y=df_train_dist[u'Distance'], x_estimator=np.mean, ax=ax[0])
sns.regplot(x=df_train_dist[u'Departure time'].map(lambda x: x.hour), y=df_train_dist[u'Distance'], x_estimator=np.mean, ax=ax[1])









    Out[73]:





<matplotlib.axes._subplots.AxesSubplot at 0x2dbe7438>

We see that departure and arrival time of a lot of long distance trains is during night around 0000 Hours, many long route trains arrive during late afternoons around 1500 hours and many long route trains leave early morning around 1000 Hours as well. Most medium distance trains arrive during the day



In [ ]:

	Train No.	train Name	islno	station Code	Station Name	Arrival time	Departure time	Distance	Source Station Code	source Station Name	Destination station Code	Destination Station Name
0	'00851'	BNC SUVIDHA SPL	1	BBS	BHUBANESWAR	'00:00:00'	'22:50:00'	0	BBS	BHUBANESWAR	BNC	BANGALORE CANT
1	'00851'	BNC SUVIDHA SPL	2	BAM	BRAHMAPUR	'01:10:00'	'01:12:00'	166	BBS	BHUBANESWAR	BNC	BANGALORE CANT
2	'00851'	BNC SUVIDHA SPL	3	VSKP	VISAKHAPATNAM	'05:10:00'	'05:30:00'	443	BBS	BHUBANESWAR	BNC	BANGALORE CANT
3	'00851'	BNC SUVIDHA SPL	4	BZA	VIJAYAWADA JN	'11:10:00'	'11:20:00'	793	BBS	BHUBANESWAR	BNC	BANGALORE CANT
4	'00851'	BNC SUVIDHA SPL	5	RU	RENIGUNTA JN	'16:42:00'	'16:52:00'	1169	BBS	BHUBANESWAR	BNC	BANGALORE CANT

	Distance	station Code	Source Station Code	Destination station Code	Arrival time	Departure time
Train No.
'00851'	1511	7	BBS	BNC	2015-08-10 22:40:00	2015-08-10 22:50:00
'00852'	1511	7	BNC	BBS	2015-08-10 01:45:00	2015-08-10 01:00:00
'01081'	436	8	DR	BSL	2015-08-10 05:00:00	2015-08-10 21:45:00
'01082'	436	8	BSL	DR	2015-08-10 16:20:00	2015-08-10 08:35:00
'01149'	117	4	BSL	CSN	2015-08-10 21:30:00	2015-08-10 18:35:00

	Distance	station Code
count	2810.000000	2810.000000
mean	1073.904270	24.557295
std	770.358422	16.903673
min	14.000000	2.000000
25%	463.500000	13.000000
50%	810.000000	20.000000
75%	1585.000000	31.000000
max	4273.000000	128.000000

	Train No.
Station Name
VIJAYAWADA JN	313
VADODARA JN	298
KANPUR CENTRAL	283
SURAT	267
ITARSI JN	262
AHMEDABAD JN	255
KALYAN JN	254
BHUSAVAL JN	248
NAGPUR	243
LUCKNOW NR	239
NEW DELHI	239
MUGHAL SARAI JN	233
BHOPAL JN	232
HOWRAH JN	230
JHANSI JN	217
BHUBANESWAR	210
PUNE JN	208
AMBALA CANT JN	206
VARANASI JN	201
VISAKHAPATNAM	201

	Distance	station Code	Source Station Code	Destination station Code	Arrival time	Departure time
Train No.
'59386'	1545	128	CWA	INDB	2015-08-10 08:10:00	2015-08-10 01:20:00
'13131'	1532	124	KOAA	ANVT	2015-08-10 11:40:00	2015-08-10 19:50:00
'11039'	1346	122	KOP	G	2015-08-10 20:15:00	2015-08-10 13:35:00
'13008'	1978	121	SGNR	HWH	2015-08-10 19:30:00	2015-08-10 23:50:00
'13007'	1978	112	HWH	SGNR	2015-08-10 07:00:00	2015-08-10 09:35:00
'13049'	1922	111	HWH	ASR	2015-08-10 10:15:00	2015-08-10 13:50:00
'13352'	2536	108	ALLP	DHN	2015-08-10 13:15:00	2015-08-10 06:00:00
'15018'	1713	106	GKP	LTT	2015-08-10 18:05:00	2015-08-10 17:50:00
'58112'	878	106	ITR	TATA	2015-08-10 00:05:00	2015-08-10 21:30:00
'13050'	1922	106	ASR	HWH	2015-08-10 15:45:00	2015-08-10 18:10:00

	Distance	station Code	Source Station Code	Destination station Code	Arrival time	Departure time
Train No.
'15906'	4273	62	DBRG	CAPE	2015-08-10 09:50:00	2015-08-10 23:45:00
'15905'	4273	62	CAPE	DBRG	2015-08-10 07:15:00	2015-08-10 23:00:00
'16318'	3715	84	JAT	CAPE	2015-08-10 21:30:00	2015-08-10 09:05:00
'16317'	3714	72	CAPE	JAT	2015-08-10 13:10:00	2015-08-10 14:10:00
'06336'	3650	58	KCVL	GHY	2015-08-10 08:15:00	2015-08-10 12:00:00
'06335'	3650	58	GHY	KCVL	2015-08-10 22:30:00	2015-08-10 23:25:00
'16688'	3609	79	JAT	MAQ	2015-08-10 22:45:00	2015-08-10 13:40:00
'16687'	3607	63	MAQ	JAT	2015-08-10 13:10:00	2015-08-10 16:50:00
'12483'	3597	26	KCVL	ASR	2015-08-10 22:20:00	2015-08-10 09:20:00
'12484'	3597	26	ASR	KCVL	2015-08-10 17:45:00	2015-08-10 05:55:00