Late night 1 hour hack of the freshly released dataset on train time tables by IRCTC. Source: https://data.gov.in/catalog/indian-railways-train-time-table-0#web_catalog_tabs_block_10
In [16]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
In [20]:
# Load the data into a dataframe
df = pd.read_csv("data/isl_wise_train_detail_03082015_v1.csv")
In [21]:
sns.set_context("poster")
# Show some rows
df.head()
Out[21]:
In [4]:
df.columns
Out[4]:
In [22]:
# Convert time columns to datetime objects
df[u'Arrival time'] = pd.to_datetime(df[u'Arrival time'])
df[u'Departure time'] = pd.to_datetime(df[u'Departure time'])
In [23]:
df.head()
Out[23]:
Lets analyze the arrival and departure time distributions. As we can see from the plots below, both the times follow as similar distribution. What is interesting is that a majority of the trains arrive during the night (which is good as Indians love to travel during night).
In [28]:
fig, ax = plt.subplots(1,2, sharey=True)
df[u'Arrival time'].map(lambda x: x.hour).hist(ax=ax[0], bins=24)
df[u'Departure time'].map(lambda x: x.hour).hist(ax=ax[1], bins=24)
ax[0].set_xlabel("Arrival Time")
ax[1].set_xlabel("Departure Time")
Out[28]:
It would also be interesting to find out the distribution of the stoppage time at a station. $Stoppage\_time = Departure\_time - Arrival\_time$
In [25]:
df["Stoppage"] = (df[u'Departure time'] - df[u'Arrival time']).astype('timedelta64[m]') # Find stoppage time in minutes
# Plot distribution of stoppage time
df["Stoppage"].hist()
plt.xlabel("Stoppage Time")
Out[25]:
This looks wierd. Stoppage time cannot be negative or more than 500 minutes (~8 hours). Let us remove these outlires and plot our distributions again.
In [26]:
df["Stoppage"][(df["Stoppage"]> 0) & (df["Stoppage"] < 61)].hist() # Let us take that max stoppage time can be an hour.
plt.xlabel("Stoppage Time")
Out[26]:
This is better but still appears that most stoppage times are less than 30 minutes. So let us plot again in that range.
In [27]:
df["Stoppage"][(df["Stoppage"]> 0) & (df["Stoppage"] < 31)].hist(bins=30) # Let us take that max stoppage time can be an hour.
plt.xlabel("Stoppage Time")
Out[27]:
This is more informative. We see that most stoppage times are either 1 or 2 minutes or a multiple of 5 minutes. Makes a lot of sense. Now let us look filter the data to make it consist of the stoppage time in this range.
In [29]:
df_stoppage_30 = df[(df["Stoppage"]> 0) & (df["Stoppage"] < 31)] # Filter data between nice stoppage times
# Plot data for this stoppage time range.
fig, ax = plt.subplots(1,2, sharey=True)
df_stoppage_30[u'Arrival time'].map(lambda x: x.hour).hist(ax=ax[0], bins=24)
df_stoppage_30[u'Departure time'].map(lambda x: x.hour).hist(ax=ax[1], bins=24)
ax[0].set_xlabel("Arrival Time")
ax[1].set_xlabel("Departure Time")
Out[29]:
Aah, it looks like less trains arrive and depart during lunch hours around 1200-1500 Hours. Looks wierd but can also point to the fact that many trains run at night and travel short distances. This makes me think that we should look closely at the total distance per train.
Lets now analyze the total distance travelled by a train. This can be easily found by using the last value for each train.
In [34]:
# Total Number of stations of the train, last arrival time, first departure time, last distance, first station and last station.
df_train_dist = df[[u'Train No.', u'station Code', u'Arrival time', u'Departure time',
u'Distance', u'Source Station Code', u'Destination station Code']]\
.groupby(u'Train No.').agg({u'station Code': "count", u'Arrival time': "last",
u'Departure time': "first", u'Distance': "last",
u'Source Station Code': "first", u'Destination station Code': "last"})
In [48]:
df_train_dist.head()
Out[48]:
In [40]:
# Let us plot the distribution of the distances as well as station codes, as well as arrival and departure times
fig, ax = plt.subplots(2,2)
df_train_dist[u'station Code'].hist(ax=ax[0][0], bins=range(df_train_dist[u'station Code'].max() + 1))
df_train_dist[u'Distance'].hist(ax=ax[0][1], bins=50)
ax[1][0].set_xlabel("Total Stations stopped")
ax[1][1].set_xlabel("Total Distance covered")
df_train_dist[u'Arrival time'].map(lambda x: x.hour).hist(ax=ax[1][0], bins=range(24))
df_train_dist[u'Departure time'].map(lambda x: x.hour).hist(ax=ax[1][1], bins=range(24))
ax[1][0].set_xlabel("Arrival Time")
ax[1][1].set_xlabel("Departure Time")
Out[40]:
Ok this is insteresting.
Now the question is: Do trains on average having more stops run longer distance or not ? Let us try to answer this question.
In [41]:
sns.lmplot(x=u'station Code', y=u'Distance', data=df_train_dist, x_estimator=np.mean)
Out[41]:
The regression plot shows that we cannot draw any conclusion regarding the relation between number of stopns and distance. We do see that low stops mean small distances but for larger distances we observe that this condition doesn't hold true. This can be attributed to the availability of both express as well as passenger trains for longer distances.
In [49]:
# Lets us see what are some general statistics of the distances and the number of stops.
df_train_dist.describe()
Out[49]:
We observe that 50% of the trains travel less than 810 Km as well as have less than 20 stops. Maximum distance travelled by a train is 4273 Km and maximum stoppages are 128, both of which are very high numbers.
In [56]:
df[[u'Train No.', u'Station Name']].groupby(u'Station Name').count().sort(u'Train No.', ascending=False).head(20)
Out[56]:
Looks like Vijaywada is the station where maximum trains have a stoppage. I am upset not to see my place Allahabad in the top 20 list. Neverthless, let us plot the distribution of these stoppages.
In [66]:
df[[u'Train No.', u'Station Name']].groupby(u'Station Name').count().hist(bins=range(1,320,2), log=True)
plt.xlabel("Number of trains stopping")
plt.ylabel("Number of stations")
Out[66]:
Looks like very few stations have a high volume of trains stopping. Most stations see close to 5 trains. Let us now look at some train statistics like:
In [67]:
df_train_dist.sort(u'station Code', ascending=False).head(10) # Top 10 trains with maximum number of stops
Out[67]:
In [68]:
df_train_dist.sort(u'Distance', ascending=False).head(10) # Top 10 trains with maximum distance
Out[68]:
In [73]:
fig, ax = plt.subplots(1,2)
sns.regplot(x=df_train_dist[u'Arrival time'].map(lambda x: x.hour), y=df_train_dist[u'Distance'], x_estimator=np.mean, ax=ax[0])
sns.regplot(x=df_train_dist[u'Departure time'].map(lambda x: x.hour), y=df_train_dist[u'Distance'], x_estimator=np.mean, ax=ax[1])
Out[73]:
We see that departure and arrival time of a lot of long distance trains is during night around 0000 Hours, many long route trains arrive during late afternoons around 1500 hours and many long route trains leave early morning around 1000 Hours as well. Most medium distance trains arrive during the day
In [ ]: