This document explores historical traffic data from an automatic traffic meassurement point on Hanasaari.
"The TMS point consists of ta data collecting unit and two induction loops on each traffic lane. The device registers vehicles passing the TMS point, recording data such as time, direction, lane, speed, vehicle length, time elapsed between vehicles and the vehicle class." From the documentation: http://www.liikennevirasto.fi/web/en/open-data/materials/tms-data#.WeilKBOCwp8
The point is located at: https://www.google.fi/maps/place/60%C2%B009'53.9%22N+24%C2%B050'55.5%22E/@60.1649798,24.846555,17z/data=!3m1!4b1!4m5!3m4!1s0x0:0x0!8m2!3d60.1649771!4d24.8487437?hl=en
Metadata for LAM point (101): https://tie.digitraffic.fi/api/v1/metadata/tms-stations/tms-number/101
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib
%matplotlib inline
The first examination will be of all data from 2016.
https://ava.liikennevirasto.fi/lam/rawdata/[year]/[ELY]/lamraw_[lam_id]_[yearshort]_[day_number].csv
NOTE! In the near future the address will change to aineistot.liikennevirasto.fi/
Description of the result file format The result file is a CSV file separated by semicolons (;). The time is the current time in Finland, EET or EEST in the summertime. The CSV files include the following fields (unit in parentheses):
TMS point id year päivän järjestysnumero hour minute second 1/100 second length (m) lane direction vehicle class speed(km/h) faulty (0 = validi record, 1=faulty record) total time (technical) time interval (technical) queue start (technical)
First a test with one day:
In [38]:
columns = [
'TMS id',
'year',
'päivän järjestysnumero',
'hour',
'minute',
'second',
'1/100 second',
'length (m)',
'lane',
'direction',
'vehicle class',
'speed(km/h)',
'faulty (0 = validi record, 1=faulty record)',
'total time (technical)',
'time interval (technical)',
'queue start (technical)']
firstday = pd.read_csv("https://ava.liikennevirasto.fi/lam/rawdata/2016/01/lamraw_101_16_1.csv",sep=';',names=columns)
firstday
Out[38]:
In [3]:
firstday['faulty (0 = validi record, 1=faulty record)'].value_counts()
Out[3]:
In [4]:
valid = firstday['faulty (0 = validi record, 1=faulty record)'] == 0
firstday = firstday[valid]
firstday.info()
In [36]:
types = {
'päivän järjestysnumero': np.int64,
'hour': np.int64,
'minute': np.int64,
'second': np.int64,
}
usecols = [
'year',
'päivän järjestysnumero',
'hour',
'minute',
'second',
'1/100 second',
'length (m)',
'lane',
'direction',
'vehicle class',
'speed(km/h)',
'faulty (0 = validi record, 1=faulty record)',
]
In [35]:
def get_day(year,day):
df = pd.read_csv("https://ava.liikennevirasto.fi/lam/rawdata/{}/01/lamraw_101_{}_{}.csv"
.format(str(year),str(year)[-2:], str(day)), sep=';', names=columns, dtype = types,usecols = usecols)
df['päivän järjestysnumero'] = df['päivän järjestysnumero'].apply(lambda x: pd.to_timedelta(x, unit='D'))
df['hour'] = df['hour'].apply(lambda x: pd.to_timedelta(x, unit='h'))
df['minute'] = df['minute'].apply(lambda x: pd.to_timedelta(x, unit='m'))
df['second'] = df['second'].apply(lambda x: pd.to_timedelta(x, unit='s'))
#convert the centi seconds to a supported prefix format
df['1/100 second'] = df['1/100 second'].apply(lambda x: pd.to_timedelta(x*10, unit='ms'))
df = df.rename(columns={'1/100 second': '1/1000 second',})
return df
In [7]:
def get_year(year):
days = []
for day in range(1,367):
try:
df = get_day(year,day)
valid = df['faulty (0 = validi record, 1=faulty record)'] == 0
days.append(df[valid])
except urllib.error.HTTPError as e:
print("Unable to retrive data for day {}".format(day))
return days
In [9]:
days_16 = get_year(2016)
Apparently there are 3 days missing from the dataset, days: 54, 69 and 70.
Aggregate all days to one dataframe.
In [10]:
y16 = pd.concat(days_16)
y16
Out[10]:
In [11]:
y16.info()
Now some of the unnecessary data can be droped.
In [12]:
#Free up some memory
del days_16
In [13]:
unnecessary = ['faulty (0 = validi record, 1=faulty record)']
y16 = y16.drop(unnecessary,axis=1)
y16.head()
Out[13]:
In [18]:
#Save to file to be able to continue from here if a crash happens, optional
y16.to_csv('data/hanasaari_lam_2016.csv')
In [19]:
df = pd.read_csv('data/hanasaari_lam_2016.csv')
In [20]:
df.head()
Out[20]:
In [27]:
df = df.drop(['Unnamed: 0'],axis=1)
df.head()
Out[27]:
In [28]:
y16['date-time'] = pd.datetime(2016,1,1)
y16.head()
Out[28]:
In [31]:
datetime = y16['date-time'] + (y16['päivän järjestysnumero'] - pd.Timedelta('1 days')) + y16['hour'] + y16['minute'] + y16['second'] + y16['1/1000 second']
datetime.head()
Out[31]:
In [32]:
datetime.tail()
Out[32]:
In [33]:
y16['date-time'] = datetime
y16.head()
Out[33]:
After the creation of a new date-time based index the source data for the index can be removed.
In [34]:
y16 = y16.set_index('date-time')
y16.head()
Out[34]:
In [35]:
unnecessary = ['year','päivän järjestysnumero', 'hour','minute','second','1/1000 second','length (m)']
y16 = y16.drop(unnecessary,axis=1)
In [39]:
y16.head()
Out[39]:
In [40]:
y16.to_csv('data/hanasaari_lam_2016.csv')
In [2]:
dtypes = {
'lane': np.int64,
'direction': np.int64,
'vehicle class': np.int64,
'speed(km/h)': np.int64
}
df = pd.read_csv('data/hanasaari_lam_2016.csv', index_col = 0, parse_dates=True,infer_datetime_format=True,dtype=dtypes)
df.head()
Out[2]:
In [3]:
y16 = df
In [4]:
monthly = y16.drop(['lane', 'direction','vehicle class','speed(km/h)'],axis=1)
monthly['vehicles in total'] = 1
monthly.head()
Out[4]:
In [5]:
monthly = monthly.groupby(monthly.index.month).count()
In [6]:
monthly.index.rename("Month",inplace=True);
monthly
Out[6]:
In [7]:
months = {1:"January",2:"February",3:"March",4:"April",5:"May",6:"June",7:"July",8:"August",9:"September",10:"October",11:"Novebler",12:"Deceber"}
In [8]:
monthly = monthly.rename(months)
In [9]:
monthly.plot(kind='bar', figsize = (15,10));
Next let's find the month that is closest to the median with respect to total traffic.
In [10]:
monthly["diff"] = monthly["vehicles in total"].apply(lambda x: abs(x - monthly.median()))
monthly
Out[10]:
In [11]:
monthly[monthly["diff"] == monthly["diff"].min()]
Out[11]:
So it appears that June and December are the median months in respect to total traffic. On the other hand they both contain holidays. Lets look at weeks instead, lets now also consider vehicle classes and average speed.
In [12]:
base = y16.drop(['lane'],axis=1)
print(base.head())
print(base["vehicle class"].unique())
There are seven vehicle classes:
Lets define light as {1,6,7} heavy as {2,4,5} and buses as {3}
In [13]:
light = {1,6,7}
heavy = {2,4,5}
bus = {3}
In [14]:
base['light traffic'] = base['vehicle class'].apply(lambda x: 1 if x in light else 0)
In [15]:
base['heavy traffic'] = base['vehicle class'].apply(lambda x: 1 if x in heavy else 0)
In [16]:
base['buses'] = base['vehicle class'].apply(lambda x: 1 if x in bus else 0)
In [17]:
base.head()
Out[17]:
In [18]:
weekly = base.drop(['vehicle class','direction','speed(km/h)'],axis=1).groupby(base.index.week).sum()
weekly.head()
Out[18]:
In [19]:
weekly.index.rename("Week (2016)",inplace=True);
weekly.plot(kind='bar', stacked=True, figsize = (15,10));
Apparently there is huge weekly variation in the amount of traffic. What we are interested in is the traffic of a typical working week without any special holidays or occations. The weeks 33-43 seem to reprsesent that well so lets plot distributions in that interval.
In [20]:
weofin = base[(base.index.week >=33) & (base.index.week <=43)]
weofin.info()
In [21]:
weofin_traffic_1 = weofin[weofin['direction'] == 1][['light traffic','heavy traffic','buses']].resample('5T', label='right').sum()
weofin_traffic_2 = weofin[weofin['direction'] == 2][['light traffic','heavy traffic','buses']].resample('5T', label='right').sum()
print(weofin_traffic_2.head())
print(weofin_traffic_1.head())
In [22]:
weofin_avg_speed_1 = weofin[weofin['direction'] == 1][['speed(km/h)']].resample('5T', label='right').mean()
weofin_avg_speed_2 = weofin[weofin['direction'] == 2][['speed(km/h)']].resample('5T', label='right').mean()
print(weofin_avg_speed_1.head())
print(weofin_avg_speed_2.head())
Lets next combine the results to two dataframes, one for the traffic towards helsinki and one the traffic away from helsinki.
In [23]:
weofin_1 = weofin_traffic_1.join(weofin_avg_speed_1, how='inner')
weofin_2 = weofin_traffic_2.join(weofin_avg_speed_2, how='inner')
In [24]:
weofin_1.plot( figsize = (20,10),title ='Traffic towards Espoo, 5 min sampling interval');
weofin_2.plot( figsize = (20,10),title ='Traffic towards Helsinki, 5 min sampling interval');
Based on visual inspection the weekly distributions seem to show a small variance. exept for in average speed.
Lets next inspect the distribution of one random week, say 33:
In [25]:
weofin_1[weofin_1.index.week == 33].plot(figsize = (20,10),title ='Traffic towards Espoo, 5 min sampling interval');
weofin_2[weofin_2.index.week == 33].plot(figsize = (20,10),title ='Traffic towards Helsinki, 5 min sampling interval');
There seem to be a great resemblance between the days so lets build an average workingdays distribution out of all workindays in the week intervall 33-43.
In [26]:
weofin_1['weekday']= weofin_1.index.weekday
weofin_2['weekday']= weofin_2.index.weekday
weofin_1.head()
Out[26]:
In [27]:
weofin_1['time']= weofin_1.index.time
weofin_2['time']= weofin_2.index.time
In [28]:
weofin_1.head()
Out[28]:
In [29]:
avg_distr_weofin_1 = weofin_1[weofin_1['weekday'] < 5].drop(['weekday'],axis=1).groupby(weofin_1.time).mean()
avg_distr_weofin_2 = weofin_2[weofin_2['weekday'] < 5].drop(['weekday'],axis=1).groupby(weofin_2.time).mean()
In [67]:
avg_distr_weofin_1.index
Out[67]:
In [63]:
xinterval = pd.date_range('1/1/2011', periods=24, freq='H').time.astype(str)
In [64]:
avg_distr_weofin_1.plot(xticks = xinterval,rot=45, figsize = (20,10),grid=True,title ='Traffic towards Espoo. Average distribution with 5 min sampling interval');
avg_distr_weofin_2.plot(xticks = xinterval,rot=45, figsize = (20,10),grid=True,title ='Traffic towards Helsinki. Average distribution with 5 min sampling interval');
Some manual measurements were also done 25.10.2017 :
Amount | Light traffic | Heavy traffic | |
---|---|---|---|
Western junction 8.34-8.44 | |||
To Lauttasaari from Espoo | 74 | 73 | 1 |
From Lauttasaari to Espoo | 43 | 41 | 2 |
Eastern junction 8.57-9.07 | |||
To Lauttasaari from Espoo | 104 | 103 | 1 |
From Helsinki to Lauttasaari | 13 | 13 | 0 |
From Lauttasaari to Espoo | 72 | 72 | 0 |
From Lauttasaari to Helsinki | 8 | 8 | 0 |
Lets next get the actual data for that day.
In [32]:
pd.to_datetime('25.10.2017')
Out[32]:
In [33]:
pd.to_datetime('25.10.2017') - pd.to_datetime('1.1.2017')
Out[33]:
In [39]:
# The differance is 297 days so the 25th is the 298th day
sample_day = get_day(2017,298)
In [40]:
sample_day.head()
Out[40]:
In [41]:
valid = sample_day['faulty (0 = validi record, 1=faulty record)'] == 0
valid.value_counts()
Out[41]:
In [42]:
sample_day[valid].info()
In [43]:
sample_day = sample_day[valid]
sample_day = sample_day.drop(['faulty (0 = validi record, 1=faulty record)'],axis=1)
In [45]:
unnecessary = ['year','päivän järjestysnumero', 'hour','minute','second','1/1000 second','length (m)']
sample_day['date-time'] = pd.datetime(2017,1,1)
sample_day['date-time'] = sample_day['date-time'] + (sample_day['päivän järjestysnumero'] - pd.Timedelta('1 days')) + sample_day['hour'] + sample_day['minute'] + sample_day['second'] + sample_day['1/1000 second']
sample_day = sample_day.drop(unnecessary,axis=1)
In [46]:
sample_day = sample_day.set_index('date-time')
sample_day.head()
Out[46]:
In [47]:
sample_day['heavy traffic'] = sample_day['vehicle class'].apply(lambda x: 1 if x in heavy else 0)
sample_day['light traffic'] = sample_day['vehicle class'].apply(lambda x: 1 if x in light else 0)
sample_day = sample_day.drop(['vehicle class'], axis = 1)
sample_day.head()
Out[47]:
In [48]:
interval_1 = sample_day[(sample_day.index <= pd.to_datetime('2017-10-25 08:44:00.000')) & (sample_day.index >= pd.to_datetime('2017-10-25 08:34:00.000'))]
interval_1
Out[48]:
In [49]:
interval_2 = sample_day[(sample_day.index <= pd.to_datetime('2017-10-25 09:07:00.000')) & (sample_day.index >= pd.to_datetime('2017-10-25 08:57:00.000'))]
interval_2
Out[49]:
In [57]:
#8.34-8.44
#Towards Helsinki
i1_tohel = interval_1[interval_1['direction'] == 2].sum()
print("Light traffic towards Helsinki (8.34-8.44): ")
print(i1_tohel['light traffic'])
print("Heavy traffic towards Helsinki (8.34-8.44): ")
print(i1_tohel['heavy traffic'])
In [61]:
#Western junction 8.34-8.44
#Towards Espoo
i1_toesp = interval_1[interval_1['direction'] == 1].sum()
print('Light traffic towards Espoo (8.34-8.44):')
print((i1_toesp['light traffic']))
print('Heavy traffic towards Espoo (8.34-8.44):')
print((i1_toesp['heavy traffic']))
In [62]:
#8.57-9.07
i2_tohel = interval_2[interval_2['direction'] == 2].sum()
i2_toesp = interval_2[interval_2['direction'] == 1].sum()
print('Light traffic towards Espoo (8.57-9.07):')
print((i2_toesp['light traffic']))
print('Heavy traffic towards Espoo (8.57-9.07):')
print((i2_toesp['heavy traffic']))
print('Light traffic towards Helsinki (8.57-9.07):')
print((i2_tohel['light traffic']))
print('Heavy traffic towards Helskinki (8.57-9.07):')
print((i2_tohel['heavy traffic']))
With the help of the observations made at the junctions and information about the total traffic in both directions probabilities/proportions for which junctions are taken can be calculated. Some proportions are exact in the interval measured and others have to be approximated based on the two time intervals. From those figures extrapolations about traffic behaviour can be made for any 5 min interval in the average distrubution calculated. This calculation is made in a different .tex document.
In [169]:
avg_distr_weofin_1.to_csv('data/dist_to_espoo.csv')
avg_distr_weofin_2.to_csv('data/dist_to_helsinki.csv')