Bike Availability Preprocessing

Data Dictionary

The raw data contains the following data per station per reading:

  • Id - String - API Resource Id
  • Name - String - The common name of the station
  • PlaceType - String ?
  • TerminalName - String - ?
  • NbBikes - Integer - The number of available bikes
  • NbDocks - Integer - The total number of docking spaces
  • NbEmptyDocks - Integer - The number of available empty docking spaces
  • Timestamp - DateTime - The moment this reading was captured
  • InstallDate - DateTime - Date when the station was installed
  • RemovalDate - DateTime - Date when the station was removed
  • Installed - Boolean - If the station is installed or not
  • Locked - Boolean - ?
  • Temporary - Boolean - If the station is temporary or not (TfL adds temporary stations to cope with demand.)
  • Latitude - Float - Latitude Coordinate
  • Longitude - Float - Longitude Coordinate

The following variables will be derived from the raw data.

  • NbUnusableDocks - Integer - The number of non-working docking spaces. Computed with NbUnusableDocks = NbDocks - (NbBikes + NbEmptyDocks)

Set up

Imports


In [1]:
%matplotlib inline

import logging
import itertools
import json
import os
import pickle
import folium
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from mpl_toolkits.basemap import Basemap
from datetime import datetime
from os import listdir
from os.path import isfile, join
from IPython.display import Image
from datetime import date

from src.data.parse_dataset import parse_dir, parse_json_files, get_file_list
from src.data.string_format import format_name, to_short_name
from src.data.visualization import lon_min_longitude, lon_min_latitude, lon_max_longitude, lon_max_latitude, lon_center_latitude, lon_center_longitude, create_london_map

logger = logging.getLogger()
logger.setLevel(logging.INFO)

Parse Raw Data

Define the Parsing Functions


In [2]:
def parse_cycles(json_obj):
    """Parses TfL's BikePoint JSON response"""

    return [parse_station(element) for element in json_obj]

def parse_station(element):
    """Parses a JSON bicycle station object to a dictionary"""

    obj = {
        'Id': element['id'],
        'Name': element['commonName'],
        'Latitude': element['lat'],
        'Longitude': element['lon'],
        'PlaceType': element['placeType'],
    }

    for p in element['additionalProperties']:
        obj[p['key']] = p['value']

        if 'timestamp' not in obj:
            obj['Timestamp'] = p['modified']
        elif obj['Timestamp'] != p['modified']:
            raise ValueError('The properties\' timestamps for station %s do not match: %s != %s' % (
            obj['id'], obj['Timestamp'], p['modified']))

    return obj

In [3]:
def bike_file_date_fn(file_name):
    """Gets the file's date"""

    return datetime.strptime(os.path.basename(file_name), 'BIKE-%Y-%m-%d:%H:%M:%S.json')

def create_between_dates_filter(file_date_fn, date_start, date_end):
    def filter_fn(file_name):
        file_date = file_date_fn(file_name)
        return file_date >= date_start and file_date <= date_end
    
    return filter_fn

Quick Data View

Load Single Day Data


In [4]:
filter_fn = create_between_dates_filter(bike_file_date_fn, 
                                       datetime(2016, 5, 16, 7, 0, 0),
                                       datetime(2016, 5, 16, 23, 59, 59))

records = parse_dir('/home/jfconavarrete/Documents/Work/Dissertation/spts-uoe/data/raw/cycles', 
                    parse_cycles, sort_fn=bike_file_date_fn, filter_fn=filter_fn)

# records is a list of lists of dicts
df = pd.DataFrame(list(itertools.chain.from_iterable(records)))

All Station View


In [5]:
df.head()


Out[5]:
Id InstallDate Installed Latitude Locked Longitude Name NbBikes NbDocks NbEmptyDocks PlaceType RemovalDate Temporary TerminalName Timestamp
0 BikePoints_1 1278947280000 true 51.529163 false -0.109970 River Street , Clerkenwell 11 19 7 BikePoint false 001023 2016-05-16T06:26:24.037
1 BikePoints_2 1278585780000 true 51.499606 false -0.197574 Phillimore Gardens, Kensington 12 37 25 BikePoint false 001018 2016-05-16T06:26:24.037
2 BikePoints_3 1278240360000 true 51.521283 false -0.084605 Christopher Street, Liverpool Street 6 32 26 BikePoint false 001012 2016-05-16T06:51:27.5
3 BikePoints_4 1278241080000 true 51.530059 false -0.120973 St. Chad's Street, King's Cross 14 23 9 BikePoint false 001013 2016-05-16T06:51:27.5
4 BikePoints_5 1278241440000 true 51.493130 false -0.156876 Sedding Street, Sloane Square 27 27 0 BikePoint false 003420 2016-05-16T06:46:27.237

Single Station View


In [6]:
df[df['Id'] == 'BikePoints_1'].head()


Out[6]:
Id InstallDate Installed Latitude Locked Longitude Name NbBikes NbDocks NbEmptyDocks PlaceType RemovalDate Temporary TerminalName Timestamp
0 BikePoints_1 1278947280000 true 51.529163 false -0.10997 River Street , Clerkenwell 11 19 7 BikePoint false 001023 2016-05-16T06:26:24.037
762 BikePoints_1 1278947280000 true 51.529163 false -0.10997 River Street , Clerkenwell 11 19 7 BikePoint false 001023 2016-05-16T06:26:24.037
1524 BikePoints_1 1278947280000 true 51.529163 false -0.10997 River Street , Clerkenwell 10 19 8 BikePoint false 001023 2016-05-16T07:01:29.163
2286 BikePoints_1 1278947280000 true 51.529163 false -0.10997 River Street , Clerkenwell 8 19 10 BikePoint false 001023 2016-05-16T07:11:30.433
3048 BikePoints_1 1278947280000 true 51.529163 false -0.10997 River Street , Clerkenwell 8 19 10 BikePoint false 001023 2016-05-16T07:11:30.433

Observations

  • There are some duplicate rows <- remove duplicates
  • RemovalDate may contain a lot of nulls <- remove if not helpful
  • Locked and Installed might be constant <- remove if not helpful

Build Dataset

Work with Chunks

Due to memory constraints we'll parse the data in chunks. In each chunk we'll remove the redundant candidate keys and also duplicate rows.


In [7]:
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

Tables

We will have two different tables, one for the stations and one for the availability readings


In [8]:
def split_data(parsed_data):
    master_df = pd.DataFrame(list(itertools.chain.from_iterable(parsed_data)))
    
    readings_df = pd.DataFrame(master_df, columns=['Id', 'Timestamp', 'NbBikes', 'NbDocks', 'NbEmptyDocks'])
    stations_df = pd.DataFrame(master_df, columns=['Id', 'Name', 'TerminalName' , 'PlaceType', 'Latitude', 
                                                   'Longitude', 'Installed', 'Temporary', 'Locked',
                                                   'RemovalDate', 'InstallDate'])
    
    return (readings_df, stations_df)

Build the Dataset


In [ ]:
# get the files to parse
five_weekdays_filter = create_between_dates_filter(bike_file_date_fn, 
                                                   datetime(2016, 6, 19, 0, 0, 0), 
                                                   datetime(2016, 6, 27, 23, 59, 59))

files = get_file_list('data/raw/cycles', filter_fn=None, sort_fn=bike_file_date_fn)

# process the files in chunks
files_batches = chunker(files, 500)

In [ ]:
# start with an empty dataset
readings_dataset = pd.DataFrame()
stations_dataset = pd.DataFrame()

# append each chunk to the datasets while removing duplicates
for batch in files_batches:
    parsed_data = parse_json_files(batch, parse_cycles)
    
    # split the data into two station data and readings data
    readings_df, stations_df = split_data(parsed_data)
    
    # append the datasets
    readings_dataset = pd.concat([readings_dataset, readings_df])
    stations_dataset = pd.concat([stations_dataset, stations_df])
    
    # remove duplicated rows
    readings_dataset.drop_duplicates(inplace=True)
    stations_dataset.drop_duplicates(inplace=True)

In [ ]:
# put the parsed data in pickle files
pickle.dump(readings_dataset, open("data/parsed/readings_dataset_raw.p", "wb"))
pickle.dump(stations_dataset, open("data/parsed/stations_dataset_raw.p", "wb"))

Read the Parsed Data


In [9]:
stations_dataset = pickle.load(open('data/parsed/stations_dataset_raw.p', 'rb'))
readings_dataset = pickle.load(open('data/parsed/readings_dataset_raw.p', 'rb'))

Technically Correct Data

The data is set to be technically correct if it:

  1. can be directly recognized as belonging to a certain variable
  2. is stored in a data type that represents the value domain of the real-world variable.

In [10]:
# convert columns to their appropriate datatypes
stations_dataset['InstallDate'] = pd.to_numeric(stations_dataset['InstallDate'], errors='raise')
stations_dataset['RemovalDate'] = pd.to_numeric(stations_dataset['RemovalDate'], errors='raise')

stations_dataset['Installed'].replace({'true': True, 'false': False}, inplace=True)
stations_dataset['Temporary'].replace({'true': True, 'false': False}, inplace=True)
stations_dataset['Locked'].replace({'true': True, 'false': False}, inplace=True)

readings_dataset['NbBikes'] = readings_dataset['NbBikes'].astype('uint16')
readings_dataset['NbDocks'] = readings_dataset['NbDocks'].astype('uint16')
readings_dataset['NbEmptyDocks'] = readings_dataset['NbEmptyDocks'].astype('uint16')

In [11]:
# format station name
stations_dataset['Name'] = stations_dataset['Name'].apply(format_name)

In [12]:
# convert string timestamp to datetime
stations_dataset['InstallDate'] = pd.to_datetime(stations_dataset['InstallDate'], unit='ms', errors='raise')
stations_dataset['RemovalDate'] = pd.to_datetime(stations_dataset['RemovalDate'], unit='ms', errors='raise')

readings_dataset['Timestamp'] =  pd.to_datetime(readings_dataset['Timestamp'], format='%Y-%m-%dT%H:%M:%S.%f', errors='raise').dt.tz_localize('UTC')

In [13]:
# sort the datasets
stations_dataset.sort_values(by=['Id'], ascending=True, inplace=True)

readings_dataset.sort_values(by=['Timestamp'], ascending=True, inplace=True)

Derive Data


In [14]:
stations_dataset['ShortName'] = stations_dataset['Name'].apply(to_short_name)

readings_dataset['NbUnusableDocks'] = readings_dataset['NbDocks'] - (readings_dataset['NbBikes'] + readings_dataset['NbEmptyDocks'])

Add Station Priority Column

Priorities downloaded from https://www.whatdotheyknow.com/request/tfl_boris_bike_statistics?unfold=1


In [15]:
stations_priorities = pd.read_csv('data/raw/priorities/station_priorities.csv', encoding='latin-1')
stations_priorities['Site'] = stations_priorities['Site'].apply(format_name)

In [16]:
stations_dataset = pd.merge(stations_dataset, stations_priorities, how='left', left_on='ShortName', right_on='Site')
stations_dataset['Priority'].replace({'One': '1', 'Two': '2', 'Long Term Suspended': np.NaN, 'Long term suspension': np.NaN}, inplace=True)
stations_dataset.drop(['Site'], axis=1, inplace=True)
stations_dataset.drop(['Borough'], axis=1, inplace=True)

In [17]:
stations_dataset


Out[17]:
Id Name TerminalName PlaceType Latitude Longitude Installed Temporary Locked RemovalDate InstallDate ShortName Priority
0 BikePoints_1 River Street, Clerkenwell 001023 BikePoint 51.529163 -0.109970 True False False NaT 2010-07-12 15:08:00 River Street 2
1 BikePoints_10 Park Street, Bankside 001024 BikePoint 51.505974 -0.092754 True False False NaT 2010-07-04 11:21:00 Park Street 2
2 BikePoints_100 Albert Embankment, Vauxhall 001059 BikePoint 51.490435 -0.122806 True False False NaT 2010-07-14 09:31:00 Albert Embankment 2
3 BikePoints_101 Queen Street 1, Bank 000999 BikePoint 51.511553 -0.092940 True False False NaT 2010-07-14 10:18:00 Queen Street 1
4 BikePoints_102 Jewry Street, Aldgate 001045 BikePoint 51.513406 -0.076793 True False False NaT 2010-07-14 10:21:00 Jewry Street 2
5 BikePoints_103 Vicarage Gate, Kensington 003441 BikePoint 51.504723 -0.192538 True False False NaT 2010-07-14 10:32:00 Vicarage Gate 2
6 BikePoints_104 Crosswall, Tower 000991 BikePoint 51.511594 -0.077121 True False False NaT 2010-07-14 10:36:00 Crosswall 1
7 BikePoints_105 Westbourne Grove, Bayswater 001041 BikePoint 51.515529 -0.190240 True False False NaT 2010-07-14 11:02:00 Westbourne Grove 2
8 BikePoints_106 Woodstock Street, Mayfair 001042 BikePoint 51.514105 -0.147301 True False False NaT 2010-07-14 11:28:00 Woodstock Street 2
9 BikePoints_107 Finsbury Leisure Centre, St. Lukes 001049 BikePoint 51.526008 -0.096317 True False False NaT 2010-07-14 11:38:00 Finsbury Leisure Centre 2
10 BikePoints_108 Abbey Orchard Street, Westminster 003429 BikePoint 51.498125 -0.132102 True False False NaT 2010-07-14 11:42:00 Abbey Orchard Street 1
11 BikePoints_109 Soho Square, Soho 001052 BikePoint 51.515631 -0.132328 True False False NaT 2010-07-14 11:52:00 Soho Square 1
12 BikePoints_11 Brunswick Square, Bloomsbury 001022 BikePoint 51.523951 -0.122502 True False False NaT 2010-07-05 14:34:00 Brunswick Square 2
13 BikePoints_110 Wellington Road, St. Johns Wood 001055 BikePoint 51.533043 -0.172528 True False False NaT 2010-07-14 12:02:00 Wellington Road 2
14 BikePoints_111 Park Lane, Hyde Park 001037 BikePoint 51.510017 -0.157275 True False False NaT 2010-07-14 12:06:00 Park Lane 2
15 BikePoints_112 Stonecutter Street, Holborn 001061 BikePoint 51.515809 -0.105270 True False False NaT 2010-07-14 13:51:00 Stonecutter Street 2
16 BikePoints_113 Gloucester Road (Central), South Kensington 003435 BikePoint 51.496462 -0.183289 True False False NaT 2010-07-14 14:10:00 Gloucester Road (Central) 2
17 BikePoints_114 Park Road (Baker Street), The Regents Park 001050 BikePoint 51.524517 -0.158963 True False False NaT 2010-07-14 15:05:00 Park Road (Baker Street) 2
18 BikePoints_115 Braham Street, Aldgate 001062 BikePoint 51.514233 -0.073537 True False False NaT 2010-07-14 15:18:00 Braham Street 2
19 BikePoints_116 Little Argyll Street, West End 000995 BikePoint 51.514499 -0.141423 True False False NaT 2010-07-14 15:46:00 Little Argyll Street 1
20 BikePoints_117 Lollard Street, Vauxhall 000998 BikePoint 51.492880 -0.114934 True False False NaT 2010-07-14 16:17:00 Lollard Street 2
21 BikePoints_118 Rochester Row, Westminster 003457 BikePoint 51.495827 -0.135478 True False False NaT 2010-07-14 16:22:00 Rochester Row 2
22 BikePoints_119 Bath Street, St. Lukes 000964 BikePoint 51.525893 -0.090847 True False False NaT 2010-07-14 16:26:00 Bath Street 2
23 BikePoints_12 Malet Street, Bloomsbury 000980 BikePoint 51.521680 -0.130431 True False False NaT 2010-07-05 14:37:00 Malet Street 1
24 BikePoints_120 The Guildhall, Guildhall 001044 BikePoint 51.515735 -0.093080 True False False NaT 2010-07-15 09:44:00 The Guildhall 2
25 BikePoints_121 Baker Street, Marylebone 001086 BikePoint 51.518913 -0.156166 True False False NaT 2010-07-15 10:20:00 Baker Street 2
26 BikePoints_122 Norton Folgate, Liverpool Street 001068 BikePoint 51.521113 -0.078869 True False False NaT 2010-07-15 10:34:00 Norton Folgate 2
27 BikePoints_123 St. John Street, Finsbury 000992 BikePoint 51.528360 -0.104724 True False False NaT 2010-07-15 10:55:00 St. John Street 2
28 BikePoints_124 Eaton Square, Belgravia 001069 BikePoint 51.496544 -0.150905 True False False NaT 2010-07-15 10:59:00 Eaton Square 2
29 BikePoints_125 Borough High Street, The Borough 000996 BikePoint 51.500694 -0.094524 True False False NaT 2010-07-15 11:10:00 Borough High Street 2
... ... ... ... ... ... ... ... ... ... ... ... ... ...
759 BikePoints_808 Stockwell Roundabout, Stockwell 300207 BikePoint 51.473486 -0.122555 True False False NaT NaT Stockwell Roundabout NaN
760 BikePoints_809 Lincolns Inn Fields, Holborn 300240 BikePoint 51.516277 -0.118272 True False False NaT NaT Lincolns Inn Fields NaN
761 BikePoints_81 Great Titchfield Street, Fitzrovia 003450 BikePoint 51.520253 -0.141327 True False False NaT 2010-07-13 09:23:00 Great Titchfield Street 2
762 BikePoints_810 Tate Modern, Bankside 300237 BikePoint 51.506725 -0.098807 True False False NaT 2016-06-03 08:40:00 Tate Modern NaN
763 BikePoints_811 Westferry Circus, Canary Wharf 300228 BikePoint 51.505703 -0.027772 True False False NaT NaT Westferry Circus NaN
764 BikePoints_814 Clapham Road, Lingham Street, Stockwell 300245 BikePoint 51.471433 -0.123670 True False False NaT 2016-06-02 12:21:00 Clapham Road NaN
765 BikePoints_814 Clapham Road, Lingham Street, Stockwell 300245 BikePoint 51.471433 -0.123670 True False False NaT NaT Clapham Road NaN
766 BikePoints_815 Lambeth Palace Road, Waterloo 300231 BikePoint 51.500089 -0.116628 True False False NaT 2016-05-04 10:28:00 Lambeth Palace Road NaN
767 BikePoints_817 Riverlight South, Nine Elms 300232 BikePoint 51.481335 -0.138212 True False False NaT 2016-06-03 10:19:00 Riverlight South NaN
768 BikePoints_818 One Tower Bridge, Bermondsey 300249 BikePoint 51.503127 -0.078655 True False False NaT NaT One Tower Bridge NaN
769 BikePoints_818 One Tower Bridge, Southwark 300249 BikePoint 51.503127 -0.078655 True False False NaT NaT One Tower Bridge NaN
770 BikePoints_82 Chancery Lane, Holborn 003453 BikePoint 51.514274 -0.111257 True False False NaT 2010-07-13 10:08:00 Chancery Lane 2
771 BikePoints_83 Panton Street, West End 003452 BikePoint 51.509639 -0.131510 True False False NaT 2010-07-13 10:10:00 Panton Street 2
772 BikePoints_84 Breams Buildings, Holborn 003449 BikePoint 51.515937 -0.111778 True False False NaT 2010-07-13 11:24:00 Breams Buildings 2
773 BikePoints_85 Tanner Street, Bermondsey 000994 BikePoint 51.500647 -0.078600 True False False NaT 2010-07-13 13:01:00 Tanner Street 2
774 BikePoints_86 Sancroft Street, Vauxhall 003434 BikePoint 51.489479 -0.115156 True False False NaT 2010-07-13 13:19:00 Sancroft Street 2
775 BikePoints_87 Devonshire Square, Liverpool Street 003438 BikePoint 51.516468 -0.079684 True False False NaT 2010-07-13 13:28:00 Devonshire Square 2
776 BikePoints_88 Bayley Street, Bloomsbury 001006 BikePoint 51.518587 -0.132053 True False False NaT 2010-07-13 13:38:00 Bayley Street 2
777 BikePoints_89 Tavistock Place, Bloomsbury 003439 BikePoint 51.526250 -0.123509 True False False NaT 2010-07-13 13:56:00 Tavistock Place 2
778 BikePoints_9 New Globe Walk, Bankside 001015 BikePoint 51.507385 -0.096440 True False False NaT 2010-07-04 11:19:00 New Globe Walk 2
779 BikePoints_90 Harrington Square 1, Camden Town 001038 BikePoint 51.533019 -0.139174 True False False NaT 2010-07-13 14:07:00 Harrington Square 2
780 BikePoints_91 Walnut Tree Walk, Vauxhall 001076 BikePoint 51.493686 -0.111014 True False False NaT 2010-07-13 15:59:00 Walnut Tree Walk 2
781 BikePoints_92 Borough Road, Elephant and Castle 001082 BikePoint 51.498898 -0.100440 True False False NaT 2010-07-13 16:04:00 Borough Road 2
782 BikePoints_93 Cloudesley Road, Angel 002586 BikePoint 51.534408 -0.109025 True False False NaT 2010-07-13 16:16:00 Cloudesley Road 2
783 BikePoints_94 Bricklayers Arms, Borough 001070 BikePoint 51.495061 -0.085814 True False False NaT 2010-07-13 16:27:00 Bricklayers Arms 2
784 BikePoints_95 Aldersgate Street, Barbican 001065 BikePoint 51.520841 -0.097340 True False False NaT 2010-07-14 08:36:00 Aldersgate Street 1
785 BikePoints_96 Falkirk Street, Hoxton 001047 BikePoint 51.530950 -0.078505 True False False NaT 2010-07-14 08:43:00 Falkirk Street 2
786 BikePoints_97 Gloucester Road (North), Kensington 003447 BikePoint 51.497924 -0.183834 True False False NaT 2010-07-14 08:53:00 Gloucester Road (North) 2
787 BikePoints_98 Hampstead Road, Euston 000972 BikePoint 51.525542 -0.138231 True False False NaT 2010-07-14 09:18:00 Hampstead Road 2
788 BikePoints_99 Old Quebec Street, Marylebone 001085 BikePoint 51.514577 -0.158264 True False False NaT 2010-07-14 09:28:00 Old Quebec Street 2

789 rows × 13 columns

Consistent Data

Stations Analysis

Overview


In [18]:
stations_dataset.shape


Out[18]:
(789, 13)

In [19]:
stations_dataset.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
Int64Index: 789 entries, 0 to 788
Data columns (total 13 columns):
Id              789 non-null object
Name            789 non-null object
TerminalName    789 non-null object
PlaceType       789 non-null object
Latitude        789 non-null float64
Longitude       789 non-null float64
Installed       789 non-null bool
Temporary       789 non-null bool
Locked          789 non-null bool
RemovalDate     3 non-null datetime64[ns]
InstallDate     690 non-null datetime64[ns]
ShortName       789 non-null object
Priority        734 non-null object
dtypes: bool(3), datetime64[ns](2), float64(2), object(6)
memory usage: 517.0 KB

In [20]:
stations_dataset.head()


Out[20]:
Id Name TerminalName PlaceType Latitude Longitude Installed Temporary Locked RemovalDate InstallDate ShortName Priority
0 BikePoints_1 River Street, Clerkenwell 001023 BikePoint 51.529163 -0.109970 True False False NaT 2010-07-12 15:08:00 River Street 2
1 BikePoints_10 Park Street, Bankside 001024 BikePoint 51.505974 -0.092754 True False False NaT 2010-07-04 11:21:00 Park Street 2
2 BikePoints_100 Albert Embankment, Vauxhall 001059 BikePoint 51.490435 -0.122806 True False False NaT 2010-07-14 09:31:00 Albert Embankment 2
3 BikePoints_101 Queen Street 1, Bank 000999 BikePoint 51.511553 -0.092940 True False False NaT 2010-07-14 10:18:00 Queen Street 1
4 BikePoints_102 Jewry Street, Aldgate 001045 BikePoint 51.513406 -0.076793 True False False NaT 2010-07-14 10:21:00 Jewry Street 2

In [21]:
stations_dataset.describe()


Out[21]:
Latitude Longitude
count 789.000000 789.000000
mean 51.440649 -0.128648
std 1.833769 0.056337
min 0.000000 -0.236769
25% 51.493184 -0.173029
50% 51.509301 -0.131961
75% 51.520858 -0.092762
max 51.549369 0.122299

In [22]:
stations_dataset.apply(lambda x:x.nunique())


Out[22]:
Id              780
Name            782
TerminalName    780
PlaceType         1
Latitude        778
Longitude       778
Installed         2
Temporary         1
Locked            2
RemovalDate       3
InstallDate     686
ShortName       770
Priority          2
dtype: int64

In [23]:
stations_dataset.isnull().sum()


Out[23]:
Id                0
Name              0
TerminalName      0
PlaceType         0
Latitude          0
Longitude         0
Installed         0
Temporary         0
Locked            0
RemovalDate     786
InstallDate      99
ShortName         0
Priority         55
dtype: int64

Observations:

  • Id, Name and Terminal name seem to be candidate keys
  • The minimum latitude and the maximum longitude are 0
  • Some stations have the same latitude or longitude
  • Id, TerminalName and Name have different unique values
  • Placetype, Installed, Temporary and Locked appear to be constant
  • Some stations do not have an install date
  • Some Stations have a removal date (very sparse)

Remove Duplicate Stations


In [24]:
def find_duplicate_ids(df):
    """Find Ids that have more than one value in the given columns"""
    
    df = df.drop_duplicates()
    value_counts_grouped_by_id = df.groupby('Id').count()    
    is_duplicate_id = value_counts_grouped_by_id.applymap(lambda x: x > 1).any(axis=1)
    duplicate_ids = value_counts_grouped_by_id[is_duplicate_id == True].index.values
    return df[df['Id'].isin(duplicate_ids)]

diplicate_ids = find_duplicate_ids(stations_dataset)
diplicate_ids


Out[24]:
Id Name TerminalName PlaceType Latitude Longitude Installed Temporary Locked RemovalDate InstallDate ShortName Priority
150 BikePoints_237 Dock Street, Wapping 003467 BikePoint 51.509786 -0.068161 True False False NaT 2010-07-22 11:44:00 Dock Street NaN
151 BikePoints_237 Dock Street, Wapping 003467 BikePoint 51.509786 -0.068161 True False False NaT 2010-07-20 11:44:00 Dock Street NaN
417 BikePoints_497 Merchant Street, Bow 200242 BikePoint 51.526535 -0.028619 True False False NaT 2012-01-24 09:47:00 Merchant Street NaN
418 BikePoints_497 Merchant Street, Bow 200242 BikePoint 51.526177 -0.027467 True False False NaT 2012-01-24 09:47:00 Merchant Street NaN
726 BikePoints_780 Imperial Wharf Station 300070 BikePoint 51.474665 -0.183165 True False False NaT 2015-08-13 08:40:00 Imperial Wharf Station 2
727 BikePoints_780 Imperial Wharf Station, Sands End 300070 BikePoint 51.474665 -0.183165 True False False NaT 2015-08-13 08:40:00 Imperial Wharf Station 2
743 BikePoints_796 Coram Street, Bloomsbury 300201 BikePoint 51.524000 -0.126409 True False True NaT 2016-02-29 11:47:00 Coram Street NaN
744 BikePoints_796 Coram Street, Bloomsbury 300201 BikePoint 51.524000 -0.126409 True False False NaT 2016-02-29 11:47:00 Coram Street NaN
745 BikePoints_798 Birkenhead Street, Kings Cross 300212 BikePoint 51.530199 0.122299 True False False NaT NaT Birkenhead Street NaN
746 BikePoints_798 Birkenhead Street, Kings Cross 300212 BikePoint 51.530199 -0.122299 True False False NaT NaT Birkenhead Street NaN
747 BikePoints_799 Kings Gate House, Westminster 300202 BikePoint 51.497698 -0.137598 True False False NaT NaT Kings Gate House NaN
748 BikePoints_799 Kings Gate House, Westminster 300202 BikePoint 51.497698 -0.137598 True False False NaT 2016-06-02 14:08:00 Kings Gate House NaN
753 BikePoints_802 Albert Square, Stockwell 300209 BikePoint 51.476590 -0.118256 True False False NaT 2016-06-02 11:05:00 Albert Square NaN
754 BikePoints_802 Albert Square, Stockwell 300209 BikePoint 51.476590 -0.118256 True False False NaT NaT Albert Square NaN
764 BikePoints_814 Clapham Road, Lingham Street, Stockwell 300245 BikePoint 51.471433 -0.123670 True False False NaT 2016-06-02 12:21:00 Clapham Road NaN
765 BikePoints_814 Clapham Road, Lingham Street, Stockwell 300245 BikePoint 51.471433 -0.123670 True False False NaT NaT Clapham Road NaN
768 BikePoints_818 One Tower Bridge, Bermondsey 300249 BikePoint 51.503127 -0.078655 True False False NaT NaT One Tower Bridge NaN
769 BikePoints_818 One Tower Bridge, Southwark 300249 BikePoint 51.503127 -0.078655 True False False NaT NaT One Tower Bridge NaN

Given these records have the same location and Id but different Name or TerminalName, we'll assume the station changed name and remove the first entries.


In [25]:
# remove the one not in merchant street
stations_dataset.drop(417, inplace=True)

# remove the one with the shortest name
stations_dataset.drop(726, inplace=True)

# remove the one that is not in kings cross (as the name of the station implies)
stations_dataset.drop(745, inplace=True)

# remove the duplicated entries 
stations_dataset.drop([747, 743, 151, 754, 765, 768],  inplace=True)

In [26]:
# make sure there are no repeated ids 
assert len(find_duplicate_ids(stations_dataset)) == 0

Check Locations

Let's have a closer look at the station locations. All of them should be in Greater London.


In [27]:
def find_locations_outside_box(locations, min_longitude, min_latitude, max_longitude, max_latitude):
    latitude_check = ~(locations['Latitude'] >= min_latitude) & (locations['Latitude'] <= max_latitude) 
    longitude_check = ~(locations['Longitude'] >= min_longitude) & (locations['Longitude'] <= max_longitude) 
    return locations[(latitude_check | longitude_check)]

outlier_locations_df = find_locations_outside_box(stations_dataset, lon_min_longitude, lon_min_latitude, 
                                                  lon_max_longitude, lon_max_latitude)
outlier_locations_df


Out[27]:
Id Name TerminalName PlaceType Latitude Longitude Installed Temporary Locked RemovalDate InstallDate ShortName Priority
738 BikePoints_791 Test Desktop 666666 BikePoint 0.0 0.0 False False False NaT 2016-01-15 12:39:00 Test Desktop NaN

This station looks like a test dation, so we'll remove it.


In [28]:
outlier_locations_idx = outlier_locations_df.index.values

stations_dataset.drop(outlier_locations_idx, inplace=True)

In [29]:
# make sure there are no stations outside London
assert len(find_locations_outside_box(stations_dataset, lon_min_longitude, lon_min_latitude, 
                                      lon_max_longitude, lon_max_latitude)) == 0

We will investigate the fact that there are stations with duplicate latitude or longitude values.


In [30]:
# find stations with duplicate longitude
id_counts_groupedby_longitude = stations_dataset.groupby('Longitude')['Id'].count()
nonunique_longitudes = id_counts_groupedby_longitude[id_counts_groupedby_longitude != 1].index.values
nonunique_longitude_stations = stations_dataset[stations_dataset['Longitude'].isin(nonunique_longitudes)].sort_values(by=['Longitude'])

id_counts_groupedby_latitude = stations_dataset.groupby('Latitude')['Id'].count()
nonunique_latitudes = id_counts_groupedby_latitude[id_counts_groupedby_latitude != 1].index.values
nonunique_latitudes_stations = stations_dataset[stations_dataset['Latitude'].isin(nonunique_latitudes)].sort_values(by=['Latitude'])

nonunique_coordinates_stations = pd.concat([nonunique_longitude_stations, nonunique_latitudes_stations])
nonunique_coordinates_stations


Out[30]:
Id Name TerminalName PlaceType Latitude Longitude Installed Temporary Locked RemovalDate InstallDate ShortName Priority
127 BikePoints_216 Old Brompton Road, South Kensington 003479 BikePoint 51.490945 -0.181190 True False False NaT 2010-07-19 11:12:00 Old Brompton Road 2
500 BikePoints_573 Limerston Street, West Chelsea 200001 BikePoint 51.485587 -0.181190 True False False NaT 2012-03-15 07:21:00 Limerston Street 2
120 BikePoints_21 Hampstead Road (Cartmel), Euston 003426 BikePoint 51.530078 -0.138846 True False False NaT 2010-07-06 14:49:00 Hampstead Road (Cartmel) 2
237 BikePoints_318 Sackville Street, Mayfair 001197 BikePoint 51.510048 -0.138846 True False False NaT 2010-07-23 11:42:00 Sackville Street 2
10 BikePoints_108 Abbey Orchard Street, Westminster 003429 BikePoint 51.498125 -0.132102 True False False NaT 2010-07-14 11:42:00 Abbey Orchard Street 1
554 BikePoints_624 Courland Grove, Wandsworth Road 200173 BikePoint 51.472918 -0.132102 True False False NaT 2013-10-08 09:24:00 Courland Grove 2
3 BikePoints_101 Queen Street 1, Bank 000999 BikePoint 51.511553 -0.092940 True False False NaT 2010-07-14 10:18:00 Queen Street 1
345 BikePoints_427 Cheapside, Bank 022180 BikePoint 51.513970 -0.092940 True False False NaT 2011-07-15 10:28:00 Cheapside 1
10 BikePoints_108 Abbey Orchard Street, Westminster 003429 BikePoint 51.498125 -0.132102 True False False NaT 2010-07-14 11:42:00 Abbey Orchard Street 1
393 BikePoints_474 Castalia Square, Cubitt Town 200155 BikePoint 51.498125 -0.011457 True False False NaT 2012-01-17 17:56:00 Castalia Square 2
386 BikePoints_468 Cantrell Road, Bow 200150 BikePoint 51.521564 -0.022694 True False False NaT 2012-01-12 13:42:00 Cantrell Road 2
476 BikePoints_550 Harford Street, Mile End 200102 BikePoint 51.521564 -0.039264 True False False NaT NaT Harford Street 2
247 BikePoints_327 New North Road 1, Hoxton 001128 BikePoint 51.530950 -0.085603 True False False NaT 2010-07-26 08:37:00 New North Road 2
785 BikePoints_96 Falkirk Street, Hoxton 001047 BikePoint 51.530950 -0.078505 True False False NaT 2010-07-14 08:43:00 Falkirk Street 2

In [31]:
def draw_stations_map(stations_df):    
    stations_map = create_london_map()

    for index, station in stations_df.iterrows():        
        folium.Marker([station['Latitude'],station['Longitude']], popup=station['Name']).add_to(stations_map)
    
    return stations_map

In [32]:
draw_stations_map(nonunique_coordinates_stations)


Out[32]:

We can observe that the stations are different and that having the same Longitude is just a coincidence.

Let's plot all the stations in a map to see how it looks


In [33]:
london_longitude = -0.127722
london_latitude = 51.507981

MAX_RECORDS = 100

stations_map = create_london_map()

for index, station in stations_dataset[0:MAX_RECORDS].iterrows():
    folium.Marker([station['Latitude'],station['Longitude']], popup=station['Name']).add_to(stations_map)
    
stations_map

#folium.Map.save(stations_map, 'reports/maps/stations_map.html')


Out[33]:

Readings Analysis

Overview


In [34]:
readings_dataset.shape


Out[34]:
(1529937, 6)

In [35]:
readings_dataset.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1529937 entries, 750 to 292193
Data columns (total 6 columns):
Id                 1529937 non-null object
Timestamp          1529937 non-null datetime64[ns, UTC]
NbBikes            1529937 non-null uint16
NbDocks            1529937 non-null uint16
NbEmptyDocks       1529937 non-null uint16
NbUnusableDocks    1529937 non-null uint16
dtypes: datetime64[ns, UTC](1), object(1), uint16(4)
memory usage: 203.4 MB

In [36]:
readings_dataset.head()


Out[36]:
Id Timestamp NbBikes NbDocks NbEmptyDocks NbUnusableDocks
750 BikePoints_791 2016-05-10 15:34:07.137000+00:00 0 0 0 0
608 BikePoints_646 2016-05-14 20:36:22.417000+00:00 0 0 0 0
570 BikePoints_608 2016-05-14 23:18:18.467000+00:00 14 29 15 0
666 BikePoints_704 2016-05-15 00:50:38.140000+00:00 9 18 9 0
634 BikePoints_672 2016-05-15 04:11:04.447000+00:00 28 28 0 0

In [37]:
readings_dataset.describe()


Out[37]:
NbBikes NbDocks NbEmptyDocks NbUnusableDocks
count 1.529937e+06 1.529937e+06 1.529937e+06 1.529937e+06
mean 1.239987e+01 2.701473e+01 1.404461e+01 5.702444e-01
std 9.076213e+00 9.518489e+00 9.577277e+00 9.018525e-01
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 5.000000e+00 2.000000e+01 7.000000e+00 0.000000e+00
50% 1.100000e+01 2.500000e+01 1.300000e+01 0.000000e+00
75% 1.800000e+01 3.300000e+01 1.900000e+01 1.000000e+00
max 6.400000e+01 6.400000e+01 6.400000e+01 3.800000e+01

In [38]:
readings_dataset.apply(lambda x:x.nunique())


Out[38]:
Id                   780
Timestamp          13236
NbBikes               65
NbDocks               58
NbEmptyDocks          65
NbUnusableDocks       20
dtype: int64

In [39]:
readings_dataset.isnull().sum()


Out[39]:
Id                 0
Timestamp          0
NbBikes            0
NbDocks            0
NbEmptyDocks       0
NbUnusableDocks    0
dtype: int64

In [40]:
timestamps = readings_dataset['Timestamp']
ax = timestamps.groupby([timestamps.dt.year, timestamps.dt.month, timestamps.dt.day]).count().plot(kind="bar")
ax.set_xlabel('Date')
ax.set_title('Readings per Day')


Out[40]:
<matplotlib.text.Text at 0x7fb428d77f90>

Observations:

  • The number of readings in each day varies widely

Discard Out of Range Data


In [41]:
start_date = date(2016, 5, 15)
end_date = date(2016, 6, 27)
days = set(pd.date_range(start=start_date, end=end_date, closed='left'))
           
readings_dataset = readings_dataset[(timestamps > start_date) & (timestamps < end_date)]

Readings Consistency Through Days

Lets get some insight about which stations do not have readings during an entire day


In [42]:
# get a subview of the readings dataset
id_timestamp_view = readings_dataset.loc[:,['Id','Timestamp']]

# remove the time component of the timestamp
id_timestamp_view['Timestamp'] = id_timestamp_view['Timestamp'].apply(lambda x: x.replace(hour=0, minute=0, second=0, microsecond=0))

# compute the days of readings per stations
days_readings = id_timestamp_view.groupby('Id').aggregate(lambda x: set(x))
days_readings['MissingDays'] = days_readings['Timestamp'].apply(lambda x: list(days - x))
days_readings['MissingDaysCount'] = days_readings['MissingDays'].apply(lambda x: len(x))

In [43]:
pickle.dump(days_readings.query('MissingDaysCount > 0'), open("data/parsed/missing_days.p", "wb"))

In [44]:
def expand_datetime(df, datetime_col):
    df['Weekday'] = df[datetime_col].apply(lambda x: x.weekday())
    return df

In [45]:
# get the stations with missing readings only
missing_days_readings = days_readings[days_readings['MissingDaysCount'] != 0]
missing_days_readings = missing_days_readings['MissingDays'].apply(lambda x: pd.Series(x)).unstack().dropna()
missing_days_readings.index = missing_days_readings.index.droplevel()

# sort and format in their own DF
missing_days_readings = pd.DataFrame(missing_days_readings, columns=['MissingDay'], index=None).reset_index().sort_values(by=['Id', 'MissingDay'])

# expand the missing day date
expand_datetime(missing_days_readings, 'MissingDay')


Out[45]:
Id MissingDay Weekday
0 BikePoints_109 2016-06-25 5
1 BikePoints_112 2016-05-25 2
53 BikePoints_112 2016-05-26 3
54 BikePoints_120 2016-06-10 4
2 BikePoints_120 2016-06-11 5
3 BikePoints_129 2016-06-25 5
55 BikePoints_133 2016-06-24 4
4 BikePoints_133 2016-06-25 5
91 BikePoints_133 2016-06-26 6
5 BikePoints_153 2016-06-17 4
258 BikePoints_153 2016-06-18 5
209 BikePoints_153 2016-06-19 6
155 BikePoints_153 2016-06-20 0
92 BikePoints_153 2016-06-21 1
281 BikePoints_153 2016-06-22 2
234 BikePoints_153 2016-06-23 3
183 BikePoints_153 2016-06-24 4
124 BikePoints_153 2016-06-25 5
56 BikePoints_153 2016-06-26 6
6 BikePoints_184 2016-06-25 5
7 BikePoints_192 2016-06-25 5
8 BikePoints_218 2016-06-04 5
9 BikePoints_226 2016-05-15 6
57 BikePoints_226 2016-05-16 0
490 BikePoints_237 2016-05-15 6
514 BikePoints_237 2016-05-16 0
465 BikePoints_237 2016-05-17 1
391 BikePoints_237 2016-05-18 2
259 BikePoints_237 2016-05-19 3
10 BikePoints_237 2016-05-20 4
... ... ... ...
437 BikePoints_817 2016-06-17 4
595 BikePoints_817 2016-06-18 5
181 BikePoints_817 2016-06-19 6
120 BikePoints_817 2016-06-20 0
322 BikePoints_817 2016-06-21 1
521 BikePoints_817 2016-06-22 2
577 BikePoints_817 2016-06-23 3
497 BikePoints_817 2016-06-24 4
87 BikePoints_817 2016-06-25 5
301 BikePoints_817 2016-06-26 6
50 BikePoints_818 2016-05-15 6
88 BikePoints_818 2016-05-17 1
121 BikePoints_818 2016-05-18 2
182 BikePoints_86 2016-05-28 5
280 BikePoints_86 2016-05-29 6
342 BikePoints_86 2016-05-30 0
122 BikePoints_86 2016-05-31 1
257 BikePoints_86 2016-06-01 2
208 BikePoints_86 2016-06-02 3
89 BikePoints_86 2016-06-03 4
359 BikePoints_86 2016-06-04 5
302 BikePoints_86 2016-06-05 6
233 BikePoints_86 2016-06-06 0
153 BikePoints_86 2016-06-07 1
51 BikePoints_86 2016-06-08 2
323 BikePoints_86 2016-06-09 3
52 BikePoints_9 2016-06-09 3
123 BikePoints_9 2016-06-10 4
90 BikePoints_9 2016-06-11 5
154 BikePoints_9 2016-06-12 6

607 rows × 3 columns


In [46]:
missing_days_readings


Out[46]:
Id MissingDay Weekday
0 BikePoints_109 2016-06-25 5
1 BikePoints_112 2016-05-25 2
53 BikePoints_112 2016-05-26 3
54 BikePoints_120 2016-06-10 4
2 BikePoints_120 2016-06-11 5
3 BikePoints_129 2016-06-25 5
55 BikePoints_133 2016-06-24 4
4 BikePoints_133 2016-06-25 5
91 BikePoints_133 2016-06-26 6
5 BikePoints_153 2016-06-17 4
258 BikePoints_153 2016-06-18 5
209 BikePoints_153 2016-06-19 6
155 BikePoints_153 2016-06-20 0
92 BikePoints_153 2016-06-21 1
281 BikePoints_153 2016-06-22 2
234 BikePoints_153 2016-06-23 3
183 BikePoints_153 2016-06-24 4
124 BikePoints_153 2016-06-25 5
56 BikePoints_153 2016-06-26 6
6 BikePoints_184 2016-06-25 5
7 BikePoints_192 2016-06-25 5
8 BikePoints_218 2016-06-04 5
9 BikePoints_226 2016-05-15 6
57 BikePoints_226 2016-05-16 0
490 BikePoints_237 2016-05-15 6
514 BikePoints_237 2016-05-16 0
465 BikePoints_237 2016-05-17 1
391 BikePoints_237 2016-05-18 2
259 BikePoints_237 2016-05-19 3
10 BikePoints_237 2016-05-20 4
... ... ... ...
437 BikePoints_817 2016-06-17 4
595 BikePoints_817 2016-06-18 5
181 BikePoints_817 2016-06-19 6
120 BikePoints_817 2016-06-20 0
322 BikePoints_817 2016-06-21 1
521 BikePoints_817 2016-06-22 2
577 BikePoints_817 2016-06-23 3
497 BikePoints_817 2016-06-24 4
87 BikePoints_817 2016-06-25 5
301 BikePoints_817 2016-06-26 6
50 BikePoints_818 2016-05-15 6
88 BikePoints_818 2016-05-17 1
121 BikePoints_818 2016-05-18 2
182 BikePoints_86 2016-05-28 5
280 BikePoints_86 2016-05-29 6
342 BikePoints_86 2016-05-30 0
122 BikePoints_86 2016-05-31 1
257 BikePoints_86 2016-06-01 2
208 BikePoints_86 2016-06-02 3
89 BikePoints_86 2016-06-03 4
359 BikePoints_86 2016-06-04 5
302 BikePoints_86 2016-06-05 6
233 BikePoints_86 2016-06-06 0
153 BikePoints_86 2016-06-07 1
51 BikePoints_86 2016-06-08 2
323 BikePoints_86 2016-06-09 3
52 BikePoints_9 2016-06-09 3
123 BikePoints_9 2016-06-10 4
90 BikePoints_9 2016-06-11 5
154 BikePoints_9 2016-06-12 6

607 rows × 3 columns


In [47]:
missing_days_readings['Id'].nunique()


Out[47]:
53

In [48]:
# plot the missing readings days 
days = missing_days_readings['MissingDay']
missing_days_counts = days.groupby([days.dt.year, days.dt.month, days.dt.day]).count()
ax = missing_days_counts.plot(kind="bar")
ax.set_xlabel('Date')
ax.set_ylabel('Number of Stations')


Out[48]:
<matplotlib.text.Text at 0x7fb41edea710>

Stations with no readings in at least one day


In [49]:
missing_days_readings_stations = stations_dataset[stations_dataset['Id'].isin(missing_days_readings['Id'].unique())]
draw_stations_map(missing_days_readings_stations)


Out[49]:

Stations with no readings in at least one day during the weekend


In [50]:
weekend_readings = missing_days_readings[missing_days_readings['Weekday'] > 4]
missing_dayreadings_stn = stations_dataset[stations_dataset['Id'].isin(weekend_readings['Id'].unique())]
draw_stations_map(missing_dayreadings_stn)


Out[50]:

Stations with no readings in at least one day during weekdays


In [51]:
weekday_readings = missing_days_readings[missing_days_readings['Weekday'] < 5]
missing_dayreadings_stn = stations_dataset[stations_dataset['Id'].isin(weekday_readings['Id'].unique())]
draw_stations_map(missing_dayreadings_stn)


Out[51]:

Observations:

  • There are 29 stations that do not have readings in at least one day
  • There were more stations without readings during May than in June
  • Other than that, there is no visible pattern

Build Datasets

Readings


In [59]:
stations_to_remove = set(readings_dataset.Id) - set(stations_dataset.Id)

In [60]:
readings_dataset = readings_dataset[~readings_dataset.Id.isin(stations_to_remove)]

In [62]:
readings_dataset.reset_index(inplace=True, drop=True)

In [63]:
readings_dataset.head()


Out[63]:
Id Timestamp NbBikes NbDocks NbEmptyDocks NbUnusableDocks
0 BikePoints_704 2016-05-15 00:50:38.140000+00:00 9 18 9 0
1 BikePoints_672 2016-05-15 04:11:04.447000+00:00 28 28 0 0
2 BikePoints_555 2016-05-15 08:21:32.870000+00:00 16 56 40 0
3 BikePoints_759 2016-05-15 09:51:44.977000+00:00 0 18 18 0
4 BikePoints_8 2016-05-15 10:11:48.467000+00:00 15 18 3 0

In [65]:
readings_dataset.describe()


Out[65]:
NbBikes NbDocks NbEmptyDocks NbUnusableDocks
count 1.500921e+06 1.500921e+06 1.500921e+06 1.500921e+06
mean 1.240701e+01 2.701745e+01 1.403816e+01 5.722786e-01
std 9.087195e+00 9.518367e+00 9.583962e+00 9.030175e-01
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 5.000000e+00 2.000000e+01 7.000000e+00 0.000000e+00
50% 1.100000e+01 2.500000e+01 1.300000e+01 0.000000e+00
75% 1.800000e+01 3.300000e+01 1.900000e+01 1.000000e+00
max 6.400000e+01 6.400000e+01 6.400000e+01 3.800000e+01

In [66]:
readings_dataset.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500921 entries, 0 to 1500920
Data columns (total 6 columns):
Id                 1500921 non-null object
Timestamp          1500921 non-null datetime64[ns, UTC]
NbBikes            1500921 non-null uint16
NbDocks            1500921 non-null uint16
NbEmptyDocks       1500921 non-null uint16
NbUnusableDocks    1500921 non-null uint16
dtypes: datetime64[ns, UTC](1), object(1), uint16(4)
memory usage: 188.0 MB

In [67]:
pickle.dump(readings_dataset, open("data/parsed/readings_dataset_utc.p", "wb"))

Stations


In [68]:
stations_dataset.reset_index(inplace=True, drop=True)

In [69]:
stations_dataset.head()


Out[69]:
Id Name TerminalName PlaceType Latitude Longitude Installed Temporary Locked RemovalDate InstallDate ShortName Priority
0 BikePoints_1 River Street, Clerkenwell 001023 BikePoint 51.529163 -0.109970 True False False NaT 2010-07-12 15:08:00 River Street 2
1 BikePoints_10 Park Street, Bankside 001024 BikePoint 51.505974 -0.092754 True False False NaT 2010-07-04 11:21:00 Park Street 2
2 BikePoints_100 Albert Embankment, Vauxhall 001059 BikePoint 51.490435 -0.122806 True False False NaT 2010-07-14 09:31:00 Albert Embankment 2
3 BikePoints_101 Queen Street 1, Bank 000999 BikePoint 51.511553 -0.092940 True False False NaT 2010-07-14 10:18:00 Queen Street 1
4 BikePoints_102 Jewry Street, Aldgate 001045 BikePoint 51.513406 -0.076793 True False False NaT 2010-07-14 10:21:00 Jewry Street 2

In [70]:
stations_dataset.describe()


Out[70]:
Latitude Longitude
count 779.000000 779.000000
mean 51.505980 -0.129346
std 0.019976 0.055562
min 51.454752 -0.236769
25% 51.493235 -0.173685
50% 51.509303 -0.132102
75% 51.520849 -0.092940
max 51.549369 -0.002275

In [71]:
stations_dataset.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 779 entries, 0 to 778
Data columns (total 13 columns):
Id              779 non-null object
Name            779 non-null object
TerminalName    779 non-null object
PlaceType       779 non-null object
Latitude        779 non-null float64
Longitude       779 non-null float64
Installed       779 non-null bool
Temporary       779 non-null bool
Locked          779 non-null bool
RemovalDate     3 non-null datetime64[ns]
InstallDate     685 non-null datetime64[ns]
ShortName       779 non-null object
Priority        733 non-null object
dtypes: bool(3), datetime64[ns](2), float64(2), object(6)
memory usage: 504.7 KB

In [72]:
pickle.dump(stations_dataset, open("data/parsed/stations_dataset_final.p", "wb"))