Data Cleaning

Set up

Imports


In [4]:
import itertools
import pandas as pd

from scripts.preprocess.parse_json import parse_dir

Parse Raw Data


In [5]:
def parse_station(element):
    """Parses a JSON bicycle station object to a dictionary"""

    obj = {
        'Id': element['id'],
        'Name': element['commonName'],
        'Latitude': element['lat'],
        'Longitude': element['lon'],
        'PlaceType': element['placeType']
    }

    for p in element['additionalProperties']:
        obj[p['key']] = p['value']

        if 'timestamp' not in obj:
            obj['Timestamp'] = p['modified']
        elif obj['Timestamp'] != p['modified']:
            raise ValueError('The properties\' timestamps for station %s do not match: %s != %s' % (
            obj['id'], obj['Timestamp'], p['modified']))

    return obj

def parse_cycles(json_obj):
    """Parses TfL's BikePoint JSON response"""

    return [parse_station(element) for element in json_obj]

In [6]:
records = parse_dir('/home/jfconavarrete/Documents/Work/Dissertation/spts-uoe/data/dev', parse_cycles)

Import into Pandas


In [7]:
dataset = pd.DataFrame(list(itertools.chain.from_iterable(records)))

dataset.shape


Out[7]:
(520906, 16)

Technically Correct Data

The data is set to be technically correct if it:

  1. can be directly recognized as belonging to a certain variable
  2. is stored in a data type that represents the value domain of the real-world variable.

Convert to Appropriate DataTypes


In [1]:
# convert columns to their appropriate datatypes
dataset['InstallDate'] = pd.to_numeric(dataset['InstallDate'], errors='raise')
dataset['Installed'] = dataset['Installed'].astype('bool_')
dataset['Temporary'] = dataset['Temporary'].astype('bool_')
dataset['Locked'] = dataset['Locked'].astype('bool_')
dataset['NbBikes'] = dataset['NbBikes'].astype('uint16')
dataset['NbDocks'] = dataset['NbDocks'].astype('uint16')
dataset['NbEmptyDocks'] = dataset['NbEmptyDocks'].astype('uint16')

# convert string timestamp to datetime
dataset['Timestamp'] =  pd.to_datetime(dataset['Timestamp'], format='%Y-%m-%dT%H:%M:%S.%f')
dataset['InstallDate'] = pd.to_datetime(dataset['InstallDate'], unit='ms')


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-e65f54382834> in <module>()
      1 # convert columns to their appropriate datatypes
----> 2 dataset['InstallDate'] = pd.to_numeric(dataset['InstallDate'], errors='raise')
      3 dataset['Installed'] = dataset['Installed'].astype('bool_')
      4 dataset['Temporary'] = dataset['Temporary'].astype('bool_')
      5 dataset['Locked'] = dataset['Locked'].astype('bool_')

NameError: name 'pd' is not defined

Data Description


In [16]:
dataset.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520906 entries, 0 to 520905
Data columns (total 16 columns):
Id              520906 non-null object
InstallDate     461559 non-null datetime64[ns]
Installed       520906 non-null bool
LastUpdated     1 non-null object
Latitude        520906 non-null float64
Locked          520906 non-null bool
Longitude       520906 non-null float64
Name            520906 non-null object
NbBikes         520906 non-null uint16
NbDocks         520906 non-null uint16
NbEmptyDocks    520906 non-null uint16
PlaceType       520906 non-null object
RemovalDate     520906 non-null object
Temporary       520906 non-null bool
TerminalName    520906 non-null object
Timestamp       520906 non-null datetime64[ns]
dtypes: bool(3), datetime64[ns](2), float64(2), object(6), uint16(3)
memory usage: 296.5 MB

In [14]:
dataset.head()


Out[14]:
Id InstallDate Installed LastUpdated Latitude Locked Longitude Name NbBikes NbDocks NbEmptyDocks PlaceType RemovalDate Temporary TerminalName Timestamp
0 BikePoints_1 2010-07-12 15:08:00 True NaN 51.529163 True -0.109970 River Street , Clerkenwell 0 19 18 BikePoint True 001023 2016-05-16 08:01:37.947
1 BikePoints_2 2010-07-08 10:43:00 True NaN 51.499606 True -0.197574 Phillimore Gardens, Kensington 19 37 18 BikePoint True 001018 2016-05-16 08:06:37.670
2 BikePoints_3 2010-07-04 10:46:00 True NaN 51.521283 True -0.084605 Christopher Street, Liverpool Street 31 32 1 BikePoint True 001012 2016-05-16 08:06:37.670
3 BikePoints_4 2010-07-04 10:58:00 True NaN 51.530059 True -0.120973 St. Chad's Street, King's Cross 0 23 23 BikePoint True 001013 2016-05-16 07:51:35.910
4 BikePoints_5 2010-07-04 11:04:00 True NaN 51.493130 True -0.156876 Sedding Street, Sloane Square 24 27 3 BikePoint True 003420 2016-05-16 08:06:37.670

Consistent Correct Data

Missing Values

Outliers

Errors

Consistency

Exploratory Data Analysis

Visual Representation

Examine Variable Relationships

Analyze Variable Over Time

Conclusions


In [ ]: