As this dataset is completely real it may have a big amount of noise in its attributes. For knowing how to clean it, we can use only a month, such as August, the month with less amount of trips, only 12,597,109. After that, the process for the whole year will be automatized in the next notebook.
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
In [2]:
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(font='sans')
In [3]:
# This auxiliary function applies another one to every row in a DataFrame for creating new columns.
def iterate_and_apply(dataframe, function, necesary_columns):
perform = True
step = 100000
start = 0
to = step
while perform:
new_columns = dataframe[start:to][necesary_columns].apply(function, axis=1)
if len(new_columns) == 0:
perform = False
else:
dataframe.update(new_columns)
new_columns = None
start += step
to += step
return dataframe
In [4]:
data = pd.read_csv('../data/faredata/trip_fare_8.csv', skipinitialspace=True)
Let's take a look to the data. Each row of the fare CSV file
represents a trip and the columns are attributes for these trips. In the next table we can check out the values of these attributes for the first five trips of the file.
In [5]:
data.head()
Out[5]:
From all of them, the first three columns could sound a bit strange to us. Let's check them:
Knowing that, we just can skip the checking of the first, second and fourth attribute status. So, let's see how the values of vendor_id
are distributed in the trips.
In [6]:
ax = data.groupby(['vendor_id']).size().plot(kind='bar')
ax.set_xlabel('vendor_id', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)
It looks like we don't need to treat these values, they are pretty well distributed.
Let's check the payment_type
.
In [7]:
ax = data.groupby(['payment_type']).size().plot(kind='bar')
ax.set_xlabel('payment_type', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)
Here we can see that card and cash are the taxi's main payment types. The rows with the other strange values can be deleted. By doing this (also we are going to do this with the other attributes) we are adding a bit of bias to the predictions, but those values are so unusual that will hardly affect to the prediction's performance.
As the rest of the attributes are numeric, a way to help ourselves is by obtaining a few of statistical values from them.
In [8]:
data.describe()
Out[8]:
For fare_amount
, an attribute that can be difficult to obtain a visualization, we can take a look to the previous table. There we can see that there are negative values! So, thinking a range of ordinary values for this attribute, something like between \$3.00 and \$200.00 comes to my mind.
Let's obtain a plot for the surcharge
values.
In [9]:
ax = data.groupby(['surcharge']).size().plot(kind='bar', figsize=(15, 5))
ax.set_xlabel('surcharge', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)
What are we going to do here? Of course, save only the trips with \$0, \$0.50 and \$1.00 values.
For mta_tax
we obtain the next plot.
In [10]:
ax = data.groupby(['mta_tax']).size().plot(kind='bar')
ax.set_xlabel('mta_tax', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)
So, something like the previous case, but here we are going to save only the trips with $0.50.
We are going to skip tip_amount
for the moment.
Obtain a useful representation from tolls_amount
is very dificult because the huge range of values. Maybe, the cause of this is that they are manually introduced. Also, these values can change as the course of the year, so they probably aren't going to be same in all the months.
A solution for that might be to obtain the values that are repeated, for example, more that a thousand times.
In [11]:
tolls = data.groupby(['tolls_amount']).size()
print tolls[tolls >= 1000.0]
tolls = None
We can say that a good range for this value may be something like \$0.00 and \$30.00.
We aren't going to take into account the total_amount
attribute. This attribute is the sum of the previous ones, so any error here it's unlikely to happen. Also, we aren't going to use it for the learning algorithm, because the tip it's implicitly included on it.
Let's clean all of these attributes!
In [12]:
payment_type = ((data.payment_type == 'CRD') | (data.payment_type == 'CSH'))
fare_amount = ((data.fare_amount >= 3.0) & (data.fare_amount <= 200.0))
surcharge = ((data.surcharge == 0.0) | (data.surcharge == 0.5) | (data.surcharge == 1.0))
mta_tax = (data.mta_tax == 0.5)
tip_amount = ((data.tip_amount >= 0.0) & (data.tip_amount <= 100.0))
tolls_amount = ((data.tolls_amount >= 0.0) & (data.tolls_amount <= 30.0))
# Let's save it in another variable, I have a hunch.
data_aux = data[payment_type & fare_amount & surcharge & mta_tax & tip_amount & tolls_amount]
payment_type = None
fare_amount = None
surcharge = None
mta_tax = None
tip_amount = None
tolls_amount = None
A way for solving the hunch I commented in the previous piece of code could be to plot the payment_attribute
.
In [13]:
ax = data_aux.groupby(['payment_type']).size().plot(kind='bar')
ax.set_xlabel('payment_type', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)
And another time for only showing the trips with a tip greater than $0.00.
In [14]:
ax = data_aux[(data_aux.tip_amount > 0)].groupby(['payment_type']).size().plot(kind='bar')
ax.set_xlabel('payment_type', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)
What is happening here? We all know :(
So, to remove this annoying noise, let's delete the cash payment type. Also, we can delete the payment_type
attribute, because this have only one different value.
In [15]:
data_aux = None
payment_type = (data.payment_type == 'CRD')
fare_amount = ((data.fare_amount >= 3.0) & (data.fare_amount <= 200.0))
surcharge = ((data.surcharge == 0.0) | (data.surcharge == 0.5) | (data.surcharge == 1.0))
mta_tax = (data.mta_tax == 0.5)
tip_amount = ((data.tip_amount >= 0.0) & (data.tip_amount <= 100.0))
tolls_amount = ((data.tolls_amount >= 0.0) & (data.tolls_amount <= 30.0))
data = data[payment_type & fare_amount & surcharge & mta_tax & tip_amount & tolls_amount]
payment_type = None
fare_amount = None
surcharge = None
mta_tax = None
tip_amount = None
tolls_amount = None
data.drop(['payment_type'], axis=1, inplace=True)
For obtaining a better value to predict, we can obtain a normalized version of the tip, the tip percentage. We can use the next function:
$$ tip\_perc\:=\:\frac{tip\_amount}{fare\_amount\:+\:surcharge\:+\:mta\_tax}\:\cdot\:100 $$
In [16]:
tip_perc_necesary_columns = ['fare_amount', 'surcharge', 'mta_tax', 'tip_amount']
tip_perc_column_name = 'tip_perc'
def calculate_tip_perc(row):
subtotal = row.fare_amount + row.surcharge + row.mta_tax
tip = row.tip_amount / subtotal
tip_perc = tip * 100
return pd.Series({tip_perc_column_name: tip_perc})
In [17]:
data[tip_perc_column_name] = np.nan
data = iterate_and_apply(data, calculate_tip_perc, tip_perc_necesary_columns)
# Now that the 'mta_tax' attribute has been used it can be deleted because it only has one value.
data.drop(['mta_tax'], axis=1, inplace=True)
Let's take a look to the statistical values of the new attribute.
In [18]:
data.describe()
Out[18]:
A 2,000% tip! This percentage may be adjusted to a more ordinary range, something like 0% and 50%.
In [19]:
tip_perc = (data.tip_perc <= 50.0)
data = data[tip_perc]
tip_perc = None
The file that contains the physical information of each trip can not be readed due to errors in one of its columns. Let's take a look to all of them explaining those that they need it:
pickup_datetime
and trip_time_in_secs
.So, we are going to read only the valid and new columns.
In [20]:
tripdata = pd.read_csv('../data/tripdata/trip_data_8.csv', skipinitialspace=True, usecols=[3, 7, 8, 9, 10, 11, 12, 13])
For join this new DataFrame with the fare
one, we are going to use their indices as... an index! It's done that way because the trips have the same order in both files.
In [21]:
for column in tripdata.columns:
data[column] = np.nan
'''
Careful! The next line use a lot of memory, a least more than 8GB.
It's showed here beacuse it's simplier that the another way I know to update the 'fare' DataFrame:
Read the 'trip' file with NumPy for creating a lot of little DataFrames and update them
into the data variable in a loop.
'''
data.update(tripdata)
tripdata = None
So, let's start to check the new attributes. The first of them is rate_code
.
In [22]:
ax = data.groupby('rate_code').size().plot(kind='bar')
ax.set_xlabel('rate_code', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)
We should only save the rows with 1 as the value for this attribute. Also, we can delete this attribute, because it has only one different value.
The next attribute to check is passenger_count
.
In [23]:
ax = data.groupby('passenger_count').size().plot(kind='bar')
ax.set_xlabel('passenger_count', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)
A usual trip has 1 to 6 passengers. So, we can discard the others.
A couple of attributes that are very realated are trip_time_in_secs
and trip_distance
. Let's check out the next table with the statistical values of the attribures.
In [24]:
data.describe()
Out[24]:
Negative values, trips lasting more than 49 days traveling million of miles. A crazy thing.
So, for fix that, we can use Google Maps and look for a long, but usual trip, like this one. A trip around 50 minutes for travelling 21.1 miles. So, we can use a maximum of 1 hour (3,600 seconds) and 25 miles.
Continuing with the attributes, it's the turn of the coordinates, longitude and latitude for pickups and dropoffs. By observing the previous table we can notice coordinates that don't even exist! For fix that, we can use only the coordinates satisfying the conditions in the following table.
min | max | |
---|---|---|
latitude | 40.459518 | 41.175342 |
longitude | −74.361107 | −71,903083 |
By filtering to that coordinates, we will use the area showed in the next image, extracted from Google Maps.
In [25]:
rate_code = (data.rate_code == 1.0)
passenger_count = ((data.passenger_count >= 1.0) & (data.passenger_count <= 6.0))
trip_time_in_secs = ((data.trip_time_in_secs > 0.0) & (data.trip_time_in_secs <= 3600.0))
trip_distance = ((data.trip_distance > 0.0) & (data.trip_distance <= 25.0))
pickup_latitude = ((data.pickup_latitude >= 40.459518) & (data.pickup_latitude <= 41.175342))
pickup_longitude = ((data.pickup_longitude >= -74.361107) & (data.pickup_longitude <= -71.903083))
dropoff_latitude = ((data.dropoff_latitude >= 40.459518) & (data.dropoff_latitude <= 41.175342))
dropoff_longitude = ((data.dropoff_longitude >= -74.361107) & (data.dropoff_longitude <= -71.903083))
data = data[rate_code & passenger_count & trip_time_in_secs & trip_distance & pickup_latitude & pickup_longitude
& dropoff_latitude & dropoff_longitude]
rate_code = None
passenger_count = None
trip_time_in_secs = None
trip_distance = None
pickup_latitude = None
pickup_longitude = None
dropoff_latitude = None
dropoff_longitude = None
data.drop(['rate_code'], axis=1, inplace=True)
But, what about the possible coordiantes that point to the water? Let's plot the southwest area of the city.
In [26]:
num_samples = 1250000
indices = np.random.choice(data.index, num_samples)
pickup_x = data.pickup_longitude[indices].values
pickup_y = data.pickup_latitude[indices].values
dropoff_x = data.dropoff_longitude[indices].values
dropoff_y = data.dropoff_latitude[indices].values
In [27]:
sns.set_style('white')
fig, ax = plt.subplots(figsize=(11, 12))
ax.scatter(pickup_x, pickup_y, s=5, color='blue', alpha=0.5)
ax.scatter(dropoff_x, dropoff_y, s=5, color='red', alpha=0.5)
ax.set_xlim([-74.05, -74.00])
ax.set_ylim([40.70, 40.75])
ax.set_title('coordinates')
In [28]:
sns.set_style('darkgrid')
indices = None
pickup_x = None
pickup_y = None
dropoff_x = None
dropoff_y = None
This is a bit strange, because this dataset doesn't include water taxi information. Just kidding.
Let's fix this issue in the next section.
A solution to the coordinates in the water problem could be dividing the whole area into smaller square areas. These smaller areas should have a relative small distance in its sides. Let's use 270 meters (885 feet) aproximately as side.
After creating the smaller areas, we could delete the trips that its pickup or dropoff happened in an area with less pickups or dropoffs than a specific number, for example, 20.
So, let's create these smaller areas and plot the coordinates in them.
In [29]:
# Coordinates variables.
min_x = -74.361107
min_y = 40.459518
max_x = -71.903083
max_y = 41.175342
step_x = 0.003205 # 270 meters (885 feet) approximately.
step_y = 0.002429 # 270 meters (885 feet) approximately.
# Number of maximum areas per axis in the 'area matrix'.
max_area_x = 766 # int((max_x - min_x) / step_x)
max_area_y = 294 # int((max_y - min_y) / step_y)
# Number of columns of the 'area matrix'.
num_columns = 767 # max_area_x + 1
# The area at the top right corner in the 'area matrix'.
max_area = 226264 # (max_area_y * num_columns) + max_area_x
# Minimum number of pickups or dropoffs per area.
min_area_number = 20
# Columns for creating the pickup and dropoff area attributes.
area_column_names = ['pickup_area', 'dropoff_area']
area_necesary_columns = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
# Functions for calculating the pickup and dropoff areas in the 'area matrix'.
def get_1d_area(coor, min_coor, step_coor):
return int((coor - min_coor) / step_coor)
def get_area(area_x, area_y, columns):
return (area_y * columns) + area_x
def calculate_area(row):
pickup_x = get_1d_area(row.pickup_longitude, min_x, step_x)
pickup_y = get_1d_area(row.pickup_latitude, min_y, step_y)
pickup_area = get_area(pickup_x, pickup_y, num_columns)
dropoff_x = get_1d_area(row.dropoff_longitude, min_x, step_x)
dropoff_y = get_1d_area(row.dropoff_latitude, min_y, step_y)
dropoff_area = get_area(dropoff_x, dropoff_y, num_columns)
return pd.Series({
area_column_names[0]: pickup_area,
area_column_names[1]: dropoff_area
})
In [30]:
for column in area_column_names:
data[column] = np.nan
data = iterate_and_apply(data, calculate_area, area_necesary_columns)
In [31]:
color_list = ['red', 'cyan']
num_samples = 1250000
index_samples = np.random.choice(data.index, num_samples)
samples = data.ix[index_samples]
pickup = samples.groupby(['pickup_area']).size().index
dropoff = samples.groupby(['dropoff_area']).size().index
samples = None
areas = np.concatenate([pickup, dropoff])
areas = np.unique(areas)
In [32]:
sns.set_style('white')
fig, ax = plt.subplots(figsize=(22, 24))
for area in areas.tolist():
if area % 2 == 0:
color = color_list[0]
else:
color = color_list[1]
data_pickup = data[(data.pickup_area == area)]
data_dropoff = data[(data.dropoff_area == area)]
pickup_x = data_pickup.pickup_longitude.values
pickup_y = data_pickup.pickup_latitude.values
dropoff_x = data_dropoff.dropoff_longitude.values
dropoff_y = data_dropoff.dropoff_latitude.values
ax.scatter(pickup_x, pickup_y, s=0.1, color=color, alpha=0.2)
ax.scatter(dropoff_x, dropoff_y, s=0.1, color=color, alpha=0.2)
ax.set_xlim([-74.1, -73.7])
ax.set_ylim([40.6, 40.9])
ax.set_title('coordinates')
In [33]:
sns.set_style('darkgrid')
index_samples = None
pickup = None
dropoff = None
areas = None
data_pickup = None
data_dropoff = None
pickup_x = None
pickup_y = None
dropoff_x = None
dropoff_y = None
Pretty cool plot!
So, it's time to remove the trips that happened in unusual areas. Also, we don't need these area attributes anymore.
In [34]:
# Filtering the amount of pickups in the same area.
aux_pickup = data.groupby(['pickup_area']).size()
aux_pickup = aux_pickup[aux_pickup >= min_area_number]
aux_pickup = data['pickup_area'].isin(aux_pickup.index)
# Filtering the amount of dropoffs in the same area.
aux_dropoff = data.groupby(['dropoff_area']).size()
aux_dropoff = aux_dropoff[aux_dropoff >= min_area_number]
aux_dropoff = data['dropoff_area'].isin(aux_dropoff.index)
# Using the filters.
data = data[aux_pickup & aux_dropoff]
aux_pickup = None
aux_dropoff = None
data.drop(['pickup_area', 'dropoff_area'], axis=1, inplace=True)
Now we just need to save this clean month in a file. Also, it's a good idea to order the DataFrame's columns.
In [35]:
column_order = ['medallion', 'hack_license', 'vendor_id', 'pickup_datetime', 'fare_amount', 'surcharge', 'tip_amount',
'tip_perc', 'tolls_amount', 'total_amount', 'passenger_count', 'trip_time_in_secs', 'trip_distance',
'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
data = data.reindex_axis(column_order, axis=1)
In [36]:
if not os.path.exists('../data/cleaned/'):
os.makedirs('../data/cleaned/')
data.to_csv('../data/cleaned/cleaned_8.csv', index=True)
In the next notebook, this entire process will be applied to other months. After that, our initial dataset with more than 170 millions trips in 48.6GB will be reduced to 88,156,805 trips in 17.3GB.