4. Obtaining a sample and improving the data

The size of the dataset makes impossible to perform the learning in a relatively short time. A solution for this may be select a sample from the dataset and learning from it. Let's select a random portion of 1,000,000 trips equally distributed throughout the year.

4.1. Preparing the notebook


In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

In [2]:
import datetime
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from __future__ import division

sns.set(font='sans')

In [3]:
read_path = '../data/cleaned/cleaned_{0}.csv'
save_path = '../data/dataset/dataset.csv'

In [4]:
# This auxiliary function applies another one to every row in a DataFrame for creating new columns.
def iterate_and_apply(dataframe, function, necesary_columns):
    perform = True
    step = 100000
    start = 0
    to = step
    
    while perform:
        new_columns = dataframe[start:to][necesary_columns].apply(function, axis=1)
        if len(new_columns) == 0:
            perform = False
        else:
            dataframe.update(new_columns)
        new_columns = None
        start += step
        to += step
    
    return dataframe

4.2. Obtaining the subdataset


In [5]:
complete_length = 88156805
final_length = 1000000
current_length = 0

first = True
data = None

In [6]:
months = range(1, 13)
for month in months:
    
    data_aux = pd.read_csv(read_path.format(month), index_col=0)
    
    if month != 12:
        this_length = int(final_length * (data_aux.shape[0] / complete_length))
    else:
        this_length = final_length - current_length
    current_length += this_length
    
    data_aux = data_aux.ix[np.random.choice(data_aux.index, this_length, replace=False)].copy()
    data_aux = data_aux.reset_index(drop=True)
    
    if first:
        data = data_aux.copy()
        first = False
    else:
        data = data.append(data_aux, ignore_index=True)
    data_aux = None

Now that we have obtained the sample from the dataset, let's create new attributes. This new task would be impossible to do in the entire dataset. Now it's the perfect time: the data is clean and small.

4.3. Getting datetime attributes

The date and time attribute that currently exists can't fit in the scikit-learn's algorithms, we need to decompose it. Here it's proposed a possible list of extracted attributes from pickup_datetime, explaining those that they need it:

  • pickup_month
  • pickup_weekday
  • pickup_day
  • pickup_time_in_mins: Time showed in minutes. For example, 16:30 p.m. would be 992.
  • pickup_non_working_today: A boolean value that shows if that day was a holiday or not.
  • pickup_non_working_tomorrow: A boolean value that shows if the day following was a holiday or not.

For creating the last two attributes, the dataset nyc_2013_holidays.csv (mentioned in the first notebook) has to be used.


In [7]:
# A mini-dataset with the 2013 holidays in NYC.
annual_holidays = pd.read_csv('../data/nyc_2013_holidays.csv')

# Columns needed.
datetime_necesary_columns = ['pickup_datetime']
datetime_column_names = ['pickup_month', 'pickup_weekday', 'pickup_day', 'pickup_time_in_mins', 'pickup_non_working_today',
                         'pickup_non_working_tomorrow']

# It says if a day is a holiday in NYC.
def is_in_annual_holidays(the_day):
    return annual_holidays[(annual_holidays.month == the_day.month) & (annual_holidays.day == the_day.day)].shape[0]

# It calculates data related with 'pickup_datetime'.
def calculate_datetime_extra(row):
    dt = datetime.datetime.strptime(row.pickup_datetime, '%Y-%m-%d %H:%M:%S')
    pickup_month = dt.month
    pickup_weekday = dt.weekday()
    pickup_day = dt.day
    pickup_time_in_mins = (dt.hour * 60) + dt.minute
    pickup_non_working_today = int((pickup_weekday == 5) or (pickup_weekday == 6) or is_in_annual_holidays(dt))
    pickup_non_working_tomorrow = int((pickup_weekday == 4) or (pickup_weekday == 5) or
                                      is_in_annual_holidays(dt + datetime.timedelta(days=1)))
    
    return pd.Series({
        datetime_column_names[0]: pickup_month,
        datetime_column_names[1]: pickup_weekday,
        datetime_column_names[2]: pickup_day,
        datetime_column_names[3]: pickup_time_in_mins,
        datetime_column_names[4]: pickup_non_working_today,
        datetime_column_names[5]: pickup_non_working_tomorrow
    })

In [8]:
for column in datetime_column_names:
    data[column] = np.nan

data = iterate_and_apply(data, calculate_datetime_extra, datetime_necesary_columns)

4.4. Getting a label to predict

Predict the tip percentage of a trip is a regression problem. Given the human nature of the data, even the noise that we couldn't clean, perform a regression could give us disastrous results. Maybe, we could change the problem to a classification one. Let's create a new attribute to predict it. It could be a total of six labels, denoting a few ranges of tips:

$$ [0,\:10),\:[10,\:15),\:[15,\:20),\:[20,\:25),\:[25,\:30)\:and\:[30,\:+\infty) $$

In [9]:
tip_label_column_name = 'tip_label'

tip_labels = ['[0-10)', '[10-15)', '[15-20)', '[20-25)', '[25-30)', '[30-inf)']
tip_ranges_by_label = [[0.0, 10.0], [10.0, 15.0], [15.0, 20.0], [20.0, 25.0], [25.0, 30.0], [30.0, 51.0]]

In [10]:
data[tip_label_column_name] = ''

for i, tip_label in enumerate(tip_labels):
    tip_mask = ((data.tip_perc >= tip_ranges_by_label[i][0]) & (data.tip_perc < tip_ranges_by_label[i][1]))
    data.tip_label[tip_mask] = tip_label
    
    tip_mask = None

In [11]:
ax = data.groupby('tip_label').size().plot(kind='bar')

ax.set_xlabel('tip_label', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)


These are the classes that we are going to predict!

4.5. Saving the file

Finally, we are going to save this dataset with a particular column order, and use it in the next notebook.


In [12]:
tip_label_order = ['medallion', 'hack_license', 'vendor_id', 'pickup_datetime', 'pickup_month', 'pickup_weekday',
                   'pickup_day', 'pickup_time_in_mins', 'pickup_non_working_today', 'pickup_non_working_tomorrow',
                   'fare_amount', 'surcharge', 'tip_amount', 'tip_perc', 'tip_label', 'tolls_amount', 'total_amount',
                   'passenger_count', 'trip_time_in_secs', 'trip_distance', 'pickup_longitude', 'pickup_latitude',
                   'dropoff_longitude', 'dropoff_latitude']

data = data.reindex_axis(tip_label_order, axis=1)

In [13]:
if not os.path.exists('../data/dataset/'):
    os.makedirs('../data/dataset/')

data.to_csv(save_path, index=False)