2. Exploring a subdataset

As this dataset is completely real it may have a big amount of noise in its attributes. For knowing how to clean it, we can use only a month, such as August, the month with less amount of trips, only 12,597,109. After that, the process for the whole year will be automatized in the next notebook.

2.1. Preparing the notebook



In [1]:

    
%matplotlib inline
%config InlineBackend.figure_format='retina'



In [2]:

    
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set(font='sans')



In [3]:

    
# This auxiliary function applies another one to every row in a DataFrame for creating new columns.
def iterate_and_apply(dataframe, function, necesary_columns):
    perform = True
    step = 100000
    start = 0
    to = step
    
    while perform:
        new_columns = dataframe[start:to][necesary_columns].apply(function, axis=1)
        if len(new_columns) == 0:
            perform = False
        else:
            dataframe.update(new_columns)
        new_columns = None
        start += step
        to += step
    
    return dataframe

2.2. Cleaning August `fare data`



In [4]:

    
data = pd.read_csv('../data/faredata/trip_fare_8.csv', skipinitialspace=True)

Let's take a look to the data. Each row of the fare CSV file represents a trip and the columns are attributes for these trips. In the next table we can check out the values of these attributes for the first five trips of the file.



In [5]:

    
data.head()









    Out[5]:






  
    
      
      medallion
      hack_license
      vendor_id
      pickup_datetime
      payment_type
      fare_amount
      surcharge
      mta_tax
      tip_amount
      tolls_amount
      total_amount
    
  
  
    
      0
       3418135604CD3F357DD9577AF978C5C0
       B25386A1F259C87449430593E904FDBC
       VTS
       2013-08-30 07:57:00
       CSH
       41.5
       0.0
       0.5
       0
       0.00
       42.00
    
    
      1
       6D3B2A7682C30DCF64F3F12976EF93B6
       A603A9D5FAA46E8FF2A97A143328D938
       CMT
       2013-08-30 23:26:23
       CSH
       31.0
       0.5
       0.5
       0
       5.33
       37.33
    
    
      2
       6D49E494913752B75B2685E0019FF3D5
       3F0BFE90A5D71741840B25600A89E225
       CMT
       2013-08-30 09:18:10
       CSH
        5.5
       0.0
       0.5
       0
       0.00
        6.00
    
    
      3
       4C4A0AFC432A1A87E97ED8F18403FF6E
       BA20A20E2CF85EF7B00162D711394C7E
       CMT
       2013-08-26 23:27:11
       CSH
       23.0
       0.5
       0.5
       0
       5.33
       29.33
    
    
      4
       1258CA1DF5E2A9E9A9F7848408A7AAEB
       8C14DCF69CAA2A9A0DFAFD99E00536A1
       CMT
       2013-08-29 10:57:56
       CSH
       14.0
       0.0
       0.5
       0
       0.00
       14.50

From all of them, the first three columns could sound a bit strange to us. Let's check them:

medallion: Hash of the taxi identifier.
hack_license: Hash of the taxi driver license.
vendor_id: Identifier of the devices's trade mark that collect all this information.

Knowing that, we just can skip the checking of the first, second and fourth attribute status. So, let's see how the values of vendor_id are distributed in the trips.



In [6]:

    
ax = data.groupby(['vendor_id']).size().plot(kind='bar')

ax.set_xlabel('vendor_id', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)

It looks like we don't need to treat these values, they are pretty well distributed.

Let's check the payment_type.



In [7]:

    
ax = data.groupby(['payment_type']).size().plot(kind='bar')

ax.set_xlabel('payment_type', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)

Here we can see that card and cash are the taxi's main payment types. The rows with the other strange values can be deleted. By doing this (also we are going to do this with the other attributes) we are adding a bit of bias to the predictions, but those values are so unusual that will hardly affect to the prediction's performance.

As the rest of the attributes are numeric, a way to help ourselves is by obtaining a few of statistical values from them.



In [8]:

    
data.describe()









    Out[8]:






  
    
      
      fare_amount
      surcharge
      mta_tax
      tip_amount
      tolls_amount
      total_amount
    
  
  
    
      count
       12597109.000000
       12597109.000000
       12597109.000000
       12597109.000000
       12597109.000000
       12597109.000000
    
    
      mean
             12.569788
              0.331081
              0.497757
              1.359659
              0.275257
             15.131272
    
    
      std
             54.536458
              0.800972
              0.073919
              2.351744
              1.499754
            252.274239
    
    
      min
          -1430.000000
            -19.500000
             -0.500000
            -96.820000
            -22.250000
          -1430.000000
    
    
      25%
              6.500000
              0.000000
              0.500000
              0.000000
              0.000000
              8.000000
    
    
      50%
              9.500000
              0.500000
              0.500000
              1.000000
              0.000000
             11.300000
    
    
      75%
             14.000000
              0.500000
              0.500000
              2.000000
              0.000000
             16.700000
    
    
      max
         158995.810000
            854.500000
             80.050000
            888.190000
            960.090000
         685908.100000

For fare_amount, an attribute that can be difficult to obtain a visualization, we can take a look to the previous table. There we can see that there are negative values! So, thinking a range of ordinary values for this attribute, something like between \$3.00 and \$200.00 comes to my mind.

Let's obtain a plot for the surcharge values.



In [9]:

    
ax = data.groupby(['surcharge']).size().plot(kind='bar', figsize=(15, 5))

ax.set_xlabel('surcharge', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)

What are we going to do here? Of course, save only the trips with \$0, \$0.50 and \$1.00 values.

For mta_tax we obtain the next plot.



In [10]:

    
ax = data.groupby(['mta_tax']).size().plot(kind='bar')

ax.set_xlabel('mta_tax', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)

So, something like the previous case, but here we are going to save only the trips with $0.50.

We are going to skip tip_amount for the moment.

Obtain a useful representation from tolls_amount is very dificult because the huge range of values. Maybe, the cause of this is that they are manually introduced. Also, these values can change as the course of the year, so they probably aren't going to be same in all the months.

A solution for that might be to obtain the values that are repeated, for example, more that a thousand times.



In [11]:

    
tolls = data.groupby(['tolls_amount']).size()

print tolls[tolls >= 1000.0]
tolls = None









    



tolls_amount
0.00            11985089
2.44                5045
5.33              569389
7.50                1283
8.25                5517
10.25              12782
10.66               3542
15.58               1095
dtype: int64

We can say that a good range for this value may be something like \$0.00 and \$30.00.

We aren't going to take into account the total_amount attribute. This attribute is the sum of the previous ones, so any error here it's unlikely to happen. Also, we aren't going to use it for the learning algorithm, because the tip it's implicitly included on it.

Let's clean all of these attributes!



In [12]:

    
payment_type = ((data.payment_type == 'CRD') | (data.payment_type == 'CSH'))
fare_amount = ((data.fare_amount >= 3.0) & (data.fare_amount <= 200.0))
surcharge = ((data.surcharge == 0.0) | (data.surcharge == 0.5) | (data.surcharge == 1.0))
mta_tax = (data.mta_tax == 0.5)
tip_amount = ((data.tip_amount >= 0.0) & (data.tip_amount <= 100.0))
tolls_amount = ((data.tolls_amount >= 0.0) & (data.tolls_amount <= 30.0))

# Let's save it in another variable, I have a hunch.
data_aux = data[payment_type & fare_amount & surcharge & mta_tax & tip_amount & tolls_amount]

payment_type = None
fare_amount = None
surcharge = None
mta_tax = None
tip_amount = None
tolls_amount = None

A way for solving the hunch I commented in the previous piece of code could be to plot the payment_attribute.



In [13]:

    
ax = data_aux.groupby(['payment_type']).size().plot(kind='bar')

ax.set_xlabel('payment_type', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)

And another time for only showing the trips with a tip greater than $0.00.



In [14]:

    
ax = data_aux[(data_aux.tip_amount > 0)].groupby(['payment_type']).size().plot(kind='bar')

ax.set_xlabel('payment_type', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)

What is happening here? We all know :(

So, to remove this annoying noise, let's delete the cash payment type. Also, we can delete the payment_type attribute, because this have only one different value.



In [15]:

    
data_aux = None

payment_type = (data.payment_type == 'CRD')
fare_amount = ((data.fare_amount >= 3.0) & (data.fare_amount <= 200.0))
surcharge = ((data.surcharge == 0.0) | (data.surcharge == 0.5) | (data.surcharge == 1.0))
mta_tax = (data.mta_tax == 0.5)
tip_amount = ((data.tip_amount >= 0.0) & (data.tip_amount <= 100.0))
tolls_amount = ((data.tolls_amount >= 0.0) & (data.tolls_amount <= 30.0))

data = data[payment_type & fare_amount & surcharge & mta_tax & tip_amount & tolls_amount]

payment_type = None
fare_amount = None
surcharge = None
mta_tax = None
tip_amount = None
tolls_amount = None

data.drop(['payment_type'], axis=1, inplace=True)

2.3. Calculating the tip percentage

For obtaining a better value to predict, we can obtain a normalized version of the tip, the tip percentage. We can use the next function:

$$ tip\_perc\:=\:\frac{tip\_amount}{fare\_amount\:+\:surcharge\:+\:mta\_tax}\:\cdot\:100 $$



In [16]:

    
tip_perc_necesary_columns = ['fare_amount', 'surcharge', 'mta_tax', 'tip_amount']
tip_perc_column_name = 'tip_perc'

def calculate_tip_perc(row):
    subtotal = row.fare_amount + row.surcharge + row.mta_tax
    tip = row.tip_amount / subtotal
    tip_perc = tip * 100
    
    return pd.Series({tip_perc_column_name: tip_perc})



In [17]:

    
data[tip_perc_column_name] = np.nan

data = iterate_and_apply(data, calculate_tip_perc, tip_perc_necesary_columns)

# Now that the 'mta_tax' attribute has been used it can be deleted because it only has one value.
data.drop(['mta_tax'], axis=1, inplace=True)

Let's take a look to the statistical values of the new attribute.



In [18]:

    
data.describe()









    Out[18]:






  
    
      
      fare_amount
      surcharge
      tip_amount
      tolls_amount
      total_amount
      tip_perc
    
  
  
    
      count
       6697963.000000
       6697963.000000
       6697963.000000
       6697963.000000
       6697963.000000
       6697963.000000
    
    
      mean
            13.065660
             0.337271
             2.493890
             0.303043
            16.699864
            18.186398
    
    
      std
             9.850364
             0.362734
             2.249244
             1.258718
            12.447595
             9.530609
    
    
      min
             3.000000
             0.000000
             0.000000
             0.000000
             3.500000
             0.000000
    
    
      25%
             7.000000
             0.000000
             1.200000
             0.000000
             9.500000
            14.285714
    
    
      50%
            10.000000
             0.500000
             2.000000
             0.000000
            12.600000
            19.354839
    
    
      75%
            15.000000
             0.500000
             2.900000
             0.000000
            18.600000
            20.000000
    
    
      max
           200.000000
             1.000000
           100.000000
            30.000000
           260.500000
          2000.000000

A 2,000% tip! This percentage may be adjusted to a more ordinary range, something like 0% and 50%.



In [19]:

    
tip_perc = (data.tip_perc <= 50.0)
data = data[tip_perc]

tip_perc = None

2.4. Adding the `trip` data file

The file that contains the physical information of each trip can not be readed due to errors in one of its columns. Let's take a look to all of them explaining those that they need it:

medallion
hack_license
vendor_id
rate_code: We can get more information about this attribute here.
store_and_fwd_flag: This is the column with errors. A solution to this issue may be ignoring its values.
pickup_datetime
dropoff_datetime: We don't need this one, its information is implicit in pickup_datetime and trip_time_in_secs.
passenger_count
trip_time_in_secs
trip_distance
pickup_latitude
pickup_longitude
dropoff_latitude
dropoff_longitude

So, we are going to read only the valid and new columns.



In [20]:

    
tripdata = pd.read_csv('../data/tripdata/trip_data_8.csv', skipinitialspace=True, usecols=[3, 7, 8, 9, 10, 11, 12, 13])

For join this new DataFrame with the fare one, we are going to use their indices as... an index! It's done that way because the trips have the same order in both files.



In [21]:

    
for column in tripdata.columns:
    data[column] = np.nan

'''
Careful! The next line use a lot of memory, a least more than 8GB.
It's showed here beacuse it's simplier that the another way I know to update the 'fare' DataFrame:
    Read the 'trip' file with NumPy for creating a lot of little DataFrames and update them
    into the data variable in a loop.
'''
data.update(tripdata)
tripdata = None

2.5. Cleaning August `trip` data

So, let's start to check the new attributes. The first of them is rate_code.



In [22]:

    
ax = data.groupby('rate_code').size().plot(kind='bar')

ax.set_xlabel('rate_code', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)

We should only save the rows with 1 as the value for this attribute. Also, we can delete this attribute, because it has only one different value.

The next attribute to check is passenger_count.



In [23]:

    
ax = data.groupby('passenger_count').size().plot(kind='bar')

ax.set_xlabel('passenger_count', fontsize=18)
ax.set_ylabel('Number of trips', fontsize=18)
ax.tick_params(labelsize=12)

A usual trip has 1 to 6 passengers. So, we can discard the others.

A couple of attributes that are very realated are trip_time_in_secs and trip_distance. Let's check out the next table with the statistical values of the attribures.



In [24]:

    
data.describe()









    Out[24]:






  
    
      
      fare_amount
      surcharge
      tip_amount
      tolls_amount
      total_amount
      tip_perc
      rate_code
      passenger_count
      trip_time_in_secs
      trip_distance
      pickup_longitude
      pickup_latitude
      dropoff_longitude
      dropoff_latitude
    
  
  
    
      count
       6679377.000000
       6679377.00000
       6679377.000000
       6679377.000000
       6679377.000000
       6679377.000000
       6679377.000000
       6679377.000000
       6679377.000000
        6679377.000000
       6679377.000000
       6679377.000000
       6679377.000000
       6679377.000000
    
    
      mean
            13.073027
             0.33731
             2.470901
             0.303073
            16.684310
            17.959636
             1.026901
             1.667426
          1421.444612
             82.635783
           -73.210150
            40.326279
           -73.201282
            40.322172
    
    
      std
             9.848450
             0.36280
             2.113576
             1.258079
            12.411177
             6.497662
             0.328381
             1.355291
         51981.318368
          25563.660419
             7.535068
             4.521099
             7.597918
             4.625446
    
    
      min
             3.000000
             0.00000
             0.000000
             0.000000
             3.500000
             0.000000
             0.000000
             0.000000
            -1.000000
              0.000000
         -1958.327000
         -3113.289800
         -2216.325200
         -3114.304900
    
    
      25%
             7.000000
             0.00000
             1.200000
             0.000000
             9.500000
            14.285714
             1.000000
             1.000000
           420.000000
              1.190000
           -73.993515
            40.733177
           -73.992462
            40.731628
    
    
      50%
            10.000000
             0.50000
             2.000000
             0.000000
            12.600000
            19.333333
             1.000000
             1.000000
           660.000000
              2.000000
           -73.982887
            40.750454
           -73.981445
            40.750904
    
    
      75%
            15.000000
             0.50000
             2.900000
             0.000000
            18.600000
            20.000000
             1.000000
             2.000000
          1020.000000
              3.600000
           -73.969521
            40.764839
           -73.966156
            40.765533
    
    
      max
           200.000000
             1.00000
            60.000000
            30.000000
           260.500000
            50.000000
           210.000000
             8.000000
       4294966.000000
       15038005.000000
            51.425247
           473.958280
            93.411148
          1344.402100

Negative values, trips lasting more than 49 days traveling million of miles. A crazy thing.

So, for fix that, we can use Google Maps and look for a long, but usual trip, like this one. A trip around 50 minutes for travelling 21.1 miles. So, we can use a maximum of 1 hour (3,600 seconds) and 25 miles.

Continuing with the attributes, it's the turn of the coordinates, longitude and latitude for pickups and dropoffs. By observing the previous table we can notice coordinates that don't even exist! For fix that, we can use only the coordinates satisfying the conditions in the following table.

	min	max
latitude	40.459518	41.175342
longitude	−74.361107	−71,903083

By filtering to that coordinates, we will use the area showed in the next image, extracted from Google Maps.



In [25]:

    
rate_code = (data.rate_code == 1.0)
passenger_count = ((data.passenger_count >= 1.0) & (data.passenger_count <= 6.0))
trip_time_in_secs = ((data.trip_time_in_secs > 0.0) & (data.trip_time_in_secs <= 3600.0))
trip_distance = ((data.trip_distance > 0.0) & (data.trip_distance <= 25.0))

pickup_latitude = ((data.pickup_latitude >= 40.459518) & (data.pickup_latitude <= 41.175342))
pickup_longitude = ((data.pickup_longitude >= -74.361107) & (data.pickup_longitude <= -71.903083))
dropoff_latitude = ((data.dropoff_latitude >= 40.459518) & (data.dropoff_latitude <= 41.175342))
dropoff_longitude = ((data.dropoff_longitude >= -74.361107) & (data.dropoff_longitude <= -71.903083))

data = data[rate_code & passenger_count & trip_time_in_secs & trip_distance & pickup_latitude & pickup_longitude
            & dropoff_latitude & dropoff_longitude]

rate_code = None
passenger_count = None
trip_time_in_secs = None
trip_distance = None

pickup_latitude = None
pickup_longitude = None
dropoff_latitude = None
dropoff_longitude = None

data.drop(['rate_code'], axis=1, inplace=True)

But, what about the possible coordiantes that point to the water? Let's plot the southwest area of the city.



In [26]:

    
num_samples = 1250000
indices = np.random.choice(data.index, num_samples)

pickup_x = data.pickup_longitude[indices].values
pickup_y = data.pickup_latitude[indices].values
dropoff_x = data.dropoff_longitude[indices].values
dropoff_y = data.dropoff_latitude[indices].values



In [27]:

    
sns.set_style('white')

fig, ax = plt.subplots(figsize=(11, 12))

ax.scatter(pickup_x, pickup_y, s=5, color='blue', alpha=0.5)
ax.scatter(dropoff_x, dropoff_y, s=5, color='red', alpha=0.5)

ax.set_xlim([-74.05, -74.00])
ax.set_ylim([40.70, 40.75])

ax.set_title('coordinates')



In [28]:

    
sns.set_style('darkgrid')

indices = None
pickup_x = None
pickup_y = None
dropoff_x = None
dropoff_y = None

This is a bit strange, because this dataset doesn't include water taxi information. Just kidding.

Let's fix this issue in the next section.

2.6. Filtering the coordinates of unusal places

A solution to the coordinates in the water problem could be dividing the whole area into smaller square areas. These smaller areas should have a relative small distance in its sides. Let's use 270 meters (885 feet) aproximately as side.

After creating the smaller areas, we could delete the trips that its pickup or dropoff happened in an area with less pickups or dropoffs than a specific number, for example, 20.

So, let's create these smaller areas and plot the coordinates in them.



In [29]:

    
# Coordinates variables.
min_x = -74.361107
min_y = 40.459518

max_x = -71.903083
max_y = 41.175342

step_x = 0.003205  # 270 meters (885 feet) approximately.
step_y = 0.002429  # 270 meters (885 feet) approximately.

# Number of maximum areas per axis in the 'area matrix'.
max_area_x = 766  # int((max_x - min_x) / step_x)
max_area_y = 294  # int((max_y - min_y) / step_y)

# Number of columns of the 'area matrix'.
num_columns = 767  # max_area_x + 1

# The area at the top right corner in the 'area matrix'.
max_area = 226264  # (max_area_y * num_columns) + max_area_x

# Minimum number of pickups or dropoffs per area.
min_area_number = 20


# Columns for creating the pickup and dropoff area attributes.
area_column_names = ['pickup_area', 'dropoff_area']
area_necesary_columns = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']


# Functions for calculating the pickup and dropoff areas in the 'area matrix'.

def get_1d_area(coor, min_coor, step_coor):
    return int((coor - min_coor) / step_coor)

def get_area(area_x, area_y, columns):
    return (area_y * columns) + area_x

def calculate_area(row):
    pickup_x = get_1d_area(row.pickup_longitude, min_x, step_x)
    pickup_y = get_1d_area(row.pickup_latitude, min_y, step_y)
    pickup_area = get_area(pickup_x, pickup_y, num_columns)
    
    dropoff_x = get_1d_area(row.dropoff_longitude, min_x, step_x)
    dropoff_y = get_1d_area(row.dropoff_latitude, min_y, step_y)
    dropoff_area = get_area(dropoff_x, dropoff_y, num_columns)
    
    return pd.Series({
        area_column_names[0]: pickup_area,
        area_column_names[1]: dropoff_area
    })



In [30]:

    
for column in area_column_names:
    data[column] = np.nan

data = iterate_and_apply(data, calculate_area, area_necesary_columns)



In [31]:

    
color_list = ['red', 'cyan']

num_samples = 1250000
index_samples = np.random.choice(data.index, num_samples)

samples = data.ix[index_samples]
pickup = samples.groupby(['pickup_area']).size().index
dropoff = samples.groupby(['dropoff_area']).size().index
samples = None

areas = np.concatenate([pickup, dropoff])
areas = np.unique(areas)



In [32]:

    
sns.set_style('white')

fig, ax = plt.subplots(figsize=(22, 24))
for area in areas.tolist():
    if area % 2 == 0:
        color = color_list[0]
    else:
        color = color_list[1]
    data_pickup = data[(data.pickup_area == area)]
    data_dropoff = data[(data.dropoff_area == area)]
    pickup_x = data_pickup.pickup_longitude.values
    pickup_y = data_pickup.pickup_latitude.values
    dropoff_x = data_dropoff.dropoff_longitude.values
    dropoff_y = data_dropoff.dropoff_latitude.values
    ax.scatter(pickup_x, pickup_y, s=0.1, color=color, alpha=0.2)
    ax.scatter(dropoff_x, dropoff_y, s=0.1, color=color, alpha=0.2)

ax.set_xlim([-74.1, -73.7])
ax.set_ylim([40.6, 40.9])

ax.set_title('coordinates')



In [33]:

    
sns.set_style('darkgrid')

index_samples = None
pickup = None
dropoff = None
areas = None
data_pickup = None
data_dropoff = None
pickup_x = None
pickup_y = None
dropoff_x = None
dropoff_y = None

Pretty cool plot!

So, it's time to remove the trips that happened in unusual areas. Also, we don't need these area attributes anymore.



In [34]:

    
# Filtering the amount of pickups in the same area.
aux_pickup = data.groupby(['pickup_area']).size()
aux_pickup = aux_pickup[aux_pickup >= min_area_number]
aux_pickup = data['pickup_area'].isin(aux_pickup.index)

# Filtering the amount of dropoffs in the same area.
aux_dropoff = data.groupby(['dropoff_area']).size()
aux_dropoff = aux_dropoff[aux_dropoff >= min_area_number]
aux_dropoff = data['dropoff_area'].isin(aux_dropoff.index)

# Using the filters.
data = data[aux_pickup & aux_dropoff]
aux_pickup = None
aux_dropoff = None

data.drop(['pickup_area', 'dropoff_area'], axis=1, inplace=True)

2.7. Saving the August file

Now we just need to save this clean month in a file. Also, it's a good idea to order the DataFrame's columns.



In [35]:

    
column_order = ['medallion', 'hack_license', 'vendor_id', 'pickup_datetime', 'fare_amount', 'surcharge', 'tip_amount',
                'tip_perc', 'tolls_amount', 'total_amount', 'passenger_count', 'trip_time_in_secs', 'trip_distance',
                'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']

data = data.reindex_axis(column_order, axis=1)



In [36]:

    
if not os.path.exists('../data/cleaned/'):
    os.makedirs('../data/cleaned/')

data.to_csv('../data/cleaned/cleaned_8.csv', index=True)

In the next notebook, this entire process will be applied to other months. After that, our initial dataset with more than 170 millions trips in 48.6GB will be reduced to 88,156,805 trips in 17.3GB.

	medallion	hack_license	vendor_id	pickup_datetime	payment_type	fare_amount	surcharge	mta_tax	tolls_amount	total_amount
0	3418135604CD3F357DD9577AF978C5C0	B25386A1F259C87449430593E904FDBC	VTS	2013-08-30 07:57:00	CSH	41.5	0.0	0.5	0.00	42.00
1	6D3B2A7682C30DCF64F3F12976EF93B6	A603A9D5FAA46E8FF2A97A143328D938	CMT	2013-08-30 23:26:23	CSH	31.0	0.5	0.5	5.33	37.33
2	6D49E494913752B75B2685E0019FF3D5	3F0BFE90A5D71741840B25600A89E225	CMT	2013-08-30 09:18:10	CSH	5.5	0.0	0.5	0.00	6.00
3	4C4A0AFC432A1A87E97ED8F18403FF6E	BA20A20E2CF85EF7B00162D711394C7E	CMT	2013-08-26 23:27:11	CSH	23.0	0.5	0.5	5.33	29.33
4	1258CA1DF5E2A9E9A9F7848408A7AAEB	8C14DCF69CAA2A9A0DFAFD99E00536A1	CMT	2013-08-29 10:57:56	CSH	14.0	0.0	0.5	0.00	14.50

	fare_amount	surcharge	mta_tax	tip_amount	tolls_amount	total_amount
count	12597109.000000	12597109.000000	12597109.000000	12597109.000000	12597109.000000	12597109.000000
mean	12.569788	0.331081	0.497757	1.359659	0.275257	15.131272
std	54.536458	0.800972	0.073919	2.351744	1.499754	252.274239
min	-1430.000000	-19.500000	-0.500000	-96.820000	-22.250000	-1430.000000
25%	6.500000	0.000000	0.500000	0.000000	0.000000	8.000000
50%	9.500000	0.500000	0.500000	1.000000	0.000000	11.300000
75%	14.000000	0.500000	0.500000	2.000000	0.000000	16.700000
max	158995.810000	854.500000	80.050000	888.190000	960.090000	685908.100000

	fare_amount	surcharge	tip_amount	tolls_amount	total_amount	tip_perc
count	6697963.000000	6697963.000000	6697963.000000	6697963.000000	6697963.000000	6697963.000000
mean	13.065660	0.337271	2.493890	0.303043	16.699864	18.186398
std	9.850364	0.362734	2.249244	1.258718	12.447595	9.530609
min	3.000000	0.000000	0.000000	0.000000	3.500000	0.000000
25%	7.000000	0.000000	1.200000	0.000000	9.500000	14.285714
50%	10.000000	0.500000	2.000000	0.000000	12.600000	19.354839
75%	15.000000	0.500000	2.900000	0.000000	18.600000	20.000000
max	200.000000	1.000000	100.000000	30.000000	260.500000	2000.000000

	fare_amount	surcharge	tip_amount	tolls_amount	total_amount	tip_perc	rate_code	passenger_count	trip_time_in_secs	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude
count	6679377.000000	6679377.00000	6679377.000000	6679377.000000	6679377.000000	6679377.000000	6679377.000000	6679377.000000	6679377.000000	6679377.000000	6679377.000000	6679377.000000	6679377.000000	6679377.000000
mean	13.073027	0.33731	2.470901	0.303073	16.684310	17.959636	1.026901	1.667426	1421.444612	82.635783	-73.210150	40.326279	-73.201282	40.322172
std	9.848450	0.36280	2.113576	1.258079	12.411177	6.497662	0.328381	1.355291	51981.318368	25563.660419	7.535068	4.521099	7.597918	4.625446
min	3.000000	0.00000	0.000000	0.000000	3.500000	0.000000	0.000000	0.000000	-1.000000	0.000000	-1958.327000	-3113.289800	-2216.325200	-3114.304900
25%	7.000000	0.00000	1.200000	0.000000	9.500000	14.285714	1.000000	1.000000	420.000000	1.190000	-73.993515	40.733177	-73.992462	40.731628
50%	10.000000	0.50000	2.000000	0.000000	12.600000	19.333333	1.000000	1.000000	660.000000	2.000000	-73.982887	40.750454	-73.981445	40.750904
75%	15.000000	0.50000	2.900000	0.000000	18.600000	20.000000	1.000000	2.000000	1020.000000	3.600000	-73.969521	40.764839	-73.966156	40.765533
max	200.000000	1.00000	60.000000	30.000000	260.500000	50.000000	210.000000	8.000000	4294966.000000	15038005.000000	51.425247	473.958280	93.411148	1344.402100