In [1]:
import tip_predictor
import pandas as pd
import os


=========Introduction=========

Use this code to predict the percentage tip expected after a trip in NYC green taxi. 
The code is a predictive model that was built and trained on top of the Gradient Boosting Classifer and the Random Forest Gradient both provided in scikit-learn

The input: 
pandas.dataframe with columns:This should be in the same format as downloaded from the website

The data frame go through the following pipeline:
	1. Cleaning
	2. Creation of derived variables
	3. Making predictions

The output:
	pandas.Series, two files are saved on disk,  submission.csv and cleaned_data.csv respectively.

To make predictions, run 'tip_predictor.make_predictions(data)', where data is any 2015 raw dataframe fresh from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
Run tip_predictor.read_me() for further instructions


In [2]:
# Download/load the September 2015 dataset
if os.path.exists('data_september_2015.csv'): # Check if the dataset is present on local disk and load it
    data = pd.read_csv('data_september_2015.csv')
else: # Download dataset if not available on disk
    url = "https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv"
    data = pd.read_csv(url)
    data.to_csv(url.split('/')[-1])

In [8]:
# make predictions 
#tip_predictor.make_predictions(data.tail(1000))

# uncomment the next line to run the entire dataset
tip_predictor.make_predictions(data)


cleaning ...
creating features ...
predicting ...
submissions and cleaned data saved as submission.csv and cleaned_data.csv respectively
run evaluate_predictions() to compare them

In [7]:
# compare predictions to real percentage tips
tip_predictor.evaluate_predictions()


mean squared error: 10.0315724521
r2 score: 0.888563908067

In [ ]: