In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import osmnx as ox
For this notebook, we will focus on exploring a 10,000 point sample of the 1.1 trillion example dataset.
Read in the first 10 rows of the sample dataset to get an idea of what the raw data looks like. The data dictionary for the variables is here.
In [2]:
data = pd.read_csv("sample_data/yellow_tripdata_2016-02_sample.csv", nrows=10)
data
Out[2]:
Adjust what is read to only include data we might use and to parse dates when the data is being ingested. Read the entire dataset to check for parsing errors and to have the data available for further exploration.
In [3]:
data = pd.read_csv("sample_data/yellow_tripdata_2016-02_sample.csv",
usecols=["tpep_pickup_datetime",
"tpep_dropoff_datetime",
"trip_distance",
"pickup_longitude",
"pickup_latitude",
"dropoff_longitude",
"dropoff_latitude"],
parse_dates=["tpep_pickup_datetime",
"tpep_dropoff_datetime"],
infer_datetime_format=True)
data.head(10)
Out[3]:
In [4]:
data["trip_time"] = data.tpep_dropoff_datetime - data.tpep_pickup_datetime
data["trip_time_in_hours"] = data.trip_time/np.timedelta64(1, 'h')
# Distribution of trip distance in minutes
(data[data.trip_time_in_hours <= 1].trip_time_in_hours*60).hist(bins=50)
plt.title("Distribution of trip times\nthat are less than an hour")
plt.xlabel("Minutes");
Let's see some basic stats on the data.
In [5]:
data.describe()
Out[5]:
We can see there are several outliers. We will clean these up and check out the new distribution of the data.
In [6]:
data = data[( (data.trip_distance > 0)
& (data.pickup_longitude < 0)
& (data.pickup_latitude > 40.5)
& (data.dropoff_longitude < 0)
& (data.dropoff_longitude > -74.3)
& (data.dropoff_latitude > 40.5)
& (data.dropoff_longitude < 0)
& (data.trip_time_in_hours < 10)
& (data.trip_time_in_hours > 0.01))]
data.describe()
Out[6]:
Looks pretty good. Let's add an average mph feature.
In [7]:
data["average_mph"] = data.trip_distance/data.trip_time_in_hours
data = data[data.average_mph < 75]
Before we decide on the base dataset to use for the analysis going forward, let's explore the data a bit.
In [8]:
# Make some histograms
var_and_max = [("trip_distance", 10),
("average_mph", 100)]
for var, max_val in var_and_max:
data[data[var] < max_val][var].hist(bins=100)
plt.title("Distribution of trips based on %s" % var)
plt.show()
Both charts look reasonable.
Let's take a look at a plot of the pickup locations.
In [9]:
data.plot(kind="scatter",
x="pickup_latitude",
xlim=(40.7, 40.825),
y="pickup_longitude",
ylim=(-74.02, -73.94),
c="average_mph",
linewidths=0,
s=2,
colormap="viridis",
figsize=(10,4))
plt.axis("off")
plt.title("Taxi pickup locations\n");
Very neat. We can see the road networks, see Central Park (based on missing points), see clusters of pickups, and see that the center of Long Island has slower trips on average.
In [10]:
import osmnx as ox
In [ ]: