Flight arrival and departure times for every commercial flight in the USA from January 2008 to April 2008. This dataset contains extensive information about almost 2 million flights, including the delay (in minutes) in reaching the desitnation.
In [1]:
import pods
import pylab as plt
%matplotlib inline
In [2]:
data = pods.datasets.airline_delay()
The data dictionary contains the standard keys 'X' and 'Y', which contain 700,000 randomly sub-sampled training points.
In [3]:
data['X'].shape
data['Y'].shape
Out[3]:
Additionally there are keys Xtest
and Ytest
which provide test data. The number of points considered to be training data is controlled by the argument num_train
argument, which defaults to 700,000. This number is chosen as it matches that used in the Gaussian Processes for Big Data paper.
In [4]:
data['Xtest'].shape
data['Ytest'].shape
Out[4]:
Of course we have included the citation information for the data.
In [5]:
print(data['citation'])
And extra information about the data is included, as standard, under the keys info
and details
.
In [6]:
print(data['info'])
print()
print(data['details'])
In [ ]: