Datasets: Downloading Data on Airline Delays

Open Data Science Initiative

4th October 2015 Neil D. Lawrence

Flight arrival and departure times for every commercial flight in the USA from January 2008 to April 2008. This dataset contains extensive information about almost 2 million flights, including the delay (in minutes) in reaching the desitnation.


In [1]:
import pods
import pylab as plt
%matplotlib inline

In [2]:
data = pods.datasets.airline_delay()


Acquiring resource: airline_delay

Details of data: 
Flight arrival and departure times for every commercial flight in the USA from January 2008 to April 2008. This dataset contains extensive information about almost 2 million flights, including the delay (in minutes) in reaching the desitnation.

Please cite:
Gaussian Processes for Big Data (In UAI'13). J. Hensman, N. Fusi and N. D. Lawrence

After downloading the data will take up 180905165 bytes of space.

Data will be stored in /Users/neil/ods_data_cache/airline_delay.

Do you wish to proceed with the download? [yes/no]
yes
Downloading  http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/dataset_mirror/airline_delay/filtered_data.pickle -> /Users/neil/ods_data_cache/airline_delay/filtered_data.pickle
[==============================] 172.525/172.525MB

The data dictionary contains the standard keys 'X' and 'Y', which contain 700,000 randomly sub-sampled training points.


In [3]:
data['X'].shape
data['Y'].shape


Out[3]:
(700000, 1)

Additionally there are keys Xtest and Ytest which provide test data. The number of points considered to be training data is controlled by the argument num_train argument, which defaults to 700,000. This number is chosen as it matches that used in the Gaussian Processes for Big Data paper.


In [4]:
data['Xtest'].shape
data['Ytest'].shape


Out[4]:
(100000, 1)

Of course we have included the citation information for the data.


In [5]:
print(data['citation'])


Gaussian Processes for Big Data (In UAI'13). J. Hensman, N. Fusi and N. D. Lawrence

And extra information about the data is included, as standard, under the keys info and details.


In [6]:
print(data['info'])
print()
print(data['details'])


Airline delay data used for demonstrating Gaussian processes for big data.

Flight arrival and departure times for every commercial flight in the USA from January 2008 to April 2008. This dataset contains extensive information about almost 2 million flights, including the delay (in minutes) in reaching the desitnation.

In [ ]: