This data set collection is from the Mauna Loa observatory which records atmospheric carbon levels. The data was used by Rasmussen and Williams (2006) to demonstrate hyperparameter setting in Gaussian processes. When first called, or if called with refresh_data=True
the latest version of the data set is downloaded. Otherwise, the cached version of the data set is loaded from disk.
In [1]:
import pods
import pylab as plt
%matplotlib inline
In [2]:
data = pods.datasets.mauna_loa()
Here, because I've downloaded the data before I have a cached version. To download a fresh version of the data I can set refresh_data=True
.
In [3]:
data = pods.datasets.mauna_loa(refresh_data=True)
The data dictionary contains the standard keys 'X' and 'Y' which give a unidimensional regression problem.
In [4]:
plt.plot(data['X'], data['Y'], 'rx')
plt.xlabel('year')
plt.ylabel('CO$_2$ concentration in ppm')
Out[4]:
Additionally there are keys Xtest
and Ytest
which provide test data. The number of points considered to be training data is controlled by the argument num_train
argument, which defaults to 545. This number is chosen as it matches that used in the Gaussian Processes for Machine Learning book. Below we plot the test and training data.
In [5]:
plt.plot(data['X'], data['Y'], 'rx')
plt.plot(data['Xtest'], data['Ytest'], 'go')
plt.xlabel('year')
plt.ylabel('CO$_2$ concentration in ppm')
Out[5]:
Of course we have included the citation information for the data.
In [6]:
print(data['citation'])
And extra information about the data is included, as standard, under the keys info
and details
.
In [7]:
print(data['info'])
print()
print(data['details'])
And, importantly, for reference you can also check the license for the data:
In [8]:
print(data['license'])