Datasets: The Spellman Yeast Data

Open Data Science Initiative

29th May 2014 Neil D. Lawrence

This data set collection is from an classic early microarray paper on the yeast cell cycle, Spellman et al (1998).


In [1]:
import pods
import pylab as plt
%matplotlib inline

In [2]:
data = pods.datasets.spellman_yeast()


Acquiring resource: spellman_yeast

Details of data: 
Two colour spotted cDNA array data set of a series of experiments to identify which genes in Yeast are cell cycle regulated.

Please cite:
Paul T. Spellman, Gavin Sherlock, Michael Q. Zhang, Vishwanath R. Iyer, Kirk Anders, Michael B. Eisen, Patrick O. Brown, David Botstein, and Bruce Futcher 'Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization.'  Molecular Biology of the Cell 9, 3273-3297

After downloading the data will take up 2510955 bytes of space.

Data will be stored in /Users/neil/ods_data_cache/spellman_yeast.

Do you wish to proceed with the download? [yes/no]
yes
Downloading  http://genome-www.stanford.edu/cellcycle/data/rawdata/combined.txt -> /Users/neil/ods_data_cache/spellman_yeast/combined.txt
[==============================]   2.395/2.395MB

The data is from two colour spotted cDNA arrays. It has been widely studied in computational biology. There are four different time series in the data as well as induction experiments. The data is returned in the form of a pandas data frame which can be described as follows.


In [3]:
data['Y'].describe()


Out[3]:
cln3-1 cln3-2 clb clb2-2 clb2-1 alpha alpha0 alpha7 alpha14 alpha21 ... elu120 elu150 elu180 elu210 elu240 elu270 elu300 elu330 elu360 elu390
count 5985.000000 5813.000000 0 5724.000000 6036.000000 0 6013.000000 5653.000000 5987.000000 5866.000000 ... 6075.000000 6059.000000 6067.000000 6060.000000 6047.000000 6068.000000 6066.000000 6066.000000 6022.000000 6064.000000
mean 0.011270 -0.284361 NaN -0.214780 -0.027646 NaN 0.004695 -0.049756 0.025881 -0.006715 ... 0.038517 -0.010840 -0.003516 0.056287 0.037281 0.016523 0.007295 0.026142 -0.069389 0.037169
std 0.552941 0.620545 NaN 0.420586 0.432394 NaN 0.413762 0.446178 0.360738 0.329861 ... 0.346560 0.326094 0.276335 0.236677 0.243618 0.263399 0.249453 0.300645 0.352169 0.236035
min -2.940000 -3.320000 NaN -2.940000 -2.940000 NaN -2.420000 -2.710000 -1.960000 -2.290000 ... -1.950000 -1.700000 -2.040000 -1.890000 -1.520000 -1.860000 -1.940000 -1.420000 -1.470000 -1.020000
25% -0.300000 -0.640000 NaN -0.430000 -0.270000 NaN -0.160000 -0.280000 -0.150000 -0.180000 ... -0.170000 -0.200000 -0.160000 -0.090000 -0.120000 -0.150000 -0.130000 -0.150000 -0.280000 -0.110000
50% 0.000000 -0.320000 NaN -0.230000 -0.040000 NaN 0.010000 -0.060000 0.030000 -0.020000 ... 0.040000 -0.010000 0.000000 0.050000 0.030000 0.010000 0.020000 0.030000 -0.070000 0.030000
75% 0.300000 0.030000 NaN -0.040000 0.190000 NaN 0.170000 0.150000 0.210000 0.150000 ... 0.240000 0.180000 0.160000 0.190000 0.190000 0.180000 0.160000 0.210000 0.130000 0.160000
max 5.550000 4.120000 NaN 4.510000 5.840000 NaN 4.950000 4.230000 2.380000 2.110000 ... 1.490000 1.400000 1.260000 1.290000 1.260000 1.310000 2.070000 2.160000 2.140000 2.250000

8 rows × 82 columns

The first five columns are the clb2 and cln3 induction experiments. The columns that follow are the alpha, cdc15, cdc28 and elutriation time course experiments. The index gives the gene names. The columns are named according to the experiment.


In [5]:
print(data['Y'].columns)


Index(['cln3-1', 'cln3-2', 'clb', 'clb2-2', 'clb2-1', 'alpha', 'alpha0',
       'alpha7', 'alpha14', 'alpha21', 'alpha28', 'alpha35', 'alpha42',
       'alpha49', 'alpha56', 'alpha63', 'alpha70', 'alpha77', 'alpha84',
       'alpha91', 'alpha98', 'alpha105', 'alpha112', 'alpha119', 'cdc15',
       'cdc15_10', 'cdc15_30', 'cdc15_50', 'cdc15_70', 'cdc15_80', 'cdc15_90',
       'cdc15_100', 'cdc15_110', 'cdc15_120', 'cdc15_130', 'cdc15_140',
       'cdc15_150', 'cdc15_160', 'cdc15_170', 'cdc15_180', 'cdc15_190',
       'cdc15_200', 'cdc15_210', 'cdc15_220', 'cdc15_230', 'cdc15_240',
       'cdc15_250', 'cdc15_270', 'cdc15_290', 'cdc28', 'cdc28_0', 'cdc28_10',
       'cdc28_20', 'cdc28_30', 'cdc28_40', 'cdc28_50', 'cdc28_60', 'cdc28_70',
       'cdc28_80', 'cdc28_90', 'cdc28_100', 'cdc28_110', 'cdc28_120',
       'cdc28_130', 'cdc28_140', 'cdc28_150', 'cdc28_160', 'elu', 'elu0',
       'elu30', 'elu60', 'elu90', 'elu120', 'elu150', 'elu180', 'elu210',
       'elu240', 'elu270', 'elu300', 'elu330', 'elu360', 'elu390'],
      dtype='object')

And the index is given by the gene name, there are 6178 genes in total.


In [6]:
print(data['Y'].index)


Index(['YAL001C', 'YAL002W', 'YAL003W', 'YAL004W', 'YAL005C', 'YAL007C',
       'YAL008W', 'YAL009W', 'YAL010C', 'YAL011W', 
       ...
       'YPR195C', 'YPR196W', 'YPR197C', 'YPR198W', 'YPR199C', 'YPR200C',
       'YPR201W', 'YPR202W', 'YPR203W', 'YPR204W'],
      dtype='object', length=6178)

We also provide a variant of the data for just the cdc15 time course.


In [7]:
data = pods.datasets.spellman_yeast_cdc15()

And in this data we also provide the associated time points.


In [8]:
plt.plot(data['t'], data['Y']['YAR015W'],'rx')
plt.title('Gene YAR015W from Spellman et al for the cdc15 Time Course')
plt.xlabel('time')
plt.ylabel('$\log_2$ expression ratio')


Out[8]:
<matplotlib.text.Text at 0x1055a44e0>

As normal we include the citation information for the data.


In [9]:
print(data['citation'])


Paul T. Spellman, Gavin Sherlock, Michael Q. Zhang, Vishwanath R. Iyer, Kirk Anders, Michael B. Eisen, Patrick O. Brown, David Botstein, and Bruce Futcher 'Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization.'  Molecular Biology of the Cell 9, 3273-3297

And extra information about the data is included, as standard, under the keys info and details.


In [10]:
print(data['info'])
print()
print(data['details'])


Time series of synchronized yeast cells from the CDC-15 experiment of Spellman et al (1998).

Two colour spotted cDNA array data set of a series of experiments to identify which genes in Yeast are cell cycle regulated.

In [ ]: