Datasets: The Spellman Yeast Data

Open Data Science Initiative

29th May 2014 Neil D. Lawrence

This data set collection is from an classic early microarray paper on the yeast cell cycle, Spellman et al (1998).



In [1]:

    
import pods
import pylab as plt
%matplotlib inline



In [2]:

    
data = pods.datasets.spellman_yeast()









    



Acquiring resource: spellman_yeast

Details of data: 
Two colour spotted cDNA array data set of a series of experiments to identify which genes in Yeast are cell cycle regulated.

Please cite:
Paul T. Spellman, Gavin Sherlock, Michael Q. Zhang, Vishwanath R. Iyer, Kirk Anders, Michael B. Eisen, Patrick O. Brown, David Botstein, and Bruce Futcher 'Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization.'  Molecular Biology of the Cell 9, 3273-3297

After downloading the data will take up 2510955 bytes of space.

Data will be stored in /Users/neil/ods_data_cache/spellman_yeast.

Do you wish to proceed with the download? [yes/no]
yes
Downloading  http://genome-www.stanford.edu/cellcycle/data/rawdata/combined.txt -> /Users/neil/ods_data_cache/spellman_yeast/combined.txt
[==============================]   2.395/2.395MB

The data is from two colour spotted cDNA arrays. It has been widely studied in computational biology. There are four different time series in the data as well as induction experiments. The data is returned in the form of a pandas data frame which can be described as follows.



In [3]:

    
data['Y'].describe()









    Out[3]:






  
    
      
      cln3-1
      cln3-2
      clb
      clb2-2
      clb2-1
      alpha
      alpha0
      alpha7
      alpha14
      alpha21
      ...
      elu120
      elu150
      elu180
      elu210
      elu240
      elu270
      elu300
      elu330
      elu360
      elu390
    
  
  
    
      count
      5985.000000
      5813.000000
      0
      5724.000000
      6036.000000
      0
      6013.000000
      5653.000000
      5987.000000
      5866.000000
      ...
      6075.000000
      6059.000000
      6067.000000
      6060.000000
      6047.000000
      6068.000000
      6066.000000
      6066.000000
      6022.000000
      6064.000000
    
    
      mean
      0.011270
      -0.284361
      NaN
      -0.214780
      -0.027646
      NaN
      0.004695
      -0.049756
      0.025881
      -0.006715
      ...
      0.038517
      -0.010840
      -0.003516
      0.056287
      0.037281
      0.016523
      0.007295
      0.026142
      -0.069389
      0.037169
    
    
      std
      0.552941
      0.620545
      NaN
      0.420586
      0.432394
      NaN
      0.413762
      0.446178
      0.360738
      0.329861
      ...
      0.346560
      0.326094
      0.276335
      0.236677
      0.243618
      0.263399
      0.249453
      0.300645
      0.352169
      0.236035
    
    
      min
      -2.940000
      -3.320000
      NaN
      -2.940000
      -2.940000
      NaN
      -2.420000
      -2.710000
      -1.960000
      -2.290000
      ...
      -1.950000
      -1.700000
      -2.040000
      -1.890000
      -1.520000
      -1.860000
      -1.940000
      -1.420000
      -1.470000
      -1.020000
    
    
      25%
      -0.300000
      -0.640000
      NaN
      -0.430000
      -0.270000
      NaN
      -0.160000
      -0.280000
      -0.150000
      -0.180000
      ...
      -0.170000
      -0.200000
      -0.160000
      -0.090000
      -0.120000
      -0.150000
      -0.130000
      -0.150000
      -0.280000
      -0.110000
    
    
      50%
      0.000000
      -0.320000
      NaN
      -0.230000
      -0.040000
      NaN
      0.010000
      -0.060000
      0.030000
      -0.020000
      ...
      0.040000
      -0.010000
      0.000000
      0.050000
      0.030000
      0.010000
      0.020000
      0.030000
      -0.070000
      0.030000
    
    
      75%
      0.300000
      0.030000
      NaN
      -0.040000
      0.190000
      NaN
      0.170000
      0.150000
      0.210000
      0.150000
      ...
      0.240000
      0.180000
      0.160000
      0.190000
      0.190000
      0.180000
      0.160000
      0.210000
      0.130000
      0.160000
    
    
      max
      5.550000
      4.120000
      NaN
      4.510000
      5.840000
      NaN
      4.950000
      4.230000
      2.380000
      2.110000
      ...
      1.490000
      1.400000
      1.260000
      1.290000
      1.260000
      1.310000
      2.070000
      2.160000
      2.140000
      2.250000
    
  

8 rows × 82 columns

The first five columns are the clb2 and cln3 induction experiments. The columns that follow are the alpha, cdc15, cdc28 and elutriation time course experiments. The index gives the gene names. The columns are named according to the experiment.



In [5]:

    
print(data['Y'].columns)









    



Index(['cln3-1', 'cln3-2', 'clb', 'clb2-2', 'clb2-1', 'alpha', 'alpha0',
       'alpha7', 'alpha14', 'alpha21', 'alpha28', 'alpha35', 'alpha42',
       'alpha49', 'alpha56', 'alpha63', 'alpha70', 'alpha77', 'alpha84',
       'alpha91', 'alpha98', 'alpha105', 'alpha112', 'alpha119', 'cdc15',
       'cdc15_10', 'cdc15_30', 'cdc15_50', 'cdc15_70', 'cdc15_80', 'cdc15_90',
       'cdc15_100', 'cdc15_110', 'cdc15_120', 'cdc15_130', 'cdc15_140',
       'cdc15_150', 'cdc15_160', 'cdc15_170', 'cdc15_180', 'cdc15_190',
       'cdc15_200', 'cdc15_210', 'cdc15_220', 'cdc15_230', 'cdc15_240',
       'cdc15_250', 'cdc15_270', 'cdc15_290', 'cdc28', 'cdc28_0', 'cdc28_10',
       'cdc28_20', 'cdc28_30', 'cdc28_40', 'cdc28_50', 'cdc28_60', 'cdc28_70',
       'cdc28_80', 'cdc28_90', 'cdc28_100', 'cdc28_110', 'cdc28_120',
       'cdc28_130', 'cdc28_140', 'cdc28_150', 'cdc28_160', 'elu', 'elu0',
       'elu30', 'elu60', 'elu90', 'elu120', 'elu150', 'elu180', 'elu210',
       'elu240', 'elu270', 'elu300', 'elu330', 'elu360', 'elu390'],
      dtype='object')

And the index is given by the gene name, there are 6178 genes in total.



In [6]:

    
print(data['Y'].index)









    



Index(['YAL001C', 'YAL002W', 'YAL003W', 'YAL004W', 'YAL005C', 'YAL007C',
       'YAL008W', 'YAL009W', 'YAL010C', 'YAL011W', 
       ...
       'YPR195C', 'YPR196W', 'YPR197C', 'YPR198W', 'YPR199C', 'YPR200C',
       'YPR201W', 'YPR202W', 'YPR203W', 'YPR204W'],
      dtype='object', length=6178)

We also provide a variant of the data for just the cdc15 time course.



In [7]:

    
data = pods.datasets.spellman_yeast_cdc15()

And in this data we also provide the associated time points.



In [8]:

    
plt.plot(data['t'], data['Y']['YAR015W'],'rx')
plt.title('Gene YAR015W from Spellman et al for the cdc15 Time Course')
plt.xlabel('time')
plt.ylabel('$\log_2$ expression ratio')









    Out[8]:





<matplotlib.text.Text at 0x1055a44e0>

As normal we include the citation information for the data.



In [9]:

    
print(data['citation'])









    



Paul T. Spellman, Gavin Sherlock, Michael Q. Zhang, Vishwanath R. Iyer, Kirk Anders, Michael B. Eisen, Patrick O. Brown, David Botstein, and Bruce Futcher 'Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization.'  Molecular Biology of the Cell 9, 3273-3297

And extra information about the data is included, as standard, under the keys info and details.



In [10]:

    
print(data['info'])
print()
print(data['details'])









    



Time series of synchronized yeast cells from the CDC-15 experiment of Spellman et al (1998).

Two colour spotted cDNA array data set of a series of experiments to identify which genes in Yeast are cell cycle regulated.



In [ ]:

	cln3-1	cln3-2	clb	clb2-2	clb2-1	alpha	alpha0	alpha7	alpha14	alpha21	...	elu120	elu150	elu180	elu210	elu240	elu270	elu300	elu330	elu360	elu390
count	5985.000000	5813.000000	0	5724.000000	6036.000000	0	6013.000000	5653.000000	5987.000000	5866.000000	...	6075.000000	6059.000000	6067.000000	6060.000000	6047.000000	6068.000000	6066.000000	6066.000000	6022.000000	6064.000000
mean	0.011270	-0.284361	NaN	-0.214780	-0.027646	NaN	0.004695	-0.049756	0.025881	-0.006715	...	0.038517	-0.010840	-0.003516	0.056287	0.037281	0.016523	0.007295	0.026142	-0.069389	0.037169
std	0.552941	0.620545	NaN	0.420586	0.432394	NaN	0.413762	0.446178	0.360738	0.329861	...	0.346560	0.326094	0.276335	0.236677	0.243618	0.263399	0.249453	0.300645	0.352169	0.236035
min	-2.940000	-3.320000	NaN	-2.940000	-2.940000	NaN	-2.420000	-2.710000	-1.960000	-2.290000	...	-1.950000	-1.700000	-2.040000	-1.890000	-1.520000	-1.860000	-1.940000	-1.420000	-1.470000	-1.020000
25%	-0.300000	-0.640000	NaN	-0.430000	-0.270000	NaN	-0.160000	-0.280000	-0.150000	-0.180000	...	-0.170000	-0.200000	-0.160000	-0.090000	-0.120000	-0.150000	-0.130000	-0.150000	-0.280000	-0.110000
50%	0.000000	-0.320000	NaN	-0.230000	-0.040000	NaN	0.010000	-0.060000	0.030000	-0.020000	...	0.040000	-0.010000	0.000000	0.050000	0.030000	0.010000	0.020000	0.030000	-0.070000	0.030000
75%	0.300000	0.030000	NaN	-0.040000	0.190000	NaN	0.170000	0.150000	0.210000	0.150000	...	0.240000	0.180000	0.160000	0.190000	0.190000	0.180000	0.160000	0.210000	0.130000	0.160000
max	5.550000	4.120000	NaN	4.510000	5.840000	NaN	4.950000	4.230000	2.380000	2.110000	...	1.490000	1.400000	1.260000	1.290000	1.260000	1.310000	2.070000	2.160000	2.140000	2.250000