Datasets: The Lee Yeast Connectivity Data

Open Data Science Initiative

29th May 2014 Neil D. Lawrence

This data set collection is from an early publication on Chromatin immunoprecipitation experiments to determine which transcription factors bind to which genes in yeast Lee et al (2002).


In [1]:
import pods
import pylab as plt
%matplotlib inline

In [2]:
data = pods.datasets.lee_yeast_ChIP()


Acquiring resource: lee_yeast_ChIP

Details of data: 
Binding location analysis for 106 regulators in yeast. The data consists of p-values for binding of regulators to genes derived from ChIP-chip experiments.

Please cite:
Tong Ihn Lee, Nicola J. Rinaldi, Francois Robert, Duncan T. Odom, Ziv Bar-Joseph, Georg K. Gerber, Nancy M. Hannett, Christopher T. Harbison, Craig M. Thompson, Itamar Simon, Julia Zeitlinger, Ezra G. Jennings, Heather L. Murray, D. Benjamin Gordon, Bing Ren, John J. Wyrick, Jean-Bosco Tagne, Thomas L. Volkert, Ernest Fraenkel, David K. Gifford, Richard A. Young 'Transcriptional Regulatory Networks in Saccharomyces cerevisiae' Science 298 (5594) pg 799--804. DOI: 10.1126/science.1075090

After downloading the data will take up 1674161 bytes of space.

Data will be stored in /Users/neil/ods_data_cache/lee_yeast_ChIP.

Do you wish to proceed with the download? [yes/no]
yes
Downloading  http://jura.wi.mit.edu/young_public/regulatory_network/binding_by_gene.tsv -> /Users/neil/ods_data_cache/lee_yeast_ChIP/binding_by_gene.tsv
[==============================]   9.047/9.047MB

The data consists of $p$-values for the hypothesized relationships between the transcription factors and the genes. There are 113 transcription factors represented in data['transcription_factors'].


In [3]:
print(data['transcription_factors'])


['ABF1', 'ACE2', 'ADR1', 'ARG80', 'ARG81', 'ARO80', 'ASH1', 'AZF1', 'BAS1', 'CAD1', 'CBF1', 'CHA4', 'CIN5', 'CRZ1', 'CUP9', 'DAL81', 'DAL82', 'DIG1', 'DOT6', 'ECM22', 'FHL1', 'FKH1', 'FKH2', 'FZF1', 'GAL4', 'GAT1', 'GAT3', 'GCN4', 'GCR1', 'GCR2', 'GLN3', 'GRF10(Pho2)', 'GTS1', 'HAA1', 'HAL9', 'HAP2', 'HAP3', 'HAP4', 'HAP5', 'HIR1', 'HIR2', 'HMS1', 'HSF1', 'IME4', 'INO2', 'INO4', 'IXR1', 'LEU3', 'MAC1', 'MAL13', 'MAL33', 'MATa1', 'MBP1', 'MCM1', 'MET31', 'MET4', 'MIG1', 'MOT3', 'MSN1', 'MSN2', 'MSN4', 'MSS11', 'MTH1', 'NDD1', 'NRG1', 'PDR1', 'PHD1', 'PHO4', 'PUT3', 'RAP1', 'RCS1', 'REB1', 'RFX1', 'RGM1', 'RGT1', 'RIM101', 'RLM1', 'RME1', 'ROX1', 'RPH1', 'RTG1', 'RTG3', 'RTS2', 'SFL1', 'SFP1', 'SIG1', 'SIP4', 'SKN7', 'SKO1', 'SMP1', 'SOK2', 'SRD1', 'STB1', 'STE12', 'STP1', 'STP2', 'SUM1', 'SWI4', 'SWI5', 'SWI6', 'THI2', 'UGA3', 'USV1', 'YAP1', 'YAP3', 'YAP5', 'YAP6', 'YAP7', 'YBR267W', 'YFL044C', 'YJL206C', 'ZAP1', 'ZMS1']

And the 6270 gene names and their annotations are given in data['annotations'].

A pandas data frame containing all the $p$-values for the binding between genes and transcription factors data is available in data['Y'].


In [4]:
data['Y'].describe()


Out[4]:
ABF1 ACE2 ADR1 ARG80 ARG81 ARO80 ASH1 AZF1 BAS1 CAD1 ... YAP1 YAP3 YAP5 YAP6 YAP7 YBR267W YFL044C YJL206C ZAP1 ZMS1
count 6270.000000 6.270000e+03 6.270000e+03 6.270000e+03 6.270000e+03 6.270000e+03 6270.00000 6270.000000 6.270000e+03 6.270000e+03 ... 6.270000e+03 6270.000000 6.270000e+03 6.270000e+03 6270.000000 6270.000000 6270.000000 6.270000e+03 6.270000e+03 6270.000000
mean 0.520917 4.953694e-01 4.679623e-01 5.647799e-01 5.676383e-01 5.386131e-01 0.46732 0.524318 5.387386e-01 5.160878e-01 ... 5.186252e-01 0.529313 5.599866e-01 5.512690e-01 0.526091 0.556157 0.539022 5.175713e-01 5.671431e-01 0.560589
std 0.270981 2.823651e-01 2.791996e-01 2.773574e-01 2.715823e-01 2.787993e-01 0.29646 0.293448 2.743980e-01 2.852743e-01 ... 2.882458e-01 0.305721 2.683394e-01 2.570737e-01 0.298690 0.308842 0.289868 3.058519e-01 2.819108e-01 0.314172
min 0.000037 4.600000e-11 7.700000e-07 8.100000e-09 1.700000e-15 6.400000e-15 0.00033 0.000240 1.100000e-16 1.200000e-10 ... 4.700000e-08 0.001300 1.100000e-07 4.100000e-15 0.002300 0.012000 0.000080 4.300000e-13 8.900000e-16 0.000210
25% 0.380000 2.600000e-01 2.400000e-01 3.600000e-01 3.700000e-01 3.300000e-01 0.18000 0.280000 3.200000e-01 3.000000e-01 ... 2.800000e-01 0.250000 3.800000e-01 3.900000e-01 0.260000 0.270000 0.310000 2.600000e-01 3.500000e-01 0.300000
50% 0.570000 4.900000e-01 4.400000e-01 5.800000e-01 5.800000e-01 5.400000e-01 0.47000 0.530000 5.500000e-01 4.700000e-01 ... 5.300000e-01 0.520000 5.500000e-01 5.900000e-01 0.510000 0.560000 0.540000 5.000000e-01 5.700000e-01 0.550000
75% 0.720000 7.300000e-01 6.800000e-01 8.100000e-01 8.000000e-01 7.800000e-01 0.70000 0.780000 7.700000e-01 7.600000e-01 ... 7.600000e-01 0.800000 7.700000e-01 7.500000e-01 0.790000 0.840000 0.797500 7.800000e-01 8.100000e-01 0.860000
max 1.000000 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.00000 1.000000 1.000000e+00 1.000000e+00 ... 1.000000e+00 1.000000 1.000000e+00 1.000000e+00 1.000000 1.000000 1.000000 1.000000e+00 1.000000e+00 1.000000

8 rows × 113 columns