Module 15: Pipeline for Analysis of Cell Screening data

In today's assignment, you'll be constructing a rudimentary pipeline to analyze cell screening data generated at high throughput. These are actual data (Complements of Sara Cherry and David Schultz) and there are many more things to analyze!

In particular, these are results from a high-throughput screen using all FDA approved small-molecules. Cells in each well of a 384-well plate have been treated with the a unique small molecule (one drug per well), and assayed to determine a phenotype/response to the drug exposure.

You have find several files in your assignment folder:

plate*.csv: these are the results files for the cell screening assay that was performed. sampleids.txt: a list of all samples that were assayed

For this assignment, we will be engaged in R completely, so treat your commands in this markdown like you would if you were working in the R environment.

The samples are arrayed in the following way, where maximum and minimum controls are located in the plate:

Remember: the use of term "maximum" (control) indicates fluorescing cells, whereas 'minimum' (control) reflects background fluorescence in the absence of cells (i.e., no response).

0. The assay here is a cell viability experiment where live cells fluorescence when exposed to a laser, whereas dead cells will not fluorescence on laser-light exposure. Imagine we want to screen for drugs that kill cancer cells. If we find a drug that kills the cells, what type of control (maximum, minimum) would we expect our well to look most like? Why?



In [0]:

1. First, let's prepare your input files

create a variable called myplatefile that contains the name of our first plate file, ie. "plate1.csv".
create a variable called myplatename that contains the name of the plate, ie. "plate1"
load data from the file using the variable myplatefile, into a variable called "myplate". Header is true. Separated by comma.
create an array called "myrownames" where each element are the letters A, B, … P. (hint: use c())
display the first 5 rows of the myplate variable



In [0]:

2a. How many maximum controls have been assayed?



In [0]:

2b. Calculate the average of all the maximum controls and minimum controls in a variable called "mean_max" and "mean_min", respectively.



In [0]:

2c. Calculate the standard deviations of all the maximum controls and minimum controls in a variable called "sd_max" and "sd_min", respectively.



In [0]:

3a. Using these variables, calculate the following quantities:

the Z-prime factor:



In [0]:

3b. What is the Z-prime factor used for in a screening assay? For this plate, how would you interpret the result from your Z-prime factor score?



In [0]:

3c. calculate the Signal-to-noise ratio:



In [0]:

3d. what is the signal-to-noise ratio used to determine in a screening assay?



In [0]:

3e. calculate the Coefficient of variations for both the maximum and minimum controls:



In [0]:

3f. What is the coefficient of variation - measured on our controls - used for in the context of a screening assay?



In [0]:

4. For each sample on the plate, calculate the percentage relative to the mean of the max control, multiplied by 100, and store in a new variable called "myplate_poc"



In [0]:

5. Next, normalize the data by the mean of the maximum control, divided by the standard deviation of the maximum control. store in a variable called "myplate_zscores".



In [0]:

Next, we'll use the paste() command to create a new string name "on the fly", so that you don't overwrite your file name.

This is useful for us, if we wanted to re-run this code on multiple plates (wink, wink! See Homework below). That way, we only have to change a variable in one place (at the very beginning of the pipeline!) rather than in multiple places throughout your code.

Here's an example of how it paste() works:

%> string1 <- "foo"
%> string2 <- "bar"
%> mypaste <- paste(string1, string2, sep="_")
%> mypaste
[1] "foo_bar"

6a. store the string "zscores.csv" in a new variable called outfix



In [0]:

6b. use paste() to combine myplatename and outfix into a new variable called "zscores_outfile".

Then, print zscores_outfile to the screen (the value should be "plate1_zscores.csv" if you've done this right)



In [0]:

6c. Output the data contained in myplate_zscores to the file stored in the variable 'zscores_outfile'

without the column names, row names, or quotes.
separated by commas.



In [0]:

Finally, let's create a heatmap which indicates all of the samples on the plate where the normalized Z score was lower than -5. To do that, we need to do a couple of things.

7a. First, create a variable called "threshold", set equal to -5



In [0]:

Next, create a matrix that is "1" for all cells in the table that are lower than -5, or "0" else. To accomplish this, you can use a combination of the conditional statement "<", as.matrix(), converted to a number:

table_forheatmap <- as.matrix( 1 * (mytable < thresh) )

the multiplication by one step is a trick that will convert the data to a number (i.e. will convert TRUE/FALSE to 1/0).
as.matrix() is a function which will convert the table to a matrix object in R (needed for the heatmap function).

7b. Using this function, make a table for your data.



In [0]:

7c. Use the heatmap() function to make a plot for data in table_forheatmap, which includes the following arguments:

these arguments are for clustering; we don't want to cluster our assay data.

Rowv=NA Colv=NA

we want to name the rows we see in the resulting heat map, per what we defined above

labRow=myrownames

do not attempt to rescale the data, plot the actual values

scale="none"

use the following color tree: dark blue for "greater than threshold", gold for "less than threshold"

col=c("dark blue", "gold")

print the table in "reverse" order, e.g. put A at the top, P at the bottom

revC=T



In [0]:

7a. Did your controls work? why or why not?

7b. Excluding the controls, how many cells returned values less than -5 on this plate?

HOMEWORK ASSIGNMENT

In the above assignment, you will note that you could have tabulated virtually all of the above in excel. However, in a true high-throughput screening assay, you will have hundreds of plates to process. That's too much for even one human to do in excel, perfectly! You may have noticed that through doing this assignment, you have written a 'generic' pipeline to process a single plate.

8a. in order to process a different plate, called "plate2", what would you change in the pipeline you created above?

8b. We have included data from 6 additional plates. Process each and report here (excluding controls)

the Z prime factor for each plate
the number of cells that gave lower than a -5 normalized score, excluding controls, per plate. Note that rather than counting the results on the heatmap, you could sum() within the appropriate part of the heatmap table (excluding the controls, of course)

To do this, you could re-run the markdown, and record the results in the table below. After you have done this, make sure your mark down provides the analysis for plate1.

Z-factor for plate 2 is: Z-factor for plate 3 is: Z-factor for plate 4 is: Z-factor for plate 5 is: Z-factor for plate 6 is: Z-factor for plate 7 is:

Number of cells less than -5 for plate 2 is: Number of cells less than -5 for plate 3 is: Number of cells less than -5 for plate 4 is: Number of cells less than -5 for plate 5 is: Number of cells less than -5 for plate 6 is: Number of cells less than -5 for plate 7 is:

You will also notice that you do not actually have any sample names attached to your data, e.g. what you actually screened.

imagine that you were provided a file that looked like the following:

sampleid,row,col,plate
87234,C,3,1
7134,C,4,1
...
81672,P,22,7

i.e. a file with 2240 rows (+1 header) where each sampleids was mapped to a corresponding row and column. Note that positive and negative control columns are excluded.

8c. Imagine that you now wanted to obtain the sampleids (i.e., the gene code id!) from the a set of cells that I was interested in, the 'hits'.

Let's imagine that I had another file which listed all of the cells which had a normalized Z-score less than -5. e.g., it looked like this:

row,col,plate
C,3,1
D,5,2
P,6,7

Describe in words the steps that would allow a computer to print out the sampleids ONLY for the entries listed in this new file. To help you, we have provided the first two steps in the process. You complete the rest!

Be specific in the details of what you would check for during your look-up.
Hint: Pretend of you had two sheets of paper, each with your lists, and you had to do this 'by hand'. What would you do, step-by-step?

Step one: Tell the computer to load the sampleid file into memory in R Step two: Tell the computer to load the 'hit' file into memory in R Step three: ...