This notebook will walk through some of the basic functionality in the cmapR package, which is largely centered around working with matrix data in GCT and GCTX format. We'll be using data from one of CMap's GEO series, GSE70138, but the steps below are applicable to any GCT or GCTX file.
Gene Expression Omnibus, or GEO, is a NCBI-maintained public repository of array and sequence genomics data. As of April 2017, we have provided 1.7 million profiles to GEO at the following accessions:
Please note that this does not represent the only data available in GCT and GCTX format; we simply mention it as a public resource for freely available data in these formats, and will use it as an example in this tutorial. Data can be downloaded from GEO using either using FTP or HTTP; see accession pages above for details.
cmapR source code can be obtained from github. To install the package, follow the install instructions in the README file within the cmapR repository.
In [4]:
# load the library
library(cmapR)
In [40]:
# access the GCT class help
?`GCT-class`
If the file you are working with fits entirely within your computer's memory, you can read the entire file in one shot. Using the level 5 matrix from GSE70138, that would look like the line below. We've commented it out here because the file is large enough that reading the whole thing is impractical.
In [3]:
# create a variable to store the path to the level 5 GCTX file
ds_path <- "GSE70138_Broad_LINCS_Level5_COMPZ_n118050x12328_2017-03-06.gctx"
my_ds <- parse.gctx(ds_path)
When working with large GCTX files, such as those in the GEO repositories listed above, it is usually not possible to read the entire file into memory all at once. In these cases, it's helpful to identify a subset of the data that is of particular interest and read only those rows or columns. In this example, we'll use the level 5 signature metadata to find all the signatures (columns) pertaining to the compound vorinostat.
In [4]:
# read the column annotations as a data.frame
col_meta_path <- "GSE70138_Broad_LINCS_sig_info_2017-03-06.txt"
col_meta <- read.delim(col_meta_path, sep="\t", stringsAsFactors=F)
# figure out which signatures correspond to vorinostat by searching the 'pert_iname' column
idx <- which(col_meta$pert_iname=="vorinostat")
# and get the corresponding sig_ids
sig_ids <- col_meta$sig_id[idx]
# read only those columns from the GCTX file by using the 'cid' parameter
vorinostat_ds <- parse.gctx(ds_path, cid=sig_ids)
# simply typing the variable name of a GCT object will display the object structure, similar to calling 'str'
vorinostat_ds
In [14]:
# initialize a matrix object
# note that you either must assign values to the rownames and colnames of the matrix, or pass them
# as the 'rid' and 'cid' arguments to GCT
mat <- matrix(stats::rnorm(100), ncol=10)
rownames(mat) <- letters[1:10]
colnames(mat) <- LETTERS[1:10]
(my_ds <- new("GCT", mat=mat))
# we can also include the row/column annotations as data.frames
# note these are just arbitrary annotations used to illustrate the function call
rdesc <- data.frame(id=letters[1:10], type=rep(c(1, 2), each=5))
cdesc <- data.frame(id=LETTERS[1:10], type=rep(c(3, 4), each=5))
(my_ds <- new("GCT", mat=mat, rdesc=rdesc, cdesc=cdesc))
When working with GCT
objects, it's often convenient to have the row and column annotations embedded as the rdesc
and cdesc
slots, respectively. If these metadata are stored separatly from the GCTX file itself, you can read them in separately and then apply them to the GCT
object using the annotate.gct
function.
In [14]:
# apply the col_meta data.frame as the column anntations to vorinostat_ds
# note that col_meta must have a colum that corresponds to the 'cid' slot of my_ds, aka the 'keyfield'
# note also how the 'cdesc' slot changes after annotating
(vorinostat_ds_annotated <- annotate.gct(vorinostat_ds, col_meta, dim="col", keyfield="sig_id"))
In the example above, the signature metadata is stored separately from the GCTX file, and so we could read it in as a data.frame
using read.delim
or some equivalent. However, in some cases the metadata may be stored in the GCTX file itself. We can extract the metadata from a GCTX file using the read.gctx.meta
function and then use this metadata to idenfity the relevant subsets of the matrix, as shown above.
In [6]:
# extract the column metadata, again assuming the GCTX file is in our working directory
col_meta_from_gctx <- read.gctx.meta(ds_path, dim="col")
# we can also extract the row metadata, if desired, by simply changing the 'dim' parameter
row_meta_from_gctx <- read.gctx.meta(ds_path, dim="row")
In [27]:
# using my_ds_annotated, slice out the columns corresponding to the A375 cell line
# we first need to determine which columns correspond to A375, but we can
# do this using the column annotations
idx <- which(vorinostat_ds_annotated@cdesc$cell_id == "A375")
vorinostat_ds_A375 <- subset.gct(vorinostat_ds_annotated, cid=idx)
# we can check the dimensionality of the any GCT object using the 'dim' function
message("vorinostat_ds_A375:")
dim(vorinostat_ds_A375)
# similarly, we can slice out the first 25 rows of this GCT object using the 'rid' parameter
vorinostat_ds_A375_first_25 <- subset.gct(vorinostat_ds_A375, rid=1:25)
# we can check the dimensionality of the any GCT object using the 'dim' function
message("vorinostat_ds_A375_first_25:")
dim(vorinostat_ds_A375_first_25)
It's often useful to have data stored in long form data.frame
objects, especially for compatibility with plotting libraries like ggplot2
. It's possible to convert a GCT
object into this long form by using the melt.gct
function, which relies on the melt
function in the data.table
package.
In [22]:
# convert my_ds_A375_first_25 to long form so we can plot the z-scores
# of these 25 genes grouped by dose
(vorinostat_ds_A375_first_25_melted <- melt.gct(vorinostat_ds_A375_first_25))
library(ggplot2)
ggplot(vorinostat_ds_A375_first_25_melted) + geom_boxplot(aes(x=pert_idose, y=value)) + ylab("z-score")
In [31]:
# extract the vorinostat signatures in PC3 as another separate dataset
idx <- which(vorinostat_ds_annotated@cdesc$cell_id == "PC3")
vorinostat_ds_PC3 <- subset.gct(vorinostat_ds_annotated, cid=idx)
dim(vorinostat_ds_PC3)
# merge the A375 and PC3 GCT objects together by columns, because
# the two objects share common rows. this is equivalent to a 'cbind' operation
# on matrices
vorinostat_ds_A375_PC3 <- merge.gct(vorinostat_ds_A375, vorinostat_ds_PC3, dim="column")
table(vorinostat_ds_A375_PC3@cdesc$cell_id)
In [34]:
# compute the row and column means
row_means <- rowMeans(vorinostat_ds_annotated@mat)
col_means <- colMeans(vorinostat_ds_annotated@mat)
message("means:")
head(row_means)
head(col_means)
# using 'apply', compute the max of each row and column
row_max <- apply(vorinostat_ds_annotated@mat, 1, max)
col_max <- apply(vorinostat_ds_annotated@mat, 2, max)
message("maxes:")
head(row_max)
head(col_max)
In [38]:
# transposing a GCT object - also swaps row and column annotations
(vorinostat_ds_transpose <- transpose.gct(vorinostat_ds_annotated))
# converting a GCT object's matrix to ranks
# the 'dim' option controls the direction along which the ranks are calculated
vorinostat_ds_rank_by_column <- rank.gct(vorinostat_ds_annotated, dim="col")
# plot z-score vs rank for the first 25 genes (rows)
plot(vorinostat_ds_rank_by_column@mat[1:25, ], vorinostat_ds_annotated@mat[1:25, ],
xlab="rank", ylab="z-score", main="z-score vs. rank")
In [39]:
# write the vorinostat_ds_A375_first_25 dataset in both GCT and GCTX format
write.gct(vorinostat_ds_A375_first_25, "vorinostat_ds_A375_first_25")
write.gctx(vorinostat_ds_A375_first_25, "vorinostat_ds_A375_first_25")
# write.gctx can also compress the dataset upon write,
# which can be controlled using the 'compression_level' option.
# the higher the value, the greater the compression, but the
# longer the read/write time
write.gctx(vorinostat_ds_A375_first_25, "vorinostat_ds_A375_first_25_compressed",
compression_level=9)