Special numbers Inf and NaN: 1/0 = Inf and 1/Inf = 0; 0/0=NaN
attributes() to get list of attributes of an object
In [1]:
x <- 0:6
print(x)
print(as.logical(x))
print(as.complex(x))
In [2]:
m <- matrix(nrow=2, ncol=3)
print(m)
attributes(m)
Out[2]:
In [3]:
# Method1: create a matrix by matrix()
m1 <- matrix(1:6, nrow=2, ncol=3)
print(m1)
# Method2: change the dimension of a vector
m2 <- 1:6
dim(m2) = c(2,3)
print(m2)
In [4]:
x <- 1:4
y <- 6:9
cbind(x,y)
rbind(x,y)
Out[4]:
Out[4]:
In [5]:
x <- list(23, "abc", 1+5i, TRUE)
print(x)
Factors are a special data type to represent categorical variables, such as gender ('male', 'female'), patterm classes ('1', '2') ..
Factor levels are ordered by default in alphabetic order. In order to directly specify the order: factor(vec, levels=c("male", "female"))
In [6]:
## create a vector of factors
xf <- factor(c("male", "female", "female", "female", "male", "female", "female", "male", "female"))
str(xf)
## print levels of xf
levels(xf)
unclass(xf)
Out[6]:
Out[6]:
In [7]:
table(xf)
Out[7]:
In [8]:
x <- c(1,2,3, NA, 6, NaN, 8)
is.na(x)
is.nan(x)
Out[8]:
Out[8]:
To store tabular data. One of the most useful things in R.
Data frames can store different object classes in different columns, unlike matrices that have to have the same elements class.
In [9]:
df <- data.frame(foo=11:14, bar=c(T, T, F, T))
df
nrow(df)
ncol(df)
Out[9]:
Out[9]:
Out[9]:
In [10]:
x <- 1:3
names(x) <- c("foo", "bar", "norf")
x
names(x)
Out[10]:
Out[10]:
To extract subsets of data (objects).
In [11]:
## subsetting vectors: numeric or logical index
x <- c("a", "b", "c", "d", "h")
x[1:3]
x[x > "c"]
Out[11]:
Out[11]:
For matrices, you can acess rows and columns by [row index, column index]
By default, subsetting drops the column and row index, to avoid that use m[1,2, drop=F] This way, the output would still be a matrix.
In [12]:
## subsetting matrices
m1[1,2]
m2[,2]
m2[,2, drop=F]
Out[12]:
Out[12]:
Out[12]:
In [13]:
xl <- list(foo=1:5, bar=0.7, norf="ABcDE")
xl[[1]]
xl[["bar"]]
xl["bar"]
xl$norf
Out[13]:
Out[13]:
Out[13]:
Out[13]:
In [14]:
xl[c(1,3)]
Out[14]:
In [15]:
xnest <- list(a=list(10, 12, 13, 14), b=c(4:7))
# to extract 13:
xnest[["a"]][[3]]
Out[15]:
In [16]:
xp <- list(abc=c(1:8), fgh=c(T, F, F, T))
xp$a
xp[["f"]]
xp[["f", exact=F]]
Out[16]:
Out[16]:
Out[16]:
In [17]:
x <- c(1,2,NA, 5, NaN, 8, NaN, 9)
x[!is.na(x)]
Out[17]:
In [18]:
y <- c(3, 5, 6, NA, NA, 8, 9, 10)
complete.cases(x,y)
Out[18]:
For efficiently reading in large datasets:
comment.char=""
You can automaucally specify column classes by reaading the first 10 rows of the file.
example:
init10 <- read.table("large_dataset.txt", nrows=10)
classes <- sapply(init10, class)
datAll <- read.table("large_dataset.txt", colClasses=classes)
Textual format has the advantage of readablity and storing element class information.
## example for dupt:
y <- data.frame(a=c(1,2), b=c("HH", "WW"))
dput(y, file="data_y.R")
new.y <- dget("data_t.R")
## example for dput: multiple objects
x <- "foo"
dump(c("x", "y"), file="data_xy.R"))
rm(x,y)
source("data_xy.R")
In [19]:
str(file)
In [20]:
con <- gzfile("data/hw1.csv.gz", "r")
xgz <- readLines(con, 10) ## read the first 1 lines
xgz[1]
Out[20]:
In [21]:
## read from a web-page
con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)
head(x)
Out[21]:
In [22]:
x <- 5
y <- if(x>3) {
10
}else{
0
}
print(y)
In [23]:
## three ways to iterate through a vector:
x <- c("a", "b", "c")
for (i in 1:length(x)) {
print(x[i])
}
for (i in seq_along(x)) print(x[i])
for (letter in x) print(letter)
Functions are first class objects.
We can have nested functions, functions inside other functions.
Function arguments can have default values. Function arguments can be mapped positionally, or by name.
The arguments that don't have default values, are required when function is called.
Note: Split your functions so that each function does a single task.
In [24]:
args(lm)
Out[24]:
In [25]:
myplot <- function(x, y, mytype="l", ...) {
plot(x, y, type=mytype)
}
... is also useful when the number of input arguments is not known in advance:
In [26]:
args(paste)
args(cat)
Out[26]:
Out[26]:
In [27]:
paste("a", "b", sep=":")
# partial matching doesn't work after ...
paste("a", "b", se=":")
Out[27]:
Out[27]:
R binds a value to every symbol.
If you define a new objects with a name that is used previously in R:
R uses lexical scoping
The global enfironemnt is the workspace.
When a free variable is found, R searches through the search space, from the workspace (global environment) until the symbol is found or it hits the emoty environment.
In [28]:
search()
Out[28]:
In [29]:
make.Gaussian <- function(mu, sigma) {
gauss <- function(x) {
exp(-((x-mu)/sigma)^2)
}
gauss
}
mygauss_0 <- make.Gaussian(mu=0, sigma=1)
mygauss_1 <- make.Gaussian(mu=1, sigma=1)
mygauss_0
Out[29]:
In [30]:
library(IRdisplay)
x <- seq(-3, 5, by=0.1)
png("figs/mygauss.png")
par(mar=c(4.7, 4.7, 0.6, 0.6))
plot(x, mygauss_0(x), type="l", col=rgb(0.2, 0.4, 0.7, 0.5), lwd=6, xlab="x", ylab="Gaussian Distn.", cex.axis=1.5, cex.lab=1.5)
lines(x, mygauss_1(x), col=rgb(0.7, 0.3, 0.2, 0.5), lwd=6)
dev.off()
display_png("figs/mygauss.png")
Out[30]:
internal reference time is 1970-01-01
weekdays() what day of the week that time/date is
quarters() gives the quarter number
objects of the same class can be used in mathematical operations
In [31]:
t <- as.Date("2014-05-11")
print(t)
unclass(t) ## number of days since 1970-01-01
Out[31]:
In [32]:
x <- Sys.time()
print(x)
p <- as.POSIXlt(x)
names(unclass(p))
p$hour
Out[32]:
Out[32]:
In [33]:
dstring <- c("January 01 2012, 10:40", "March 05 2013, 11:20")
strptime(dstring, "%B %d %Y, %H:%M")
## the formats can be found in help page ?strptime
Out[33]:
Apply functions can be used to apply a function over elements of a vector, list or data frame. They make the codes shorter, and faster through vectorized operations.
In [34]:
## Example for lapply()
x <- list(a=1:20, b=rnorm(10, 2))
lapply(x, mean)
sapply(x, mean)
Out[34]:
Out[34]:
In [35]:
## Example for apply()
x <- matrix(rnorm(50, mean=1), nrow=10, ncol=5)
apply(x, 2, mean)
apply(x, 1, mean)
Out[35]:
Out[35]:
In [36]:
# Benchmark: Comparing apply() and rowMeans()
library(rbenchmark)
a <- matrix(rnorm(100000), nrow=1000)
benchmark(apply(a, 1, mean))
benchmark(rowMeans(a))
Out[36]:
Out[36]:
In [37]:
## Example for tapply()
x <- c(rnorm(10, mean=3), runif(10, min=1, max=3), rnorm(10, mean=5, sd=2))
f <- gl(3, 10)
tapply(x, f, mean)
Out[37]:
split() is not a loop function, but it can be used to split a vector based on a given factor vector. The output is a list.
In [38]:
split(x, f)
Out[38]:
In [39]:
library(datasets)
s <- split(airquality, airquality$Month)
sapply(s, function(x) colMeans(x[,c("Ozone", "Solar.R", "Wind", "Temp")], na.rm=T))
Out[39]:
In [40]:
x <- rnorm(n=10, mean=20, sd=5)
x
str(rpois)
Out[40]:
In [41]:
## adding Gaussian noise to sin(x)
x <- seq(0, 2*pi, by=0.02)
y <- sapply(sin(x), function(x) rnorm(n=1, mean=sin(x), sd=0.2))
## the same as y <- sin(x) + rnorm(n=length(x), mean=0, sd=0.2)
par(mar=c(4.7, 4.7, 0.7, 0.7))
plot(x, y, pch=19, col=rgb(0.2, 0.3, 0.7, 0.6), cex=1.3, cex.lab=1.6, cex.axis=1.6)
In [42]:
x <- rnorm(200)
log.mu <- 0.5 + 0.3 * x
y <- rpois(200, exp(log.mu))
plot(x, y, pch=19, col=rgb(0.2, 0.3, 0.7, 0.6), cex=1.5, cex.axis=1.6, cex.lab=1.6)
In [43]:
str(sample)
sample(letters, 5)
sample(1:10, size=6, replace=T)
Out[43]:
Out[43]:
In [44]:
system.time(rnorm(n=100000, mean=0, sd=1))
Out[44]:
In [45]:
system.time(readLines("http://www.jhsph.edu"))
Out[45]:
In [46]:
system.time({
n <- 1000
r <- numeric(n)
for (i in 1:n) {
x = rnorm(i)
r[i] = mean(x)
}
})
Out[46]:
In [47]:
# comparing time of vectorized operations vs. for loops
n=1000000
x <- rnorm(n, mean=0, sd=1)
system.time(sum(x))
system.time({
s = 0
for (i in 1:n) {
s = s + x[i]
}
})
Out[47]:
Out[47]:
In [48]:
Rprof(filename="data/rprof_test.dat")
for (i in 1:1000) {
x <- seq(0, 5, by=0.01)
y <- 1.5+x + rnorm(n=length(x), mean=0, sd=1)
lm(y ~ x)
}
Rprof(NULL)
summaryRprof(filename="data/rprof_test.dat")
Out[48]:
In [49]:
## define a constructor function:
make.NegLogLik <- function(data, fixed=c(FALSE, FALSE)) {
params <- fixed
function(p) {
params[!fixed] <- p
mu <- params[1]
sigma <- params[2]
a <- -0.5 * length(data) * log(2*pi*sigma^2)
b <- -0.5 * sum((data-mu)^2) / (sigma^2)
-(a + b)
}
}
# Generate a random dataset with normal distribution
d <- rnorm(n=1000, mean=2, sd=3)
nLL <- make.NegLogLik(d)
nLL
ls(environment(nLL))
Out[49]:
Out[49]:
In [50]:
## Minimze Neative of log-likelihood
optim(c(mu=0, sigma=1), nLL)$par
Out[50]:
In [51]:
## generate a range of possible values for mean
x <- seq(1, 3, len=100)
## when fixing sigma, use the true value
nLL.mean <- make.NegLogLik(d, fixed=c(FALSE, 3))
y <- sapply(x, nLL.mean)
#windows.options(width=40, height=10)
#layout(matrix(c(1,2), 1, 2, byrow = TRUE),
# widths=c(5,5), heights=c(1,1))
par(mar=c(4.7, 4.7, 0.7, 0.7), mfrow=c(1,2))
plot(x, -(y-min(y)), type="l", lwd=4, col="steelblue",
cex.axis=1.7, cex.lab=1.7, xlab=expression(mu), ylab="log-likelihood")
plot(x, exp(-(y-min(y))), type="l", lwd=4, col="steelblue",
cex.axis=1.7, cex.lab=1.7, xlab=expression(mu), ylab="likelihood")