1. R Data Types

top

Atomic classes in R:

charcater
numeric
integer (L)
complex
logical (TRUE/FALSE)

Special numbers Inf and NaN: 1/0 = Inf and 1/Inf = 0; 0/0=NaN

Object attributes:

names, dimnames
dimensions (for matrices and arrays)
class
length
Other user defined attributes

attributes() to get list of attributes of an object

Creating vectors

vector() : to create an empty vector
vector("numeric", length=10)
c('a', 'b') : to concatenate multiple objects into a single vector
Mixing objects from different classes: c(1.5, 'a') : coercion

Explicit coercion

as.character()
as.numeric()
as.logical()
as.complex()



In [1]:

    
x <- 0:6
print(x)
print(as.logical(x))
print(as.complex(x))









    



[1] 0 1 2 3 4 5 6
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[1] 0+0i 1+0i 2+0i 3+0i 4+0i 5+0i 6+0i

Matrices

To create a vector:

create a matrox by matrix() function
change the dimension of a vector dim()
binding vectors and matruces: cbind() and rbind()



In [2]:

    
m <- matrix(nrow=2, ncol=3)
print(m)
attributes(m)









    



     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA






    Out[2]:





$dim
[1] 2 3



In [3]:

    
# Method1: create a matrix by matrix()
m1 <- matrix(1:6, nrow=2, ncol=3)
print(m1)
# Method2: change the dimension of a vector
m2 <- 1:6
dim(m2) = c(2,3)
print(m2)









    



     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6



In [4]:

    
x <- 1:4
y <- 6:9
cbind(x,y)
rbind(x,y)









    Out[4]:





     x y
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9






    Out[4]:





  [,1] [,2] [,3] [,4]
x    1    2    3    4
y    6    7    8    9

Lists

Lists are special vectors that can store elements from different classes

Create a list by list()



In [5]:

    
x <- list(23, "abc", 1+5i, TRUE)
print(x)









    



[[1]]
[1] 23

[[2]]
[1] "abc"

[[3]]
[1] 1+5i

[[4]]
[1] TRUE

Factors

Factors are a special data type to represent categorical variables, such as gender ('male', 'female'), patterm classes ('1', '2') ..

Create factors by factor()
Access the labels: levels()
Make a table view of number of items in each factor: table()

Factor levels are ordered by default in alphabetic order. In order to directly specify the order: factor(vec, levels=c("male", "female"))



In [6]:

    
## create a vector of factors
xf <- factor(c("male", "female", "female", "female", "male", "female", "female", "male", "female"))
str(xf)
## print levels of xf
levels(xf)
unclass(xf)









    



 Factor w/ 2 levels "female","male": 2 1 1 1 2 1 1 2 1






    Out[6]:





[1] "female" "male"  






    Out[6]:





[1] 2 1 1 1 2 1 1 2 1
attr(,"levels")
[1] "female" "male"



In [7]:

    
table(xf)









    Out[7]:





xf
female   male 
     6      3

Missing values

Missing values are denoted by NA or NaN

NaN is for undefined mathematical operations such as 0/0. NaN is a sub-type of NA

is.na() and is.nan() to test whether vector elements are missing



In [8]:

    
x <- c(1,2,3, NA, 6, NaN, 8)
is.na(x)
is.nan(x)









    Out[8]:





[1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE






    Out[8]:





[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

Data Frames

To store tabular data. One of the most useful things in R.

Data frames can store different object classes in different columns, unlike matrices that have to have the same elements class.

special attributes: row.names, col.names
Rading tabular data from files: read.table() and read.csv()
Convert data.frame to a matrix: data.matrox() (coercion may happen)



In [9]:

    
df <- data.frame(foo=11:14, bar=c(T, T, F, T))
df
nrow(df)
ncol(df)









    Out[9]:





  foo   bar
1  11  TRUE
2  12  TRUE
3  13 FALSE
4  14  TRUE






    Out[9]:





[1] 4






    Out[9]:





[1] 2

Names

R objects can have a name. Useful for self describing data, so each vector can be accesed by an index or by its object names

access and change the names by names()
names in list: x <- list(a=1, b=2, c=3)
names in matrices: dimnames()



In [10]:

    
x <- 1:3
names(x) <- c("foo", "bar", "norf")
x
names(x)









    Out[10]:





 foo  bar norf 
   1    2    3 






    Out[10]:





[1] "foo"  "bar"  "norf"

2. Subsetting R objects

Top

To extract subsets of data (objects).

subsetting with [ ] (given numeric index or logical index)
subsetting with double bracket for lists: [[ ]] (only to extract a single element of a list, meaning no vector inside double bracket)
subsetting with $ to extract elements of lists and data.frames by their names



In [11]:

    
## subsetting vectors: numeric or logical index
x <- c("a", "b", "c", "d", "h")
x[1:3]
x[x > "c"]









    Out[11]:





[1] "a" "b" "c"






    Out[11]:





[1] "d" "h"

For matrices, you can acess rows and columns by [row index, column index]

By default, subsetting drops the column and row index, to avoid that use m[1,2, drop=F] This way, the output would still be a matrix.



In [12]:

    
## subsetting matrices
m1[1,2]
m2[,2]

m2[,2, drop=F]









    Out[12]:





[1] 3






    Out[12]:





[1] 3 4






    Out[12]:





     [,1]
[1,]    3
[2,]    4

Subsetting lists

Using double bracket to subset lists. Only extract one element of a list at time. Therefore, no vector inside double bracket. (x[[1:3]] ==> error)



In [13]:

    
xl <- list(foo=1:5, bar=0.7, norf="ABcDE")
xl[[1]]
xl[["bar"]]
xl["bar"]
xl$norf









    Out[13]:





[1] 1 2 3 4 5






    Out[13]:





[1] 0.7






    Out[13]:





$bar
[1] 0.7







    Out[13]:





[1] "ABcDE"

Use single bracket [ ] To extract multiple elements of a list:



In [14]:

    
xl[c(1,3)]









    Out[14]:





$foo
[1] 1 2 3 4 5

$norf
[1] "ABcDE"

Nested list



In [15]:

    
xnest <- list(a=list(10, 12, 13, 14), b=c(4:7))
# to extract 13:
xnest[["a"]][[3]]









    Out[15]:





[1] 13

Partial Matching

Partial matching works for subsetting objects with [[ ]] and $



In [16]:

    
xp <- list(abc=c(1:8), fgh=c(T, F, F, T))
xp$a
xp[["f"]]
xp[["f", exact=F]]









    Out[16]:





[1] 1 2 3 4 5 6 7 8






    Out[16]:





NULL






    Out[16]:





[1]  TRUE FALSE FALSE  TRUE

Removing missing values

In order to subset non-missing values onlt, use !is.na()
Complete cases of a data.frame, a matrix, or multiple vectors complete.cases()



In [17]:

    
x <- c(1,2,NA, 5, NaN, 8, NaN, 9)
x[!is.na(x)]









    Out[17]:





[1] 1 2 5 8 9



In [18]:

    
y <- c(3, 5, 6, NA, NA, 8, 9, 10)
complete.cases(x,y)









    Out[18]:





[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE

3. Reading and Writing Data

Top

Reading data

To read in tabular data, use read.table() or read.csv()
readLines() to read lines of a text file
source() for reading R codes (inverse of dump())
dget() for reading R codes (inverse of dput())
load() for reading a saved workspace
unserialize() for reading single R objects in binary form

Attributes of read.table()

file: name of the file
header: logical (T/F) indicating if the file has a header or not
sep: a string that indicates how columns are separated
colClasses: a character vector indicating the class of each column in the data file
nrows: the number of rows in the dataset
comment.char: a character string indicating the comment character used in the file
skip: the number of lines to be skipped from the begining
stringAsFactors: logical, should character variables be coded as factors?

Writing data

write.table()
writeLines()
dump()
dput()
save()
serialize()

Reading large datasets

For efficiently reading in large datasets:

first identify column classes
ignore the comment character if you know there is no comment in the file: comment.char=""

You can automaucally specify column classes by reaading the first 10 rows of the file.

example:

init10 <- read.table("large_dataset.txt", nrows=10)
classes <- sapply(init10, class)
datAll <- read.table("large_dataset.txt", colClasses=classes)

If you know the number of rows, it is more memory efficient to set nrows as well

Reading and writing in textual format

Textual format has the advantage of readablity and storing element class information.

dput() for a single object
dunpy() for multiple objects

## example for dupt:
y <- data.frame(a=c(1,2), b=c("HH", "WW"))
dput(y, file="data_y.R")
new.y <- dget("data_t.R")

## example for dput: multiple objects
x <- "foo"
dump(c("x", "y"), file="data_xy.R"))
rm(x,y)
source("data_xy.R")

Connection interfaces

Data can be read from other sources by creating a connection.

file(): opens a connection to a file
gzfile(): opens a connection to a gz compressed file
bzfile(): opens a connection to a bz compressed file
url(): opens a connection to a web-page



In [19]:

    
str(file)









    



function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"), 
    raw = FALSE)



In [20]:

    
con <- gzfile("data/hw1.csv.gz", "r")
xgz <- readLines(con, 10) ## read the first 1 lines
xgz[1]









    



simpleWarning in readLines(con, 10): seek on a gzfile connection returned an internal error






    Out[20]:





[1] "RT,SERIALNO,DIVISION,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FS,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,GRNTP,GRPIP,HHL,HHT,HINCP,HUGCL,HUPAC,HUPAOC,HUPARC,LNGI,MV,NOC,NPF,NPP,NR,NRC,OCPIP,PARTNER,PSF,R18,R60,R65,RESMODE,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,WORKSTAT,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,FTAXP,FTELP,FTENP,FVACSP,FVALP,FVEHP,FWATP,FYBLP,wgtp1,wgtp2,wgtp3,wgtp4,wgtp5,wgtp6,wgtp7,wgtp8,wgtp9,wgtp10,wgtp11,wgtp12,wgtp13,wgtp14,wgtp15,wgtp16,wgtp17,wgtp18,wgtp19,wgtp20,wgtp21,wgtp22,wgtp23,wgtp24,wgtp25,wgtp26,wgtp27,wgtp28,wgtp29,wgtp30,wgtp31,wgtp32,wgtp33,wgtp34,wgtp35,wgtp36,wgtp37,wgtp38,wgtp39,wgtp40,wgtp41,wgtp42,wgtp43,wgtp44,wgtp45,wgtp46,wgtp47,wgtp48,wgtp49,wgtp50,wgtp51,wgtp52,wgtp53,wgtp54,wgtp55,wgtp56,wgtp57,wgtp58,wgtp59,wgtp60,wgtp61,wgtp62,wgtp63,wgtp64,wgtp65,wgtp66,wgtp67,wgtp68,wgtp69,wgtp70,wgtp71,wgtp72,wgtp73,wgtp74,wgtp75,wgtp76,wgtp77,wgtp78,wgtp79,wgtp80"



In [21]:

    
## read from a web-page
con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)
head(x)









    Out[21]:





[1] "<!DOCTYPE html>"                                               
[2] "<html lang=\"en\">"                                            
[3] ""                                                              
[4] "<head>"                                                        
[5] "<meta charset=\"utf-8\" />"                                    
[6] "<title>Johns Hopkins Bloomberg School of Public Health</title>"

Top

4. Flow Control

if, else, else if
for
while
repeat an infinite loop, (to stop use break)
break break the execution of a loop
next skip one iteration
return exit a function



In [22]:

    
x <- 5
y <- if(x>3) {
   10
}else{
   0
}

print(y)



In [23]:

    
## three ways to iterate through a vector:
x <- c("a", "b", "c")

for (i in 1:length(x)) {
   print(x[i])
}

for (i in seq_along(x))  print(x[i])

for (letter in x)  print(letter)









    



[1] "a"
[1] "b"
[1] "c"
[1] "a"
[1] "b"
[1] "c"
[1] "a"
[1] "b"
[1] "c"

5. Functions

Top

Functions are first class objects.

We can have nested functions, functions inside other functions.

Function arguments can have default values. Function arguments can be mapped positionally, or by name.

The arguments that don't have default values, are required when function is called.

Note: Split your functions so that each function does a single task.



In [24]:

    
args(lm)









    Out[24]:





function (formula, data, subset, weights, na.action, method = "qr", 
    model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
    contrasts = NULL, offset, ...) 
NULL

special argument "..."

Three dots indicates variable number of arguments.

It is usually used when extending another function:



In [25]:

    
myplot <- function(x, y, mytype="l", ...) {
   plot(x, y, type=mytype)
}

... is also useful when the number of input arguments is not known in advance:



In [26]:

    
args(paste)

args(cat)









    Out[26]:





function (..., sep = " ", collapse = NULL) 
NULL






    Out[26]:





function (..., file = "", sep = " ", fill = FALSE, labels = NULL, 
    append = FALSE) 
NULL



In [27]:

    
paste("a", "b", sep=":")
# partial matching doesn't work after ...
paste("a", "b", se=":")









    Out[27]:





[1] "a:b"






    Out[27]:





[1] "a b :"

6. Scoping Rules

Top

R binds a value to every symbol.

If you define a new objects with a name that is used previously in R:

R uses lexical scoping

R first searches for a free variable in the environment in which the function is defined
If no matched found, it searches in the parent environment

The global enfironemnt is the workspace.

When a free variable is found, R searches through the search space, from the workspace (global environment) until the symbol is found or it hits the emoty environment.



In [28]:

    
search()









    Out[28]:





[1] ".GlobalEnv"        "package:stats"     "package:graphics" 
[4] "package:grDevices" "package:utils"     "package:datasets" 
[7] "package:methods"   "Autoloads"         "package:base"



In [29]:

    
make.Gaussian <- function(mu, sigma) {
    gauss <- function(x) {
       exp(-((x-mu)/sigma)^2)
    }
    gauss
}

mygauss_0 <- make.Gaussian(mu=0, sigma=1)
mygauss_1 <- make.Gaussian(mu=1, sigma=1)
mygauss_0









    Out[29]:





function(x) {
       exp(-((x-mu)/sigma)^2)
    }
<environment: 0x1ff91d0>



In [30]:

    
library(IRdisplay)

x <- seq(-3, 5, by=0.1)

png("figs/mygauss.png")
par(mar=c(4.7, 4.7, 0.6, 0.6))
plot(x, mygauss_0(x), type="l", col=rgb(0.2, 0.4, 0.7, 0.5), lwd=6, xlab="x", ylab="Gaussian Distn.", cex.axis=1.5, cex.lab=1.5)
lines(x, mygauss_1(x), col=rgb(0.7, 0.3, 0.2, 0.5), lwd=6)
dev.off()
display_png("figs/mygauss.png")









    Out[30]:





pdf 
  2

7. Dates and Times

Top

dates are represented by Date class
times are represented by POSIXct or POSIXlt classes
- POSIXct is a very large integer; It uses a data frame to store time
- POSIXlt store other information, such as day of month, day of year, ...
internal reference time is 1970-01-01
weekdays() what day of the week that time/date is
months() gives the month name
quarters() gives the quarter number
objects of the same class can be used in mathematical operations



In [31]:

    
t <- as.Date("2014-05-11")
print(t)
unclass(t)  ## number of days since 1970-01-01









    



[1] "2014-05-11"






    Out[31]:





[1] 16201



In [32]:

    
x <- Sys.time()
print(x)
p <- as.POSIXlt(x)
names(unclass(p))
p$hour









    



[1] "2014-07-05 23:45:44 EDT"






    Out[32]:





 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"  
 [9] "isdst"  "zone"   "gmtoff"






    Out[32]:





[1] 23

Format Strings to dates

To convert a string in an arbitrary format to a Date format:



In [33]:

    
dstring <- c("January 01 2012, 10:40", "March 05 2013, 11:20")
strptime(dstring, "%B %d %Y, %H:%M")

## the formats can be found in help page ?strptime









    Out[33]:





[1] "2012-01-01 10:40:00 EST" "2013-03-05 11:20:00 EST"

8. Apply Functions

Top

Apply functions can be used to apply a function over elements of a vector, list or data frame. They make the codes shorter, and faster through vectorized operations.

lapply() loop over elements of a list, and evaluate a function on each element.
- if input is not a list, it will be corced to a list (if possible)
sapply() same as lapply, but also simplifies the result
apply() apply a function over margins of an array
tapply() apply a function over subsets of a vector
mapply() multivariate version of lapply



In [34]:

    
## Example for lapply()
x <- list(a=1:20, b=rnorm(10, 2))
lapply(x, mean)
sapply(x, mean)









    Out[34]:





$a
[1] 10.5

$b
[1] 2.083524







    Out[34]:





        a         b 
10.500000  2.083524



In [35]:

    
## Example for apply()
x <- matrix(rnorm(50, mean=1), nrow=10, ncol=5)
apply(x, 2, mean)
apply(x, 1, mean)









    Out[35]:





[1] 1.0639838 0.7583712 1.9225710 1.2163184 0.7788464






    Out[35]:





 [1] 0.8258462 1.6545185 0.8579511 1.2843574 0.1814005 2.0651896 1.2851799
 [8] 1.5149253 0.9494952 0.8613177

Shortcut functions

For means and sums of a matrix, these shortcut functions perform much faster on big matrices.

rowSums() equivalent to apply(x, 1, sum)
colSums() equivalent to apply(x, 2, sum)
rowMeans() equivalent to apply(x, 1, mean)
colMeans() equivalent to apply(x, 2, mean)



In [36]:

    
# Benchmark: Comparing apply() and rowMeans()
library(rbenchmark)
a <- matrix(rnorm(100000), nrow=1000)

benchmark(apply(a, 1, mean))
benchmark(rowMeans(a))









    Out[36]:





               test replications elapsed relative user.self sys.self user.child
1 apply(a, 1, mean)          100   2.064        1     2.044    0.012          0
  sys.child
1         0






    Out[36]:





         test replications elapsed relative user.self sys.self user.child
1 rowMeans(a)          100   0.032        1     0.032        0          0
  sys.child
1         0

tapply() and split()

tapplly() can be used to apply a function on each sub-group (determied by a factor vector).



In [37]:

    
## Example for tapply()
x <- c(rnorm(10, mean=3), runif(10, min=1, max=3), rnorm(10, mean=5, sd=2))
f <- gl(3, 10)
tapply(x, f, mean)









    Out[37]:





       1        2        3 
3.192245 1.742892 5.028784

split() is not a loop function, but it can be used to split a vector based on a given factor vector. The output is a list.



In [38]:

    
split(x, f)









    Out[38]:





$`1`
 [1] 3.024968 3.866025 2.810482 3.425582 4.168800 4.495183 2.009164 1.691082
 [9] 3.326658 3.104507

$`2`
 [1] 2.974586 2.862309 1.254037 1.466379 1.491590 1.245616 1.958429 1.645276
 [9] 1.370200 1.160498

$`3`
 [1]  7.230541  5.884052  3.269644 10.447268  6.557051  2.791258  4.381073
 [8]  2.738739  3.447649  3.540565



In [39]:

    
library(datasets)
s <- split(airquality, airquality$Month)

sapply(s, function(x) colMeans(x[,c("Ozone", "Solar.R", "Wind", "Temp")], na.rm=T))









    Out[39]:





                5         6          7          8         9
Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind     11.62258  10.26667   8.941935   8.793548  10.18000
Temp     65.54839  79.10000  83.903226  83.967742  76.90000

9. Random Number Generation

Top

rnorm() generates normal random number variates with a given mean and SD
- rnorm(n, mean=0, sd=1)
dnorm() returns the probablity density function at a point x, with given mean and CD
pnorm() returns the cumulative distribution function at point x
qnorm() quantile
rpois() generates random numbers with Poisson distribution with a given rate (lambda)
rbinom() generates binomial random numbers with a given probability



In [40]:

    
x <- rnorm(n=10, mean=20, sd=5)
x
str(rpois)









    Out[40]:





 [1] 14.36841 27.17210 23.60247 17.18039 23.46058 19.76642 19.04366 31.96666
 [9] 22.28409 19.87019






    



function (n, lambda)

Gaussian Noise



In [41]:

    
## adding Gaussian noise to sin(x)
x <- seq(0, 2*pi, by=0.02)
y <- sapply(sin(x), function(x) rnorm(n=1, mean=sin(x), sd=0.2))
## the same as y <- sin(x) + rnorm(n=length(x), mean=0, sd=0.2)

par(mar=c(4.7, 4.7, 0.7, 0.7))
plot(x, y, pch=19, col=rgb(0.2, 0.3, 0.7, 0.6), cex=1.3, cex.lab=1.6, cex.axis=1.6)

Use set.seed() to make the random numbers reproducible

Simulate random numbers from a Generalized Linear Model

Assume $y \approx Poisson(\mu)$

and $log(\mu) = \beta_0 + \beta_1 x$



In [42]:

    
x <- rnorm(200)
log.mu <- 0.5 + 0.3 * x
y <- rpois(200, exp(log.mu))

plot(x, y, pch=19, col=rgb(0.2, 0.3, 0.7, 0.6), cex=1.5, cex.axis=1.6, cex.lab=1.6)

Random Sampling

sample() to draw a random sample from a set
replace=TRUE
replace=FALSE



In [43]:

    
str(sample)
sample(letters, 5)
sample(1:10, size=6, replace=T)









    



function (x, size, replace = FALSE, prob = NULL)  






    Out[43]:





[1] "f" "r" "q" "x" "t"






    Out[43]:





[1]  7  1  2  1  6 10

10. R Profiler

First design your code, unit-test it, then optimize it
Using system.time() to measure how much time it takes to perform a step
- returns an object of class proc_time
- user time --> CPU time
- elapsed time --> wallclock time



In [44]:

    
system.time(rnorm(n=100000, mean=0, sd=1))









    Out[44]:





   user  system elapsed 
  0.016   0.000   0.015



In [45]:

    
system.time(readLines("http://www.jhsph.edu"))









    Out[45]:





   user  system elapsed 
  0.008   0.000   0.361

Measureing multiple expressions:



In [46]:

    
system.time({
    n <- 1000
    r <- numeric(n)
    for (i in 1:n) {
        x = rnorm(i)
        r[i] = mean(x)
    }
})









    Out[46]:





   user  system elapsed 
  0.104   0.000   0.105



In [47]:

    
# comparing time of vectorized operations vs. for loops
n=1000000
x <- rnorm(n, mean=0, sd=1)

system.time(sum(x))
system.time({
    s = 0
    for (i in 1:n) {
        s = s + x[i]
    }
})









    Out[47]:





   user  system elapsed 
  0.000   0.000   0.002 






    Out[47]:





   user  system elapsed 
  1.452   0.004   1.462

Using Rprof()

summaryRprof() will tabulates the output



In [48]:

    
Rprof(filename="data/rprof_test.dat")

for (i in 1:1000) {
    x <- seq(0, 5, by=0.01)
    y <- 1.5+x + rnorm(n=length(x), mean=0, sd=1)
    lm(y ~ x)
}

Rprof(NULL)

summaryRprof(filename="data/rprof_test.dat")









    Out[48]:





$by.self
                        self.time self.pct total.time total.pct
"as.character"               0.32    12.21       0.32     12.21
"match"                      0.16     6.11       0.36     13.74
"model.frame.default"        0.14     5.34       0.90     34.35
"[.data.frame"               0.14     5.34       0.28     10.69
".External2"                 0.12     4.58       0.60     22.90
".External"                  0.12     4.58       0.12      4.58
"[["                         0.10     3.82       0.34     12.98
"[[.data.frame"              0.10     3.82       0.24      9.16
"na.omit.data.frame"         0.08     3.05       0.48     18.32
"deparse"                    0.08     3.05       0.18      6.87
"unique"                     0.08     3.05       0.12      4.58
"%in%"                       0.06     2.29       0.22      8.40
"as.name"                    0.06     2.29       0.06      2.29
"get"                        0.06     2.29       0.06      2.29
"unclass"                    0.06     2.29       0.06      2.29
"model.matrix.default"       0.04     1.53       0.46     17.56
"lapply"                     0.04     1.53       0.34     12.98
"lm.fit"                     0.04     1.53       0.14      5.34
"anyDuplicated.default"      0.04     1.53       0.04      1.53
"match.call"                 0.04     1.53       0.04      1.53
"mode"                       0.04     1.53       0.04      1.53
"names"                      0.04     1.53       0.04      1.53
"rep.int"                    0.04     1.53       0.04      1.53
"<Anonymous>"                0.02     0.76       2.62    100.00
"lm"                         0.02     0.76       2.44     93.13
"sapply"                     0.02     0.76       0.42     16.03
"model.response"             0.02     0.76       0.38     14.50
"FUN"                        0.02     0.76       0.22      8.40
"paste"                      0.02     0.76       0.20      7.63
".getXlevels"                0.02     0.76       0.18      6.87
"getExportedValue"           0.02     0.76       0.16      6.11
"simplify2array"             0.02     0.76       0.14      5.34
"as.list.data.frame"         0.02     0.76       0.08      3.05
"seq"                        0.02     0.76       0.08      3.05
"anyDuplicated"              0.02     0.76       0.06      2.29
"seq.default"                0.02     0.76       0.06      2.29
"terms.formula"              0.02     0.76       0.04      1.53
"$<-"                        0.02     0.76       0.02      0.76
"abs"                        0.02     0.76       0.02      0.76
"any"                        0.02     0.76       0.02      0.76
"as.list.default"            0.02     0.76       0.02      0.76
"attr"                       0.02     0.76       0.02      0.76
".Call"                      0.02     0.76       0.02      0.76
"dim"                        0.02     0.76       0.02      0.76
"isNamespace"                0.02     0.76       0.02      0.76
"length"                     0.02     0.76       0.02      0.76
"makepredictcall"            0.02     0.76       0.02      0.76
"NCOL"                       0.02     0.76       0.02      0.76
"pmatch"                     0.02     0.76       0.02      0.76
"pmin"                       0.02     0.76       0.02      0.76
"seq_len"                    0.02     0.76       0.02      0.76
"structure"                  0.02     0.76       0.02      0.76
"sum"                        0.02     0.76       0.02      0.76
"unique.default"             0.02     0.76       0.02      0.76

$by.total
                        total.time total.pct self.time self.pct
"<Anonymous>"                 2.62    100.00      0.02     0.76
"doTryCatch"                  2.62    100.00      0.00     0.00
"eval"                        2.62    100.00      0.00     0.00
"evaluate"                    2.62    100.00      0.00     0.00
"evaluate_call"               2.62    100.00      0.00     0.00
"handle"                      2.62    100.00      0.00     0.00
"handle_shell"                2.62    100.00      0.00     0.00
"try"                         2.62    100.00      0.00     0.00
"tryCatch"                    2.62    100.00      0.00     0.00
"tryCatchList"                2.62    100.00      0.00     0.00
"tryCatchOne"                 2.62    100.00      0.00     0.00
"withCallingHandlers"         2.62    100.00      0.00     0.00
"withVisible"                 2.62    100.00      0.00     0.00
"lm"                          2.44     93.13      0.02     0.76
"model.frame.default"         0.90     34.35      0.14     5.34
".External2"                  0.60     22.90      0.12     4.58
"na.omit.data.frame"          0.48     18.32      0.08     3.05
"na.omit"                     0.48     18.32      0.00     0.00
"model.matrix.default"        0.46     17.56      0.04     1.53
"model.matrix"                0.46     17.56      0.00     0.00
"sapply"                      0.42     16.03      0.02     0.76
"model.response"              0.38     14.50      0.02     0.76
"match"                       0.36     13.74      0.16     6.11
"[["                          0.34     12.98      0.10     3.82
"lapply"                      0.34     12.98      0.04     1.53
"as.character"                0.32     12.21      0.32    12.21
"[.data.frame"                0.28     10.69      0.14     5.34
"["                           0.28     10.69      0.00     0.00
"[[.data.frame"               0.24      9.16      0.10     3.82
"%in%"                        0.22      8.40      0.06     2.29
"FUN"                         0.22      8.40      0.02     0.76
"paste"                       0.20      7.63      0.02     0.76
"deparse"                     0.18      6.87      0.08     3.05
".getXlevels"                 0.18      6.87      0.02     0.76
"getExportedValue"            0.16      6.11      0.02     0.76
"::"                          0.16      6.11      0.00     0.00
"lm.fit"                      0.14      5.34      0.04     1.53
"simplify2array"              0.14      5.34      0.02     0.76
".External"                   0.12      4.58      0.12     4.58
"unique"                      0.12      4.58      0.08     3.05
"$"                           0.12      4.58      0.00     0.00
"as.vector"                   0.12      4.58      0.00     0.00
"$.data.frame"                0.12      4.58      0.00     0.00
"as.list"                     0.10      3.82      0.00     0.00
"rnorm"                       0.10      3.82      0.00     0.00
"as.list.data.frame"          0.08      3.05      0.02     0.76
"seq"                         0.08      3.05      0.02     0.76
"asNamespace"                 0.08      3.05      0.00     0.00
"getInternalExportName"       0.08      3.05      0.00     0.00
"model.offset"                0.08      3.05      0.00     0.00
"vapply"                      0.08      3.05      0.00     0.00
"as.name"                     0.06      2.29      0.06     2.29
"get"                         0.06      2.29      0.06     2.29
"unclass"                     0.06      2.29      0.06     2.29
"anyDuplicated"               0.06      2.29      0.02     0.76
"seq.default"                 0.06      2.29      0.02     0.76
".deparseOpts"                0.06      2.29      0.00     0.00
"getNamespace"                0.06      2.29      0.00     0.00
"anyDuplicated.default"       0.04      1.53      0.04     1.53
"match.call"                  0.04      1.53      0.04     1.53
"mode"                        0.04      1.53      0.04     1.53
"names"                       0.04      1.53      0.04     1.53
"rep.int"                     0.04      1.53      0.04     1.53
"terms.formula"               0.04      1.53      0.02     0.76
"getNamespaceInfo"            0.04      1.53      0.00     0.00
"model.weights"               0.04      1.53      0.00     0.00
"terms"                       0.04      1.53      0.00     0.00
"$<-"                         0.02      0.76      0.02     0.76
"abs"                         0.02      0.76      0.02     0.76
"any"                         0.02      0.76      0.02     0.76
"as.list.default"             0.02      0.76      0.02     0.76
"attr"                        0.02      0.76      0.02     0.76
".Call"                       0.02      0.76      0.02     0.76
"dim"                         0.02      0.76      0.02     0.76
"isNamespace"                 0.02      0.76      0.02     0.76
"length"                      0.02      0.76      0.02     0.76
"makepredictcall"             0.02      0.76      0.02     0.76
"NCOL"                        0.02      0.76      0.02     0.76
"pmatch"                      0.02      0.76      0.02     0.76
"pmin"                        0.02      0.76      0.02     0.76
"seq_len"                     0.02      0.76      0.02     0.76
"structure"                   0.02      0.76      0.02     0.76
"sum"                         0.02      0.76      0.02     0.76
"unique.default"              0.02      0.76      0.02     0.76
"is.empty.model"              0.02      0.76      0.00     0.00
"logical"                     0.02      0.76      0.00     0.00
"nrow"                        0.02      0.76      0.00     0.00
"unlist"                      0.02      0.76      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 2.62

11. Optimization in R

optim(), nlm(), optimize()
Pass them a function and a vector of parameters.
We can hold some parameters fixed, by fixed=c(F, T)

Example: maximize log-likelihood



In [49]:

    
## define a constructor function:
make.NegLogLik <- function(data, fixed=c(FALSE, FALSE)) {
    params <- fixed
    function(p) {
        params[!fixed] <- p
        mu <- params[1]
        sigma <- params[2]
        a <- -0.5 * length(data) * log(2*pi*sigma^2)
        b <- -0.5 * sum((data-mu)^2) / (sigma^2)
        -(a + b)
    }
}

# Generate a random dataset with normal distribution
d <- rnorm(n=1000, mean=2, sd=3)
nLL <- make.NegLogLik(d)
nLL
ls(environment(nLL))









    Out[49]:





function(p) {
        params[!fixed] <- p
        mu <- params[1]
        sigma <- params[2]
        a <- -0.5 * length(data) * log(2*pi*sigma^2)
        b <- -0.5 * sum((data-mu)^2) / (sigma^2)
        -(a + b)
    }
<environment: 0x1c7b758>






    Out[49]:





[1] "data"   "fixed"  "params"



In [50]:

    
## Minimze Neative of log-likelihood
optim(c(mu=0, sigma=1), nLL)$par









    Out[50]:





      mu    sigma 
2.059232 3.108933

Plot the log-likelihood function



In [51]:

    
## generate a range of possible values for mean
x <- seq(1, 3, len=100)
## when fixing sigma, use the true value
nLL.mean <- make.NegLogLik(d, fixed=c(FALSE, 3)) 
y <- sapply(x, nLL.mean)

#windows.options(width=40, height=10)
#layout(matrix(c(1,2), 1, 2, byrow = TRUE),
#   widths=c(5,5), heights=c(1,1))



par(mar=c(4.7, 4.7, 0.7, 0.7), mfrow=c(1,2))
plot(x, -(y-min(y)), type="l", lwd=4, col="steelblue", 
     cex.axis=1.7, cex.lab=1.7, xlab=expression(mu), ylab="log-likelihood")
plot(x, exp(-(y-min(y))), type="l", lwd=4, col="steelblue", 
     cex.axis=1.7, cex.lab=1.7, xlab=expression(mu), ylab="likelihood")

R Programming