Package installation

Loading basic graphics functions.

Installation of R packages.

Setting up the size for graphics.


In [36]:
source("https://raw.githubusercontent.com/eogasawara/mylibrary/master/myGraphics.R")
plot_size(4,3)

In [37]:
loadlibrary("TSPred")
loadlibrary("STMotif")

Basic R concepts

Variable assignment

Functions for data evaluation

Vector definition

Calculations

Printing values


In [38]:
x <- 2 # variable assignment

x # variable evaluation

is.numeric(x) # variable 

weight <- c(60, 72, 57, 90, 95, 72) # vector with six observations

height <- c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91)

bmi <- weight/height^2

print(bmi)

print(sprintf("%.2f +/- %.2f", mean(bmi), sd(bmi)))


2
TRUE
[1] 19.59184 22.22222 20.93664 24.93075 31.37799 19.73630
[1] "23.13 +/- 4.49"

Plotting graphics

Lines are drawn on last active canvas.


In [39]:
plot(height, weight)



In [40]:
plot(height, weight)
hh <- c(1.65, 1.70, 1.75, 1.80, 1.85, 1.90)
lines(hh, 22.5 * hh^2)


Statistical tests

Check if the mean of observations is not different from a theoretical value.


In [41]:
t.test(bmi, mu=22.5)


	One Sample t-test

data:  bmi
t = 0.34488, df = 5, p-value = 0.7442
alternative hypothesis: true mean is not equal to 22.5
95 percent confidence interval:
 18.41734 27.84791
sample estimates:
mean of x 
 23.13262 

Properties of functions

Functions have default values.

It is possible to check parameters and see the documentation with examples.


In [43]:
plot(height, weight, pch=2)

args(plot.default)

#?graphics::plot


function (x, y = NULL, type = "p", xlim = NULL, ylim = NULL, 
    log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL, 
    ann = par("ann"), axes = TRUE, frame.plot = axes, panel.first = NULL, 
    panel.last = NULL, asp = NA, xgap.axis = NA, ygap.axis = NA, 
    ...) 
NULL

Handling NA observations

Operations with NA lead to NA.

Observations may have an associated label.


In [44]:
x <- c(A=1, B=NA, C=3)

mean(x)

mean(x, na.rm=TRUE)

names(x)

x["B"] <- 2

x["B"]*x


<NA>
2
  1. 'A'
  2. 'B'
  3. 'C'
A
2
B
4
C
6

Matrices

Matrices can be filled from vectors or data frames.

It is possible to associate names for rows and columns.

Operations such as transpose, scalar product, matrix product ($\%$*$\%$), determinants are available.

Additional documentation can be found at https://www.statmethods.net/advstats/matrix.html.


In [45]:
m <- 1:9
dim(m) <- c(3,3)
m

mb <- matrix(1:9, nrow=3,byrow=TRUE)
rownames(mb) = LETTERS[1:3]
mb

t(m)

m*x

det(m)


A matrix: 3 × 3 of type int
147
258
369
A matrix: 3 × 3 of type int
A123
B456
C789
A matrix: 3 × 3 of type int
123
456
789
A matrix: 3 × 3 of type dbl
1 4 7
41016
91827
0

Factors

Factors are used to handle categorical data.


In [46]:
pain = c(0,3,2,2,1)
fpain = factor(pain,levels=0:3)
levels(fpain) = c("none","mild","medium","severe")

fpain

as.numeric(fpain)

levels(fpain)


  1. none
  2. severe
  3. medium
  4. medium
  5. mild
Levels:
  1. 'none'
  2. 'mild'
  3. 'medium'
  4. 'severe'
  1. 1
  2. 4
  3. 3
  4. 3
  5. 2
  1. 'none'
  2. 'mild'
  3. 'medium'
  4. 'severe'

Lists

Lists are used to work with "objects".


In [47]:
x = c(5260,5470,5640,6180,6390,
      6515,6805,7515,7515,8230,8770)
y = c(3910,4220,3885,5160,5645,
      4680,5265,5975,6790,6900,7335)

lst <- list(A=x, B=y)

lst

lst$A


$A
  1. 5260
  2. 5470
  3. 5640
  4. 6180
  5. 6390
  6. 6515
  7. 6805
  8. 7515
  9. 7515
  10. 8230
  11. 8770
$B
  1. 3910
  2. 4220
  3. 3885
  4. 5160
  5. 5645
  6. 4680
  7. 5265
  8. 5975
  9. 6790
  10. 6900
  11. 7335
  1. 5260
  2. 5470
  3. 5640
  4. 6180
  5. 6390
  6. 6515
  7. 6805
  8. 7515
  9. 7515
  10. 8230
  11. 8770

Data frames

Data frames (tables) provide support for structured data.


In [48]:
d <- data.frame(A=lst$A,B=lst$B)
d

df <- d[d$A > 7000 | d$A < 6000,]
df


A data.frame: 11 × 2
AB
<dbl><dbl>
52603910
54704220
56403885
61805160
63905645
65154680
68055265
75155975
75156790
82306900
87707335
A data.frame: 7 × 2
AB
<dbl><dbl>
152603910
254704220
356403885
875155975
975156790
1082306900
1187707335

Maps

Apply functions can be applied for all rows or columns.

The first character of the function name establishes the return type (s: simple, l: list).


In [49]:
lapply(d, min, na.rm=TRUE)

sapply(d, min, na.rm=TRUE)

apply(d, 1, min)

apply(d, 2, min)


$A
5260
$B
3885
A
5260
B
3885
  1. 3910
  2. 4220
  3. 3885
  4. 5160
  5. 5645
  6. 4680
  7. 5265
  8. 5975
  9. 6790
  10. 6900
  11. 7335
A
5260
B
3885

Sort and Order


In [50]:
sort(d$B)
o <- order(d$B)
o
ds <- d[o,]
ds


  1. 3885
  2. 3910
  3. 4220
  4. 4680
  5. 5160
  6. 5265
  7. 5645
  8. 5975
  9. 6790
  10. 6900
  11. 7335
  1. 3
  2. 1
  3. 2
  4. 6
  5. 4
  6. 7
  7. 5
  8. 8
  9. 9
  10. 10
  11. 11
A data.frame: 11 × 2
AB
<dbl><dbl>
356403885
152603910
254704220
665154680
461805160
768055265
563905645
875155975
975156790
1082306900
1187707335

Loading and Saving data

There are many functions for reading CSV, Excel, and RData formats.


In [51]:
wine = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header = TRUE, sep = ",")
head(wine)

save(wine, file="wine.RData")

rm(wine)

load("wine.RData")
write.table(wine, file="wine.csv", row.names=FALSE, quote = FALSE)


A data.frame: 6 × 14
X1X14.23X1.71X2.43X15.6X127X2.8X3.06X.28X2.29X5.64X1.04X3.92X1065
<int><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int>
113.201.782.1411.21002.652.760.261.284.381.053.401050
113.162.362.6718.61012.803.240.302.815.681.033.171185
114.371.952.5016.81133.853.490.242.187.800.863.451480
113.242.592.8721.01182.802.690.391.824.321.042.93 735
114.201.762.4515.21123.273.390.341.976.751.052.851450
114.391.872.4514.6 962.502.520.301.985.251.023.581290

Functions

Functions may have parameters and can return an object.


In [52]:
create_dataset <- function() {
  data <- read.table(text = "Year Months Flights Delays
                     2016 Jan-Mar 11 6
                     2016 Apr-Jun 12 5
                     2016 Jul-Sep 13 3
                     2016 Oct-Dec 12 5
                     2017 Jan-Mar 10 4
                     2017 Apr-Jun 9 3
                     2017 Jul-Sep 11 4
                     2017 Oct-Dec 25 15
                     2018 Jan-Mar 14 3
                     2018 Apr-Jun 12 5
                     2018 Jul-Sep 13 3
                     2018 Oct-Dec 15 4",
                     header = TRUE,sep = "")  
  data$OnTime <- data$Flights - data$Delays 
  data$Perc <- round(100 * data$Delays / data$Flights)
  return(data)
}

data <- create_dataset()
head(data)


A data.frame: 6 × 6
YearMonthsFlightsDelaysOnTimePerc
<int><fct><int><int><int><dbl>
2016Jan-Mar116 555
2016Apr-Jun125 742
2016Jul-Sep1331023
2016Oct-Dec125 742
2017Jan-Mar104 640
2017Apr-Jun 93 633

Pipelines

The operator $\%$>$\%$ creates a pipeline.

The first parameter of the next invoked function receives the data from the pipeline.

Library $dplyr$ contains a set of functions that support relational algebra operations.


In [53]:
loadlibrary("dplyr")

data_sd <- create_dataset() %>% 
  select(variable=Months, value=Delays) %>% 
  group_by(variable) %>% 
  summarize(sd = sd(value), value = mean(value))

data_sd$variable <- factor(data_sd$variable,
    levels = c('Jan-Mar','Apr-Jun','Jul-Sep','Oct-Dec'))

head(data_sd)


A tibble: 4 × 3
variablesdvalue
<fct><dbl><dbl>
Apr-Jun1.15470054.333333
Jan-Mar1.52752524.333333
Jul-Sep0.57735033.333333
Oct-Dec6.08276258.000000

Advanced graphics

Library $ggplot$ contains advanced graphics.

The $myGraphics.ipynb$ notebook has some examples of creating nice graphics using $ggplot$. Additional information can be found at https://nbviewer.jupyter.org/github/eogasawara/mylibrary/blob/master/myGraphics.ipynb.


In [54]:
loadlibrary("RColorBrewer")

col_set <- brewer.pal(11, 'Spectral')

grf <- plot.bar(data_sd, colors=col_set[2], alpha=0.5)
grf <- grf + geom_errorbar(
    aes(x=variable, ymin=value-sd, ymax=value+sd), 
    width=0.2, colour=col_set[2], alpha=0.9, size=1.1) 

plot(grf)


Melt library

The $melt$ function transforms columns values into rows grouped by $id.vars$.

The name of columns is used to fill the $variable$ attribute created during $melt$.


In [55]:
loadlibrary("reshape")
data <- create_dataset()
head(data)
data <- melt(data[,c('Year', 'Months', 'Flights', 'Delays', 'OnTime', 'Perc')], 
               id.vars = c(1,2))
head(data)


A data.frame: 6 × 6
YearMonthsFlightsDelaysOnTimePerc
<int><fct><int><int><int><dbl>
2016Jan-Mar116 555
2016Apr-Jun125 742
2016Jul-Sep1331023
2016Oct-Dec125 742
2017Jan-Mar104 640
2017Apr-Jun 93 633
A data.frame: 6 × 4
YearMonthsvariablevalue
<int><fct><fct><dbl>
2016Jan-MarFlights11
2016Apr-JunFlights12
2016Jul-SepFlights13
2016Oct-DecFlights12
2017Jan-MarFlights10
2017Apr-JunFlights 9

In [56]:
data$x <- sprintf("%d-%s", data$Year, data$Months)
data$x <- factor(data$x,levels = data$x[1:12])

grf <- plot.series(data %>% filter(variable %in% c('Flights', 'Delays')),
                   colors=col_set[c(4,2)]) 
grf <- grf + theme(axis.text.x = element_text(angle=45, hjust=1))

plot(grf)


Merge

The function $merge$ can be used to join data frames. It can be used to produce inner, left, right, and outer joins.


In [57]:
stores <- data.frame(
    city = c("Rio de Janeiro", "Sao Paulo", "Paris", "New York", "Tokyo"),
    value = c(10, 12, 20, 25, 18))
head(stores)


divisions <- data.frame(
    city = c("Rio de Janeiro", "Sao Paulo", "Paris", "New York", "Tokyo"),
    country = c("Brazil", "Brazil", "France", "US", "Japan"))
head(divisions)

data <- merge(stores, divisions, by.x="city", by.y="city")
head(data)

result <- data %>% group_by(country) %>% summarize(count = n(), amount = sum(value))
head(result)


A data.frame: 5 × 2
cityvalue
<fct><dbl>
Rio de Janeiro10
Sao Paulo 12
Paris 20
New York 25
Tokyo 18
A data.frame: 5 × 2
citycountry
<fct><fct>
Rio de JaneiroBrazil
Sao Paulo Brazil
Paris France
New York US
Tokyo Japan
A data.frame: 5 × 3
cityvaluecountry
<fct><dbl><fct>
New York 25US
Paris 20France
Rio de Janeiro10Brazil
Sao Paulo 12Brazil
Tokyo 18Japan
A tibble: 4 × 3
countrycountamount
<fct><int><dbl>
Brazil222
France120
Japan 118
US 125

Loops


In [58]:
for (i in 1:nrow(result)) {
  value <- result$amount[i]
  if (result$count[i] > 1) {
      value <- 0.8*value
  }
  print(sprintf("%6s - %.1f", result$country[i], value))
}


[1] "Brazil - 17.6"
[1] "France - 20.0"
[1] " Japan - 18.0"
[1] "    US - 25.0"

In [ ]: