Package installation

Loading basic graphics functions.

Installation of R packages.

Setting up the size for graphics.



In [36]:

    
source("https://raw.githubusercontent.com/eogasawara/mylibrary/master/myGraphics.R")
plot_size(4,3)



In [37]:

    
loadlibrary("TSPred")
loadlibrary("STMotif")

Basic R concepts

Variable assignment

Functions for data evaluation

Vector definition

Calculations

Printing values



In [38]:

    
x <- 2 # variable assignment

x # variable evaluation

is.numeric(x) # variable 

weight <- c(60, 72, 57, 90, 95, 72) # vector with six observations

height <- c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91)

bmi <- weight/height^2

print(bmi)

print(sprintf("%.2f +/- %.2f", mean(bmi), sd(bmi)))









    




2






    




TRUE






    



[1] 19.59184 22.22222 20.93664 24.93075 31.37799 19.73630
[1] "23.13 +/- 4.49"

Plotting graphics

Lines are drawn on last active canvas.



In [39]:

    
plot(height, weight)



In [40]:

    
plot(height, weight)
hh <- c(1.65, 1.70, 1.75, 1.80, 1.85, 1.90)
lines(hh, 22.5 * hh^2)

Statistical tests

Check if the mean of observations is not different from a theoretical value.



In [41]:

    
t.test(bmi, mu=22.5)









    





	One Sample t-test

data:  bmi
t = 0.34488, df = 5, p-value = 0.7442
alternative hypothesis: true mean is not equal to 22.5
95 percent confidence interval:
 18.41734 27.84791
sample estimates:
mean of x 
 23.13262

Properties of functions

Functions have default values.

It is possible to check parameters and see the documentation with examples.



In [43]:

    
plot(height, weight, pch=2)

args(plot.default)

#?graphics::plot









    




function (x, y = NULL, type = "p", xlim = NULL, ylim = NULL, 
    log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL, 
    ann = par("ann"), axes = TRUE, frame.plot = axes, panel.first = NULL, 
    panel.last = NULL, asp = NA, xgap.axis = NA, ygap.axis = NA, 
    ...) 
NULL

Handling NA observations

Operations with NA lead to NA.

Observations may have an associated label.



In [44]:

    
x <- c(A=1, B=NA, C=3)

mean(x)

mean(x, na.rm=TRUE)

names(x)

x["B"] <- 2

x["B"]*x

Matrices

Matrices can be filled from vectors or data frames.

It is possible to associate names for rows and columns.

Operations such as transpose, scalar product, matrix product ($\%$*$\%$), determinants are available.

Additional documentation can be found at https://www.statmethods.net/advstats/matrix.html.



In [45]:

    
m <- 1:9
dim(m) <- c(3,3)
m

mb <- matrix(1:9, nrow=3,byrow=TRUE)
rownames(mb) = LETTERS[1:3]
mb

t(m)

m*x

det(m)









    





A matrix: 3 × 3 of type int

	1 4 7
	2 5 8
	3 6 9









    





A matrix: 3 × 3 of type int

	A 1 2 3
	B 4 5 6
	C 7 8 9









    





A matrix: 3 × 3 of type int

	1 2 3
	4 5 6
	7 8 9









    





A matrix: 3 × 3 of type dbl

	1  4  7
	4 10 16
	9 18 27









    




0

Factors

Factors are used to handle categorical data.



In [46]:

    
pain = c(0,3,2,2,1)
fpain = factor(pain,levels=0:3)
levels(fpain) = c("none","mild","medium","severe")

fpain

as.numeric(fpain)

levels(fpain)









    





	none
	severe
	medium
	medium
	mild



	
		Levels:
	
	
		'none'
		'mild'
		'medium'
		'severe'
	







    





	1
	4
	3
	3
	2








    





	'none'
	'mild'
	'medium'
	'severe'

Lists

Lists are used to work with "objects".



In [47]:

    
x = c(5260,5470,5640,6180,6390,
      6515,6805,7515,7515,8230,8770)
y = c(3910,4220,3885,5160,5645,
      4680,5265,5975,6790,6900,7335)

lst <- list(A=x, B=y)

lst

lst$A

Data frames

Data frames (tables) provide support for structured data.



In [48]:

    
d <- data.frame(A=lst$A,B=lst$B)
d

df <- d[d$A > 7000 | d$A < 6000,]
df









    





A data.frame: 11 × 2

	A B
	<dbl> <dbl>


	5260 3910
	5470 4220
	5640 3885
	6180 5160
	6390 5645
	6515 4680
	6805 5265
	7515 5975
	7515 6790
	8230 6900
	8770 7335









    





A data.frame: 7 × 2

	 A B
	 <dbl> <dbl>


	1 5260 3910
	2 5470 4220
	3 5640 3885
	8 7515 5975
	9 7515 6790
	10 8230 6900
	11 8770 7335

Maps

Apply functions can be applied for all rows or columns.

The first character of the function name establishes the return type (s: simple, l: list).



In [49]:

    
lapply(d, min, na.rm=TRUE)

sapply(d, min, na.rm=TRUE)

apply(d, 1, min)

apply(d, 2, min)

Sort and Order



In [50]:

    
sort(d$B)
o <- order(d$B)
o
ds <- d[o,]
ds









    





	3885
	3910
	4220
	4680
	5160
	5265
	5645
	5975
	6790
	6900
	7335








    





	3
	1
	2
	6
	4
	7
	5
	8
	9
	10
	11








    





A data.frame: 11 × 2

	 A B
	 <dbl> <dbl>


	3 5640 3885
	1 5260 3910
	2 5470 4220
	6 6515 4680
	4 6180 5160
	7 6805 5265
	5 6390 5645
	8 7515 5975
	9 7515 6790
	10 8230 6900
	11 8770 7335

Loading and Saving data

There are many functions for reading CSV, Excel, and RData formats.



In [51]:

    
wine = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", header = TRUE, sep = ",")
head(wine)

save(wine, file="wine.RData")

rm(wine)

load("wine.RData")
write.table(wine, file="wine.csv", row.names=FALSE, quote = FALSE)









    





A data.frame: 6 × 14

	X1 X14.23 X1.71 X2.43 X15.6 X127 X2.8 X3.06 X.28 X2.29 X5.64 X1.04 X3.92 X1065
	<int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>


	1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
	1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
	1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
	1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93  735
	1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
	1 14.39 1.87 2.45 14.6  96 2.50 2.52 0.30 1.98 5.25 1.02 3.58 1290

Functions

Functions may have parameters and can return an object.



In [52]:

    
create_dataset <- function() {
  data <- read.table(text = "Year Months Flights Delays
                     2016 Jan-Mar 11 6
                     2016 Apr-Jun 12 5
                     2016 Jul-Sep 13 3
                     2016 Oct-Dec 12 5
                     2017 Jan-Mar 10 4
                     2017 Apr-Jun 9 3
                     2017 Jul-Sep 11 4
                     2017 Oct-Dec 25 15
                     2018 Jan-Mar 14 3
                     2018 Apr-Jun 12 5
                     2018 Jul-Sep 13 3
                     2018 Oct-Dec 15 4",
                     header = TRUE,sep = "")  
  data$OnTime <- data$Flights - data$Delays 
  data$Perc <- round(100 * data$Delays / data$Flights)
  return(data)
}

data <- create_dataset()
head(data)









    





A data.frame: 6 × 6

	Year Months Flights Delays OnTime Perc
	<int> <fct> <int> <int> <int> <dbl>


	2016 Jan-Mar 11 6  5 55
	2016 Apr-Jun 12 5  7 42
	2016 Jul-Sep 13 3 10 23
	2016 Oct-Dec 12 5  7 42
	2017 Jan-Mar 10 4  6 40
	2017 Apr-Jun  9 3  6 33

Pipelines

The operator $\%$>$\%$ creates a pipeline.

The first parameter of the next invoked function receives the data from the pipeline.

Library $dplyr$ contains a set of functions that support relational algebra operations.



In [53]:

    
loadlibrary("dplyr")

data_sd <- create_dataset() %>% 
  select(variable=Months, value=Delays) %>% 
  group_by(variable) %>% 
  summarize(sd = sd(value), value = mean(value))

data_sd$variable <- factor(data_sd$variable,
    levels = c('Jan-Mar','Apr-Jun','Jul-Sep','Oct-Dec'))

head(data_sd)









    





A tibble: 4 × 3

	variable sd value
	<fct> <dbl> <dbl>


	Apr-Jun 1.1547005 4.333333
	Jan-Mar 1.5275252 4.333333
	Jul-Sep 0.5773503 3.333333
	Oct-Dec 6.0827625 8.000000

Advanced graphics

Library $ggplot$ contains advanced graphics.

The $myGraphics.ipynb$ notebook has some examples of creating nice graphics using $ggplot$. Additional information can be found at https://nbviewer.jupyter.org/github/eogasawara/mylibrary/blob/master/myGraphics.ipynb.



In [54]:

    
loadlibrary("RColorBrewer")

col_set <- brewer.pal(11, 'Spectral')

grf <- plot.bar(data_sd, colors=col_set[2], alpha=0.5)
grf <- grf + geom_errorbar(
    aes(x=variable, ymin=value-sd, ymax=value+sd), 
    width=0.2, colour=col_set[2], alpha=0.9, size=1.1) 

plot(grf)

Melt library

The $melt$ function transforms columns values into rows grouped by $id.vars$.

The name of columns is used to fill the $variable$ attribute created during $melt$.



In [55]:

    
loadlibrary("reshape")
data <- create_dataset()
head(data)
data <- melt(data[,c('Year', 'Months', 'Flights', 'Delays', 'OnTime', 'Perc')], 
               id.vars = c(1,2))
head(data)









    





A data.frame: 6 × 6

	Year Months Flights Delays OnTime Perc
	<int> <fct> <int> <int> <int> <dbl>


	2016 Jan-Mar 11 6  5 55
	2016 Apr-Jun 12 5  7 42
	2016 Jul-Sep 13 3 10 23
	2016 Oct-Dec 12 5  7 42
	2017 Jan-Mar 10 4  6 40
	2017 Apr-Jun  9 3  6 33









    





A data.frame: 6 × 4

	Year Months variable value
	<int> <fct> <fct> <dbl>


	2016 Jan-Mar Flights 11
	2016 Apr-Jun Flights 12
	2016 Jul-Sep Flights 13
	2016 Oct-Dec Flights 12
	2017 Jan-Mar Flights 10
	2017 Apr-Jun Flights  9



In [56]:

    
data$x <- sprintf("%d-%s", data$Year, data$Months)
data$x <- factor(data$x,levels = data$x[1:12])

grf <- plot.series(data %>% filter(variable %in% c('Flights', 'Delays')),
                   colors=col_set[c(4,2)]) 
grf <- grf + theme(axis.text.x = element_text(angle=45, hjust=1))

plot(grf)

Merge

The function $merge$ can be used to join data frames. It can be used to produce inner, left, right, and outer joins.



In [57]:

    
stores <- data.frame(
    city = c("Rio de Janeiro", "Sao Paulo", "Paris", "New York", "Tokyo"),
    value = c(10, 12, 20, 25, 18))
head(stores)


divisions <- data.frame(
    city = c("Rio de Janeiro", "Sao Paulo", "Paris", "New York", "Tokyo"),
    country = c("Brazil", "Brazil", "France", "US", "Japan"))
head(divisions)

data <- merge(stores, divisions, by.x="city", by.y="city")
head(data)

result <- data %>% group_by(country) %>% summarize(count = n(), amount = sum(value))
head(result)









    





A data.frame: 5 × 2

	city value
	<fct> <dbl>


	Rio de Janeiro 10
	Sao Paulo     12
	Paris         20
	New York      25
	Tokyo         18









    





A data.frame: 5 × 2

	city country
	<fct> <fct>


	Rio de Janeiro Brazil
	Sao Paulo     Brazil
	Paris         France
	New York      US    
	Tokyo         Japan 









    





A data.frame: 5 × 3

	city value country
	<fct> <dbl> <fct>


	New York      25 US    
	Paris         20 France
	Rio de Janeiro 10 Brazil
	Sao Paulo     12 Brazil
	Tokyo         18 Japan 









    





A tibble: 4 × 3

	country count amount
	<fct> <int> <dbl>


	Brazil 2 22
	France 1 20
	Japan 1 18
	US    1 25

Loops



In [58]:

    
for (i in 1:nrow(result)) {
  value <- result$amount[i]
  if (result$count[i] > 1) {
      value <- 0.8*value
  }
  print(sprintf("%6s - %.1f", result$country[i], value))
}









    



[1] "Brazil - 17.6"
[1] "France - 20.0"
[1] " Japan - 18.0"
[1] "    US - 25.0"



In [ ]:

A	B
<dbl>	<dbl>
5260	3910
5470	4220
5640	3885
6180	5160
6390	5645
6515	4680
6805	5265
7515	5975
7515	6790
8230	6900
8770	7335

X1	X14.23	X1.71	X2.43	X15.6	X127	X2.8	X3.06	X.28	X2.29	X5.64	X1.04	X3.92	X1065
<int>	<dbl>	<dbl>	<dbl>	<dbl>	<int>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<int>
1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735
1	14.20	1.76	2.45	15.2	112	3.27	3.39	0.34	1.97	6.75	1.05	2.85	1450
1	14.39	1.87	2.45	14.6	96	2.50	2.52	0.30	1.98	5.25	1.02	3.58	1290

Year	Months	Flights	Delays	OnTime	Perc
<int>	<fct>	<int>	<int>	<int>	<dbl>
2016	Jan-Mar	11	6	5	55
2016	Apr-Jun	12	5	7	42
2016	Jul-Sep	13	3	10	23
2016	Oct-Dec	12	5	7	42
2017	Jan-Mar	10	4	6	40
2017	Apr-Jun	9	3	6	33

city	value
<fct>	<dbl>
Rio de Janeiro	10
Sao Paulo	12
Paris	20
New York	25
Tokyo	18