Lecture outline:
To use R in the notebook simply, you need to:
(1) have R installed on your computer
(2) have the Python module rpy2
installed
(2) invoke the following magic command:
In [41]:
%load_ext rmagic
Then type in the magic command
%%R
at the beginning of the cell:
In [42]:
%%R
x = c(1,2,3,4)
y = c('a','b','c','d')
xy = data.frame(x,y)
colnames(xy) = c('X', 'Y')
print(xy)
As in Python, there are basic types in R reprensenting the usual things:
To retrieve the type of a variable x
use the function:
typeof(x)
(This is the R-equivalent of the type
function in Python.)
To print a variable x
, use the R command
print(x)
In [43]:
%%R
x = 3
y = 'Hello'
z = TRUE
print(typeof(x))
print(typeof(y))
print(typeof(z))
Variable assignments have the same syntax as in Python:
variable = value
Although, R provides also the arrow syntax for that same purpose:
variable <- value
value -> variable
(We will use the Python-like syntax for variable assignments.)
In [44]:
%%R
#Python-like variable assignment syntax works in R
x = 'Hello'
#Arrow syntax
y <- 'Bonjour'
'Guten tag' -> z
print(x); print(y); print(z)
In R, the class representing the Boolean type is called logical
.
As usual, it has only two possible values:
TRUE and FALSE
which can be abbreviated to
T and F
Difference with Python: In R, the Boolean values are all upper-cases (instead of
True
andFalse
).
In [45]:
%%R
a = TRUE; b = FALSE
c = T; d = F
print(typeof(a)); print(typeof(b))
print(a); print(b)
print(c); print(typeof(d))
Here is the R-syntax for the usual operations on Booleans:
In [46]:
%%R
a = TRUE; b = FALSE
print(a & b) # & = Python's 'and'
print(a | b) # | = Python's 'or'
print(!a) # ! = Python's 'not'
print( a | (!b & a))
In [47]:
a = 2; b = 3
print(a == b)
print(a <= b)
print(a >= b)
print(a < b)
print(a > b)
As Python, R has two types to represent numbers:
integer
to represent natural numbersdouble
to represent real numbers
In [48]:
%%R
a = 3
b = 4.5
print(typeof(a))
print(typeof(b))
Differences with Python:
- R interprets any number passed to the R interpreter as a real number
- Python interprets a number as areal number only if it has a period.
To differenciate between this types, one should use the conversion operators:
as.integer(x)
as.double(x)
In [49]:
%%R
a = as.integer(2)
print(typeof(a))
b = as.double(a)
print(typeof(b))
In [50]:
%%R
a = 2; b = 3; c = 3.4
print(a+b)
print(a*b)
print(a^b) # The power syntax is different than in Python: a**b
print(a/b)
print(b%%a) # The modulo syntax is diffenrent than that of Python: b%a
print(c%/%b) # Integer division
As in Python, one creates strings using quotes (double or singles):
In [51]:
%%R
a = 'hello'
b = "hello"
c = "hello! I'am good!"
print(a); print(b); print(c)
print(typeof(a))
Newlines and escape characters in general are used the same way as in Python.
Differences with Python:
The R
print(x)
function does not print the escape characters.
The R function
cat(x,y,z,etc.)
prints the escape characters.
It behaves very much like the Python (version 2.7) print function:
print x,y,z,etc.
In [52]:
%%R
x = 'One\nTwo\nThree\nFour\n'
print(x)
cat(x)
cat('I need', 3,'pairs of gloves for my', 4, 'hands')
R does not provide the same level of syntactical conveniences as Python for string manipulations:
$\longrightarrow$ no special syntax for string formatting in R
$\longrightarrow$ no addition operator for string concatenation in R
$\longrightarrow$ no multiplication operator for string repetition in R
$\longrightarrow$ no bracket operator for substring access in R
In Python, one way to format a string is to use the place holder syntax:
"...%d...%s...etc." % (digit, string, etc.)
In R, one uses the function
sprintf("...%d...%s...etc.", digit, string, etc.)
which returns the formatted string.
In [53]:
%%R
var = sprintf("%-10s\t%-10s\t%-10s\n", 'Name', 'Age', 'Weigth')
obs1 = sprintf("%-10s\t%-10d\t%-10.2f\n", 'Benoit', 56, 300)
obs2 = sprintf("%-10s\t%-10d\t%-10.2f\n", 'Claude', 12, 400)
cat(var); cat(obs1); cat(obs2)
The R function
paste(x, y, etc., sep=s)
x, y,
etc.s
passed to the argument sep
in between those strings
In [54]:
%%R
x = paste('a', 'b', 'c', 'd', sep='::')
cat(x)
The R function
substr(x, start=i, stop=j)
x
that
In [55]:
%%R
x = '123456789'
y = substr(x, 2,6)
cat(y)
Differences with Python:
(1) All ranges in R always start a $1$ instead of $0$
(2) All ranges in R always include the upper-bound
This is valid whenever any type of ranges are around
Example: In the string
x = "abcde"
substr(x, 1, 3)
returns 'abc' (while x[1:3]
returns 'bc' in Python)Python is a general purpose language.
Its basic types and basic data structures are very standard among such programming languages:
You have
int
, float
, str
, and bool
that represent scalar quantities (i.e. single elements) of Booleans, numbers, and strings.
list
, dict
, sets
that represent vectorial quantities (i.e. collections) of the basic types.
In most general purpose languages:
Basic Data Structures = structured collections of basic types
Depeding on their structures, these collections can be:
$\longrightarrow$ Numpy arrays and Pandas Series
$\longrightarrow$ Python lists, Python Dictionaries, and Pandas DataFrames
$\longrightarrow$ Python Dictionaries, Pandas Series, and Pandas DataFrames
$\longrightarrow$ Numpy arrays, Pandas Series, and Pandas Dataframes
- R was created with STATISTICS IN MIND.
- The main object of statistics is that of a DATA TABLE.
- The BASICS DATA TYPES and DATA STRUCTURES in R are reflecting this purpose!
Actually, R doesn't really have SEPARATE
R has only
2 BASICS VECTORIZED LABELLED DATA STRUCTURES
(1) VECTORS $\longrightarrow$ corresponding to data table COLUMNS (hence: HOMOGENEOUS)
(1) LISTS $\longrightarrow$ corresponding to data table ROWS (hence: HETEROGENEOUS)
... AND NO BASIC SCALAR DATA TYPES!!!!
The "basic scalar types" that we just saw are in reallity ...
... VECTORS WITH ONLY ONE ELEMENT!!!!
There are 3 basic data types in R separated in 3 MODES:
numeric mode
logical mode
character mode
Remarks:
R distinguishes between two types in numeric
mode:
$\longrightarrow$ int
for integers
$\longrightarrow$ doubles
for doubles
In [56]:
%%R
print(typeof(3))
print(mode(3))
In [57]:
%%R
a = as.integer(3)
print(typeof(a))
print(mode(a))
R vectors $\simeq$ Pandas Series
From a data type perspective:
R $\simeq$ what we would obtain if were allowed to program in Python only with
$\longrightarrow$ quantitative Pandas Series instead of numbers
$\longrightarrow$ logical Pandas Series instead of Booleans
$\longrightarrow$ categorical Pandas Series instead of strings
R vectors are created using the special concatenate function
c(a=x1, b=x2, c=x3, d=x4, etc)
that returns a R vector with
element values x1, x2, x3, x4
etc.
labelled by the passed argument names: a, b, c, d,
etc.
The element values in a R vector should all be of the same type (i.e. numbers, Booleans, or strings).
Remark: As for Pandas Series, the labels (i.e. parameter names) may be omitted.
In [58]:
%%R
x = c(a=12, b=34, c=45, d=34)
y = c('Hello', 'Bonjour', 'Guten Tag')
z = c(T, F, T, F, F)
cat("x = \n");print(x); print(typeof(x)); cat('\n\n')
cat("y = \n");print(y); print(typeof(y)); cat('\n\n')
cat("z = \n");print(z); print(typeof(z)); cat('\n\n')
We may want to create an empty vector, which we will populate later on.
For that, we need to invoke the vector class constructor explictely:
x = vector(lenght, mode)
where
lenght
is the vector length
mode
is the vector mode:
'numeric' 'logical' 'character'
In [59]:
%%R
x = vector(length=3, mode='character')
cat('The mode of the vector x is', mode(x),'its length is', length(x),'\n')
x[1] = 'elephant'
x[2] = 'raccoon'
x[3] = 'monkey'
print(x)
Element indexing and element retrieval in R vectorized basic types (i.e vectors and lists) are very similar to that of Python:
On a vector x
, the BRACKET OPERATOR
x[range]
gives us access to the elements specified by the range
, which can be:
length(x)
(retrieving the corresponding element)Differences with Python:
- Indices always start at 1 (instead of 0)
- The slice notation n:m actually creates the integer vector $(n, n+1, \dots, m-1, m)$
In [60]:
%%R
scores = c(Mark=88, John=24, Lucie=54, Bob=100)
a = scores['Mark']
print(a)
b = scores[1]
print(b)
c = scores[1:3]
print(c)
d = scores[-2]
print(d)
As in Python, we have for
loops.
The main difference is that
code blocks are indicated by curly brackets instead of special indentation
The are also other minor syntactical differences, as you will see below.
The fact that
x = n:m
creates a integer vector x
on which a for loop can iterate is very practical.
There is also the function
seq(from=a, to=b, by=c)
that creates integer vectors, very useful to loop over, and a function
rep(x, n)
that returns a the vector x
repeated n times.
In [61]:
%%R
DNA = rep(c('A','C','T'),4)
RANGE = 1:10
SEQ = seq(0,100,20)
print(DNA)
print(RANGE)
print(SEQ)
In [62]:
%%R
for(x in 1:10){
print(x^2)
}
Since basic types are vectors, one can loop on any 'numeric', 'logical', or 'character' types, even with a single element:
In [63]:
%%R
# Try the loop in with different a by uncommenting some lines below
a = 3
#a = c(3, 4, 5)
#a = 'Hello'
#a = c('Hello', 'Bonjour', 'Gutent Tag')
for( x in a) print(x) # We don't need curly braces with just one command
One can also retrieve the vector element names using the function:
names(x)
Then we can iterate over this names.
In [64]:
%%R
scores = c(Mark=88, John=24, Lucie=54, Bob=100)
for (student in names(scores)) cat(student, 'got', scores[student], '\n')
When possible
Loops should be implemented the vectorized way!!!
since
All the operations for basic types are vectorized!!!
This works exactly the same way as for Numpy arrays:
In [65]:
%%R
## VECTORIZED OPERATION ON NUMERIC TYPES
x = c(1,3,2,4)
y = c(4,1,4,2)
print(x+y)
print(x*y)
print(x/y)
print(x^y)
print(x%%y)
print(x%/%y)
In [66]:
%%R
# OTHER BASIC MATH and STAT OPERATIONS ON THE NUMERICAL TYPE
x = c(2, 1, 54, 21, 56, 7, 1, 4)
mean(x)
median(x)
sd(x)
quantile(x,0.2)
sum(x)
prod(x)
cumsum(x)
cumprod(x)
sqrt(x)
cos(x)
sin(x)
For instance, to normalize a sequence of numbers:
In [67]:
%%R
numbers = c(2, 1, 54, 21, 56, 7, 1, 4)
one could use a for loops as follows:
In [68]:
%%R
m = mean(numbers)
s = sd(numbers)
normalized = vector(length=length(x), mode=mode(x))
for (i in 1:length(numbers)){
normalized[i] = (numbers[i] - m)/s
}
print(normalized)
The following vectorized version is much preferred:
In [69]:
%%R
normalized = (numbers-mean(numbers))/sd(numbers)
print(normalized)
They function exactly as in Python, except for
the curly brace to define the code blocks
the round parenthesis surrounding the Boolean condition
In [70]:
%%R
condition = F
if(condition){
print('If the boolean variable "condition" is True, this statement is executed.')
} else {
print('Otherwise, this statement here is executed')
}
In [71]:
%%R
# The else part may be omitted in case there is nothing to do when "condition" is False
condition = T
if(condition){
print('Great! Condition was True')
}
In [72]:
%%R
# try with number = 0, 1, 2, 3, 4
# the block of code corresponding to the first matching condition is executed;
# the remaining conditions are then skipped
number = 0.5
if (number < 1){
cat('number is smaller than', 1)
} else if (number < 2){
cat('number"is smaller than', 2)
} else if (number < 3){
cat('number is smaller than', 3)
} else{
print('number is big!')
}
When possible:
Branching should be implemented the vectorized way!!!
since:
All the operations for basic types are vectorized!!!
This works exactly the same way as Numpy arrays:
In [73]:
%%R
a = c(T, F, F, T, T, F)
b = c(F, T, T, F, F, T)
print(a & b) # & = Python's 'and'
print(a | b) # | = Python's 'or'
print(!a) # ! = Python's 'not'
print( a | (!b & a))
In [74]:
%%R
a = c(1, 2, 3, 4, 5)
b = c(9, 8, 7, 6, 5)
print(a == b)
print(a <= b)
print(a >= b)
print(a < b)
print(a > b)
As for Numpy arrays, one can retrieve elements from an R vector by logical indexing:
In [75]:
%%R
dat = c(1, 2, 3, 4, 5, 6)
ind = c(T, F, T, T, F, F)
print(dat[ind])
In [76]:
%%R
dat = c(1, 2, 3, 4, 5, 6)
ind = dat < 4
print(ind)
Putting everything together:
In [77]:
%%R
filtered_data = dat[dat < 4]
print(filtered_data)
Problem: extracting the ouliers from a sequence of data points using
(1) conventional for loop and if statement
(2) vectorized logical indexing
In [78]:
%%R
# DATA POINTS
x = c(1,2, 89, 50, 44, 53, 60, 45, 62, 53, 37, 48, 70, 100, 55)
# FIRST AND THIRD QUARTILES
Q1 = quantile(x, 0.25)
Q3 = quantile(x, 0.75)
# OUTLIER LOWER AND UPPER CUTOFFS
L = Q1 - 1.5*(Q3 - Q1)
U = Q3 + 1.5*(Q3 - Q1)
(1) using conventional if and for
In [79]:
%%R
upper_outliers = c()
lower_outliers = c()
for(a in x){
if (a > U) upper_outliers = c(upper_outliers, a)
if (a < L) lower_outliers = c(lower_outliers, a)
}
cat('Upper Outliers:', upper_outliers, '\n')
cat('Lower Outliers:', lower_outliers, '\n')
(2) vectorized version
In [80]:
%%R
upper_outliers = x[x > U]
lower_outliers = x[x < L]
cat('Upper Outliers:', upper_outliers, '\n')
cat('Lower Outliers:', lower_outliers, '\n')
Let x
be a logical vector. The functions
any(x)
returns TRUE
if ONE of the elements in x
is TRUE
and FALSE
otherwise.
all(x)
returns TRUE
if ALL the elements in x
are TRUE
and FALSE
otherwise.
In [81]:
%%R
x = c(T, F, T)
print(any(x))
print(all(x))
The function
z = ifelse(cond, x, y)
cond
x
and y
as follows:z[i] = x[i] if cond[i] == TRUE
z[i] = y[i] if cond[i] == FALSE
Suppose, we have two series of observations, and we want to keep to the highest value for each observation.
We can do that with a classical for/if statement, or use vectorization with ifelse
:
In [82]:
%%R
obs1 = c(12, 34, 55, 21, 54, 22, 78 ,65, 34)
obs2 = c(24, 14, 85, 12, 99, 10, 1 ,9, 100)
In [83]:
%%R
# CLASSICAL FOR/IF
max_obs = c()
for (i in 1:length(obs1)){
if (obs1[i] >= obs2[i]) max_obs = c(max_obs, obs1[i])
else max_obs = c(max_obs, obs2[i])
}
print(max_obs)
In [84]:
%%R
# VECTORIZED VERSION
max_obs = ifelse(obs1 > obs2, obs1, obs2)
cat('\n','obs1:', obs1,'\n','obs2: ', obs2,'\n','maxo: ', max_obs)
R way of defining functions ressembles much Python inline function definitions:
In [85]:
f = lambda x, y : x + 2*y
f(3, 2)
Out[85]:
The Python keyword lambda
creates a function whose
input variables are the variables defined before the colon
output is the evaluation of the statement after the colon
The function is then stored into the variable (here: f
), which becomes the function name.
In R,
lambda
is replaced by the keyword function
return
Difference with Python:
If the keyword
return
is omitted in a R function, the function output will coincide with the last statement output in the function body.In R, returning nothing corresponds to returning the object
NULL
(corresponding to the objectNone
in Python).
Other than that, it's very much the same business:
In [86]:
%%R
print_and_return_nothing = function(string='Hello!'){
print(string)
return(NULL)
}
a = print_and_return_nothing()
print(a)
In [87]:
%%R
# the code in the previous cell above is the same as the following one that returns None
print_and_return_nothing = function(string='Hello!'){
print(string)
return(NULL)
}
a = print_and_return_nothing()
print(a)
In [88]:
%%R
dont_print_but_return_something = function(string='Hello!') string
xxx = dont_print_but_return_something()
print(xxx)