In [1]:
# Namaskara
In [2]:
4L + 2L
Out[2]:
In [3]:
"Hello World"
Out[3]:
In [4]:
4.2
Out[4]:
In [5]:
TRUE
Out[5]:
In [6]:
FALSE
Out[6]:
In [7]:
# Additions are done with `+`
40 + 2
Out[7]:
In [8]:
# Subtractions with `-`
44 - 2
Out[8]:
In [9]:
# Multiplication with `*`
21 * 2
Out[9]:
In [10]:
# Divisions with `/`
84 / 2
Out[10]:
In [11]:
# Exponentiation with `^`
7 ^ 8
Out[11]:
In [13]:
# Modulo with `%%`
71 %% 5
Out[13]:
In [14]:
TRUE & FALSE
Out[14]:
In [15]:
TRUE & TRUE
Out[15]:
In [16]:
TRUE | FALSE
Out[16]:
In [17]:
FALSE | FALSE
Out[17]:
In [18]:
answer_to_life <- 42
In [19]:
answer_to_life
Out[19]:
All arithmatic operations are supported on variables
In [20]:
foo <- 21
bar <- 21
answer_to_life <- foo + bar
answer_to_life
Out[20]:
In [21]:
class(42L)
class(42.0)
class("Fourty Two")
Out[21]:
Out[21]:
Out[21]:
In [22]:
# Answer here
class(TRUE)
Out[22]:
In [23]:
# Lets start off by checking where we are
getwd()
Out[23]:
In [24]:
# Lets then move to the _cars_ directory within this directory
# Paths can be relative
setwd("../cars")
In [25]:
getwd()
Out[25]:
In [26]:
# Or paths can be absolute
setwd("/Users/amitkaps/Dropbox/github/intro-R-data-science/intro")
In [27]:
getwd()
Out[27]:
In [28]:
# Let us try an invalid path
setwd("C:/Users/Shrayasr/personal/code/intro-R-data-science/introoooooooo")
In [29]:
getwd()
Out[29]:
R is all about getting work done. It gives you nifty methods to quickly go out and pick data up so that you're up and running within R.
In [30]:
# Lets read in a bunch of cars
read.csv("small_cars.csv")
Out[30]:
As simple as that.
In [32]:
# Answer here
setwd("/Users/amitkaps")
read.csv("small_cars.csv")
Out[32]:
In [33]:
# Let us say we want to express the amount of Kilometers
# that we have run in the past 5 days.
# We can use a vector for this.
kms_run <- c(4.0, 5.2, 6.0, 5.2, 5.0)
kms_run
Out[33]:
In [34]:
# We can also use it to track which all days of the week we ran
did_run <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
did_run
Out[34]:
In [35]:
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)
This week, we ran:
Day of week | Kilometers |
---|---|
Monday | 6 |
Tuesday | 6.2 |
Wednesday | 6 |
Thursday | 7.2 |
Friday | 7.5 |
Populate this in a vector kms_this_week
In [36]:
# Answer here
kms_this_week <- c (6.0, 6.2, 6.0, 7.2, 7.5)
When we're looking at a vector, it makes more sense if we can somehow name all the values, right?
Just looking at kms_last_week
can become confusing. Let us use the names
function to give each element the day of the week
In [37]:
names(kms_last_week) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
In [38]:
# Now, the data can stand independently and is much more clearer
kms_last_week
Out[38]:
Note: We're assigning a vector when we're giving names. So instead of repeating it multiple times, we can reuse the vector as well
In [39]:
days_of_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(kms_last_week) <- days_of_week
kms_last_week
Out[39]:
Arithmatic can be performed on vectors. Let us calculate the total amount of kilometers that we ran on each day in the past 2 weeks
In [44]:
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)
kms_this_week <- c(6.0, 6.2, 6.0, 7.2, 7.5)
total_kms_past_2_weeks <- kms_last_week + kms_this_week
total_kms_past_2_weeks
Out[44]:
But how many kilometers did we run totally in each week? Sum each of the vectors using the sum
function - simple, no?
In [41]:
distance_last_week <- sum(kms_last_week)
distance_this_week <- sum(kms_this_week)
distance_last_week
distance_this_week
Out[41]:
Out[41]:
In [45]:
# Answer here
names(total_kms_past_2_weeks) <- days_of_week
total_kms_past_2_weeks
vector to arrive at your answer
In [46]:
# Answer here
sum(total_kms_past_2_weeks)
Out[46]:
Consider the total_kms_past_2_weeks
vector. Let us say that we want to get the distance we ran across both weeks, on wednesday. We know that wednesday is the 3rd day of the week, So we pick up the 3rd element from the vector like so:
In [47]:
total_kms_past_2_weeks[3]
Out[47]:
Note: A very important thing to note here is that R begins its indexing from 1
and not 0
unlike most other programming languages.
What if we're interested in a section of results, say our performance as the week comes to an end (wednesday, thursday, friday).
We can provide a vector of required indices like so:
In [48]:
total_kms_past_2_weeks[c(3,4,5)]
Out[48]:
But say we have 100 elements in the vector, it would soon become tedious if we want to select a range, say from 50-72
or from 44-62
, right? To solve this problem, R provides us with the range operator - :
which we takes a starting number and an ending numer and returns a vector containing all those numbers. We can then use this to fetch required elements.
In [49]:
# Let us look at just the range operator
1:5
Out[49]:
In [53]:
# Answer here
total_kms_past_2_weeks[0:3]
Out[53]:
In [52]:
0:3
Out[52]:
Also, since we've given names to the vector elements, we can use those names to seek to the elements instead of using indexes.
In [54]:
names(total_kms_past_2_weeks) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
total_kms_past_2_weeks["Wednesday"]
total_kms_past_2_weeks[c("Monday", "Tuesday")]
Out[54]:
Out[54]:
We can also perform logical operations on vectors. Let us check to see on how many days in the last week, we ran more than 4 kilometers
In [55]:
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)
names(kms_last_week) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
days_more_than_5 <- kms_last_week > 5
days_more_than_5
Out[55]:
We can use logical operations in combination with the vector to select only those elements from a vector that match a condition.
Now that days_more_than_5
contain a list of days where we ran more than 5 kilometers, let us select just those items into another vector
In [56]:
kms_last_week[days_more_than_5]
Out[56]:
In [59]:
kms_last_week[kms_last_week<5]
Out[59]:
Take a vector that represents all the days we did intervals / distance in the last 10 days. There are some days where we rest as well.
In [60]:
running_style <- c("INT", "INT", "DIST", "DIST", "DIST", "REST", "INT", "DIST", "DIST", "DIST")
names(running_style) <- 1:10
running_style
Out[60]:
As you see, we can divide our runs into categories. Factors are used to represent these categories. Let us use the factor
function to create a factor variable out of this vector
In [61]:
running_style_f <- factor(running_style)
Once we have this, we can use the levels
function to extract the different levels that R interprets for us.
In [62]:
levels(running_style_f)
Out[62]:
Perfect, this tells us that we have 3 level
s, i.e. we indeed have 3 running styles.
We can confirm that running_style_f
is indeed a factor variable by checking its underlying type with the class function
In [63]:
class(running_style_f)
Out[63]:
Once we have our level
s, we can modify them to our suiting with the levels
function (very similar to the names
function)
In [64]:
levels(running_style_f)
levels(running_style_f) <- c("Endurance", "Speed", "Rest")
levels(running_style_f)
Out[64]:
Out[64]:
This also gives us access to a new function - summary
which gives us a summary of the data
In [65]:
summary(running_style_f)
Out[65]:
This quickly tells us that out of the 10 days we ran, on 6 we did distance runs, 3 were interval runs and we took 1 day of rest.
In [66]:
running_style_f = factor(running_style)
running_style_f[1] > running_style_f[2]
Out[66]:
As you see, it yields us a ">
not meaningful for factors" error
In [ ]:
kms_run <- c(4.0, 5.2, 6.0, 5.2, 5.0, 6.0, 6.2, 6.0, 7.2, 7.5)
In [68]:
# Lets classify this into Long, Medium and Short runs manually
distance_type <- c("S", "M", "M", "M", "M", "M", "M", "M", "L", "L")
Now we can pick up factors from this, but we understand an order here. Short < Medium < Long. To introduce an order, we need to pass the order=TRUE
and pass the right order of the levels
we require.
In [69]:
distance_type_f = factor(distance_type, order=TRUE, levels=c("S", "M", "L"))
distance_type_f
Out[69]:
Now that we have an order in place, we can use <
and >
In [71]:
distance_type_f[1]
distance_type_f[2]
distance_type_f[1] > distance_type_f[2]
Out[71]:
Out[71]:
Out[71]:
The real reason why factors are important will be covered in forthcoming sessions. This just introduces the concept and the necessity for it.
The Data Frame is R's most iconic type. Soon, you'll find out that a Data Frame is great to express all kinds of data
Think of a Data Frame as a 2 dimensional structure having rows and columns. Each column may be of a different type each row can be thought of as representing an observation
To quickly get started with data frames, let us use an inbuilt data frame in R that contains some data on cars. From the help:
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
This data is stored in mtcars
. Let us look at it.
In [73]:
?mtcars
Out[73]:
In [74]:
# Answer here
class(mtcars)
Out[74]:
As you can see, it contains data about cars, each row represents one particular car and its associated details.
One of the most important things when working with Data Frames and in general with Data Science is to spend time understanding the structure of data. The structure of data, however is independent of the data itself. It is enough to get a glimpse of the data to get started with.
For this sake, R exposes 2 functions - head
and tail
that allow us to peek at the starting / ending of the data frame
In [75]:
# head
head(mtcars)
Out[75]:
In [76]:
# tail
tail(mtcars)
Out[76]:
In [77]:
# Another way to get a quick glimpse of the data is to use the `str` function.
str(mtcars)
The str
function, as you can see shows us some nice details. It tells us
32
)11
)Another quick way to find out just the number of rows and columns is to use the nrow
and ncol
functions
In [78]:
# total number of rows
nrow(mtcars)
Out[78]:
In [79]:
# total number of columns
ncol(mtcars)
Out[79]:
In [80]:
distance <- c(4.0, 5.2, 6.0, 5.2, 5.0, 6.0, 6.2, 6.0, 7.2, 7.5)
time_taken <- c(20.5, 28.0, 40.2, 24.1, 26.0, 42.0, 43.2, 40.1, 50.2, 50.7)
run_type <- c("S", "S", "E", "S", "S", "E", "E", "E", "E", "E") # S is speed; E is endurance
workout_after <- c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)
Now that we have 4 vectors, we can create a data frame from these 4 vectors using the data.frame
function
In [84]:
running_df <- data.frame(distance, time_taken, run_type, workout_after,
stringsAsFactors = FALSE)
In [85]:
# Lets print this to see what we get
running_df
Out[85]:
Quite similar to the data frame we earlier saw with cars. Lets work with this!
In [86]:
# Answer here
head(running_df)
tail(running_df)
str(running_df)
Out[86]:
Out[86]:
In [87]:
?data.frame
Out[87]:
Note: When you run str
on the data frame, notice that the run_type
column has automatically been interpreted as a Factor
type.
Rows and columns can be selected from the data frame by similar methods as followed in vectors. I.e. using [
and ]
Within [
and ]
there are 2 parts - The row part and the column part separated by a comma (,
)
In [89]:
# Let us pick up the value in the 4th column, 2nd row
running_df[2, 4]
running_df
Out[89]:
Out[89]:
In [90]:
# Answer here
running_df[5:9, 1:2]
Out[90]:
R also makes it possible to omit one part of the 2 parts inside [
and ]
. The separator is mandatory though. So now
In [91]:
# Answer here
running_df[,1:2]
Out[91]:
In [92]:
# Answer here
running_df[5:9,]
Out[92]:
You can also use the name of the column to select instead of specifying the numbers
In [93]:
running_df[5:9, "distance"]
Out[93]:
There are times where we want to operate only on one column. We have, as of now, understood that there are 2 ways to do this:
In [94]:
# Using the index of the distance column
running_df[,1]
Out[94]:
In [95]:
# Using the column name
running_df[, "distance"]
Out[95]:
There's also a 3rd way which you'll see used extensively through out R and that uses the $
operator
In [96]:
running_df$distance
Out[96]:
Note: Do note that when you are working on an individual column, the data structure is a vector
and not a data.frame
In [97]:
running_df[c(TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE),]
Out[97]:
But this is tedious so R gives us a subset
function to do the same thing in a more readable fashion
In [99]:
# Lets pick out all the days where we did endurance runs.
(subset(running_df, subset = (run_type == "E")))$distance
Out[99]:
In [100]:
# Answer here
subset(running_df, subset = (run_type == "E" & workout_after == TRUE))
Out[100]:
Ordering helps us to understand our data better and helps with comparison.
The order
function helps us to do that in R. It is quite smart as well. Consider a vector
In [ ]:
some_alphabets <- c("h", "a", "q", "z", "n", "r")
some_alphabets
In [ ]:
# Lets call order on them and see what happens
order(some_alphabets)
It gives us a vector. An ordered vector. Now let us select the original vector using this one
In [ ]:
o <- order(some_alphabets)
some_alphabets[o]
We can also sort it in the opposite order using the decreasing=TRUE
argument to order
In [ ]:
# Answer here
order
on a column in a data frame. Order the distance
column within our running_df
data frame
In [ ]:
# Answer here
running_df
data frame, create a new data frame (running_df_ordered
) that is ordered by the distance
column
In [ ]:
# Answer here