Introduction to R

R is an interpreted programming language that is majorly used in the scientific domain. It is widely used among statisticians and data miners.

Basics

Comments


In [1]:
# Namaskara

Integers


In [2]:
4L + 2L


Out[2]:
6

Strings


In [3]:
"Hello World"


Out[3]:
'Hello World'

Floats


In [4]:
4.2


Out[4]:
4.2

Logical


In [5]:
TRUE


Out[5]:
TRUE

In [6]:
FALSE


Out[6]:
FALSE

Arithmatic Operations


In [7]:
# Additions are done with `+`
40 + 2


Out[7]:
42

In [8]:
# Subtractions with `-`
44 - 2


Out[8]:
42

In [9]:
# Multiplication with `*`
21 * 2


Out[9]:
42

In [10]:
# Divisions with `/`
84 / 2


Out[10]:
42

In [11]:
# Exponentiation with `^`
7 ^ 8


Out[11]:
5764801

In [13]:
# Modulo with `%%`
71 %% 5


Out[13]:
1

Logical Operations


In [14]:
TRUE & FALSE


Out[14]:
FALSE

In [15]:
TRUE & TRUE


Out[15]:
TRUE

In [16]:
TRUE | FALSE


Out[16]:
TRUE

In [17]:
FALSE | FALSE


Out[17]:
FALSE

Variables & Assignment

Use the <- operator to assign values to variables.


In [18]:
answer_to_life <- 42

In [19]:
answer_to_life


Out[19]:
42

All arithmatic operations are supported on variables


In [20]:
foo <- 21
bar <- 21

answer_to_life <- foo + bar
answer_to_life


Out[20]:
42

Types

The class function can be used to identify the underlying type of a variable or literal


In [21]:
class(42L)
class(42.0)
class("Fourty Two")


Out[21]:
'integer'
Out[21]:
'numeric'
Out[21]:
'character'

[ Exercise ]

What is the type of TRUE / FALSE ?


In [22]:
# Answer here
class(TRUE)


Out[22]:
'logical'

Working with the environment

It is important to understand how to bring in things from outside the environment - Files, URLs, filesystem navigation, etc

Working Directory

The directory where you're currently working in is called the Working Directory


In [23]:
# Lets start off by checking where we are
getwd()


Out[23]:
'/Users/amitkaps/Dropbox/github/intro-R-data-science/intro'

In [24]:
# Lets then move to the _cars_ directory within this directory

# Paths can be relative
setwd("../cars")

In [25]:
getwd()


Out[25]:
'/Users/amitkaps/Dropbox/github/intro-R-data-science/cars'

In [26]:
# Or paths can be absolute
setwd("/Users/amitkaps/Dropbox/github/intro-R-data-science/intro")

In [27]:
getwd()


Out[27]:
'/Users/amitkaps/Dropbox/github/intro-R-data-science/intro'

In [28]:
# Let us try an invalid path
setwd("C:/Users/Shrayasr/personal/code/intro-R-data-science/introoooooooo")


Error in setwd("C:/Users/Shrayasr/personal/code/intro-R-data-science/introoooooooo"): cannot change working directory

In [29]:
getwd()


Out[29]:
'/Users/amitkaps/Dropbox/github/intro-R-data-science/intro'

Reading stuff

R is all about getting work done. It gives you nifty methods to quickly go out and pick data up so that you're up and running within R.


In [30]:
# Lets read in a bunch of cars
read.csv("small_cars.csv")


Out[30]:
namemodelurlpricetypeABSAcceleration..0.100.kmph.Air.ConditionerAudio.Controls.on.Streeing.WheelAudio.System..with.remote.ellip.hSpeakersTilt.FunctionTop.Speed..kmph.Traction.ControlTransmission.TypeTubeless.TyresTurning.Circle.Radius..metres.USB...Auxiliary.InputWheelbase..mm.brand
1Ashok Leyland StileAshok Leyland Stile LE 8-STR (Diesel)http://carzoom.in/car-specification/ashok-leyland-stile-le-8-str-diesel/749990MPV No18.7 Manual No No<8b> No Yes140 No 5 Speed Manual Yes5.2 No2725Ashok
2Ashok Leyland StileAshok Leyland Stile LS 8-STR (Diesel)http://carzoom.in/car-specification/ashok-leyland-stile-ls-8-str-diesel/799990MPV No18.7 Manual No No<8b> No Yes140 No 5 Speed Manual Yes5.2 No2725Ashok
3Ashok Leyland StileAshok Leyland Stile LX 8-STR (Diesel)http://carzoom.in/car-specification/ashok-leyland-stile-lx-8-str-diesel/829990MPV No18.7 Manual No No<8b> No Yes140 No 5 Speed Manual Yes5.2 No2725Ashok
4Ashok Leyland StileAshok Leyland Stile LS 7-STR (Diesel)http://carzoom.in/car-specification/ashok-leyland-stile-ls-7-str-diesel/849990MPV No18.7 Manual No No<8b> No Yes140 No 5 Speed Manual Yes5.2 No2725Ashok

As simple as that.


[ Exercise ]

  • Put a CSV file (anything, really) in some location on your computer (except the intro-R-data-science directory)
  • Use getwd and setwd to navigate to that location
  • Read the csv using read.csv

In [32]:
# Answer here
setwd("/Users/amitkaps")
read.csv("small_cars.csv")


Out[32]:
namemodelurlpricetypeABSAcceleration..0.100.kmph.Air.ConditionerAudio.Controls.on.Streeing.WheelAudio.System..with.remote.ellip.hSpeakersTilt.FunctionTop.Speed..kmph.Traction.ControlTransmission.TypeTubeless.TyresTurning.Circle.Radius..metres.USB...Auxiliary.InputWheelbase..mm.brand
1Ashok Leyland StileAshok Leyland Stile LE 8-STR (Diesel)http://carzoom.in/car-specification/ashok-leyland-stile-le-8-str-diesel/749990MPV No18.7 Manual No No<8b> No Yes140 No 5 Speed Manual Yes5.2 No2725Ashok
2Ashok Leyland StileAshok Leyland Stile LS 8-STR (Diesel)http://carzoom.in/car-specification/ashok-leyland-stile-ls-8-str-diesel/799990MPV No18.7 Manual No No<8b> No Yes140 No 5 Speed Manual Yes5.2 No2725Ashok
3Ashok Leyland StileAshok Leyland Stile LX 8-STR (Diesel)http://carzoom.in/car-specification/ashok-leyland-stile-lx-8-str-diesel/829990MPV No18.7 Manual No No<8b> No Yes140 No 5 Speed Manual Yes5.2 No2725Ashok
4Ashok Leyland StileAshok Leyland Stile LS 7-STR (Diesel)http://carzoom.in/car-specification/ashok-leyland-stile-ls-7-str-diesel/849990MPV No18.7 Manual No No<8b> No Yes140 No 5 Speed Manual Yes5.2 No2725Ashok

Vectors

Vectors are one dimensional arrays that can hold one type of data. The c function allows us to create a vector out of provided values


In [33]:
# Let us say we want to express the amount of Kilometers 
# that we have run in the past 5 days. 
# We can use a vector for this.

kms_run <- c(4.0, 5.2, 6.0, 5.2, 5.0)
kms_run


Out[33]:
  1. 4
  2. 5.2
  3. 6
  4. 5.2
  5. 5

In [34]:
# We can also use it to track which all days of the week we ran

did_run <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
did_run


Out[34]:
  1. TRUE
  2. TRUE
  3. FALSE
  4. TRUE
  5. FALSE

[ Exercise ]

For some analysis, let us put together the amount of kilometers that we have run over the past 2 weeks. The last week, we ran:

Day of week Kilometers
Monday 4
Tuesday 5.2
Wednesday 6
Thursday 5.2
Friday 5

This is expressed as the vector:


In [35]:
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)

This week, we ran:

Day of week Kilometers
Monday 6
Tuesday 6.2
Wednesday 6
Thursday 7.2
Friday 7.5

Populate this in a vector kms_this_week


In [36]:
# Answer here
kms_this_week <- c (6.0, 6.2, 6.0, 7.2, 7.5)

When we're looking at a vector, it makes more sense if we can somehow name all the values, right?

Just looking at kms_last_week can become confusing. Let us use the names function to give each element the day of the week


In [37]:
names(kms_last_week) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

In [38]:
# Now, the data can stand independently and is much more clearer
kms_last_week


Out[38]:
Monday
4
Tuesday
5.2
Wednesday
6
Thursday
5.2
Friday
5

Note: We're assigning a vector when we're giving names. So instead of repeating it multiple times, we can reuse the vector as well


In [39]:
days_of_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(kms_last_week) <- days_of_week
kms_last_week


Out[39]:
Monday
4
Tuesday
5.2
Wednesday
6
Thursday
5.2
Friday
5

Vector arithmatic

Arithmatic can be performed on vectors. Let us calculate the total amount of kilometers that we ran on each day in the past 2 weeks


In [44]:
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)
kms_this_week <- c(6.0, 6.2, 6.0, 7.2, 7.5)

total_kms_past_2_weeks <- kms_last_week + kms_this_week
total_kms_past_2_weeks


Out[44]:
  1. 10
  2. 11.4
  3. 12
  4. 12.4
  5. 12.5

But how many kilometers did we run totally in each week? Sum each of the vectors using the sum function - simple, no?


In [41]:
distance_last_week <- sum(kms_last_week)
distance_this_week <- sum(kms_this_week)

distance_last_week
distance_this_week


Out[41]:
25.4
Out[41]:
32.9

[ Exercise ]

  • Assign days of the week names to the total_kms_past_2_weeks vector using the names function.

In [45]:
# Answer here
names(total_kms_past_2_weeks) <- days_of_week
  • What is the total distance we ran across both weeks? Use the total_kms_past_2_weeks vector to arrive at your answer

In [46]:
# Answer here
sum(total_kms_past_2_weeks)


Out[46]:
58.3

Vector element selection

Consider the total_kms_past_2_weeks vector. Let us say that we want to get the distance we ran across both weeks, on wednesday. We know that wednesday is the 3rd day of the week, So we pick up the 3rd element from the vector like so:


In [47]:
total_kms_past_2_weeks[3]


Out[47]:
Wednesday: 12

Note: A very important thing to note here is that R begins its indexing from 1 and not 0 unlike most other programming languages.

What if we're interested in a section of results, say our performance as the week comes to an end (wednesday, thursday, friday).

We can provide a vector of required indices like so:


In [48]:
total_kms_past_2_weeks[c(3,4,5)]


Out[48]:
Wednesday
12
Thursday
12.4
Friday
12.5

But say we have 100 elements in the vector, it would soon become tedious if we want to select a range, say from 50-72 or from 44-62, right? To solve this problem, R provides us with the range operator - : which we takes a starting number and an ending numer and returns a vector containing all those numbers. We can then use this to fetch required elements.


In [49]:
# Let us look at just the range operator
1:5


Out[49]:
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5

[ Exercise ]

Use the Range operator (:) to fetch the Monday - Wednesday section in the total_kms_past_2_weeks vector


In [53]:
# Answer here
total_kms_past_2_weeks[0:3]


Out[53]:
Monday
10
Tuesday
11.4
Wednesday
12

In [52]:
0:3


Out[52]:
  1. 0
  2. 1
  3. 2
  4. 3

Also, since we've given names to the vector elements, we can use those names to seek to the elements instead of using indexes.


In [54]:
names(total_kms_past_2_weeks) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
total_kms_past_2_weeks["Wednesday"]
total_kms_past_2_weeks[c("Monday", "Tuesday")]


Out[54]:
Wednesday: 12
Out[54]:
Monday
10
Tuesday
11.4

We can also perform logical operations on vectors. Let us check to see on how many days in the last week, we ran more than 4 kilometers


In [55]:
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)

names(kms_last_week) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

days_more_than_5 <- kms_last_week > 5
days_more_than_5


Out[55]:
Monday
FALSE
Tuesday
TRUE
Wednesday
TRUE
Thursday
TRUE
Friday
FALSE

We can use logical operations in combination with the vector to select only those elements from a vector that match a condition.

Now that days_more_than_5 contain a list of days where we ran more than 5 kilometers, let us select just those items into another vector


In [56]:
kms_last_week[days_more_than_5]


Out[56]:
Tuesday
5.2
Wednesday
6
Thursday
5.2

In [59]:
kms_last_week[kms_last_week<5]


Out[59]:
Monday: 4

Factors

Usually, most data is catagorical. Meaning that data can usually be put into catagories.

Let's start with something simple. Training for runs happens in 2 forms:

  • Interval based training, where you focus on speed
  • Distance based training, where the focus is on endurance

Take a vector that represents all the days we did intervals / distance in the last 10 days. There are some days where we rest as well.


In [60]:
running_style <- c("INT", "INT", "DIST", "DIST", "DIST", "REST", "INT", "DIST", "DIST", "DIST")
names(running_style) <- 1:10
running_style


Out[60]:
1
'INT'
2
'INT'
3
'DIST'
4
'DIST'
5
'DIST'
6
'REST'
7
'INT'
8
'DIST'
9
'DIST'
10
'DIST'

As you see, we can divide our runs into categories. Factors are used to represent these categories. Let us use the factor function to create a factor variable out of this vector


In [61]:
running_style_f <- factor(running_style)

Once we have this, we can use the levels function to extract the different levels that R interprets for us.


In [62]:
levels(running_style_f)


Out[62]:
  1. 'DIST'
  2. 'INT'
  3. 'REST'

Perfect, this tells us that we have 3 levels, i.e. we indeed have 3 running styles.

We can confirm that running_style_f is indeed a factor variable by checking its underlying type with the class function


In [63]:
class(running_style_f)


Out[63]:
'factor'

Once we have our levels, we can modify them to our suiting with the levels function (very similar to the names function)


In [64]:
levels(running_style_f)
levels(running_style_f) <- c("Endurance", "Speed", "Rest")
levels(running_style_f)


Out[64]:
  1. 'DIST'
  2. 'INT'
  3. 'REST'
Out[64]:
  1. 'Endurance'
  2. 'Speed'
  3. 'Rest'

This also gives us access to a new function - summary which gives us a summary of the data


In [65]:
summary(running_style_f)


Out[65]:
Endurance
6
Speed
3
Rest
1

This quickly tells us that out of the 10 days we ran, on 6 we did distance runs, 3 were interval runs and we took 1 day of rest.

Types of factor variables

As said, Factor allow us to create categorical variables. These variables can be of 2 types:

  • Nominal
  • Ordinal

Nominal Variables

By default a factor is nominal. Meaning that it picks categories by name and without any assigned order. So trying a logical < or > operation against them won't yield us anything


In [66]:
running_style_f = factor(running_style)
running_style_f[1] > running_style_f[2]


Warning message:
In Ops.factor(running_style_f[1], running_style_f[2]): '>' not meaningful for factors
Out[66]:
[1] NA

As you see, it yields us a "> not meaningful for factors" error

Ordinal Variables

Passing a order=TRUE argument to factor will make the factor into an ordinal variable and < and > are meaningful here.

Consider the amount of kilometers run in the past 10 days


In [ ]:
kms_run <- c(4.0, 5.2, 6.0, 5.2, 5.0, 6.0, 6.2, 6.0, 7.2, 7.5)

In [68]:
# Lets classify this into Long, Medium and Short runs manually
distance_type <- c("S", "M", "M", "M", "M", "M", "M", "M", "L", "L")

Now we can pick up factors from this, but we understand an order here. Short < Medium < Long. To introduce an order, we need to pass the order=TRUE and pass the right order of the levels we require.


In [69]:
distance_type_f = factor(distance_type, order=TRUE, levels=c("S", "M", "L"))
distance_type_f


Out[69]:
  1. S
  2. M
  3. M
  4. M
  5. M
  6. M
  7. M
  8. M
  9. L
  10. L

Now that we have an order in place, we can use < and >


In [71]:
distance_type_f[1]
distance_type_f[2]
distance_type_f[1] > distance_type_f[2]


Out[71]:
S
Out[71]:
M
Out[71]:
FALSE

The real reason why factors are important will be covered in forthcoming sessions. This just introduces the concept and the necessity for it.

Data Frame

The Data Frame is R's most iconic type. Soon, you'll find out that a Data Frame is great to express all kinds of data

Think of a Data Frame as a 2 dimensional structure having rows and columns. Each column may be of a different type each row can be thought of as representing an observation

To quickly get started with data frames, let us use an inbuilt data frame in R that contains some data on cars. From the help:

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

This data is stored in mtcars. Let us look at it.


In [73]:
?mtcars


Out[73]:
mtcars {datasets}R Documentation

Motor Trend Car Road Tests

Description

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Usage

mtcars

Format

A data frame with 32 observations on 11 variables.

[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors

Source

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

Examples

require(graphics)
pairs(mtcars, main = "mtcars data")
coplot(mpg ~ disp | as.factor(cyl), data = mtcars,
       panel = panel.smooth, rows = 1)

[Package datasets version 3.2.4 ]

[ Exercise ]

How will you find out what type mtcars is?


In [74]:
# Answer here
class(mtcars)


Out[74]:
'data.frame'

As you can see, it contains data about cars, each row represents one particular car and its associated details.

One of the most important things when working with Data Frames and in general with Data Science is to spend time understanding the structure of data. The structure of data, however is independent of the data itself. It is enough to get a glimpse of the data to get started with.

For this sake, R exposes 2 functions - head and tail that allow us to peek at the starting / ending of the data frame


In [75]:
# head
head(mtcars)


Out[75]:
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Valiant18.162251052.763.4620.221031

In [76]:
# tail
tail(mtcars)


Out[76]:
mpgcyldisphpdratwtqsecvsamgearcarb
Porsche 914-2264120.3914.432.1416.70152
Lotus Europa30.4495.11133.771.51316.91152
Ford Pantera L15.883512644.223.1714.50154
Ferrari Dino19.761451753.622.7715.50156
Maserati Bora1583013353.543.5714.60158
Volvo 142E21.441211094.112.7818.61142

In [77]:
# Another way to get a quick glimpse of the data is to use the `str` function.
str(mtcars)


'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The str function, as you can see shows us some nice details. It tells us

  • The number of observations (rows) we have (32)
  • The number of variables (columns) in consideration (11)
  • Each of the column with their data type and the first few entries

Another quick way to find out just the number of rows and columns is to use the nrow and ncol functions


In [78]:
# total number of rows
nrow(mtcars)


Out[78]:
32

In [79]:
# total number of columns
ncol(mtcars)


Out[79]:
11

Creating Data Frames

Let us create our own Data Frame to better understand their underlying concepts.

Let's put together a bunch of vectors representing the different variables (columns) in our data frame


In [80]:
distance <- c(4.0, 5.2, 6.0, 5.2, 5.0, 6.0, 6.2, 6.0, 7.2, 7.5)
time_taken <- c(20.5, 28.0, 40.2, 24.1, 26.0, 42.0, 43.2, 40.1, 50.2, 50.7)
run_type <- c("S", "S", "E", "S", "S", "E", "E", "E", "E", "E") # S is speed; E is endurance
workout_after <- c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)

Now that we have 4 vectors, we can create a data frame from these 4 vectors using the data.frame function


In [84]:
running_df <- data.frame(distance, time_taken, run_type, workout_after, 
                        stringsAsFactors = FALSE)

In [85]:
# Lets print this to see what we get
running_df


Out[85]:
distancetime_takenrun_typeworkout_after
1420.5STRUE
25.228SFALSE
3640.2EFALSE
45.224.1SFALSE
5526STRUE
6642ETRUE
76.243.2EFALSE
8640.1EFALSE
97.250.2EFALSE
107.550.7EFALSE

Quite similar to the data frame we earlier saw with cars. Lets work with this!


[ Exercise ]

Use the head, tail and the str function to inspect the data frame we just created (running.df)


In [86]:
# Answer here
head(running_df)
tail(running_df)
str(running_df)


Out[86]:
distancetime_takenrun_typeworkout_after
1420.5STRUE
25.228SFALSE
3640.2EFALSE
45.224.1SFALSE
5526STRUE
6642ETRUE
Out[86]:
distancetime_takenrun_typeworkout_after
5526STRUE
6642ETRUE
76.243.2EFALSE
8640.1EFALSE
97.250.2EFALSE
107.550.7EFALSE
'data.frame':	10 obs. of  4 variables:
 $ distance     : num  4 5.2 6 5.2 5 6 6.2 6 7.2 7.5
 $ time_taken   : num  20.5 28 40.2 24.1 26 42 43.2 40.1 50.2 50.7
 $ run_type     : chr  "S" "S" "E" "S" ...
 $ workout_after: logi  TRUE FALSE FALSE FALSE TRUE TRUE ...

In [87]:
?data.frame


Out[87]:
data.frame {base}R Documentation

Data Frames

Description

This function creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.

Usage

data.frame(..., row.names = NULL, check.rows = FALSE,
           check.names = TRUE,
           stringsAsFactors = default.stringsAsFactors())

default.stringsAsFactors()

Arguments

...

these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself.

row.names

NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.

check.rows

if TRUE then the rows are checked for consistency of length and names.

check.names

logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.

stringsAsFactors

logical: should character vectors be converted to factors? The ‘factory-fresh’ default is TRUE, but this can be changed by setting options(stringsAsFactors = FALSE).

Details

A data frame is a list of variables of the same number of rows with unique row names, given class "data.frame". If no variables are included, the row names determine the number of rows.

The column names should be non-empty, and attempts to use empty names will have unsupported results. Duplicate column names are allowed, but you need to use check.names = FALSE for data.frame to generate such a data frame. However, not all operations on data frames will preserve duplicated column names: for example matrix-like subsetting will force column names in the result to be unique.

data.frame converts each of its arguments to a data frame by calling as.data.frame(optional = TRUE). As that is a generic function, methods can be written to change the behaviour of arguments according to their classes: R comes with many such methods. Character variables passed to data.frame are converted to factor columns unless protected by I or argument stringsAsFactors is false. If a list or data frame or matrix is passed to data.frame it is as if each component or column had been passed as a separate argument (except for matrices of class "model.matrix" and those protected by I).

Objects passed to data.frame should have the same number of rows, but atomic vectors (see is.vector), factors and character vectors protected by I will be recycled a whole number of times if necessary (including as elements of list arguments).

If row names are not supplied in the call to data.frame, the row names are taken from the first component that has suitable names, for example a named vector or a matrix with rownames or a data frame. (If that component is subsequently recycled, the names are discarded with a warning.) If row.names was supplied as NULL or no suitable component was found the row names are the integer sequence starting at one (and such row names are considered to be ‘automatic’, and not preserved by as.matrix).

If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).

Names are removed from vector inputs not protected by I.

default.stringsAsFactors is a utility that takes getOption("stringsAsFactors") and ensures the result is TRUE or FALSE (or throws an error if the value is not NULL).

Value

A data frame, a matrix-like structure whose columns may be of differing types (numeric, logical, factor and character and so on).

How the names of the data frame are created is complex, and the rest of this paragraph is only the basic story. If the arguments are all named and simple objects (not lists, matrices of data frames) then the argument names give the column names. For an unnamed simple argument, a deparsed version of the argument is used as the name (with an enclosing I(...) removed). For a named matrix/list/data frame argument with more than one named column, the names of the columns are the name of the argument followed by a dot and the column name inside the argument: if the argument is unnamed, the argument's column names are used. For a named or unnamed matrix/list/data frame argument that contains a single column, the column name in the result is the column name in the argument. Finally, the names are adjusted to be unique and syntactically valid unless check.names = FALSE.

Note

In versions of R prior to 2.4.0 row.names had to be character: to ensure compatibility with such versions of R, supply a character vector as the row.names argument.

References

Chambers, J. M. (1992) Data for models. Chapter 3 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

See Also

I, plot.data.frame, print.data.frame, row.names, names (for the column names), [.data.frame for subsetting methods, Math.data.frame etc, about Group methods for data.frames; read.table, make.names.

Examples

L3 <- LETTERS[1:3]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = 1, y = 1:10, fac = fac))
## The "same" with automatic column names:
data.frame(1, 1:10, sample(L3, 10, replace = TRUE))

is.data.frame(d)

## do not convert to factor, using I() :
(dd <- cbind(d, char = I(letters[1:10])))
rbind(class = sapply(dd, class), mode = sapply(dd, mode))

stopifnot(1:10 == row.names(d))  # {coercion}

(d0  <- d[, FALSE])   # data frame with 0 columns and 10 rows
(d.0 <- d[FALSE, ])   # <0 rows> data frame  (3 named cols)
(d00 <- d0[FALSE, ])  # data frame with 0 columns and 0 rows

[Package base version 3.2.4 ]

Note: When you run str on the data frame, notice that the run_type column has automatically been interpreted as a Factor type.

Selecting elements

Rows and columns can be selected from the data frame by similar methods as followed in vectors. I.e. using [ and ]

Within [ and ] there are 2 parts - The row part and the column part separated by a comma (,)


In [89]:
# Let us pick up the value in the 4th column, 2nd row
running_df[2, 4]
running_df


Out[89]:
FALSE
Out[89]:
distancetime_takenrun_typeworkout_after
1420.5STRUE
25.228SFALSE
3640.2EFALSE
45.224.1SFALSE
5526STRUE
6642ETRUE
76.243.2EFALSE
8640.1EFALSE
97.250.2EFALSE
107.550.7EFALSE

[ Exercise ]

  • The row and column part, much like vector element selector allows for use of the range operator (:). Select columns 1-2 for rows 5-9

In [90]:
# Answer here
running_df[5:9, 1:2]


Out[90]:
distancetime_taken
5526
6642
76.243.2
8640.1
97.250.2

R also makes it possible to omit one part of the 2 parts inside [ and ]. The separator is mandatory though. So now

  • Select columns 1-2 for all rows

In [91]:
# Answer here
running_df[,1:2]


Out[91]:
distancetime_taken
1420.5
25.228
3640.2
45.224.1
5526
6642
76.243.2
8640.1
97.250.2
107.550.7
  • Select all columns for rows 5-9

In [92]:
# Answer here 
running_df[5:9,]


Out[92]:
distancetime_takenrun_typeworkout_after
5526STRUE
6642ETRUE
76.243.2EFALSE
8640.1EFALSE
97.250.2EFALSE

You can also use the name of the column to select instead of specifying the numbers


In [93]:
running_df[5:9, "distance"]


Out[93]:
  1. 5
  2. 6
  3. 6.2
  4. 6
  5. 7.2

There are times where we want to operate only on one column. We have, as of now, understood that there are 2 ways to do this:


In [94]:
# Using the index of the distance column
running_df[,1]


Out[94]:
  1. 4
  2. 5.2
  3. 6
  4. 5.2
  5. 5
  6. 6
  7. 6.2
  8. 6
  9. 7.2
  10. 7.5

In [95]:
# Using the column name
running_df[, "distance"]


Out[95]:
  1. 4
  2. 5.2
  3. 6
  4. 5.2
  5. 5
  6. 6
  7. 6.2
  8. 6
  9. 7.2
  10. 7.5

There's also a 3rd way which you'll see used extensively through out R and that uses the $ operator


In [96]:
running_df$distance


Out[96]:
  1. 4
  2. 5.2
  3. 6
  4. 5.2
  5. 5
  6. 6
  7. 6.2
  8. 6
  9. 7.2
  10. 7.5

Note: Do note that when you are working on an individual column, the data structure is a vector and not a data.frame

Working with subsets

Let us say that we want to select the first 4 rows of the data frame, we can do so my passing a vector like so


In [97]:
running_df[c(TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE),]


Out[97]:
distancetime_takenrun_typeworkout_after
1420.5STRUE
25.228SFALSE
3640.2EFALSE
45.224.1SFALSE

But this is tedious so R gives us a subset function to do the same thing in a more readable fashion


In [99]:
# Lets pick out all the days where we did endurance runs.
(subset(running_df, subset = (run_type == "E")))$distance


Out[99]:
  1. 6
  2. 6
  3. 6.2
  4. 6
  5. 7.2
  6. 7.5

[ Exercise ]

Pick out all those rows where we did endurance runs and worked out after the run. Use the subset function


In [100]:
# Answer here
subset(running_df, subset = (run_type == "E" & workout_after == TRUE))


Out[100]:
distancetime_takenrun_typeworkout_after
6642ETRUE

Ordering

Ordering helps us to understand our data better and helps with comparison.

The order function helps us to do that in R. It is quite smart as well. Consider a vector


In [ ]:
some_alphabets <- c("h", "a", "q", "z", "n", "r")
some_alphabets

In [ ]:
# Lets call order on them and see what happens
order(some_alphabets)

It gives us a vector. An ordered vector. Now let us select the original vector using this one


In [ ]:
o <- order(some_alphabets)
some_alphabets[o]

We can also sort it in the opposite order using the decreasing=TRUE argument to order


[ Exercise ]

  • Sort the some_alphabets vector in the descending order

In [ ]:
# Answer here
  • We can also get use order on a column in a data frame. Order the distance column within our running_df data frame

In [ ]:
# Answer here
  • From the existing running_df data frame, create a new data frame (running_df_ordered) that is ordered by the distance column

In [ ]:
# Answer here