Introduction to R

R is an interpreted programming language that is majorly used in the scientific domain. It is widely used among statisticians and data miners.

Basics

Comments



In [1]:

    
# Namaskara

Integers



In [2]:

    
4L + 2L









    Out[2]:




6

Strings



In [3]:

    
"Hello World"









    Out[3]:




'Hello World'

Floats



In [4]:

    
4.2









    Out[4]:




4.2

Logical



In [5]:

    
TRUE









    Out[5]:




TRUE



In [6]:

    
FALSE









    Out[6]:




FALSE

Arithmatic Operations



In [7]:

    
# Additions are done with `+`
40 + 2









    Out[7]:




42



In [8]:

    
# Subtractions with `-`
44 - 2









    Out[8]:




42



In [9]:

    
# Multiplication with `*`
21 * 2









    Out[9]:




42



In [10]:

    
# Divisions with `/`
84 / 2









    Out[10]:




42



In [11]:

    
# Exponentiation with `^`
7 ^ 8









    Out[11]:




5764801



In [13]:

    
# Modulo with `%%`
71 %% 5









    Out[13]:




1

Logical Operations



In [14]:

    
TRUE & FALSE









    Out[14]:




FALSE



In [15]:

    
TRUE & TRUE









    Out[15]:




TRUE



In [16]:

    
TRUE | FALSE









    Out[16]:




TRUE



In [17]:

    
FALSE | FALSE









    Out[17]:




FALSE

Variables & Assignment

Use the <- operator to assign values to variables.



In [18]:

    
answer_to_life <- 42



In [19]:

    
answer_to_life









    Out[19]:




42

All arithmatic operations are supported on variables



In [20]:

    
foo <- 21
bar <- 21

answer_to_life <- foo + bar
answer_to_life









    Out[20]:




42

Types

The class function can be used to identify the underlying type of a variable or literal



In [21]:

    
class(42L)
class(42.0)
class("Fourty Two")









    Out[21]:




'integer'






    Out[21]:




'numeric'






    Out[21]:




'character'

[ Exercise ]

What is the type of TRUE / FALSE ?



In [22]:

    
# Answer here
class(TRUE)









    Out[22]:




'logical'

Working with the environment

It is important to understand how to bring in things from outside the environment - Files, URLs, filesystem navigation, etc

Working Directory

The directory where you're currently working in is called the Working Directory



In [23]:

    
# Lets start off by checking where we are
getwd()









    Out[23]:




'/Users/amitkaps/Dropbox/github/intro-R-data-science/intro'



In [24]:

    
# Lets then move to the _cars_ directory within this directory

# Paths can be relative
setwd("../cars")



In [25]:

    
getwd()









    Out[25]:




'/Users/amitkaps/Dropbox/github/intro-R-data-science/cars'



In [26]:

    
# Or paths can be absolute
setwd("/Users/amitkaps/Dropbox/github/intro-R-data-science/intro")



In [27]:

    
getwd()









    Out[27]:




'/Users/amitkaps/Dropbox/github/intro-R-data-science/intro'



In [28]:

    
# Let us try an invalid path
setwd("C:/Users/Shrayasr/personal/code/intro-R-data-science/introoooooooo")









    



Error in setwd("C:/Users/Shrayasr/personal/code/intro-R-data-science/introoooooooo"): cannot change working directory



In [29]:

    
getwd()









    Out[29]:




'/Users/amitkaps/Dropbox/github/intro-R-data-science/intro'

Reading stuff

R is all about getting work done. It gives you nifty methods to quickly go out and pick data up so that you're up and running within R.



In [30]:

    
# Lets read in a bunch of cars
read.csv("small_cars.csv")









    Out[30]:





name model url price type ABS Acceleration..0.100.kmph. Air.Conditioner Audio.Controls.on.Streeing.Wheel Audio.System..with.remote. ellip.h Speakers Tilt.Function Top.Speed..kmph. Traction.Control Transmission.Type Tubeless.Tyres Turning.Circle.Radius..metres. USB...Auxiliary.Input Wheelbase..mm. brand

	1 Ashok Leyland Stile Ashok Leyland Stile LE 8-STR (Diesel) http://carzoom.in/car-specification/ashok-leyland-stile-le-8-str-diesel/ 749990 MPV  No 18.7  Manual  No  No <8b>  No  Yes 140  No  5 Speed Manual  Yes 5.2  No 2725 Ashok
	2 Ashok Leyland Stile Ashok Leyland Stile LS 8-STR (Diesel) http://carzoom.in/car-specification/ashok-leyland-stile-ls-8-str-diesel/ 799990 MPV  No 18.7  Manual  No  No <8b>  No  Yes 140  No  5 Speed Manual  Yes 5.2  No 2725 Ashok
	3 Ashok Leyland Stile Ashok Leyland Stile LX 8-STR (Diesel) http://carzoom.in/car-specification/ashok-leyland-stile-lx-8-str-diesel/ 829990 MPV  No 18.7  Manual  No  No <8b>  No  Yes 140  No  5 Speed Manual  Yes 5.2  No 2725 Ashok
	4 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR (Diesel) http://carzoom.in/car-specification/ashok-leyland-stile-ls-7-str-diesel/ 849990 MPV  No 18.7  Manual  No  No <8b>  No  Yes 140  No  5 Speed Manual  Yes 5.2  No 2725 Ashok

As simple as that.

[ Exercise ]

Put a CSV file (anything, really) in some location on your computer (except the intro-R-data-science directory)
Use getwd and setwd to navigate to that location
Read the csv using read.csv



In [32]:

    
# Answer here
setwd("/Users/amitkaps")
read.csv("small_cars.csv")









    Out[32]:





name model url price type ABS Acceleration..0.100.kmph. Air.Conditioner Audio.Controls.on.Streeing.Wheel Audio.System..with.remote. ellip.h Speakers Tilt.Function Top.Speed..kmph. Traction.Control Transmission.Type Tubeless.Tyres Turning.Circle.Radius..metres. USB...Auxiliary.Input Wheelbase..mm. brand

	1 Ashok Leyland Stile Ashok Leyland Stile LE 8-STR (Diesel) http://carzoom.in/car-specification/ashok-leyland-stile-le-8-str-diesel/ 749990 MPV  No 18.7  Manual  No  No <8b>  No  Yes 140  No  5 Speed Manual  Yes 5.2  No 2725 Ashok
	2 Ashok Leyland Stile Ashok Leyland Stile LS 8-STR (Diesel) http://carzoom.in/car-specification/ashok-leyland-stile-ls-8-str-diesel/ 799990 MPV  No 18.7  Manual  No  No <8b>  No  Yes 140  No  5 Speed Manual  Yes 5.2  No 2725 Ashok
	3 Ashok Leyland Stile Ashok Leyland Stile LX 8-STR (Diesel) http://carzoom.in/car-specification/ashok-leyland-stile-lx-8-str-diesel/ 829990 MPV  No 18.7  Manual  No  No <8b>  No  Yes 140  No  5 Speed Manual  Yes 5.2  No 2725 Ashok
	4 Ashok Leyland Stile Ashok Leyland Stile LS 7-STR (Diesel) http://carzoom.in/car-specification/ashok-leyland-stile-ls-7-str-diesel/ 849990 MPV  No 18.7  Manual  No  No <8b>  No  Yes 140  No  5 Speed Manual  Yes 5.2  No 2725 Ashok

Vectors

Vectors are one dimensional arrays that can hold one type of data. The c function allows us to create a vector out of provided values



In [33]:

    
# Let us say we want to express the amount of Kilometers 
# that we have run in the past 5 days. 
# We can use a vector for this.

kms_run <- c(4.0, 5.2, 6.0, 5.2, 5.0)
kms_run









    Out[33]:





	4
	5.2
	6
	5.2
	5



In [34]:

    
# We can also use it to track which all days of the week we ran

did_run <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
did_run









    Out[34]:





	TRUE
	TRUE
	FALSE
	TRUE
	FALSE

[ Exercise ]

For some analysis, let us put together the amount of kilometers that we have run over the past 2 weeks. The last week, we ran:

Day of week	Kilometers
Monday	4
Tuesday	5.2
Wednesday	6
Thursday	5.2
Friday	5

This is expressed as the vector:



In [35]:

    
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)

This week, we ran:

Day of week	Kilometers
Monday	6
Tuesday	6.2
Wednesday	6
Thursday	7.2
Friday	7.5

Populate this in a vector kms_this_week



In [36]:

    
# Answer here
kms_this_week <- c (6.0, 6.2, 6.0, 7.2, 7.5)

When we're looking at a vector, it makes more sense if we can somehow name all the values, right?

Just looking at kms_last_week can become confusing. Let us use the names function to give each element the day of the week



In [37]:

    
names(kms_last_week) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")



In [38]:

    
# Now, the data can stand independently and is much more clearer
kms_last_week









    Out[38]:





	Monday
		4
	Tuesday
		5.2
	Wednesday
		6
	Thursday
		5.2
	Friday
		5

Note: We're assigning a vector when we're giving names. So instead of repeating it multiple times, we can reuse the vector as well



In [39]:

    
days_of_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(kms_last_week) <- days_of_week
kms_last_week









    Out[39]:





	Monday
		4
	Tuesday
		5.2
	Wednesday
		6
	Thursday
		5.2
	Friday
		5

Vector arithmatic

Arithmatic can be performed on vectors. Let us calculate the total amount of kilometers that we ran on each day in the past 2 weeks



In [44]:

    
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)
kms_this_week <- c(6.0, 6.2, 6.0, 7.2, 7.5)

total_kms_past_2_weeks <- kms_last_week + kms_this_week
total_kms_past_2_weeks









    Out[44]:





	10
	11.4
	12
	12.4
	12.5

But how many kilometers did we run totally in each week? Sum each of the vectors using the sum function - simple, no?



In [41]:

    
distance_last_week <- sum(kms_last_week)
distance_this_week <- sum(kms_this_week)

distance_last_week
distance_this_week









    Out[41]:




25.4






    Out[41]:




32.9

[ Exercise ]

Assign days of the week names to the total_kms_past_2_weeks vector using the names function.



In [45]:

    
# Answer here
names(total_kms_past_2_weeks) <- days_of_week

What is the total distance we ran across both weeks? Use the total_kms_past_2_weeks vector to arrive at your answer



In [46]:

    
# Answer here
sum(total_kms_past_2_weeks)









    Out[46]:




58.3

Vector element selection

Consider the total_kms_past_2_weeks vector. Let us say that we want to get the distance we ran across both weeks, on wednesday. We know that wednesday is the 3rd day of the week, So we pick up the 3rd element from the vector like so:



In [47]:

    
total_kms_past_2_weeks[3]









    Out[47]:




Wednesday: 12

Note: A very important thing to note here is that R begins its indexing from 1 and not 0 unlike most other programming languages.

What if we're interested in a section of results, say our performance as the week comes to an end (wednesday, thursday, friday).

We can provide a vector of required indices like so:



In [48]:

    
total_kms_past_2_weeks[c(3,4,5)]









    Out[48]:





	Wednesday
		12
	Thursday
		12.4
	Friday
		12.5

But say we have 100 elements in the vector, it would soon become tedious if we want to select a range, say from 50-72 or from 44-62, right? To solve this problem, R provides us with the range operator - : which we takes a starting number and an ending numer and returns a vector containing all those numbers. We can then use this to fetch required elements.



In [49]:

    
# Let us look at just the range operator
1:5

[ Exercise ]

Use the Range operator (:) to fetch the Monday - Wednesday section in the total_kms_past_2_weeks vector



In [53]:

    
# Answer here
total_kms_past_2_weeks[0:3]









    Out[53]:





	Monday
		10
	Tuesday
		11.4
	Wednesday
		12



In [52]:

    
0:3









    Out[52]:





	0
	1
	2
	3

Also, since we've given names to the vector elements, we can use those names to seek to the elements instead of using indexes.



In [54]:

    
names(total_kms_past_2_weeks) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
total_kms_past_2_weeks["Wednesday"]
total_kms_past_2_weeks[c("Monday", "Tuesday")]









    Out[54]:




Wednesday: 12






    Out[54]:





	Monday
		10
	Tuesday
		11.4

We can also perform logical operations on vectors. Let us check to see on how many days in the last week, we ran more than 4 kilometers



In [55]:

    
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)

names(kms_last_week) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

days_more_than_5 <- kms_last_week > 5
days_more_than_5









    Out[55]:





	Monday
		FALSE
	Tuesday
		TRUE
	Wednesday
		TRUE
	Thursday
		TRUE
	Friday
		FALSE

We can use logical operations in combination with the vector to select only those elements from a vector that match a condition.

Now that days_more_than_5 contain a list of days where we ran more than 5 kilometers, let us select just those items into another vector



In [56]:

    
kms_last_week[days_more_than_5]









    Out[56]:





	Tuesday
		5.2
	Wednesday
		6
	Thursday
		5.2



In [59]:

    
kms_last_week[kms_last_week<5]









    Out[59]:




Monday: 4

Factors

Usually, most data is catagorical. Meaning that data can usually be put into catagories.

Let's start with something simple. Training for runs happens in 2 forms:

Interval based training, where you focus on speed
Distance based training, where the focus is on endurance

Take a vector that represents all the days we did intervals / distance in the last 10 days. There are some days where we rest as well.



In [60]:

    
running_style <- c("INT", "INT", "DIST", "DIST", "DIST", "REST", "INT", "DIST", "DIST", "DIST")
names(running_style) <- 1:10
running_style









    Out[60]:





	1
		'INT'
	2
		'INT'
	3
		'DIST'
	4
		'DIST'
	5
		'DIST'
	6
		'REST'
	7
		'INT'
	8
		'DIST'
	9
		'DIST'
	10
		'DIST'

As you see, we can divide our runs into categories. Factors are used to represent these categories. Let us use the factor function to create a factor variable out of this vector



In [61]:

    
running_style_f <- factor(running_style)

Once we have this, we can use the levels function to extract the different levels that R interprets for us.



In [62]:

    
levels(running_style_f)









    Out[62]:





	'DIST'
	'INT'
	'REST'

Perfect, this tells us that we have 3 levels, i.e. we indeed have 3 running styles.

We can confirm that running_style_f is indeed a factor variable by checking its underlying type with the class function



In [63]:

    
class(running_style_f)









    Out[63]:




'factor'

Once we have our levels, we can modify them to our suiting with the levels function (very similar to the names function)



In [64]:

    
levels(running_style_f)
levels(running_style_f) <- c("Endurance", "Speed", "Rest")
levels(running_style_f)









    Out[64]:





	'DIST'
	'INT'
	'REST'








    Out[64]:





	'Endurance'
	'Speed'
	'Rest'

This also gives us access to a new function - summary which gives us a summary of the data



In [65]:

    
summary(running_style_f)









    Out[65]:





	Endurance
		6
	Speed
		3
	Rest
		1

This quickly tells us that out of the 10 days we ran, on 6 we did distance runs, 3 were interval runs and we took 1 day of rest.

Types of factor variables

As said, Factor allow us to create categorical variables. These variables can be of 2 types:

Nominal
Ordinal

Nominal Variables

By default a factor is nominal. Meaning that it picks categories by name and without any assigned order. So trying a logical < or > operation against them won't yield us anything



In [66]:

    
running_style_f = factor(running_style)
running_style_f[1] > running_style_f[2]









    



Warning message:
In Ops.factor(running_style_f[1], running_style_f[2]): '>' not meaningful for factors





    Out[66]:





[1] NA

As you see, it yields us a "> not meaningful for factors" error

Ordinal Variables

Passing a order=TRUE argument to factor will make the factor into an ordinal variable and < and > are meaningful here.

Consider the amount of kilometers run in the past 10 days



In [ ]:

    
kms_run <- c(4.0, 5.2, 6.0, 5.2, 5.0, 6.0, 6.2, 6.0, 7.2, 7.5)



In [68]:

    
# Lets classify this into Long, Medium and Short runs manually
distance_type <- c("S", "M", "M", "M", "M", "M", "M", "M", "L", "L")

Now we can pick up factors from this, but we understand an order here. Short < Medium < Long. To introduce an order, we need to pass the order=TRUE and pass the right order of the levels we require.



In [69]:

    
distance_type_f = factor(distance_type, order=TRUE, levels=c("S", "M", "L"))
distance_type_f









    Out[69]:





	S
	M
	M
	M
	M
	M
	M
	M
	L
	L

Now that we have an order in place, we can use < and >



In [71]:

    
distance_type_f[1]
distance_type_f[2]
distance_type_f[1] > distance_type_f[2]









    Out[71]:




S






    Out[71]:




M






    Out[71]:




FALSE

The real reason why factors are important will be covered in forthcoming sessions. This just introduces the concept and the necessity for it.

Data Frame

The Data Frame is R's most iconic type. Soon, you'll find out that a Data Frame is great to express all kinds of data

Think of a Data Frame as a 2 dimensional structure having rows and columns. Each column may be of a different type each row can be thought of as representing an observation

To quickly get started with data frames, let us use an inbuilt data frame in R that contains some data on cars. From the help:

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

This data is stored in mtcars. Let us look at it.



In [73]:

    
?mtcars









    Out[73]:





mtcars {datasets} R Documentation

Motor Trend Car Road Tests

Description

The data was extracted from the 1974 Motor Trend US magazine,
and comprises fuel consumption and 10 aspects of
automobile design and performance for 32 automobiles (1973–74
models).



Usage

mtcars


Format

A data frame with 32 observations on 11 variables.




 
    [, 1]  mpg   Miles/(US) gallon 


 
    [, 2]  cyl   Number of cylinders 


 
    [, 3]  disp  Displacement (cu.in.) 


 
    [, 4]  hp    Gross horsepower 


 
    [, 5]  drat  Rear axle ratio 


 
    [, 6]  wt    Weight (1000 lbs) 


 
    [, 7]  qsec  1/4 mile time 


 
    [, 8]  vs    V/S 


 
    [, 9]  am    Transmission (0 = automatic, 1 = manual) 


 
    [,10]  gear  Number of forward gears 


 
    [,11]  carb  Number of carburetors
  






Source

Henderson and Velleman (1981),
Building multiple regression models interactively.
Biometrics, 37, 391–411.



Examples

require(graphics)
pairs(mtcars, main = "mtcars data")
coplot(mpg ~ disp | as.factor(cyl), data = mtcars,
       panel = panel.smooth, rows = 1)


[Package datasets version 3.2.4 ]

[ Exercise ]

How will you find out what type mtcars is?



In [74]:

    
# Answer here
class(mtcars)









    Out[74]:




'data.frame'

As you can see, it contains data about cars, each row represents one particular car and its associated details.

One of the most important things when working with Data Frames and in general with Data Science is to spend time understanding the structure of data. The structure of data, however is independent of the data itself. It is enough to get a glimpse of the data to get started with.

For this sake, R exposes 2 functions - head and tail that allow us to peek at the starting / ending of the data frame



In [75]:

    
# head
head(mtcars)









    Out[75]:





mpg cyl disp hp drat wt qsec vs am gear carb

	Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
	Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
	Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
	Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
	Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
	Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1



In [76]:

    
# tail
tail(mtcars)









    Out[76]:





mpg cyl disp hp drat wt qsec vs am gear carb

	Porsche 914-2 26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
	Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
	Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
	Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
	Maserati Bora 15 8 301 335 3.54 3.57 14.6 0 1 5 8
	Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2



In [77]:

    
# Another way to get a quick glimpse of the data is to use the `str` function.
str(mtcars)









    



'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

The str function, as you can see shows us some nice details. It tells us

The number of observations (rows) we have (32)
The number of variables (columns) in consideration (11)
Each of the column with their data type and the first few entries

Another quick way to find out just the number of rows and columns is to use the nrow and ncol functions



In [78]:

    
# total number of rows
nrow(mtcars)









    Out[78]:




32



In [79]:

    
# total number of columns
ncol(mtcars)









    Out[79]:




11

Creating Data Frames

Let us create our own Data Frame to better understand their underlying concepts.

Let's put together a bunch of vectors representing the different variables (columns) in our data frame



In [80]:

    
distance <- c(4.0, 5.2, 6.0, 5.2, 5.0, 6.0, 6.2, 6.0, 7.2, 7.5)
time_taken <- c(20.5, 28.0, 40.2, 24.1, 26.0, 42.0, 43.2, 40.1, 50.2, 50.7)
run_type <- c("S", "S", "E", "S", "S", "E", "E", "E", "E", "E") # S is speed; E is endurance
workout_after <- c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)

Now that we have 4 vectors, we can create a data frame from these 4 vectors using the data.frame function



In [84]:

    
running_df <- data.frame(distance, time_taken, run_type, workout_after, 
                        stringsAsFactors = FALSE)



In [85]:

    
# Lets print this to see what we get
running_df









    Out[85]:





distance time_taken run_type workout_after

	1 4 20.5 S TRUE
	2 5.2 28 S FALSE
	3 6 40.2 E FALSE
	4 5.2 24.1 S FALSE
	5 5 26 S TRUE
	6 6 42 E TRUE
	7 6.2 43.2 E FALSE
	8 6 40.1 E FALSE
	9 7.2 50.2 E FALSE
	10 7.5 50.7 E FALSE

Quite similar to the data frame we earlier saw with cars. Lets work with this!

[ Exercise ]

Use the head, tail and the str function to inspect the data frame we just created (running.df)



In [86]:

    
# Answer here
head(running_df)
tail(running_df)
str(running_df)









    Out[86]:





distance time_taken run_type workout_after

	1 4 20.5 S TRUE
	2 5.2 28 S FALSE
	3 6 40.2 E FALSE
	4 5.2 24.1 S FALSE
	5 5 26 S TRUE
	6 6 42 E TRUE









    Out[86]:





distance time_taken run_type workout_after

	5 5 26 S TRUE
	6 6 42 E TRUE
	7 6.2 43.2 E FALSE
	8 6 40.1 E FALSE
	9 7.2 50.2 E FALSE
	10 7.5 50.7 E FALSE









    



'data.frame':	10 obs. of  4 variables:
 $ distance     : num  4 5.2 6 5.2 5 6 6.2 6 7.2 7.5
 $ time_taken   : num  20.5 28 40.2 24.1 26 42 43.2 40.1 50.2 50.7
 $ run_type     : chr  "S" "S" "E" "S" ...
 $ workout_after: logi  TRUE FALSE FALSE FALSE TRUE TRUE ...



In [87]:

    
?data.frame









    Out[87]:





data.frame {base} R Documentation

Data Frames

Description

This function creates data frames, tightly coupled
collections of variables which share many of the properties of
matrices and of lists, used as the fundamental data structure by most
of R's modeling software.



Usage

data.frame(..., row.names = NULL, check.rows = FALSE,
           check.names = TRUE,
           stringsAsFactors = default.stringsAsFactors())

default.stringsAsFactors()



Arguments


...

these arguments are of either the form value or
tag = value.  Component names are created based on the tag (if
present) or the deparsed argument itself.

row.names

NULL or a single integer or character string
specifying a column to be used as row names, or a character or
integer vector giving the row names for the data frame.

check.rows

if TRUE then the rows are checked for
consistency of length and names.

check.names

logical.  If TRUE then the names of the
variables in the data frame are checked to ensure that they are
syntactically valid variable names and are not duplicated.
If necessary they are adjusted (by make.names)
so that they are.

stringsAsFactors

logical: should character vectors be converted
to factors?  The ‘factory-fresh’ default is TRUE, but
this can be changed by setting options(stringsAsFactors
      = FALSE).




Details

A data frame is a list of variables of the same number of rows with
unique row names, given class "data.frame".  If no variables
are included, the row names determine the number of rows.

The column names should be non-empty, and attempts to use empty names
will have unsupported results.  Duplicate column names are allowed,
but you need to use check.names = FALSE for data.frame
to generate such a data frame.  However, not all operations on data
frames will preserve duplicated column names: for example matrix-like
subsetting will force column names in the result to be unique.

data.frame converts each of its arguments to a data frame by
calling as.data.frame(optional = TRUE).  As that is a
generic function, methods can be written to change the behaviour of
arguments according to their classes: R comes with many such methods.
Character variables passed to data.frame are converted to
factor columns unless protected by I or argument
stringsAsFactors is false.  If a list or data
frame or matrix is passed to data.frame it is as if each
component or column had been passed as a separate argument (except for
matrices of class "model.matrix" and those protected by
I).

Objects passed to data.frame should have the same number of
rows, but atomic vectors (see is.vector), factors and
character vectors protected by I will be recycled a
whole number of times if necessary (including as elements of list
arguments).

If row names are not supplied in the call to data.frame, the
row names are taken from the first component that has suitable names,
for example a named vector or a matrix with rownames or a data frame.
(If that component is subsequently recycled, the names are discarded
with a warning.)  If row.names was supplied as NULL or no
suitable component was found the row names are the integer sequence
starting at one (and such row names are considered to be
‘automatic’, and not preserved by as.matrix).

If row names are supplied of length one and the data frame has a
single row, the row.names is taken to specify the row names and
not a column (by name or number).

Names are removed from vector inputs not protected by I.

default.stringsAsFactors is a utility that takes
getOption("stringsAsFactors") and ensures the result is
TRUE or FALSE (or throws an error if the value is not
NULL).



Value

A data frame, a matrix-like structure whose columns may be of
differing types (numeric, logical, factor and character and so on).

How the names of the data frame are created is complex, and the rest
of this paragraph is only the basic story.  If the arguments are all
named and simple objects (not lists, matrices of data frames) then the
argument names give the column names.  For an unnamed simple argument,
a deparsed version of the argument is used as the name (with an
enclosing I(...) removed).  For a named matrix/list/data frame
argument with more than one named column, the names of the columns are
the name of the argument followed by a dot and the column name inside
the argument: if the argument is unnamed, the argument's column names
are used.  For a named or unnamed matrix/list/data frame argument that
contains a single column, the column name in the result is the column
name in the argument.  Finally, the names are adjusted to be unique
and syntactically valid unless check.names = FALSE.



Note

In versions of R prior to 2.4.0 row.names had to be
character: to ensure compatibility with such versions of R, supply
a character vector as the row.names argument.



References

Chambers, J. M. (1992)
Data for models.
Chapter 3 of Statistical Models in S
eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.



See Also

I,
plot.data.frame,
print.data.frame,
row.names, names (for the column names),
[.data.frame for subsetting methods,
Math.data.frame etc, about
Group methods for data.frames;
read.table,
make.names.



Examples

L3 <- LETTERS[1:3]
fac <- sample(L3, 10, replace = TRUE)
(d <- data.frame(x = 1, y = 1:10, fac = fac))
## The "same" with automatic column names:
data.frame(1, 1:10, sample(L3, 10, replace = TRUE))

is.data.frame(d)

## do not convert to factor, using I() :
(dd <- cbind(d, char = I(letters[1:10])))
rbind(class = sapply(dd, class), mode = sapply(dd, mode))

stopifnot(1:10 == row.names(d))  # {coercion}

(d0  <- d[, FALSE])   # data frame with 0 columns and 10 rows
(d.0 <- d[FALSE, ])   # <0 rows> data frame  (3 named cols)
(d00 <- d0[FALSE, ])  # data frame with 0 columns and 0 rows


[Package base version 3.2.4 ]

Note: When you run str on the data frame, notice that the run_type column has automatically been interpreted as a Factor type.

Selecting elements

Rows and columns can be selected from the data frame by similar methods as followed in vectors. I.e. using [ and ]

Within [ and ] there are 2 parts - The row part and the column part separated by a comma (,)



In [89]:

    
# Let us pick up the value in the 4th column, 2nd row
running_df[2, 4]
running_df









    Out[89]:




FALSE






    Out[89]:





distance time_taken run_type workout_after

	1 4 20.5 S TRUE
	2 5.2 28 S FALSE
	3 6 40.2 E FALSE
	4 5.2 24.1 S FALSE
	5 5 26 S TRUE
	6 6 42 E TRUE
	7 6.2 43.2 E FALSE
	8 6 40.1 E FALSE
	9 7.2 50.2 E FALSE
	10 7.5 50.7 E FALSE

[ Exercise ]

The row and column part, much like vector element selector allows for use of the range operator (:). Select columns 1-2 for rows 5-9



In [90]:

    
# Answer here
running_df[5:9, 1:2]









    Out[90]:





distance time_taken

	5 5 26
	6 6 42
	7 6.2 43.2
	8 6 40.1
	9 7.2 50.2

R also makes it possible to omit one part of the 2 parts inside [ and ]. The separator is mandatory though. So now

Select columns 1-2 for all rows



In [91]:

    
# Answer here
running_df[,1:2]









    Out[91]:





distance time_taken

	1 4 20.5
	2 5.2 28
	3 6 40.2
	4 5.2 24.1
	5 5 26
	6 6 42
	7 6.2 43.2
	8 6 40.1
	9 7.2 50.2
	10 7.5 50.7

Select all columns for rows 5-9



In [92]:

    
# Answer here 
running_df[5:9,]









    Out[92]:





distance time_taken run_type workout_after

	5 5 26 S TRUE
	6 6 42 E TRUE
	7 6.2 43.2 E FALSE
	8 6 40.1 E FALSE
	9 7.2 50.2 E FALSE

You can also use the name of the column to select instead of specifying the numbers



In [93]:

    
running_df[5:9, "distance"]









    Out[93]:





	5
	6
	6.2
	6
	7.2

There are times where we want to operate only on one column. We have, as of now, understood that there are 2 ways to do this:



In [94]:

    
# Using the index of the distance column
running_df[,1]



In [95]:

    
# Using the column name
running_df[, "distance"]

There's also a 3rd way which you'll see used extensively through out R and that uses the $ operator



In [96]:

    
running_df$distance

Note: Do note that when you are working on an individual column, the data structure is a vector and not a data.frame

Working with subsets

Let us say that we want to select the first 4 rows of the data frame, we can do so my passing a vector like so



In [97]:

    
running_df[c(TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE),]









    Out[97]:





distance time_taken run_type workout_after

	1 4 20.5 S TRUE
	2 5.2 28 S FALSE
	3 6 40.2 E FALSE
	4 5.2 24.1 S FALSE

But this is tedious so R gives us a subset function to do the same thing in a more readable fashion



In [99]:

    
# Lets pick out all the days where we did endurance runs.
(subset(running_df, subset = (run_type == "E")))$distance









    Out[99]:





	6
	6
	6.2
	6
	7.2
	7.5

[ Exercise ]

Pick out all those rows where we did endurance runs and worked out after the run. Use the subset function



In [100]:

    
# Answer here
subset(running_df, subset = (run_type == "E" & workout_after == TRUE))









    Out[100]:





distance time_taken run_type workout_after

	6 6 42 E TRUE

Ordering

Ordering helps us to understand our data better and helps with comparison.

The order function helps us to do that in R. It is quite smart as well. Consider a vector



In [ ]:

    
some_alphabets <- c("h", "a", "q", "z", "n", "r")
some_alphabets



In [ ]:

    
# Lets call order on them and see what happens
order(some_alphabets)

It gives us a vector. An ordered vector. Now let us select the original vector using this one



In [ ]:

    
o <- order(some_alphabets)
some_alphabets[o]

We can also sort it in the opposite order using the decreasing=TRUE argument to order

[ Exercise ]

Sort the some_alphabets vector in the descending order



In [ ]:

    
# Answer here

We can also get use order on a column in a data frame. Order the distance column within our running_df data frame



In [ ]:

    
# Answer here

From the existing running_df data frame, create a new data frame (running_df_ordered) that is ordered by the distance column



In [ ]:

    
# Answer here

`...`	these arguments are of either the form `value` or `tag = value`. Component names are created based on the tag (if present) or the deparsed argument itself.
`row.names`	`NULL` or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.
`check.rows`	if `TRUE` then the rows are checked for consistency of length and names.
`check.names`	logical. If `TRUE` then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by `make.names`) so that they are.
`stringsAsFactors`	logical: should character vectors be converted to factors? The ‘factory-fresh’ default is `TRUE`, but this can be changed by setting `options(stringsAsFactors = FALSE)`.

	name	model	url	price	type	ABS	Acceleration..0.100.kmph.	Air.Conditioner	Audio.Controls.on.Streeing.Wheel	Audio.System..with.remote.	ellip.h	Speakers	Tilt.Function	Top.Speed..kmph.	Traction.Control	Transmission.Type	Tubeless.Tyres	Turning.Circle.Radius..metres.	USB...Auxiliary.Input	Wheelbase..mm.	brand
1	Ashok Leyland Stile	Ashok Leyland Stile LE 8-STR (Diesel)	http://carzoom.in/car-specification/ashok-leyland-stile-le-8-str-diesel/	749990	MPV	No	18.7	Manual	No	No	<8b>	No	Yes	140	No	5 Speed Manual	Yes	5.2	No	2725	Ashok
2	Ashok Leyland Stile	Ashok Leyland Stile LS 8-STR (Diesel)	http://carzoom.in/car-specification/ashok-leyland-stile-ls-8-str-diesel/	799990	MPV	No	18.7	Manual	No	No	<8b>	No	Yes	140	No	5 Speed Manual	Yes	5.2	No	2725	Ashok
3	Ashok Leyland Stile	Ashok Leyland Stile LX 8-STR (Diesel)	http://carzoom.in/car-specification/ashok-leyland-stile-lx-8-str-diesel/	829990	MPV	No	18.7	Manual	No	No	<8b>	No	Yes	140	No	5 Speed Manual	Yes	5.2	No	2725	Ashok
4	Ashok Leyland Stile	Ashok Leyland Stile LS 7-STR (Diesel)	http://carzoom.in/car-specification/ashok-leyland-stile-ls-7-str-diesel/	849990	MPV	No	18.7	Manual	No	No	<8b>	No	Yes	140	No	5 Speed Manual	Yes	5.2	No	2725	Ashok

[, 1]	mpg	Miles/(US) gallon
[, 2]	cyl	Number of cylinders
[, 3]	disp	Displacement (cu.in.)
[, 4]	hp	Gross horsepower
[, 5]	drat	Rear axle ratio
[, 6]	wt	Weight (1000 lbs)
[, 7]	qsec	1/4 mile time
[, 8]	vs	V/S
[, 9]	am	Transmission (0 = automatic, 1 = manual)
[,10]	gear	Number of forward gears
[,11]	carb	Number of carburetors

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21	6	160	110	3.9	2.62	16.46	0	1	4	4
Mazda RX4 Wag	21	6	160	110	3.9	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Porsche 914-2	26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
Ford Pantera L	15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
Ferrari Dino	19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
Maserati Bora	15	8	301	335	3.54	3.57	14.6	0	1	5	8
Volvo 142E	21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

	distance	time_taken	run_type	workout_after
1	4	20.5	S	TRUE
2	5.2	28	S	FALSE
3	6	40.2	E	FALSE
4	5.2	24.1	S	FALSE
5	5	26	S	TRUE
6	6	42	E	TRUE
7	6.2	43.2	E	FALSE
8	6	40.1	E	FALSE
9	7.2	50.2	E	FALSE
10	7.5	50.7	E	FALSE