Copyright (C) 2017 J. Patrick Hall, jphall@gwu.edu
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
In [3]:
x <- 'Hello World!'
print(x)
cat(x)
x
class(x) <- 'some.class' # '.' is just a character, it does not denote object membership
print(x)
cat(x)
In [4]:
# An object with no functions or operators is also printed to the console
x
R contains thousands of libraries, often called packages, for many different purposes
Packages are:
install.packages()
function or a GUI commandlibrary()
function, after being installed
In [5]:
library(dplyr) # popular package for data wrangling with consistent syntax
library(ggplot2) # popular package for plotting with consistent syntax
In [6]:
# '<-' is the preferred assignment operator in R
# '/' is the safest directory separator character to use
git_dir <- 'C:/path/to/GWU_data_mining/01_basic_data_prep/src/notebooks/r'
In [7]:
setwd(git_dir)
getwd()
In [8]:
n_rows <- 1000
n_vars <- 2
In [9]:
key <- seq(n_rows)
In [10]:
key[1:5]
In [11]:
num_vars <- paste('numeric', seq_len(n_vars), sep = '')
num_vars
char_vars <- paste('char', seq_len(n_vars), sep = '')
char_vars
In [12]:
scratch_df <- data.frame(INDEX = key)
# head() displays the top of a data structure
head(scratch_df)
In [13]:
scratch_df[, num_vars] <- replicate(n_vars, runif(n_rows))
head(scratch_df)
sapply()
applies a function to a sequence of valuesLETTERS
is a character vector containing uppercase lettersreplicate()
replicates n_var lists of n_row elements from text_draw sampled randomly from test_draw
using the sample()
function
In [14]:
text_draw <- sapply(LETTERS[1:7],
FUN = function(x) paste(rep(x, 8), collapse = ""))
text_draw
In [15]:
scratch_df[, char_vars] <- replicate(n_vars,
sample(text_draw, n_rows, replace = TRUE))
head(scratch_df)
dplyr
is a popular, intuitive, and effcient package for manipulating data sets
In [16]:
scratch_tbl <- tbl_df(scratch_df)
In [17]:
glimpse(scratch_tbl)
In [18]:
ggplot(scratch_tbl, aes(numeric1)) +
geom_bar(stat = "bin", fill = "blue", bins = 100) +
ggtitle('Histogram of Numeric1')
In [19]:
ggplot(scratch_tbl, aes(char1)) +
geom_bar(aes(fill=char1)) +
ggtitle('Histogram of Char1') +
coord_flip()
Subset a range of variables with similar names and numeric suffixes
In [20]:
num_vars <- select(scratch_tbl, num_range('numeric', 1:n_vars))
head(num_vars)
Subset all the variables whose names begin with 'char'
In [21]:
char_vars <- select(scratch_tbl, starts_with('char'))
head(char_vars)
Subset variables by their names
In [22]:
mixed_vars <- select(scratch_tbl, one_of('numeric1', 'char1'))
head(mixed_vars)
Subset/slice rows using their numeric indices
In [23]:
some_rows <- slice(scratch_tbl, 1:10)
some_rows
Subset top rows based on the value of a certain variable
In [24]:
sorted_top_rows <- top_n(scratch_tbl, 10, numeric1)
sorted_top_rows
Subset rows where a certain variable has a certain value
In [25]:
AAAAAAAA_rows <- filter(scratch_tbl, char1 == 'AAAAAAAA')
head(AAAAAAAA_rows)
dplyr::transform
enables the creation of new variables from existing variables
In [26]:
scratch_tbl2 <- transform(scratch_tbl,
new_numeric = round(numeric1, 1))
head(scratch_tbl2)
dplyr::mutate
enables the creation of new variables from existing variables and computed variables
In [27]:
scratch_tbl2 <- mutate(scratch_tbl,
new_numeric = round(numeric1, 1),
new_numeric2 = new_numeric * 10)
head(scratch_tbl2)
dplyr::transmute
enables the creation of new variables from existing variables and computed variables, but keeps only newly created variables
In [28]:
scratch_tbl2 <- transmute(scratch_tbl,
new_numeric = round(numeric1, 1),
new_numeric2 = new_numeric * 10)
head(scratch_tbl2)
In [29]:
# one sort var: char1
sorted <- arrange(char_vars, char1)
head(sorted)
In [30]:
# two sort vars: char1, char2
sorted2 <- arrange(char_vars, char1, char2)
head(sorted2)
In [31]:
bindr <- bind_rows(sorted, sorted2)
nrow(bindr) #nrow - number of rows
In [32]:
bindc <- bind_cols(sorted, sorted2)
ncol(bindc) # ncol - number of columns
In [33]:
sorted_left <- arrange(select(scratch_tbl, one_of('INDEX', 'char1')), char1)
right <- select(scratch_tbl, one_of('INDEX', 'numeric1'))
In [34]:
joined <- left_join(sorted_left, right, by = 'INDEX')
head(joined)
In [35]:
test <- select(scratch_tbl, one_of('INDEX', 'numeric1', 'char1'))
In [36]:
print(all.equal(joined, test, ignore_row_order = FALSE))
In [37]:
print(all.equal(joined, test, ignore_col_order = FALSE))
In [38]:
print(all.equal(joined, test))
In [39]:
ave <- summarise(num_vars, avg = mean(numeric1)) # avg is the name of the new variable
ave
In [40]:
all_aves <-summarise_each(num_vars, funs(mean)) # funs() defines the summary function
all_aves
In [41]:
grouped <- group_by(joined, char1)
In [42]:
grouped <- summarise(grouped, avg = mean(numeric1)) # avg is the name of the new variable
grouped
In [43]:
transposed = t(scratch_tbl)
glimpse(transposed)
Often, instead of simply transposing, a data set will need to be reformatted in a melt/stack - column split - cast action described in Hadley Wickham's Tidy Data: https://www.jstatsoft.org/article/view/v059i10
See also dplyr::gather and dplyr::spread()
In [44]:
# export
filename <- paste(git_dir, 'scratch.csv', sep = '/')
write.table(scratch_tbl, file = filename, quote = FALSE, sep = ',',
row.names = FALSE)
In [45]:
# import
import <- read.table(filename, header = TRUE, sep = ',')