Copyright (C) 2017 J. Patrick Hall, jphall@gwu.edu
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data.table
is an efficient package for manipulating data sets
data.table
is dt[i, j, by]
i
is row index, indexed from 1 ...j
is col index, indexed from 1 ...
In [1]:
library(data.table)
In [2]:
# '<-' is the preferred assignment operator in R
# '/' is the safest directory separator character to use
git_dir <- 'C:/path/to/GWU_data_mining/01_basic_data_prep/src/notebooks/r'
In [3]:
setwd(git_dir)
getwd()
In [4]:
n_rows <- 1000
n_vars <- 2
In [5]:
key <- seq(n_rows)
In [6]:
key[1:5]
In [7]:
num_vars <- paste('numeric', seq_len(n_vars), sep = '')
num_vars
char_vars <- paste('char', seq_len(n_vars), sep = '')
char_vars
In [8]:
text_draw <- sapply(LETTERS[1:7],
FUN = function(x) paste(rep(x, 8), collapse = ""))
text_draw
In [9]:
scratch_dt <- data.table(key,
replicate(n_vars, runif(n_rows)),
replicate(n_vars, sample(text_draw, n_rows,
replace = TRUE)))
The data.table::set*
family of methods in data.table always updates items by reference for efficiency
In [10]:
setnames(scratch_dt, c('key', num_vars, char_vars))
Set option to print first 5 and last 5 rows of data.table
by default
In [11]:
options(datatable.print.topn=5)
print(scratch_dt)
In [12]:
scratch_dt[,plot(numeric1, numeric2)]
Selecting a single column results in a vector
In [13]:
class(scratch_dt[,char1])
length(scratch_dt[,char1])
Specifying multiple columns by a vector results in a concatenated vector
In [14]:
class(scratch_dt[,c(numeric1, char1)])
length(scratch_dt[,c(numeric1, char1)])
Specifying multiple columns by list results in a data.table
In [15]:
class(scratch_dt[,list(numeric1, char1)])
print(scratch_dt[,list(numeric1, char1)])
'.' is an alias for 'list'
In [16]:
class(scratch_dt[,.(numeric1, char1)] )
print(scratch_dt[,.(numeric1, char1)])
Compute a standalone vector and display first five elements
In [17]:
scratch_dt[1:5, round(numeric1, 1)] # compute standalone vector
Compute a new column with assigned name
In [18]:
print(scratch_dt[, .(new_numeric = round(numeric1, 1))]) # assign name
Use numeric indices (or 'slicing')
In [19]:
scratch_dt[3:5]
In [20]:
scratch_dt[3:5,] # comma is optional
Use variable values
In [21]:
print(scratch_dt[char1 == 'DDDDDDDD'])
In [22]:
print(scratch_dt[char1 %in% c('DDDDDDDD', 'EEEEEEEE')])
In [23]:
scratch_dt[.N]
In [24]:
scratch_dt[,.N]
data.table::setorder
reorders columns by reference
In [25]:
sorted <- setorder(scratch_dt, char1)
print(sorted)
When used in data.table
, order()
also reorders columns by reference
In [26]:
sorted <- scratch_dt[order(char1)]
print(sorted)
Sort orders can be specified by using order()
In [27]:
sorted2 <- scratch_dt[order(char1, -numeric1)]
print(sorted2)
In [28]:
sorted3 <- setkey(scratch_dt, key)
print(sorted3)
In [29]:
scratch_dt2 <- scratch_dt[key > 500, char1 := 'ZZZZZZZZ']
print(scratch_dt2)
In [30]:
head(scratch_dt2[, new_numeric := round(numeric1, 1)])
Use data.table::rbindlist
to stack data.tables
vertically
In [31]:
bindr <- rbindlist(list(sorted, sorted2))
nrow(bindr)
data.table::merge
joins tables side-by-side using a common key (or 'by') variable
In [32]:
joined1 <- merge(sorted, sorted2, by = c('key'))
print(joined1)
Add a key to the scratch_dt2
table
In [33]:
scratch_dt2 <- setkey(scratch_dt2[,.(key, char1, new_numeric)], key)
print(scratch_dt2)
Now sorted3
and scratch_dt2
can be joined without specifiying a key
In [34]:
joined2 <- merge(sorted3, scratch_dt2)
print(joined2)
In [35]:
scratch_dt2[, sum(new_numeric), by = char1]
scratch_dt2[1:500, sum(new_numeric), by = char1]
In [36]:
scratch_dt2[, .N, by = char1]
In [37]:
two_by_vars <- scratch_dt[, mean(new_numeric), by = .(char1, char2)]
two_by_vars[order(char1, char2)]
In [38]:
scratch_dt2[, lapply(.SD, base::sum), by = char1]
In [39]:
scratch_dt2[, .SD[c(1, .N)], by = char1]
In [40]:
# chaining - one line of code
scratch_dt2[, .(new_numeric2 = sum(new_numeric)), by = char1][new_numeric2 > 40]
In [41]:
# no chaining - two lines of code
scratch_dt3 <- scratch_dt2[, .(new_numeric2 = sum(new_numeric)), by = char1]
scratch_dt3[new_numeric2 > 40]
In [42]:
transposed = t(scratch_dt)
str(transposed)
Often, instead of simply transposing, a data set will need to be reformatted in a melt/stack - column split - cast action described in Hadley Wickham's Tidy Data: https://www.jstatsoft.org/article/view/v059i10
See also dcast.data.table and melt.data.table
data.table::fread
and data.table::fwrite
allow for optimized file i/o
fwrite
only availabe in data.table version > 1.9.7
In [42]:
# use fwrite to write a file
fwrite(scratch_dt, 'scratch.csv')
In [43]:
# use fread to read a file
scratch_dt <- fread('scratch.csv')
print(scratch_dt)