Data frames are basically the common tables you know from excel or from anywhere on the internet. Usually data.frame is the product of your long effort to preprocess and clean the data. To combine what we already know, data.frames are lists of vectors of the same size, which have functionality ot easily access rows of data across multiple vectors.
Data frames columns MUST have same length - missing values can be replaced with NAs, NaNs or NULLs; And similarly to the vector restraint, each column must have only a single variable type.
In [32]:
set.seed(1)
age = sample(c(10:25), 25, replace = T)
gender = sample(c("male", "female"), 25, replace = T)
smoker = sample(c(T, F), 25, replace = T)
BMI = rnorm(25, 20, 2)
df = data.frame(age = age, gender = gender, smoker = smoker, BMI = BMI)
There are some simple functions to examine data.frames
In [33]:
head(df)
In [34]:
summary(df)
In [35]:
nrow(df)
ncol(df)
Remember theat each column is basically a vector. Therefore if you select the vector, you can run any functions on it. It is also important to know the different types of subsetting lists. Single [n] will select the n-th element of a list WITH the name of the list - tehrefore it doesn't return a vector per se. Double [[n]] on the
In [36]:
df[3]
df[[3]]
Other way of selecting vectors is to follow the list way of selecting elements by name. That way uses $ operator. This selection is effectively same as the sellection with [[n]]. But remember, that if you want to use name of the column in brackets, you need to put a string there [["smoker"]] (otherwise it will search for a smoker variable).
In [37]:
df$smoker
df[["smoker"]]
df[["smoker"]] == df$smoker
And the data.frame own way to select columns is to use its df[ROW, COLUMN] statement. Column part accepts numbers as well as string
In [38]:
df[,3]
df[,"smoker"]
In [39]:
a = "BMI"
df[, a]
When we talk about subsetting data frames we usually mean selection of rows while keeping columns. But if you want to only kjeep some columns, use techniquest presented above.
There are many ways how to subset a data frame. The first thing to realise is that data frame is a list of vectors, therefore we can use similar functionality that lists have. The df[ROW, COLUMN] will also come in handy. If in doubt, go back to varaibles lecture about lists.
Basically we have two major ways of subsetting - using common indexing or using functions
In [40]:
small_df = data.frame(age = c(17, 23, 25), smoker = c(T, T, F), weight = c(65, 87, 74))
That means that you select the second row in these two ways.
In [41]:
small_df[c(F, T, F),]
small_df[2,]
In [42]:
age20smoker = which(df$age > 20 & smoker) # creating vector of indices
age20smoker
df[age20smoker,]
The use of logical vector style is much more common, but maybe a bit harder to wrap your head around. It basically selects all parts that evaluate to true.
In [43]:
numbers = 1:10
log = rep(c(T,F), 5)
numbers
log
numbers[log]
You can use logical vector of the
In [44]:
age20smoker = age > 20 & smoker #creating logical vector
age20smoker
df[age20smoker,]
In [45]:
select_last = c(rep(F, 24), T)
select_last
df[select_last,]
In [46]:
df_smokers = df[smoker,]
df_smokers$BMI
mean(df_smokers$BMI)
In [47]:
zeny = gender == "female"
age22 = age > 22
zeny22 = zeny & age22
df[zeny22,]
In [48]:
# maximal BMI "male" age < 24 non-smoker
males = gender == "male"
age24 = age < 24
nonsmoker = !smoker
male24nonsmoker = males & age24 & nonsmoker
df[male24nonsmoker,]$BMI