In this exercise, we work with a data set from World Values Survey who have extensive documentation on questions in the survey.
We focus on question V105: "I'd like to ask you how much you trust people people from this group completely, somewhat, not very from various groups. Could you tell me for each whether you trust much or not at all? People you meet for the first time"
In [1]:
library(foreign)
In [2]:
dataset = read.dta("WV6_Data_Turkey_2012_Stata_v20180912.dta")
In [3]:
head( dataset )
In [4]:
## here we dicotomize the variable to help in some analysis
## mark missing values NA
dataset$V105[ as.integer( dataset$V105 ) > 4 ] <- NA
dataset$V105[ as.integer( dataset$V105 ) <= 2 ] <- NA ## this helps us to work on binary data.
dataset$V105 <- droplevels( dataset$V105 )
dataset <- dataset[ complete.cases( dataset ), ]
summary( dataset$V105 )
To control the quality of your data analysis, we split the classified data into two groups. The first one, training data is used to develop and train the model. The second spllit is testing data which we use to explore how well we trained the mode. We shall never use testing data when we train the model so that we can evaluate the accuracy of any model by showing it unseen data.
In [5]:
create_train_test <- function(data, size = 0.8, train = TRUE) {
n_row = nrow(data)
total_row = size * n_row
train_sample <- 1:total_row
if (train) {
return (data[train_sample, ])
} else {
return (data[-train_sample, ])
}
}
In [6]:
train <- create_train_test( dataset, train = TRUE )
test <- create_train_test( dataset, train = FALSE )
Decision trees help in data classification by exploring how to best predicts belonging to some category, step by step. It creates a nice tree-like visualization. They work best binary variables equally same size.
In [7]:
library(rpart)
library(rpart.plot)
In [8]:
model_rpart <- rpart( V105 ~ V10 + V20 + V30 + V40 + V50 + V60 + V70 + V80 + V90 + V100 + V242, data = train, method = "class")
rpart.plot( model_rpart )
Support vector machines similarly are used to create a mechanism to classify content based on variables. Note how you can explore the importance of individual variables using varImp.
You can also use advanced techniques to improve the model prediction by cross-validating even when doing data analysis -- not only in the end when comparing results from train and test data. This means that the model is created several times with different splits (folds) of the dataset and they are used together to create the best model.
In [9]:
library(caret)
In [10]:
## heere we create crossvalidation
tc <- trainControl(
method = "repeatedcv",
returnResamp = "all",
number = 2,
repeats = 2
)
In [12]:
model_svm <- train( V105 ~ V10 + V20 + V30 + V40 + V50 + V60 + V70 + V80 + V90 + V100 + V242, data=train,
method="svmLinear", trControl = tc)
In [13]:
varImp( model_svm, scale=TRUE )
In [14]:
varImp( model_svm, scale=FALSE )
In [ ]:
model_rf <- train( V105 ~ V10 + V20 + V30 + V40 + V50 + V60 + V70 + V80 + V90 + V100 + V242, data=train, method="rf")
In [ ]:
plot( model_rf )
In [ ]:
p <- predict( model_rf, test )
confusionMatrix( p, test$V105 )
In [ ]:
hist( train$V242, xlab = "Age", main = "Age of responders" )
In [ ]:
model_lasso <- train( V242 ~ V10 + V20 + V30 , data=train, method="lmStepAIC")
In [ ]:
summary( model_lasso )
In [ ]:
test_lasso <- predict( model_lasso, test )
In [ ]:
cor( test_lasso, test$V242 )
In [ ]:
plot( test_lasso, test$V242 )
In [ ]: