Getting the data

In this exercise, we work with a data set from World Values Survey who have extensive documentation on questions in the survey.

We focus on question V105: "I'd like to ask you how much you trust people people from this group completely, somewhat, not very from various groups. Could you tell me for each whether you trust much or not at all? People you meet for the first time"


In [1]:
library(foreign)

In [2]:
dataset = read.dta("WV6_Data_Turkey_2012_Stata_v20180912.dta")


Warning message in read.dta("WV6_Data_Turkey_2012_Stata_v20180912.dta"):
“value labels (‘V243_AU’) for ‘V243_AU’ are missing”Warning message in read.dta("WV6_Data_Turkey_2012_Stata_v20180912.dta"):
“value labels (‘V244_AU’) for ‘V244_AU’ are missing”Warning message in read.dta("WV6_Data_Turkey_2012_Stata_v20180912.dta"):
“value labels (‘V258A’) for ‘V258A’ are missing”Warning message in read.dta("WV6_Data_Turkey_2012_Stata_v20180912.dta"):
“value labels (‘SECVALWGT’) for ‘SECVALWGT’ are missing”

In [3]:
head( dataset )


A data.frame: 6 × 430
V1V2V2AV3V4V5V6V7V8V9I_ABORTLIBI_DIVORLIBCHOICEWEIGHT3BI_VOICE1I_VOICE2I_VOI2_00VOICEWEIGHT4BCOW
<fct><fct><fct><int><fct><fct><fct><fct><fct><fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><fct>
2010-2013TurkeyTurkey1Very important Rather importantVery important Not at all importantDont know Very important0.2222220.8888890.37037010.000.50.2500.2501TUR Turkey
2010-2013TurkeyTurkey2Very important Rather importantRather importantNot at all importantVery important Very important0.0000000.0000000.00000010.660.50.5800.5801TUR Turkey
2010-2013TurkeyTurkey3Very important Very important Very important Very important Very important Very important0.5555560.4444440.44444411.000.00.5000.5001TUR Turkey
2010-2013TurkeyTurkey4Very important Very important Rather importantRather important Not very important Very important0.0000000.2222220.07407410.660.00.3300.3301TUR Turkey
2010-2013TurkeyTurkey5Very important Very important Very important Very important Very important Very important0.0000000.0000000.00000010.000.00.0000.0001TUR Turkey
2010-2013TurkeyTurkey6Rather importantRather importantVery important Not at all importantNot at all importantVery important0.0000000.4444440.14814810.330.50.4150.4151TUR Turkey

In [4]:
## here we dicotomize the variable to help in some analysis
## mark missing values NA
dataset$V105[ as.integer( dataset$V105 ) > 4 ] <- NA
dataset$V105[ as.integer( dataset$V105 ) <= 2 ] <- NA ## this helps us to work on binary data.
dataset$V105 <- droplevels( dataset$V105 )

dataset <- dataset[ complete.cases( dataset ), ]

summary( dataset$V105 )


Do not trust very much
746
Do not trust at all
498

Splitting training and test data

To control the quality of your data analysis, we split the classified data into two groups. The first one, training data is used to develop and train the model. The second spllit is testing data which we use to explore how well we trained the mode. We shall never use testing data when we train the model so that we can evaluate the accuracy of any model by showing it unseen data.


In [5]:
create_train_test <- function(data, size = 0.8, train = TRUE) {
    n_row = nrow(data)
    total_row = size * n_row
    train_sample <- 1:total_row
    if (train) {
        return (data[train_sample, ])
    } else {
        return (data[-train_sample, ])
    }
}

In [6]:
train <- create_train_test( dataset, train = TRUE )
test <- create_train_test( dataset, train = FALSE )

Decision trees

Decision trees help in data classification by exploring how to best predicts belonging to some category, step by step. It creates a nice tree-like visualization. They work best binary variables equally same size.


In [7]:
library(rpart)
library(rpart.plot)

In [8]:
model_rpart <- rpart( V105 ~ V10 + V20 + V30 + V40 + V50 + V60 + V70 + V80 + V90 + V100 + V242, data = train, method = "class")
rpart.plot( model_rpart )


Support vector machines

Support vector machines similarly are used to create a mechanism to classify content based on variables. Note how you can explore the importance of individual variables using varImp.

You can also use advanced techniques to improve the model prediction by cross-validating even when doing data analysis -- not only in the end when comparing results from train and test data. This means that the model is created several times with different splits (folds) of the dataset and they are used together to create the best model.


In [9]:
library(caret)


Loading required package: lattice
Loading required package: ggplot2
Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang

In [10]:
## heere we create crossvalidation
tc <- trainControl(
  method = "repeatedcv",
  returnResamp = "all",
  number = 2,
  repeats = 2
)

In [12]:
model_svm <- train( V105 ~ V10 + V20 + V30 + V40 + V50 + V60 + V70 + V80 + V90 + V100 + V242, data=train,
                    method="svmLinear", trControl = tc)


Warning message in .local(x, ...):
“Variable(s) `' constant. Cannot scale data.”Warning message in .local(x, ...):
“Variable(s) `' constant. Cannot scale data.”Warning message in .local(x, ...):
“Variable(s) `' constant. Cannot scale data.”Warning message in .local(x, ...):
“Variable(s) `' constant. Cannot scale data.”Warning message in .local(x, ...):
“Variable(s) `' constant. Cannot scale data.”

In [13]:
varImp( model_svm, scale=TRUE )


ROC curve variable importance

     Importance
V60     100.000
V242     91.875
V40      58.734
V80      53.489
V50      35.805
V70      29.691
V20      19.395
V90       9.127
V10       6.242
V100      5.162
V30       0.000

In [14]:
varImp( model_svm, scale=FALSE )


ROC curve variable importance

     Importance
V60      0.5419
V242     0.5388
V40      0.5262
V80      0.5242
V50      0.5175
V70      0.5152
V20      0.5113
V90      0.5074
V10      0.5064
V100     0.5059
V30      0.5040

Random forest


In [ ]:
model_rf <- train( V105 ~ V10 + V20 + V30 + V40 + V50 + V60 + V70 + V80 + V90 + V100 + V242, data=train, method="rf")

In [ ]:
plot( model_rf )

Evaluating results

Now let's examine how well the models work with unseen test data.


In [ ]:
p <- predict( model_rf, test )
confusionMatrix( p, test$V105 )

Continous variable

Above we worked with dataset that vas nominal, or classified. Let's move to work on dataset that is continous.


In [ ]:
hist( train$V242, xlab = "Age", main = "Age of responders" )

In [ ]:
model_lasso <- train( V242 ~ V10 + V20 + V30 , data=train, method="lmStepAIC")

In [ ]:
summary( model_lasso )

In [ ]:
test_lasso <- predict( model_lasso, test )

In [ ]:
cor( test_lasso, test$V242 )

In [ ]:
plot( test_lasso, test$V242 )

In [ ]: