Recoding variables

Sometimes you need to recompute varialbes, create new ones or simply recode strings to numers, strings to logicals or numbers to logical values.

For example look at this table:


In [1]:
plays = read.table("../../data//hejtmy-plays.csv", sep = ",", header = T)
head(plays)


play.IDgame.IDgame.nameuseriddatequantitylocationlengthincompletenowinstats...player.7.ratingplayer.7.winplayer.8.usernameplayer.8.nameplayer.8.startpositionplayer.8.colorplayer.8.scoreplayer.8.newplayer.8.ratingplayer.8.win
19912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA ... NA NA NA NA NA NA NA NA NA
29912984 40692 Small WorldNA 2013-08-05 NA Roztoky 35 NA NA ... NA NA NA NA NA NA NA NA NA
39913062 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA ... NA NA NA NA NA NA NA NA NA
49953882 96848 Mage Knight Board GameNA 2013-08-11 NA Nižbor 300 NA NA ... NA NA NA NA NA NA NA NA NA
59953895 95234 Cthulhu GloomNA 1970-08-09 NA Nižbor 200 NA NA ... NA NA NA NA NA NA NA NA NA
69953904 40692 Small WorldNA 2013-08-10 NA Nižbor 120 NA NA ... NA NA NA NA NA NA NA NA NA

In [2]:
length(names(plays))


75

Do you see the weird symbols in the location column? For some reason te table looks alright when we open it in text editor but gets screwed during upload. Well, this is the time to read up on the ENCODING. As it turns out, default encoding that is set by the read.table function is ANSI. But our file contains non ansi symbols and is encoded with utf-8. Therefore we need to fix it by recoding the column.

Unfortunatelly that is nore fdifficult when the table is already loaded and would need to be done for all columns, so let's just have a look at read.table parameters and fix it.


In [3]:
plays = read.table("../../data//hejtmy-plays.csv", sep = ",", header = T, encoding = "UTF-8")
head(plays)


play.IDgame.IDgame.nameuseriddatequantitylocationlengthincompletenowinstats...player.7.ratingplayer.7.winplayer.8.usernameplayer.8.nameplayer.8.startpositionplayer.8.colorplayer.8.scoreplayer.8.newplayer.8.ratingplayer.8.win
19912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA ... NA NA NA NA NA NA NA NA NA
29912984 40692 Small WorldNA 2013-08-05 NA Roztoky 35 NA NA ... NA NA NA NA NA NA NA NA NA
39913062 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA ... NA NA NA NA NA NA NA NA NA
49953882 96848 Mage Knight Board GameNA 2013-08-11 NA Nižbor 300 NA NA ... NA NA NA NA NA NA NA NA NA
59953895 95234 Cthulhu GloomNA 1970-08-09 NA Nižbor 200 NA NA ... NA NA NA NA NA NA NA NA NA
69953904 40692 Small WorldNA 2013-08-10 NA Nižbor 120 NA NA ... NA NA NA NA NA NA NA NA NA

Perfect. Now, looking at the number of columns, someting is out of order. we have amazing amount of NA values. Also, sometimes same players are in the first position, sometimes in the second, so it would make comparisons between players really obnoxious to do.

What would make more sense woud be to read the table into the so called long format. Therefore we would have multiple lines for the same play with different players as values in the player column. This format is quite used in SQL and with some changes it dos impede the analysis. Unfortuntely its not as easy to o a we would might like. There are some ingenious solutins that ill sho you, but tere are also packages o smooth the way

ingenious way

Because we are going to mess th table up, we need to save the ID of each play. Luckilly for us, its already in there. If we didn't have the ID, we would have needed to create one as simply as plays[,"id"] = 1:nrow(plays)

NEvertheless, let's continue.

Now we need to radically restructure the player part. Basically, we need each player to become one row in the table, herefore we need to split

player.X.username player.X.name player.X.startposition player.X.color player.X.score player.X.new player.X.rating player.X.win

as is often useful, let's start with a single row


In [4]:
row = plays[1, ]

We need to split players to individual rows and then "paste" the original play in front. So let's start with the splitting part. Luckily, the naming conventions are quite clear and column positions are very systematic and regular, so we can use a simple for loop to do it. Let's have a look at the situation we are at:


In [5]:
names(plays)


  1. "play.ID"
  2. "game.ID"
  3. "game.name"
  4. "userid"
  5. "date"
  6. "quantity"
  7. "location"
  8. "length"
  9. "incomplete"
  10. "nowinstats"
  11. "comments"
  12. "player.1.username"
  13. "player.1.name"
  14. "player.1.startposition"
  15. "player.1.color"
  16. "player.1.score"
  17. "player.1.new"
  18. "player.1.rating"
  19. "player.1.win"
  20. "player.2.username"
  21. "player.2.name"
  22. "player.2.startposition"
  23. "player.2.color"
  24. "player.2.score"
  25. "player.2.new"
  26. "player.2.rating"
  27. "player.2.win"
  28. "player.3.username"
  29. "player.3.name"
  30. "player.3.startposition"
  31. "player.3.color"
  32. "player.3.score"
  33. "player.3.new"
  34. "player.3.rating"
  35. "player.3.win"
  36. "player.4.username"
  37. "player.4.name"
  38. "player.4.startposition"
  39. "player.4.color"
  40. "player.4.score"
  41. "player.4.new"
  42. "player.4.rating"
  43. "player.4.win"
  44. "player.5.username"
  45. "player.5.name"
  46. "player.5.startposition"
  47. "player.5.color"
  48. "player.5.score"
  49. "player.5.new"
  50. "player.5.rating"
  51. "player.5.win"
  52. "player.6.username"
  53. "player.6.name"
  54. "player.6.startposition"
  55. "player.6.color"
  56. "player.6.score"
  57. "player.6.new"
  58. "player.6.rating"
  59. "player.6.win"
  60. "player.7.username"
  61. "player.7.name"
  62. "player.7.startposition"
  63. "player.7.color"
  64. "player.7.score"
  65. "player.7.new"
  66. "player.7.rating"
  67. "player.7.win"
  68. "player.8.username"
  69. "player.8.name"
  70. "player.8.startposition"
  71. "player.8.color"
  72. "player.8.score"
  73. "player.8.new"
  74. "player.8.rating"
  75. "player.8.win"

In [6]:
which(names(plays) == "player.1.username")
which(names(plays) == "player.2.username")


12
20

So we know there are maximum of 8 player and each player has 8 columns with information. The player information also starts at the 12th place. Let's try first step of the for loop in here


In [8]:
iPlayer_info = which(names(plays) == "player.1.username")
play_info = row[1:iPlayer_info-1]
nPlayerCol = 8
i = 1 #this will get incremented in the loop later
iStart = iPlayer_info*i
iEnd = iStart + nPlayerCol - 1
player_info = row[iStart:iEnd]
player_row = c(play_info, player_info)
player_row


$play.ID
9912835
$game.ID
91536
$game.name
Quarriors!
$userid
NA
$date
2013-08-05
$quantity
NA
$location
Roztoky
$length
20
$incomplete
NA
$nowinstats
NA
$comments
$player.1.username
Tatsukochi
$player.1.name
hejtmy
$player.1.startposition
NA
$player.1.color
$player.1.score
20
$player.1.new
NA
$player.1.rating
NA
$player.1.win
1

In [9]:
new_df = data.frame()
iPlayer_info = which(names(plays) == "player.1.username")
play_info = row[, (1:(iPlayer_info - 1))]
nPlayerCol = 8
for (i in 1:8){
    # now we want to extract information about the specific player
    iStart = iPlayer_info + (i-1) * nPlayerCol
    iEnd = iStart + nPlayerCol - 1
    player_info = row[, iStart:iEnd]
    colnames(player_info) = c("player.username" ,"player.name", "player.startposition", "player.color", "player.score",
                            "player.new", "player.rating", "player.win")
    player_row = cbind(play_info, player_info)
    new_df = rbind(new_df, player_row)
}
new_df


play.IDgame.IDgame.nameuseriddatequantitylocationlengthincompletenowinstatscommentsplayer.usernameplayer.nameplayer.startpositionplayer.colorplayer.scoreplayer.newplayer.ratingplayer.win
19912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA Tatsukochihejtmy NA 20 NA NA 1
29912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA Betka NA 2 NA NA NA
39912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA NA NA NA NA NA
49912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA NA NA NA NA NA
59912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA NA NA NA NA NA
69912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA NA NA NA NA NA
79912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA NA NA NA NA NA NA
89912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA NA NA NA NA NA NA NA

Looks good. Now we need to do it for every row. Because we already did it for one row, we only need to reassign the row and resave throughout.

Now this would lead to a lot of empty values. WE can remove afterwards, but we can also speed things up by breaking the forloop when empty player is located.


In [11]:
plays_recoded = data.frame()
for (i in 1:nrow(plays)){
  row = plays[i,]
  play_df = data.frame()
  iPlayer_info = which(names(plays) == "player.1.username")
  play_info = row[, (1:(iPlayer_info - 1))]
  nPlayerCol = 8
  for (i in 1:8){
    # now we want to extract information about the specific player
    iStart = iPlayer_info + (i-1) * nPlayerCol
    iEnd = iStart + nPlayerCol - 1
    player_info = row[, iStart:iEnd]
    colnames(player_info) = c("player.username" ,"player.name", "player.startposition", "player.color", "player.score",
                              "player.new", "player.rating", "player.win")
    if(player_info$player.name == ""){break} #CHANGE!!!!!!
    player_row = cbind(play_info, player_info)
    play_df = rbind(play_df, player_row)
  }
  plays_recoded = rbind(plays_recoded, play_df)
}

Now have a look at it


In [12]:
head(plays_recoded)


play.IDgame.IDgame.nameuseriddatequantitylocationlengthincompletenowinstatscommentsplayer.usernameplayer.nameplayer.startpositionplayer.colorplayer.scoreplayer.newplayer.ratingplayer.win
19912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA Tatsukochihejtmy NA 20 NA NA 1
29912835 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA Betka NA 2 NA NA NA
229912984 40692 Small WorldNA 2013-08-05 NA Roztoky 35 NA NA Tatsukochi hejtmy 1 106 NA NA 1
219912984 40692 Small WorldNA 2013-08-05 NA Roztoky 35 NA NA Betka NA 94 NA NA NA
39913062 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA Tatsukochihejtmy 2 20 NA NA NA
319913062 91536 Quarriors!NA 2013-08-05NA Roztoky 20 NA NA Betka 1 10 NA NA NA

Now it works but the function looks horrible. We can fix stuff using functions to make things clearer.

Now the stat tests will be slightly more complicated, because we need to do them play.ID wise, but we should manage. There are packages to easy our ways into that.

reshape way


In [ ]:
reshape(plays, direction = "long", varying = 11:75)

Time play

there is more complicated issue when dealing with time. Time can be written in many ways - as string, as posix format, datetime, datenum, time with zones or without, miliseconds since something etc. Depending on how you encode it, you can do simple or complicate stuff with it :)


In [ ]:


In [ ]: