Recoding variables

Sometimes you need to recompute varialbes, create new ones or simply recode strings to numers, strings to logicals or numbers to logical values.

For example look at this table:



In [1]:

    
plays = read.table("../../data//hejtmy-plays.csv", sep = ",", header = T)
head(plays)









    





play.ID game.ID game.name userid date quantity location length incomplete nowinstats ... player.7.rating player.7.win player.8.username player.8.name player.8.startposition player.8.color player.8.score player.8.new player.8.rating player.8.win

	1 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA        ...       NA        NA        NA                  NA        NA        NA        NA        NA        NA        
	2 9912984    40692      Small World NA         2013-08-05 NA         Roztoky    35         NA         NA         ...        NA         NA         NA                    NA         NA         NA         NA         NA         NA         
	3 9913062   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA        ...       NA        NA        NA                  NA        NA        NA        NA        NA        NA        
	4 9953882               96848                 Mage Knight Board Game NA                    2013-08-11            NA                    NiÅ¾bor               300                   NA                    NA                    ...                   NA                    NA                    NA                                          NA                    NA                    NA                    NA                    NA                    NA                    
	5 9953895      95234        Cthulhu Gloom NA           1970-08-09   NA           NiÅ¾bor      200          NA           NA           ...          NA           NA           NA                        NA           NA           NA           NA           NA           NA           
	6 9953904    40692      Small World NA         2013-08-10 NA         NiÅ¾bor    120        NA         NA         ...        NA         NA         NA                    NA         NA         NA         NA         NA         NA



In [2]:

    
length(names(plays))

Do you see the weird symbols in the location column? For some reason te table looks alright when we open it in text editor but gets screwed during upload. Well, this is the time to read up on the ENCODING. As it turns out, default encoding that is set by the read.table function is ANSI. But our file contains non ansi symbols and is encoded with utf-8. Therefore we need to fix it by recoding the column.

Unfortunatelly that is nore fdifficult when the table is already loaded and would need to be done for all columns, so let's just have a look at read.table parameters and fix it.



In [3]:

    
plays = read.table("../../data//hejtmy-plays.csv", sep = ",", header = T, encoding = "UTF-8")
head(plays)









    





play.ID game.ID game.name userid date quantity location length incomplete nowinstats ... player.7.rating player.7.win player.8.username player.8.name player.8.startposition player.8.color player.8.score player.8.new player.8.rating player.8.win

	1 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA        ...       NA        NA        NA                  NA        NA        NA        NA        NA        NA        
	2 9912984    40692      Small World NA         2013-08-05 NA         Roztoky    35         NA         NA         ...        NA         NA         NA                    NA         NA         NA         NA         NA         NA         
	3 9913062   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA        ...       NA        NA        NA                  NA        NA        NA        NA        NA        NA        
	4 9953882               96848                 Mage Knight Board Game NA                    2013-08-11            NA                    Nižbor                300                   NA                    NA                    ...                   NA                    NA                    NA                                          NA                    NA                    NA                    NA                    NA                    NA                    
	5 9953895      95234        Cthulhu Gloom NA           1970-08-09   NA           Nižbor       200          NA           NA           ...          NA           NA           NA                        NA           NA           NA           NA           NA           NA           
	6 9953904    40692      Small World NA         2013-08-10 NA         Nižbor     120        NA         NA         ...        NA         NA         NA                    NA         NA         NA         NA         NA         NA

Perfect. Now, looking at the number of columns, someting is out of order. we have amazing amount of NA values. Also, sometimes same players are in the first position, sometimes in the second, so it would make comparisons between players really obnoxious to do.

What would make more sense woud be to read the table into the so called long format. Therefore we would have multiple lines for the same play with different players as values in the player column. This format is quite used in SQL and with some changes it dos impede the analysis. Unfortuntely its not as easy to o a we would might like. There are some ingenious solutins that ill sho you, but tere are also packages o smooth the way

ingenious way

Because we are going to mess th table up, we need to save the ID of each play. Luckilly for us, its already in there. If we didn't have the ID, we would have needed to create one as simply as plays[,"id"] = 1:nrow(plays)

NEvertheless, let's continue.

Now we need to radically restructure the player part. Basically, we need each player to become one row in the table, herefore we need to split

player.X.username player.X.name player.X.startposition player.X.color player.X.score player.X.new player.X.rating player.X.win

as is often useful, let's start with a single row



In [4]:

    
row = plays[1, ]

We need to split players to individual rows and then "paste" the original play in front. So let's start with the splitting part. Luckily, the naming conventions are quite clear and column positions are very systematic and regular, so we can use a simple for loop to do it. Let's have a look at the situation we are at:



In [5]:

    
names(plays)









    





	"play.ID"
	"game.ID"
	"game.name"
	"userid"
	"date"
	"quantity"
	"location"
	"length"
	"incomplete"
	"nowinstats"
	"comments"
	"player.1.username"
	"player.1.name"
	"player.1.startposition"
	"player.1.color"
	"player.1.score"
	"player.1.new"
	"player.1.rating"
	"player.1.win"
	"player.2.username"
	"player.2.name"
	"player.2.startposition"
	"player.2.color"
	"player.2.score"
	"player.2.new"
	"player.2.rating"
	"player.2.win"
	"player.3.username"
	"player.3.name"
	"player.3.startposition"
	"player.3.color"
	"player.3.score"
	"player.3.new"
	"player.3.rating"
	"player.3.win"
	"player.4.username"
	"player.4.name"
	"player.4.startposition"
	"player.4.color"
	"player.4.score"
	"player.4.new"
	"player.4.rating"
	"player.4.win"
	"player.5.username"
	"player.5.name"
	"player.5.startposition"
	"player.5.color"
	"player.5.score"
	"player.5.new"
	"player.5.rating"
	"player.5.win"
	"player.6.username"
	"player.6.name"
	"player.6.startposition"
	"player.6.color"
	"player.6.score"
	"player.6.new"
	"player.6.rating"
	"player.6.win"
	"player.7.username"
	"player.7.name"
	"player.7.startposition"
	"player.7.color"
	"player.7.score"
	"player.7.new"
	"player.7.rating"
	"player.7.win"
	"player.8.username"
	"player.8.name"
	"player.8.startposition"
	"player.8.color"
	"player.8.score"
	"player.8.new"
	"player.8.rating"
	"player.8.win"



In [6]:

    
which(names(plays) == "player.1.username")
which(names(plays) == "player.2.username")

So we know there are maximum of 8 player and each player has 8 columns with information. The player information also starts at the 12th place. Let's try first step of the for loop in here



In [8]:

    
iPlayer_info = which(names(plays) == "player.1.username")
play_info = row[1:iPlayer_info-1]
nPlayerCol = 8
i = 1 #this will get incremented in the loop later
iStart = iPlayer_info*i
iEnd = iStart + nPlayerCol - 1
player_info = row[iStart:iEnd]
player_row = c(play_info, player_info)
player_row









    





	$play.ID
		9912835
	$game.ID
		91536
	$game.name
		Quarriors!
	$userid
		NA
	$date
		2013-08-05
	$quantity
		NA
	$location
		Roztoky
	$length
		20
	$incomplete
		NA
	$nowinstats
		NA
	$comments
		
	$player.1.username
		Tatsukochi
	$player.1.name
		hejtmy
	$player.1.startposition
		NA
	$player.1.color
		
	$player.1.score
		20
	$player.1.new
		NA
	$player.1.rating
		NA
	$player.1.win
		1



In [9]:

    
new_df = data.frame()
iPlayer_info = which(names(plays) == "player.1.username")
play_info = row[, (1:(iPlayer_info - 1))]
nPlayerCol = 8
for (i in 1:8){
    # now we want to extract information about the specific player
    iStart = iPlayer_info + (i-1) * nPlayerCol
    iEnd = iStart + nPlayerCol - 1
    player_info = row[, iStart:iEnd]
    colnames(player_info) = c("player.username" ,"player.name", "player.startposition", "player.color", "player.score",
                            "player.new", "player.rating", "player.win")
    player_row = cbind(play_info, player_info)
    new_df = rbind(new_df, player_row)
}
new_df









    





play.ID game.ID game.name userid date quantity location length incomplete nowinstats comments player.username player.name player.startposition player.color player.score player.new player.rating player.win

	1 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                  Tatsukochi hejtmy    NA                  20        NA        NA        1         
	2 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                            Betka     NA                  2         NA        NA        NA        
	3 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                                      NA                  NA        NA        NA        NA        
	4 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                                      NA                  NA        NA        NA        NA        
	5 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                                      NA                  NA        NA        NA        NA        
	6 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                                      NA                  NA        NA        NA        NA        
	7 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                                      NA        NA        NA        NA        NA        NA        
	8 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                  NA                  NA        NA        NA        NA        NA        NA

Looks good. Now we need to do it for every row. Because we already did it for one row, we only need to reassign the row and resave throughout.

Now this would lead to a lot of empty values. WE can remove afterwards, but we can also speed things up by breaking the forloop when empty player is located.



In [11]:

    
plays_recoded = data.frame()
for (i in 1:nrow(plays)){
  row = plays[i,]
  play_df = data.frame()
  iPlayer_info = which(names(plays) == "player.1.username")
  play_info = row[, (1:(iPlayer_info - 1))]
  nPlayerCol = 8
  for (i in 1:8){
    # now we want to extract information about the specific player
    iStart = iPlayer_info + (i-1) * nPlayerCol
    iEnd = iStart + nPlayerCol - 1
    player_info = row[, iStart:iEnd]
    colnames(player_info) = c("player.username" ,"player.name", "player.startposition", "player.color", "player.score",
                              "player.new", "player.rating", "player.win")
    if(player_info$player.name == ""){break} #CHANGE!!!!!!
    player_row = cbind(play_info, player_info)
    play_df = rbind(play_df, player_row)
  }
  plays_recoded = rbind(plays_recoded, play_df)
}

Now have a look at it



In [12]:

    
head(plays_recoded)









    





play.ID game.ID game.name userid date quantity location length incomplete nowinstats comments player.username player.name player.startposition player.color player.score player.new player.rating player.win

	1 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                  Tatsukochi hejtmy    NA                  20        NA        NA        1         
	2 9912835   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                            Betka     NA                  2         NA        NA        NA        
	22 9912984    40692      Small World NA         2013-08-05 NA         Roztoky    35         NA         NA                    Tatsukochi hejtmy     1                     106        NA         NA         1          
	21 9912984    40692      Small World NA         2013-08-05 NA         Roztoky    35         NA         NA                               Betka      NA                    94         NA         NA         NA         
	3 9913062   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                  Tatsukochi hejtmy    2                   20        NA        NA        NA        
	31 9913062   91536     Quarriors! NA        2013-08-05 NA        Roztoky   20        NA        NA                            Betka     1                   10        NA        NA        NA

Now it works but the function looks horrible. We can fix stuff using functions to make things clearer.

Now the stat tests will be slightly more complicated, because we need to do them play.ID wise, but we should manage. There are packages to easy our ways into that.

reshape way



In [ ]:

    
reshape(plays, direction = "long", varying = 11:75)

Time play

there is more complicated issue when dealing with time. Time can be written in many ways - as string, as posix format, datetime, datenum, time with zones or without, miliseconds since something etc. Depending on how you encode it, you can do simple or complicate stuff with it :)



In [ ]:



In [ ]:

	play.ID	game.ID	game.name	userid	date	quantity	location	length	incomplete	nowinstats	...	player.7.rating	player.7.win	player.8.username	player.8.startposition	player.8.color	player.8.score	player.8.new	player.8.rating	player.8.win
1	9912835	91536	Quarriors!	NA	2013-08-05	NA	Roztoky	20	NA	NA	...	NA	NA	NA	NA	NA	NA	NA	NA	NA
2	9912984	40692	Small World	NA	2013-08-05	NA	Roztoky	35	NA	NA	...	NA	NA	NA	NA	NA	NA	NA	NA	NA
3	9913062	91536	Quarriors!	NA	2013-08-05	NA	Roztoky	20	NA	NA	...	NA	NA	NA	NA	NA	NA	NA	NA	NA
4	9953882	96848	Mage Knight Board Game	NA	2013-08-11	NA	NiÅ¾bor	300	NA	NA	...	NA	NA	NA	NA	NA	NA	NA	NA	NA
5	9953895	95234	Cthulhu Gloom	NA	1970-08-09	NA	NiÅ¾bor	200	NA	NA	...	NA	NA	NA	NA	NA	NA	NA	NA	NA
6	9953904	40692	Small World	NA	2013-08-10	NA	NiÅ¾bor	120	NA	NA	...	NA	NA	NA	NA	NA	NA	NA	NA	NA