DIC LAB 2 Problem 2 A: Extract Data : European Soccer Data

Define all the libraries which needs to be set for operations here


In [1]:
library("XML")
library("methods")
library("dplyr")
library("tidyr")


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Define the database name


In [2]:
dbName = toString("database.sqlite")

Connect to the database


In [3]:
my_db <- src_sqlite(dbName, create = FALSE)

Query the Player table


In [4]:
playerTable = tbl(my_db, sql("SELECT player_api_id,player_name FROM Player"))

Query the Player Attributes table


In [5]:
playerAttrTable = tbl(my_db, sql("SELECT player_api_id,overall_rating,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes FROM Player_Attributes"))

Add the gk_rating field to the table using mutate and gk fields


In [6]:
playerAttrTable = mutate(playerAttrTable, gk_rating = (gk_diving + gk_handling + gk_kicking + gk_positioning + gk_reflexes))

Group all the players by their id and find the different max ratings


In [7]:
playerAttrTable = playerAttrTable %>% 
    group_by(player_api_id) %>%
    summarize(overall_rating = max(overall_rating), gk_rating = max(gk_rating))

Print the table content


In [8]:
playerAttrTable


Source:   query [?? x 3]
Database: sqlite 3.11.1 [database.sqlite]

   player_api_id overall_rating gk_rating
           <int>          <int>     <int>
1           2625             63       153
2           2752             72        53
3           2768             75       128
4           2770             74       148
5           2790             73       144
6           2796             69       340
7           2802             82       146
8           2805             76       141
9           2857             72       143
10          2862             76       146
# ... with more rows

Left Join the tables by player id


In [9]:
playerData = left_join(playerTable, playerAttrTable, by = "player_api_id")

Print the data after join


In [10]:
playerData


Source:   query [?? x 4]
Database: sqlite 3.11.1 [database.sqlite]

   player_api_id        player_name overall_rating gk_rating
           <int>              <chr>          <int>     <int>
1         505942 Aaron Appindangoye             67        43
2         155782    Aaron Cresswell             74       121
3         162549        Aaron Doran             71       125
4          30572      Aaron Galindo             75       111
5          23780       Aaron Hughes             78       129
6          27316         Aaron Hunt             79       132
7         564793         Aaron Kuhl             61        64
8          30895       Aaron Lennon             84       146
9         528212       Aaron Lennox             48       237
10        101042      Aaron Meijers             69        54
# ... with more rows

Query the Match table


In [11]:
matchTable = tbl(my_db, sql("SELECT league_id,home_team_api_id,away_team_api_id,home_team_goal,away_team_goal,home_player_1,home_player_2,home_player_3,home_player_4,home_player_5,home_player_6,home_player_7,home_player_8,home_player_9,home_player_10,home_player_11,away_player_1,away_player_2,away_player_3,away_player_4,away_player_5,away_player_6,away_player_7,away_player_8,away_player_9,away_player_10,away_player_11 FROM Match"))

Group the Match table based on the player id


In [12]:
playerMatchDataTable = gather(tbl_df(matchTable), players, player_api_id, home_player_1 : away_player_11)

Group the players based on their id and summarize the attributes


In [13]:
playerMatchesDF = playerMatchDataTable %>% 
    group_by(player_api_id) %>%
    summarize(appearances = n(),league_appearances = n_distinct(league_id))

Print the data


In [14]:
playerMatchesDF


player_api_idappearancesleague_appearances
2625 90 1
2752 81 1
2768 71 1
2770 16 2
2790 50 1
2796 1 1
2802144 2
2805131 1
2857 3 1
2862 17 1
2907 6 1
2973 16 1
2977 5 1
2983 71 1
2984 38 1
3140 42 1
3204 42 1
3263 19 1
3264 1 1
3316 6 1
3329 85 1
3377 1 1
3441 74 1
3512 4 1
3517 17 1
3520 70 1
4747 36 1
4863 25 1
4902 68 1
4939 97 1
703659 1 1
704523 4 1
705484 4 1
706985 2 1
710807 1 1
715642 1 1
716998 1 1
717248 4 1
717270 1 1
717557 4 1
719413 3 1
720738 5 1
721133 15 1
722766 13 1
723037 7 1
725718 1 1
726956 4 1
728125 2 1
728414 1 1
728478 1 1
730065 1 1
733787 2 1
742405 1 1
743720 1 1
744907 1 1
746419 2 1
748432 2 1
750435 1 1
750584 1 1
NA29257 11

Join the two tables now based on the player id


In [15]:
playersFinalData = left_join(tbl_df(playerData), playerMatchesDF, by = "player_api_id")

Print the data


In [16]:
head(playersFinalData,n=10)


player_api_idplayer_nameoverall_ratinggk_ratingappearancesleague_appearances
505942 Aaron Appindangoye67 43 8 1
155782 Aaron Cresswell 74 121 75 1
162549 Aaron Doran 71 125 104 1
30572 Aaron Galindo 75 111 10 1
23780 Aaron Hughes 78 129 162 1
27316 Aaron Hunt 79 132 158 1
564793 Aaron Kuhl 61 64 5 1
30895 Aaron Lennon 84 146 181 1
528212 Aaron Lennox 48 237 1 1
101042 Aaron Meijers 69 54 167 1

Query the Team table


In [17]:
teamTable = tbl(my_db, sql("SELECT team_api_id,team_long_name FROM Team"))

Query the Team attribute table


In [18]:
teamAttrTable = tbl(my_db, sql("SELECT team_api_id,buildUpPlaySpeed,buildUpPlayDribbling,buildUpPlayPassing,chanceCreationPassing,chanceCreationCrossing,chanceCreationShooting,defencePressure,defenceAggression,defenceTeamWidth FROM Team_Attributes"))

Add the extra fields using different tables


In [19]:
teamAttrTable = mutate(teamAttrTable, chance_creation = (chanceCreationPassing + chanceCreationCrossing + chanceCreationShooting), build_up_play =(buildUpPlaySpeed +buildUpPlayDribbling+buildUpPlayPassing), defense = (defencePressure+defenceAggression+defenceTeamWidth))

Group the team table based on the team id and max Values


In [20]:
teamAttrTable = teamAttrTable %>% 
    group_by(team_api_id) %>%
    summarize(chance_creation = max(chance_creation), build_up_play = max(build_up_play), defense = max(defense))

Print the data


In [21]:
teamAttrTable


Source:   query [?? x 4]
Database: sqlite 3.11.1 [database.sqlite]

   team_api_id chance_creation build_up_play defense
         <int>           <int>         <int>   <int>
1         1601             186           139     175
2         1773             157           153     137
3         1957             163           158     210
4         2033             165           147     155
5         2182             196           176     160
6         2183             180            NA     175
7         2186             168           125     205
8         4087             180           156     165
9         4170             165            NA     200
10        6269             172           179     151
# ... with more rows

Join the two tables into one


In [22]:
teamTableData = left_join(teamTable, teamAttrTable, by = "team_api_id")

Print the data


In [23]:
teamTableData


Source:   query [?? x 5]
Database: sqlite 3.11.1 [database.sqlite]

   team_api_id    team_long_name chance_creation build_up_play defense
         <int>             <chr>           <int>         <int>   <int>
1         9987          KRC Genk             182           148     205
2         9993      Beerschot AC             185            NA     210
3        10000  SV Zulte-Waregem             153           147     200
4         9994  Sporting Lokeren             154           159     200
5         9984 KSV Cercle Brugge             200           145     195
6         8635    RSC Anderlecht             182           149     190
7         9991          KAA Gent             184           142     166
8         9998         RAEC Mons             168           160     138
9         7947     FCV Dender EH              NA            NA      NA
10        9985 Standard de Liège             182           161     205
# ... with more rows

Group the teams based on the two given ids


In [24]:
teamMatchDataTable = gather(tbl_df(matchTable), team, team_api_id, home_team_api_id,away_team_api_id)

Find the gilas scored by the home team


In [25]:
homeTeamGoals = gather(tbl_df(matchTable), team, team_api_id, home_team_api_id) %>% 
    group_by(team_api_id) %>%
    summarize(homeGoals = sum(home_team_goal))

Print the data


In [26]:
head(homeTeamGoals,n =10)


team_api_idhomeGoals
1601148
1773 74
1957176
2033 70
2182215
2183 96
2186144
4049 22
4064 13
4087105

Group the teams based on the id's for away teams


In [27]:
awayTeamGoals = gather(tbl_df(matchTable), team, team_api_id, away_team_api_id) %>% 
    group_by(team_api_id) %>%
    summarize(awayGoals = sum(away_team_goal))

Join the teams based on the goals when playing as away and home team


In [28]:
teamGoalsData = full_join(awayTeamGoals,homeTeamGoals) %>%  mutate(goals = (awayGoals+homeGoals))


Joining, by = "team_api_id"

Print the data


In [29]:
teamGoalsData


team_api_idawayGoalshomeGoalsgoals
1601132 148 280
1773 52 74 126
1957111 176 287
2033 68 70 138
2182157 215 372
2183 88 96 184
2186 91 144 235
4049 13 22 35
4064 14 13 27
4087 75 105 180
4170 12 19 31
6269 15 19 34
6351 13 15 28
6367 8 19 27
6391 14 23 37
6403126 166 292
6413 80 124 204
6421 20 35 55
6433 33 41 74
6493 54 74 128
6547 12 17 29
6601 11 27 38
6631 9 15 24
7730 47 52 99
7788 52 67 119
7794 64 101 165
7819111 109 220
7841117 131 248
7842 71 96 167
7844144 156 300
10214145 187 332
10215104 136 240
10217162 204 366
10218 64 90 154
10219 61 92 153
10228203 267 470
10229213 279 492
10233143 249 392
10235198 291 489
10238103 135 238
10242 33 38 71
10243227 241 468
10249151 193 344
10251123 136 259
10252156 179 335
10260244 338 582
10261125 199 324
10264159 239 398
10265164 168 332
10267185 299 484
10268 33 32 65
10269196 219 415
10278 9 27 36
10281 76 94 170
108893 7 14 21
158085 45 56 101
177361 12 21 33
188163 19 15 34
208931 14 23 37
274581 30 41 71

Join the tables to make a full team table


In [30]:
teamFinalData = left_join(tbl_df(teamTableData),teamGoalsData,by = "team_api_id")

Print the data


In [31]:
head(teamFinalData,n=10)


team_api_idteam_long_namechance_creationbuild_up_playdefenseawayGoalshomeGoalsgoals
9987 KRC Genk 182 148 205 143 205 348
9993 Beerschot AC 185 NA 210 62 112 174
10000 SV Zulte-Waregem 153 147 200 130 176 306
9994 Sporting Lokeren 154 159 200 130 149 279
9984 KSV Cercle Brugge200 145 195 94 119 213
8635 RSC Anderlecht 182 149 190 180 247 427
9991 KAA Gent 184 142 166 177 213 390
9998 RAEC Mons 168 160 138 53 76 129
7947 FCV Dender EH NA NA NA 23 21 44
9985 Standard de Liège182 161 205 142 199 341

Write the player data to a csv file


In [32]:
write.csv(playersFinalData, file = "player.csv",row.names=FALSE)

Write the team data to a csv file


In [33]:
write.csv(teamFinalData, file = "team.csv",row.names=FALSE)

References