Regresssion with scikit-learn

using Soccer Dataset

We will again be using the open dataset from the popular site Kaggle that we used in Week 1 for our example.

Recall that this European Soccer Database has more than 25,000 matches and more than 10,000 players for European professional soccer seasons from 2008 to 2016.

Note: Please download the file database.sqlite if you don't yet have it in your Week-7-MachineLearning folder.

Import Libraries



In [1]:

    
import sqlite3
import pandas as pd 
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt

Read Data from the Database into pandas



In [3]:

    
# Create your connection.
cnx = sqlite3.connect('database.sqlite')
df = pd.read_sql_query("SELECT * FROM Player_Attributes", cnx)



In [4]:

    
df.head()









    Out[4]:







  
    
      
      id
      player_fifa_api_id
      player_api_id
      date
      overall_rating
      potential
      preferred_foot
      attacking_work_rate
      defensive_work_rate
      crossing
      ...
      vision
      penalties
      marking
      standing_tackle
      sliding_tackle
      gk_diving
      gk_handling
      gk_kicking
      gk_positioning
      gk_reflexes
    
  
  
    
      0
      1
      218353
      505942
      2016-02-18 00:00:00
      67.0
      71.0
      right
      medium
      medium
      49.0
      ...
      54.0
      48.0
      65.0
      69.0
      69.0
      6.0
      11.0
      10.0
      8.0
      8.0
    
    
      1
      2
      218353
      505942
      2015-11-19 00:00:00
      67.0
      71.0
      right
      medium
      medium
      49.0
      ...
      54.0
      48.0
      65.0
      69.0
      69.0
      6.0
      11.0
      10.0
      8.0
      8.0
    
    
      2
      3
      218353
      505942
      2015-09-21 00:00:00
      62.0
      66.0
      right
      medium
      medium
      49.0
      ...
      54.0
      48.0
      65.0
      66.0
      69.0
      6.0
      11.0
      10.0
      8.0
      8.0
    
    
      3
      4
      218353
      505942
      2015-03-20 00:00:00
      61.0
      65.0
      right
      medium
      medium
      48.0
      ...
      53.0
      47.0
      62.0
      63.0
      66.0
      5.0
      10.0
      9.0
      7.0
      7.0
    
    
      4
      5
      218353
      505942
      2007-02-22 00:00:00
      61.0
      65.0
      right
      medium
      medium
      48.0
      ...
      53.0
      47.0
      62.0
      63.0
      66.0
      5.0
      10.0
      9.0
      7.0
      7.0
    
  

5 rows × 42 columns



In [5]:

    
df.shape









    Out[5]:





(183978, 42)



In [6]:

    
df.columns









    Out[6]:





Index(['id', 'player_fifa_api_id', 'player_api_id', 'date', 'overall_rating',
       'potential', 'preferred_foot', 'attacking_work_rate',
       'defensive_work_rate', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes'],
      dtype='object')

Declare the Columns You Want to Use as Features



In [7]:

    
features = [
       'potential', 'crossing', 'finishing', 'heading_accuracy',
       'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy',
       'long_passing', 'ball_control', 'acceleration', 'sprint_speed',
       'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina',
       'strength', 'long_shots', 'aggression', 'interceptions', 'positioning',
       'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle',
       'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning',
       'gk_reflexes']

Specify the Prediction Target



In [8]:

    
target = ['overall_rating']

Clean the Data



In [9]:

    
df = df.dropna()

Extract Features and Target ('overall_rating') Values into Separate Dataframes



In [10]:

    
X = df[features]



In [11]:

    
y = df[target]

Let us look at a typical row from our features:



In [12]:

    
X.iloc[2]









    Out[12]:





potential             66.0
crossing              49.0
finishing             44.0
heading_accuracy      71.0
short_passing         61.0
volleys               44.0
dribbling             51.0
curve                 45.0
free_kick_accuracy    39.0
long_passing          64.0
ball_control          49.0
acceleration          60.0
sprint_speed          64.0
agility               59.0
reactions             47.0
balance               65.0
shot_power            55.0
jumping               58.0
stamina               54.0
strength              76.0
long_shots            35.0
aggression            63.0
interceptions         41.0
positioning           45.0
vision                54.0
penalties             48.0
marking               65.0
standing_tackle       66.0
sliding_tackle        69.0
gk_diving              6.0
gk_handling           11.0
gk_kicking            10.0
gk_positioning         8.0
gk_reflexes            8.0
Name: 2, dtype: float64

Let us also display our target values:



In [13]:

    
y









    Out[13]:







  
    
      
      overall_rating
    
  
  
    
      0
      67.0
    
    
      1
      67.0
    
    
      2
      62.0
    
    
      3
      61.0
    
    
      4
      61.0
    
    
      5
      74.0
    
    
      6
      74.0
    
    
      7
      73.0
    
    
      8
      73.0
    
    
      9
      73.0
    
    
      10
      73.0
    
    
      11
      74.0
    
    
      12
      73.0
    
    
      13
      71.0
    
    
      14
      71.0
    
    
      15
      71.0
    
    
      16
      70.0
    
    
      17
      70.0
    
    
      18
      70.0
    
    
      19
      70.0
    
    
      20
      70.0
    
    
      21
      70.0
    
    
      22
      69.0
    
    
      23
      69.0
    
    
      24
      69.0
    
    
      25
      69.0
    
    
      26
      69.0
    
    
      27
      69.0
    
    
      28
      69.0
    
    
      29
      68.0
    
    
      ...
      ...
    
    
      183933
      76.0
    
    
      183934
      75.0
    
    
      183935
      77.0
    
    
      183936
      77.0
    
    
      183937
      63.0
    
    
      183938
      63.0
    
    
      183939
      63.0
    
    
      183940
      63.0
    
    
      183941
      63.0
    
    
      183942
      66.0
    
    
      183943
      66.0
    
    
      183944
      66.0
    
    
      183945
      66.0
    
    
      183946
      66.0
    
    
      183947
      68.0
    
    
      183948
      68.0
    
    
      183949
      68.0
    
    
      183950
      68.0
    
    
      183951
      67.0
    
    
      183952
      67.0
    
    
      183968
      78.0
    
    
      183969
      81.0
    
    
      183970
      81.0
    
    
      183971
      81.0
    
    
      183972
      83.0
    
    
      183973
      83.0
    
    
      183974
      78.0
    
    
      183975
      77.0
    
    
      183976
      78.0
    
    
      183977
      80.0
    
  

180354 rows × 1 columns

Split the Dataset into Training and Test Datasets



In [14]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=324)

(1) Linear Regression: Fit a model to the training set



In [15]:

    
regressor = LinearRegression()
regressor.fit(X_train, y_train)









    Out[15]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Perform Prediction using Linear Regression Model



In [16]:

    
y_prediction = regressor.predict(X_test)
y_prediction









    Out[16]:





array([[ 66.51284879],
       [ 79.77234615],
       [ 66.57371825],
       ..., 
       [ 69.23780133],
       [ 64.58351696],
       [ 73.6881185 ]])

What is the mean of the expected target value in test set ?



In [17]:

    
y_test.describe()









    Out[17]:







  
    
      
      overall_rating
    
  
  
    
      count
      59517.000000
    
    
      mean
      68.635818
    
    
      std
      7.041297
    
    
      min
      33.000000
    
    
      25%
      64.000000
    
    
      50%
      69.000000
    
    
      75%
      73.000000
    
    
      max
      94.000000

Evaluate Linear Regression Accuracy using Root Mean Square Error



In [18]:

    
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))



In [19]:

    
print(RMSE)









    



2.805303046855223

(2) Decision Tree Regressor: Fit a new regression model to the training set



In [20]:

    
regressor = DecisionTreeRegressor(max_depth=20)
regressor.fit(X_train, y_train)









    Out[20]:





DecisionTreeRegressor(criterion='mse', max_depth=20, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

Perform Prediction using Decision Tree Regressor



In [21]:

    
y_prediction = regressor.predict(X_test)
y_prediction









    Out[21]:





array([ 62.        ,  84.        ,  62.38666667, ...,  71.        ,
        62.        ,  73.        ])

For comparision: What is the mean of the expected target value in test set ?



In [22]:

    
y_test.describe()









    Out[22]:







  
    
      
      overall_rating
    
  
  
    
      count
      59517.000000
    
    
      mean
      68.635818
    
    
      std
      7.041297
    
    
      min
      33.000000
    
    
      25%
      64.000000
    
    
      50%
      69.000000
    
    
      75%
      73.000000
    
    
      max
      94.000000

Evaluate Decision Tree Regression Accuracy using Root Mean Square Error



In [23]:

    
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))



In [24]:

    
print(RMSE)









    



1.4564614438612797



In [ ]:

	id	player_fifa_api_id	player_api_id	date	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	...	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	1	218353	505942	2016-02-18 00:00:00	67.0	71.0	right	medium	medium	49.0	...	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0
1	2	218353	505942	2015-11-19 00:00:00	67.0	71.0	right	medium	medium	49.0	...	54.0	48.0	65.0	69.0	69.0	6.0	11.0	10.0	8.0	8.0
2	3	218353	505942	2015-09-21 00:00:00	62.0	66.0	right	medium	medium	49.0	...	54.0	48.0	65.0	66.0	69.0	6.0	11.0	10.0	8.0	8.0
3	4	218353	505942	2015-03-20 00:00:00	61.0	65.0	right	medium	medium	48.0	...	53.0	47.0	62.0	63.0	66.0	5.0	10.0	9.0	7.0	7.0
4	5	218353	505942	2007-02-22 00:00:00	61.0	65.0	right	medium	medium	48.0	...	53.0	47.0	62.0	63.0	66.0	5.0	10.0	9.0	7.0	7.0

	overall_rating
count	59517.000000
mean	68.635818
std	7.041297
min	33.000000
25%	64.000000
50%	69.000000
75%	73.000000
max	94.000000