In [1]:
import pandas as pd

Check the data, deal with NaNs


In [2]:
df = pd.read_csv("../data/ign.csv")

In [3]:
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18625 entries, 0 to 18624
Data columns (total 11 columns):
Unnamed: 0        18625 non-null int64
score_phrase      18625 non-null object
title             18625 non-null object
url               18625 non-null object
platform          18625 non-null object
score             18625 non-null float64
genre             18589 non-null object
editors_choice    18625 non-null object
release_year      18625 non-null int64
release_month     18625 non-null int64
release_day       18625 non-null int64
dtypes: float64(1), int64(4), object(6)
memory usage: 1.6+ MB
None

In [4]:
df = df.drop('title', axis=1)
df = df.drop('url', axis=1)
df = df.drop('Unnamed: 0', axis=1)

In [5]:
df = df.dropna()

In [6]:
print(df.info())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 18589 entries, 0 to 18624
Data columns (total 8 columns):
score_phrase      18589 non-null object
platform          18589 non-null object
score             18589 non-null float64
genre             18589 non-null object
editors_choice    18589 non-null object
release_year      18589 non-null int64
release_month     18589 non-null int64
release_day       18589 non-null int64
dtypes: float64(1), int64(3), object(4)
memory usage: 1.3+ MB
None

In [7]:
print(df.head())


  score_phrase          platform  score       genre editors_choice  \
0      Amazing  PlayStation Vita    9.0  Platformer              Y   
1      Amazing  PlayStation Vita    9.0  Platformer              Y   
2        Great              iPad    8.5      Puzzle              N   
3        Great          Xbox 360    8.5      Sports              N   
4        Great     PlayStation 3    8.5      Sports              N   

   release_year  release_month  release_day  
0          2012              9           12  
1          2012              9           12  
2          2012              9           12  
3          2012              9           11  
4          2012              9           11  

Encode parameters


In [8]:
from sklearn import preprocessing

In [9]:
le = preprocessing.LabelEncoder()

for col in df.columns.values:
    #Encode only the categorical variables
    if df[col].dtype=='object':
        le.fit(df[col].values)
        print("Encoded classes are: {}\n".format(le.classes_))
        df[col]=le.transform(df[col])


Encoded classes are: ['Amazing' 'Awful' 'Bad' 'Disaster' 'Good' 'Great' 'Masterpiece' 'Mediocre'
 'Okay' 'Painful' 'Unbearable']

Encoded classes are: ['Android' 'Arcade' 'Atari 2600' 'Atari 5200' 'Commodore 64/128'
 'DVD / HD Video Game' 'Dreamcast' 'Dreamcast VMU' 'Game Boy'
 'Game Boy Advance' 'Game Boy Color' 'Game.Com' 'GameCube' 'Genesis'
 'Linux' 'Lynx' 'Macintosh' 'Master System' 'N-Gage' 'NES' 'NeoGeo'
 'NeoGeo Pocket Color' 'New Nintendo 3DS' 'Nintendo 3DS' 'Nintendo 64'
 'Nintendo 64DD' 'Nintendo DS' 'Nintendo DSi' 'Ouya' 'PC' 'PlayStation'
 'PlayStation 2' 'PlayStation 3' 'PlayStation 4' 'PlayStation Portable'
 'PlayStation Vita' 'Pocket PC' 'Saturn' 'Sega 32X' 'Sega CD' 'SteamOS'
 'Super NES' 'TurboGrafx-16' 'TurboGrafx-CD' 'Vectrex' 'Web Games' 'Wii'
 'Wii U' 'Windows Phone' 'Windows Surface' 'Wireless' 'WonderSwan'
 'WonderSwan Color' 'Xbox' 'Xbox 360' 'Xbox One' 'iPad' 'iPhone' 'iPod']

Encoded classes are: ['Action' 'Action, Adventure' 'Action, Compilation' 'Action, Editor'
 'Action, Platformer' 'Action, Puzzle' 'Action, RPG' 'Action, Simulation'
 'Action, Strategy' 'Adult, Card' 'Adventure' 'Adventure, Adult'
 'Adventure, Adventure' 'Adventure, Compilation' 'Adventure, Episodic'
 'Adventure, Platformer' 'Adventure, RPG' 'Baseball' 'Battle' 'Board'
 'Board, Compilation' 'Card' 'Card, Battle' 'Card, Compilation' 'Card, RPG'
 'Casino' 'Compilation' 'Compilation, Compilation' 'Compilation, RPG'
 'Educational' 'Educational, Action' 'Educational, Adventure'
 'Educational, Card' 'Educational, Productivity' 'Educational, Puzzle'
 'Educational, Simulation' 'Educational, Trivia' 'Fighting'
 'Fighting, Action' 'Fighting, Adventure' 'Fighting, Compilation'
 'Fighting, RPG' 'Fighting, Simulation' 'Flight' 'Flight, Action'
 'Flight, Racing' 'Flight, Simulation' 'Hardware' 'Hunting'
 'Hunting, Action' 'Hunting, Simulation' 'Music' 'Music, Action'
 'Music, Adventure' 'Music, Compilation' 'Music, Editor' 'Music, RPG'
 'Other' 'Other, Action' 'Other, Adventure' 'Party' 'Pinball'
 'Pinball, Compilation' 'Platformer' 'Platformer, Action'
 'Platformer, Adventure' 'Productivity' 'Productivity, Action' 'Puzzle'
 'Puzzle, Action' 'Puzzle, Adventure' 'Puzzle, Compilation'
 'Puzzle, Platformer' 'Puzzle, RPG' 'Puzzle, Word Game' 'RPG' 'RPG, Action'
 'RPG, Compilation' 'RPG, Editor' 'RPG, Simulation' 'Racing'
 'Racing, Action' 'Racing, Compilation' 'Racing, Editor' 'Racing, Shooter'
 'Racing, Simulation' 'Shooter' 'Shooter, Adventure'
 'Shooter, First-Person' 'Shooter, Platformer' 'Shooter, RPG' 'Simulation'
 'Simulation, Adventure' 'Sports' 'Sports, Action' 'Sports, Baseball'
 'Sports, Compilation' 'Sports, Editor' 'Sports, Fighting' 'Sports, Golf'
 'Sports, Other' 'Sports, Party' 'Sports, Racing' 'Sports, Simulation'
 'Strategy' 'Strategy, Compilation' 'Strategy, RPG' 'Strategy, Simulation'
 'Trivia' 'Virtual Pet' 'Wrestling' 'Wrestling, Simulation']

Encoded classes are: ['N' 'Y']


In [10]:
print(df.head())


   score_phrase  platform  score  genre  editors_choice  release_year  \
0             0        35    9.0     63               1          2012   
1             0        35    9.0     63               1          2012   
2             5        56    8.5     68               0          2012   
3             5        54    8.5     93               0          2012   
4             5        32    8.5     93               0          2012   

   release_month  release_day  
0              9           12  
1              9           12  
2              9           12  
3              9           11  
4              9           11  

Tips and objectives

Keep in mind that even if score_phrase would normally be the feature to predict, based on the genre of the game, the score (maybe this one has even a direct correlation?), the release year, etc. it might be more interesting to try to use another feature has label. Just use something that makes sense :)

If needed, feel free to apply the knowledge you have already gathered to make changes to the dataset.

The goal of this exercise is to:

  • Choose at least three models and use simple cross-validation. Which of the models would you implement?
    • Hold out different percentages of data and see how that affects the results.
  • Using the same three models use k-fold cross validation. Which one has the best result?
    • Try different values of k. How does that affect the results? Try to justify.
    • Is it a good idea to use leave-one-out cross-validation on this dataset?
  • Use random splitting. How does this affect the results?
  • Implement any type of preprocessing in cross-validation using pipeline. Think about how to do this without using this method (you don't need to implement it).

It's a good idea to use a random_state equal to some integer in order to replicate results.

Remember, the goal is too get acquainted with this kind of procedures. Don't stress too much with high scores. If you remember anything else you would like to try, feel free to implement it!

Implementations


In [11]:
# Now it's your turn

In [ ]: