In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.style.use('seaborn')
In [2]:
users = pd.read_csv('rating.csv')
anime = pd.read_csv('anime.csv')
In [3]:
print (anime.rating)
anime.head()
Out[3]:
In [4]:
print (len(users))
users.head()
Out[4]:
So we have two pandas data frames. In the first, we have information regarding a particular anime. The information here speaks for itself. In the second data frame, we have the actual ratings for a given user/ anime pair. Ratings range from 1 to 10, with -1 signifying a completed but unrated movie.
We can see there are 12294 total anime and 7.8 million user/ rating pairs. The first thing we want to do is remove all ratings that are "-1" as they don't provide us useful information on ratings. There are also some anime with NaNs for scores. We'll want to remove those, but let's wait a second on that.
In [5]:
users.drop(users[users.rating == -1].index, inplace=True)
print (len(users))
users.head()
Out[5]:
Neat. We trimmed down about 1.5 million data points. But that's still too many. Let's, for the purposes of this analysis, focus only on TV shows. We're going to need a list of anime_ids corresponding to non-TV shows. Let's kill two birds with one stone and include anime that have null ratings for whatever reason.
In [6]:
bad_ids = anime.anime_id[(anime.type != 'TV') | (anime.rating.isnull())]
users.drop(users[users.anime_id.map(lambda x: x in bad_ids)].index, inplace=True)
anime.drop(anime[(anime.type != 'TV') | (anime.rating.isnull())].index, inplace=True)
print(len(users))
print(len(anime))
Neat. We cut down to 3.6 million user/anime pairs, with 3671 unique TV titles. Note that we cut down about 75% of anime entries, yet the number of user/anime pairs cut down by less than 50%. Those titles (movies, specials, OVAs) obviously had less interest, so our resulting data is denser. That's good for our recommender systems.
We'll want to wrap this code in a pre-processing function in a package down the line, but for this exploratory analysis, I'll leave it as such.
Let's visualize the data. Let's do some basic stuff first.
In [7]:
print ("min: %f; max: %f" % (anime.rating.min(), anime.rating.max()))
print ("mean: %f; median: %f" % (anime.rating.mean(axis=0), anime.rating.median(axis=0)))
nBins = int((anime.rating.max()-anime.rating.min())/0.1) #nBins such that each bin corresponds to a 0.1 range
anime.rating.hist(bins=nBins)
Out[7]:
The data is skewed left, but not significantly so. The mean and median are very close to each other, at about 6.9. We also see spikes at the bins corresponding to [6.7, 6.8) and [7.1, 7.2), for whatever reason.
Can we run the same analysis but on a genre-by-genre basis? First, we need to get a list of genres. This probably isn't the most efficient way to do it.
In [8]:
genres = anime.genre.apply(lambda x: str(x).split(","))
genres2 = genres.apply(pd.Series)
all_genres = []
for i in range(len(genres2.columns)):
genres2[i] = genres2[i].str.strip()
all_genres += map(lambda s: str(s).strip(), list(genres2[i].unique()))
all_genres = list(np.unique(all_genres))
all_genres.remove('nan')
In [1]:
all_genres
In [28]:
import plotly.plotly as py
import plotly.graph_objs as go
data = [];
for genre in all_genres:
ratings = anime.rating[(genres2[0:] == genre).any(axis=1)]
data.append( go.Histogram(
x=ratings,
histnorm='probability density',
opacity=0.40,
name=genre,
))
layout = go.Layout(barmode='overlay')
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='anime comparison')
Out[28]:
Man, plotly is awesome. (If you don't see a chart here, you're probably reading this on Github. Try opening it in nbviewer: https://nbviewer.jupyter.org/)
So this chart is a bit hard to understand at first, so it'll be more useful to instead toggle off most of these histograms. Double click on a genre of choice (say, action) in the legend to see only the plot for action. And then single click any other genres (e.g. romance) to see them overlaid. Note that with two categories, we have 3 colors, corresponding to each show as well as the places where they're combined.
If we do that, we can see that that romance looks fairly similarly distributed to action, but shifted right a little bit. So it seems either the anime community likes romances a little bit more, or anime romances in general are a little bit better (or both!).
We can actually compare distributions, parameterized by mean & standard deviation (those do look sort of Gaussian, after all), and more easily interpret the data.
In [9]:
stats = []
for genre in all_genres:
ratings = anime.rating[(genres2[0:] == genre).any(axis=1)]
stats.append([ratings.mean(), ratings.std()])
In [10]:
fig, ax = plt.subplots()
x = np.array(stats)[:,0]
y = np.array(stats)[:,1]
plt.scatter(x, y)
for i, txt in enumerate(all_genres):
ax.annotate(txt, (x[i],y[i]))
Something further to the right means a higher mean value (more acclaimed), and further upwards means a higher std value (more disagreement). So we find thriller and psychological shows are among the more acclaimed, whereas kids shows are not looked at highly. We also find music, parody, and comedy shows to have high levels of disagreement -- which makes sense, as these are subjective in nature.
It should be pretty clear that in a recommender system, we'll want to incorporate genre information as a feature (duh!).
Now, let's examine users. First, let's look at the distributions of how many anime they've watched.
In [11]:
users.user_id.value_counts().hist(bins=100)
Out[11]:
So it looks like a power law curve, but the plot is too scaled out. Let's drop out users who have seen more than 250 anime for the sake of visualization.
In [12]:
vc = users.user_id.value_counts()
vc = vc[vc.map(lambda x: x < 250)]
vc.hist(bins=50)
Out[12]:
The power law curve of how many anime the a user has seen is pretty clear from this chart. As noted before, recommender systems work best with relatively dense data, i.e. when users have rated several products. We might find it useful, then, to remove users that have seen fewer than a certain amount of anime. This will be a hyper-parameter when we fit recommender systems. Let's go ahead and remove users who have seen fewer than 10 anime, for now.
In [13]:
vc = users.user_id.value_counts()
low_ratings = vc[vc.map(lambda x: x < 10)].index
users.drop(users[users.user_id.map(lambda x: x in low_ratings)].index, inplace=True)
print(len(users))
users.head()
Out[13]:
We lose only about 80,000 user/anime pairs, but now our dataset is more robust. The number 10 was chosen arbitrarily, of course. When we build a recommender system, we can change it based on validation error or on computational limitations.
Finally, let's get a sense of how people (who've rated more than 10 TV shows) rate, by plotting two histograms, one of mean score and one of standard deviation.
In [32]:
users.groupby('user_id')['rating'].agg(pd.np.mean).hist(bins=100)
Out[32]:
In [33]:
users.groupby('user_id')['rating'].agg(pd.np.std).hist(bins=100)
Out[33]:
This might be my inner statistics geek showing, but I really love how Gaussian the means looks and how Chi-Square (albeit with a high degree of freedom) the standard deviations look. We also note there are some folks who just give a whole lot of 10s and nothing else, causing spikes in each of the graph.