W3 Lab Assignment

Submit the .ipynb file to Canvas with file name w03_lab_lastname_firstname.ipynb.

In this lab, we will introduce pandas, matplotlib, and seaborn and continue to use the imdb.csv file from the last lab.

There will be some exercises, and as usual, write your code in the empty cells to answer them.

Importing libraries

I think some of you have already used pandas. Pandas is a library for high-performance data analysis, and makes makes tedious jobs of reading, manipulating, analyzing data super easy and nice. You can even plot directly using pandas. If you used R before, you'll see a lot of similarity with the R's dataframe and pandas's dataframe.


In [1]:
import pandas as pd  
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Matplotlib magic

Jupyter notebook provides several magic commands. These are the commands that you can use only in the notebook (not in IDLE for instance). One of the greatest magic command is matplotlib inline, which displays plots within the notebook instead of creating figure files.


In [2]:
%matplotlib inline

There are many ways to import matplotlib, but the most common way is:


In [3]:
import matplotlib.pyplot as plt

Q1: Revisting W2 lab

Let's revisit last week's exercise with pandas. It's very easy to read CSV files with pandas, using the panda.read_csv() function. This function has many many options and it may be worthwhile to take a look at available options. Things that you need to be careful are:

  1. delimiter or sep: the data file may use ',', tab, or any weird character to separate fields. You can't read data properly if this option is incorrect.
  2. header: some data files have "header" row that contains the names of the columns. If you read it as data or use the first row as the header, you'll have problems.
  3. na_values or na_filter: often the dataset is incomplete and contains missing data (NA, NaN (not a number), etc.). It's very important to take care of them properly.

You don't need to create dictionaries and other data structures. Pandas just imports the whole table into a data structure called DataFrame. You can do all kinds of interesting manipulation with the DataFrame.


In [4]:
df = pd.read_csv('imdb.csv', delimiter='\t')

Let's look at the first few rows to get some sense of the data.


In [5]:
df.head()


Out[5]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61
2 #7DaysLater 2013 7.1 14
3 #Bikerlive 2014 6.8 11
4 #ByMySide 2012 5.5 13

You can see more, or less lines of course


In [6]:
df.head(2)


Out[6]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61

You can extract one column by using dictionary-like expression


In [7]:
df['Year'].head(3)


Out[7]:
0    1994
1    2006
2    2013
Name: Year, dtype: int64

or select multiple columns


In [8]:
df[['Year','Rating']].head(3)


Out[8]:
Year Rating
0 1994 5.4
1 2006 6.1
2 2013 7.1

To get the first 10 rows


In [9]:
df[:10]


Out[9]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61
2 #7DaysLater 2013 7.1 14
3 #Bikerlive 2014 6.8 11
4 #ByMySide 2012 5.5 13
5 #LawstinWoods 2013 7.0 6
6 #lovemilla 2013 6.7 17
7 #nitTWITS 2011 7.1 9
8 $#*! My Dad Says 2010 6.3 4349
9 $1,000,000 Chance of a Lifetime 1986 6.4 16

We can also select both rows and columns. For example, to select the first 10 rows of the 'Year' and 'Rating' columns:


In [10]:
df[['Year','Rating']][:10]


Out[10]:
Year Rating
0 1994 5.4
1 2006 6.1
2 2013 7.1
3 2014 6.8
4 2012 5.5
5 2013 7.0
6 2013 6.7
7 2011 7.1
8 2010 6.3
9 1986 6.4

You can swap the order of rows and columns.

But, when you deal with large datasets, You may want to stick to this principle:

Try to reduce the size of the dataset you are handling as soon as possible, and as much as possible.

For instance, if you have a billion rows with three columns, getting the small row slice (df[:10]) and working with this small slice can be much better than getting the column slice (df['Year']) and working with this slice (still contains billion items).


In [11]:
df[:10][['Year','Rating']]


Out[11]:
Year Rating
0 1994 5.4
1 2006 6.1
2 2013 7.1
3 2014 6.8
4 2012 5.5
5 2013 7.0
6 2013 6.7
7 2011 7.1
8 2010 6.3
9 1986 6.4

It is very easy to answer the question of the number of movies per year. The value_counts() function counts how many times each data value (year) appears.


In [12]:
print( min(df['Year']), df['Year'].min(), max(df['Year']), df['Year'].max() )
year_nummovies = df["Year"].value_counts()
year_nummovies.head()


1874 1874 2017 2017
Out[12]:
2011    13944
2012    13887
2013    13048
2010    12931
2009    12268
Name: Year, dtype: int64

To calculate average ratings and votes


In [13]:
print( np.mean(df['Rating']), np.mean(df['Votes']) )


6.29619534138 1691.2317746

or you can even do


In [14]:
print( df['Rating'].mean() )


6.29619534138

To get the median ratings of movies in 1990s, we first select only movies in that decade


In [15]:
geq = df['Year'] >= 1990
leq = df['Year'] <= 1999
movie_nineties = df[geq & leq]

In [16]:
movie_nineties.head()


Out[16]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
23 'N Sync TV 1998 7.5 11
33 't Zal je gebeuren... 1998 6.0 7
34 't Zonnetje in huis 1993 6.1 148
42 .COM 1999 3.8 5

Then, we can do the calculation


In [17]:
print( movie_nineties['Rating'].median(), movie_nineties['Votes'].median() )


6.3 32.0

Finally, if we want to know the top 10 movies in 1990s, we can use the sort() function:


In [18]:
sorted_by_rating = movie_nineties.sort('Rating', ascending=False)
sorted_by_rating[:10]


Out[18]:
Title Year Rating Votes
131241 Girls Loving Girls 1996 9.8 5
202778 Nicole's Revenge 1995 9.5 13
38899 The Beatles Anthology 1995 9.4 3822
39429 The Civil War 1990 9.4 4615
218444 Pink Floyd: P. U. L. S. E. Live at Earls Court 1994 9.3 3202
279320 The Shawshank Redemption 1994 9.3 1511933
72171 Bardot 1992 9.2 5
42590 The Sopranos 1999 9.2 163406
29419 Otvorena vrata 1994 9.1 2337
3955 Baseball 1994 9.1 2463

Exercise

Calculate the following basic characteristics of ratings of movies only in 1994: 10th percentile, median, mean, 90th percentile.

Write your code in the cell below


In [53]:
# implement here
df[(df['Year']==1994)]['Rating'].describe([.1,0.9])


Out[53]:
count    3415.000000
mean        6.208551
std         1.404644
min         1.000000
10%         4.200000
50%         6.400000
90%         7.960000
max         9.300000
Name: Rating, dtype: float64

In [32]:
df['Rating'].median()


Out[32]:
6.5

Q2: Basic plotting with pandas

Pandas provides some easy ways to draw plots by using matplotlib. Dataframe object has several plotting functions. For instance,


In [20]:
df['Year'].hist()


Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199c28ee10>

Exercise

Can you plot the histogram of ratings of the movies between 2000 and 2014?


In [38]:
# implement here
plt.hist(df[(df['Year']>2000) & (df['Year']<2014)]['Rating'],bins = 10)


Out[38]:
(array([   455.,   1653.,   3960.,   6721.,  14027.,  24887.,  31904.,
         33155.,  18249.,    607.]),
 array([ 1.  ,  1.87,  2.74,  3.61,  4.48,  5.35,  6.22,  7.09,  7.96,
         8.83,  9.7 ]),
 <a list of 10 Patch objects>)

Q3: Basic plotting with matplotlib

Let's plot the histogram of ratings using the pyplot.hist() function.


In [22]:
plt.hist(df['Rating'], bins=10)


Out[22]:
(array([   824.,   3363.,   9505.,  21207.,  42500.,  69391.,  86470.,
         58059.,  21538.,    154.]),
 array([ 1.  ,  1.89,  2.78,  3.67,  4.56,  5.45,  6.34,  7.23,  8.12,
         9.01,  9.9 ]),
 <a list of 10 Patch objects>)

Exercise

Let's try to make some style changes to the plot:


In [44]:
# implement here
plt.hist(df[(df['Year']>2000) & (df['Year']<2014)]['Rating'],bins = 20,facecolor='g')
plt.xlabel('bins')
plt.ylabel('# Ratings')
plt.title('Histogram of Rating Distribution for years 2000-2014')
plt.grid(True)


Q4: Basic plotting with Seaborn

Seaborn sits on the top of matplotlib and makes it easier to draw statistical plots. Most plots that you create with Seaborn can be created with matplotlib. It just typically requires a lot more work.

Be sure seaborn has been installed on your computer, otherwise run

conda install seaborn


In [45]:
import seaborn as sns

Let's do nothing and just run the histgram again


In [46]:
plt.hist(df['Rating'], bins=10)


Out[46]:
(array([   824.,   3363.,   9505.,  21207.,  42500.,  69391.,  86470.,
         58059.,  21538.,    154.]),
 array([ 1.  ,  1.89,  2.78,  3.67,  4.56,  5.45,  6.34,  7.23,  8.12,
         9.01,  9.9 ]),
 <a list of 10 Patch objects>)

We can use the distplot() function to plot the histogram.


In [47]:
sns.distplot(df['Rating'])


Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f199261b908>

Exercise

Read the document about the function and make the following changes: http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html

  • change the number of bins to 10;
  • not to show kde;

In [54]:
# implement here
sns.distplot(df['Rating'],bins = 10,kde=False)
plt.xlabel('bins')
plt.ylabel('# Ratings')
plt.title('Histogram of Rating Distribution for years 2000-2014')


Out[54]:
<matplotlib.text.Text at 0x7f1992538cc0>