W3 Lab Assignment

Submit the .ipynb file to Canvas with file name w03_lab_lastname_firstname.ipynb.

In this lab, we will introduce pandas, matplotlib, and seaborn and continue to use the imdb.csv file from the last lab.

There will be some exercises, and as usual, write your code in the empty cells to answer them.

Importing libraries

I think some of you have already used pandas. Pandas is a library for high-performance data analysis, and makes makes tedious jobs of reading, manipulating, analyzing data super easy and nice. You can even plot directly using pandas. If you used R before, you'll see a lot of similarity with the R's dataframe and pandas's dataframe.



In [1]:

    
import pandas as pd  
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Matplotlib magic

Jupyter notebook provides several magic commands. These are the commands that you can use only in the notebook (not in IDLE for instance). One of the greatest magic command is matplotlib inline, which displays plots within the notebook instead of creating figure files.



In [2]:

    
%matplotlib inline

There are many ways to import matplotlib, but the most common way is:



In [3]:

    
import matplotlib.pyplot as plt

Q1: Revisting W2 lab

Let's revisit last week's exercise with pandas. It's very easy to read CSV files with pandas, using the panda.read_csv() function. This function has many many options and it may be worthwhile to take a look at available options. Things that you need to be careful are:

delimiter or sep: the data file may use ',', tab, or any weird character to separate fields. You can't read data properly if this option is incorrect.
header: some data files have "header" row that contains the names of the columns. If you read it as data or use the first row as the header, you'll have problems.
na_values or na_filter: often the dataset is incomplete and contains missing data (NA, NaN (not a number), etc.). It's very important to take care of them properly.

You don't need to create dictionaries and other data structures. Pandas just imports the whole table into a data structure called DataFrame. You can do all kinds of interesting manipulation with the DataFrame.



In [4]:

    
df = pd.read_csv('imdb.csv', delimiter='\t')

Let's look at the first few rows to get some sense of the data.



In [5]:

    
df.head()









    Out[5]:






  
    
      
      Title
      Year
      Rating
      Votes
    
  
  
    
      0
      !Next?
      1994
      5.4
      5
    
    
      1
      #1 Single
      2006
      6.1
      61
    
    
      2
      #7DaysLater
      2013
      7.1
      14
    
    
      3
      #Bikerlive
      2014
      6.8
      11
    
    
      4
      #ByMySide
      2012
      5.5
      13

You can see more, or less lines of course



In [6]:

    
df.head(2)

You can extract one column by using dictionary-like expression



In [7]:

    
df['Year'].head(3)









    Out[7]:





0    1994
1    2006
2    2013
Name: Year, dtype: int64

or select multiple columns



In [8]:

    
df[['Year','Rating']].head(3)

To get the first 10 rows



In [9]:

    
df[:10]









    Out[9]:






  
    
      
      Title
      Year
      Rating
      Votes
    
  
  
    
      0
      !Next?
      1994
      5.4
      5
    
    
      1
      #1 Single
      2006
      6.1
      61
    
    
      2
      #7DaysLater
      2013
      7.1
      14
    
    
      3
      #Bikerlive
      2014
      6.8
      11
    
    
      4
      #ByMySide
      2012
      5.5
      13
    
    
      5
      #LawstinWoods
      2013
      7.0
      6
    
    
      6
      #lovemilla
      2013
      6.7
      17
    
    
      7
      #nitTWITS
      2011
      7.1
      9
    
    
      8
      $#*! My Dad Says
      2010
      6.3
      4349
    
    
      9
      $1,000,000 Chance of a Lifetime
      1986
      6.4
      16

We can also select both rows and columns. For example, to select the first 10 rows of the 'Year' and 'Rating' columns:



In [10]:

    
df[['Year','Rating']][:10]

You can swap the order of rows and columns.

But, when you deal with large datasets, You may want to stick to this principle:

Try to reduce the size of the dataset you are handling as soon as possible, and as much as possible.

For instance, if you have a billion rows with three columns, getting the small row slice (df[:10]) and working with this small slice can be much better than getting the column slice (df['Year']) and working with this slice (still contains billion items).



In [11]:

    
df[:10][['Year','Rating']]

It is very easy to answer the question of the number of movies per year. The value_counts() function counts how many times each data value (year) appears.



In [12]:

    
print( min(df['Year']), df['Year'].min(), max(df['Year']), df['Year'].max() )
year_nummovies = df["Year"].value_counts()
year_nummovies.head()









    



1874 1874 2017 2017






    Out[12]:





2011    13944
2012    13887
2013    13048
2010    12931
2009    12268
Name: Year, dtype: int64

To calculate average ratings and votes



In [13]:

    
print( np.mean(df['Rating']), np.mean(df['Votes']) )









    



6.29619534138 1691.2317746

or you can even do



In [14]:

    
print( df['Rating'].mean() )









    



6.29619534138

To get the median ratings of movies in 1990s, we first select only movies in that decade



In [15]:

    
geq = df['Year'] >= 1990
leq = df['Year'] <= 1999
movie_nineties = df[geq & leq]



In [16]:

    
movie_nineties.head()









    Out[16]:






  
    
      
      Title
      Year
      Rating
      Votes
    
  
  
    
      0
      !Next?
      1994
      5.4
      5
    
    
      23
      'N Sync TV
      1998
      7.5
      11
    
    
      33
      't Zal je gebeuren...
      1998
      6.0
      7
    
    
      34
      't Zonnetje in huis
      1993
      6.1
      148
    
    
      42
      .COM
      1999
      3.8
      5

Then, we can do the calculation



In [17]:

    
print( movie_nineties['Rating'].median(), movie_nineties['Votes'].median() )

Finally, if we want to know the top 10 movies in 1990s, we can use the sort() function:



In [18]:

    
sorted_by_rating = movie_nineties.sort('Rating', ascending=False)
sorted_by_rating[:10]









    Out[18]:






  
    
      
      Title
      Year
      Rating
      Votes
    
  
  
    
      131241
      Girls Loving Girls
      1996
      9.8
      5
    
    
      202778
      Nicole's Revenge
      1995
      9.5
      13
    
    
      38899
      The Beatles Anthology
      1995
      9.4
      3822
    
    
      39429
      The Civil War
      1990
      9.4
      4615
    
    
      218444
      Pink Floyd: P. U. L. S. E. Live at Earls Court
      1994
      9.3
      3202
    
    
      279320
      The Shawshank Redemption
      1994
      9.3
      1511933
    
    
      72171
      Bardot
      1992
      9.2
      5
    
    
      42590
      The Sopranos
      1999
      9.2
      163406
    
    
      29419
      Otvorena vrata
      1994
      9.1
      2337
    
    
      3955
      Baseball
      1994
      9.1
      2463

Exercise

Calculate the following basic characteristics of ratings of movies only in 1994: 10th percentile, median, mean, 90th percentile.

Write your code in the cell below



In [53]:

    
# implement here
df[(df['Year']==1994)]['Rating'].describe([.1,0.9])









    Out[53]:





count    3415.000000
mean        6.208551
std         1.404644
min         1.000000
10%         4.200000
50%         6.400000
90%         7.960000
max         9.300000
Name: Rating, dtype: float64



In [32]:

    
df['Rating'].median()









    Out[32]:





6.5

Q2: Basic plotting with pandas

Pandas provides some easy ways to draw plots by using matplotlib. Dataframe object has several plotting functions. For instance,



In [20]:

    
df['Year'].hist()









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f199c28ee10>

Exercise

Can you plot the histogram of ratings of the movies between 2000 and 2014?



In [38]:

    
# implement here
plt.hist(df[(df['Year']>2000) & (df['Year']<2014)]['Rating'],bins = 10)









    Out[38]:





(array([   455.,   1653.,   3960.,   6721.,  14027.,  24887.,  31904.,
         33155.,  18249.,    607.]),
 array([ 1.  ,  1.87,  2.74,  3.61,  4.48,  5.35,  6.22,  7.09,  7.96,
         8.83,  9.7 ]),
 <a list of 10 Patch objects>)

Q3: Basic plotting with matplotlib

Let's plot the histogram of ratings using the pyplot.hist() function.



In [22]:

    
plt.hist(df['Rating'], bins=10)









    Out[22]:





(array([   824.,   3363.,   9505.,  21207.,  42500.,  69391.,  86470.,
         58059.,  21538.,    154.]),
 array([ 1.  ,  1.89,  2.78,  3.67,  4.56,  5.45,  6.34,  7.23,  8.12,
         9.01,  9.9 ]),
 <a list of 10 Patch objects>)

Exercise

Let's try to make some style changes to the plot:

change the color from blue to whatever you want
- http://matplotlib.org/users/pyplot_tutorial.html#working-with-text
- http://matplotlib.org/api/colors_api.html
add labels of x and y axis
change the number of bins to 20



In [44]:

    
# implement here
plt.hist(df[(df['Year']>2000) & (df['Year']<2014)]['Rating'],bins = 20,facecolor='g')
plt.xlabel('bins')
plt.ylabel('# Ratings')
plt.title('Histogram of Rating Distribution for years 2000-2014')
plt.grid(True)

Q4: Basic plotting with Seaborn

Seaborn sits on the top of matplotlib and makes it easier to draw statistical plots. Most plots that you create with Seaborn can be created with matplotlib. It just typically requires a lot more work.

Be sure seaborn has been installed on your computer, otherwise run

conda install seaborn



In [45]:

    
import seaborn as sns

Let's do nothing and just run the histgram again



In [46]:

    
plt.hist(df['Rating'], bins=10)









    Out[46]:





(array([   824.,   3363.,   9505.,  21207.,  42500.,  69391.,  86470.,
         58059.,  21538.,    154.]),
 array([ 1.  ,  1.89,  2.78,  3.67,  4.56,  5.45,  6.34,  7.23,  8.12,
         9.01,  9.9 ]),
 <a list of 10 Patch objects>)

We can use the distplot() function to plot the histogram.



In [47]:

    
sns.distplot(df['Rating'])









    Out[47]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f199261b908>

Exercise

Read the document about the function and make the following changes: http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html

change the number of bins to 10;
not to show kde;



In [54]:

    
# implement here
sns.distplot(df['Rating'],bins = 10,kde=False)
plt.xlabel('bins')
plt.ylabel('# Ratings')
plt.title('Histogram of Rating Distribution for years 2000-2014')









    Out[54]:





<matplotlib.text.Text at 0x7f1992538cc0>

	Title	Year	Rating	Votes
0	!Next?	1994	5.4	5
1	#1 Single	2006	6.1	61
2	#7DaysLater	2013	7.1	14
3	#Bikerlive	2014	6.8	11
4	#ByMySide	2012	5.5	13

	Title	Year	Rating	Votes
0	!Next?	1994	5.4	5
23	'N Sync TV	1998	7.5	11
33	't Zal je gebeuren...	1998	6.0	7
34	't Zonnetje in huis	1993	6.1	148
42	.COM	1999	3.8	5

	Title	Year	Rating	Votes
131241	Girls Loving Girls	1996	9.8	5
202778	Nicole's Revenge	1995	9.5	13
38899	The Beatles Anthology	1995	9.4	3822
39429	The Civil War	1990	9.4	4615
218444	Pink Floyd: P. U. L. S. E. Live at Earls Court	1994	9.3	3202
279320	The Shawshank Redemption	1994	9.3	1511933
72171	Bardot	1992	9.2	5
42590	The Sopranos	1999	9.2	163406
29419	Otvorena vrata	1994	9.1	2337
3955	Baseball	1994	9.1	2463