Submit the .ipynb file to Canvas with file name w03_lab_lastname_firstname.ipynb.
In this lab, we will introduce pandas, matplotlib, and seaborn and continue to use the imdb.csv file from the last lab.
There will be some exercises, and as usual, write your code in the empty cells to answer them.
I think some of you have already used pandas. Pandas is a library for high-performance data analysis, and makes makes tedious jobs of reading, manipulating, analyzing data super easy and nice. You can even plot directly using pandas. If you used R before, you'll see a lot of similarity with the R's dataframe and pandas's dataframe.
In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
Jupyter notebook provides several magic commands. These are the commands that you can use only in the notebook (not in IDLE for instance). One of the greatest magic command is matplotlib inline, which displays plots within the notebook instead of creating figure files.
In [2]:
%matplotlib inline
There are many ways to import matplotlib, but the most common way is:
In [3]:
import matplotlib.pyplot as plt
Let's revisit last week's exercise with pandas. It's very easy to read CSV files with pandas, using the panda.read_csv() function. This function has many many options and it may be worthwhile to take a look at available options. Things that you need to be careful are:
delimiter or sep: the data file may use ',', tab, or any weird character to separate fields. You can't read data properly if this option is incorrect. header: some data files have "header" row that contains the names of the columns. If you read it as data or use the first row as the header, you'll have problems. na_values or na_filter: often the dataset is incomplete and contains missing data (NA, NaN (not a number), etc.). It's very important to take care of them properly. You don't need to create dictionaries and other data structures. Pandas just imports the whole table into a data structure called DataFrame. You can do all kinds of interesting manipulation with the DataFrame.
In [4]:
df = pd.read_csv('imdb.csv', delimiter='\t')
Let's look at the first few rows to get some sense of the data.
In [5]:
df.head()
Out[5]:
You can see more, or less lines of course
In [6]:
df.head(2)
Out[6]:
You can extract one column by using dictionary-like expression
In [7]:
df['Year'].head(3)
Out[7]:
or select multiple columns
In [8]:
df[['Year','Rating']].head(3)
Out[8]:
To get the first 10 rows
In [9]:
df[:10]
Out[9]:
We can also select both rows and columns. For example, to select the first 10 rows of the 'Year' and 'Rating' columns:
In [10]:
df[['Year','Rating']][:10]
Out[10]:
You can swap the order of rows and columns.
But, when you deal with large datasets, You may want to stick to this principle:
Try to reduce the size of the dataset you are handling as soon as possible, and as much as possible.
For instance, if you have a billion rows with three columns, getting the small row slice (df[:10]) and working with this small slice can be much better than getting the column slice (df['Year']) and working with this slice (still contains billion items).
In [11]:
df[:10][['Year','Rating']]
Out[11]:
It is very easy to answer the question of the number of movies per year. The value_counts() function counts how many times each data value (year) appears.
In [12]:
print( min(df['Year']), df['Year'].min(), max(df['Year']), df['Year'].max() )
year_nummovies = df["Year"].value_counts()
year_nummovies.head()
Out[12]:
To calculate average ratings and votes
In [13]:
print( np.mean(df['Rating']), np.mean(df['Votes']) )
or you can even do
In [14]:
print( df['Rating'].mean() )
To get the median ratings of movies in 1990s, we first select only movies in that decade
In [15]:
geq = df['Year'] >= 1990
leq = df['Year'] <= 1999
movie_nineties = df[geq & leq]
In [16]:
movie_nineties.head()
Out[16]:
Then, we can do the calculation
In [17]:
print( movie_nineties['Rating'].median(), movie_nineties['Votes'].median() )
Finally, if we want to know the top 10 movies in 1990s, we can use the sort() function:
In [18]:
sorted_by_rating = movie_nineties.sort('Rating', ascending=False)
sorted_by_rating[:10]
Out[18]:
Calculate the following basic characteristics of ratings of movies only in 1994: 10th percentile, median, mean, 90th percentile.
Write your code in the cell below
In [53]:
# implement here
df[(df['Year']==1994)]['Rating'].describe([.1,0.9])
Out[53]:
In [32]:
df['Rating'].median()
Out[32]:
In [20]:
df['Year'].hist()
Out[20]:
In [38]:
# implement here
plt.hist(df[(df['Year']>2000) & (df['Year']<2014)]['Rating'],bins = 10)
Out[38]:
Let's plot the histogram of ratings using the pyplot.hist() function.
In [22]:
plt.hist(df['Rating'], bins=10)
Out[22]:
In [44]:
# implement here
plt.hist(df[(df['Year']>2000) & (df['Year']<2014)]['Rating'],bins = 20,facecolor='g')
plt.xlabel('bins')
plt.ylabel('# Ratings')
plt.title('Histogram of Rating Distribution for years 2000-2014')
plt.grid(True)
Seaborn sits on the top of matplotlib and makes it easier to draw statistical plots. Most plots that you create with Seaborn can be created with matplotlib. It just typically requires a lot more work.
Be sure seaborn has been installed on your computer, otherwise run
conda install seaborn
In [45]:
import seaborn as sns
Let's do nothing and just run the histgram again
In [46]:
plt.hist(df['Rating'], bins=10)
Out[46]:
We can use the distplot() function to plot the histogram.
In [47]:
sns.distplot(df['Rating'])
Out[47]:
Read the document about the function and make the following changes: http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.distplot.html
In [54]:
# implement here
sns.distplot(df['Rating'],bins = 10,kde=False)
plt.xlabel('bins')
plt.ylabel('# Ratings')
plt.title('Histogram of Rating Distribution for years 2000-2014')
Out[54]: