W5 Lab Assignment

This lab covers some fundamental plots of 1-D data.


In [27]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

sns.set_style('white')

%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")

Q1 1-D Scatter Plot

Using fake data

Remember that, if you want to play with visualization tools, you can use not only the real data, but also fake data. Actually it is a nice way to experiment because you can control every aspect of data. Let's create some random numbers.

The function np.random.randn() generates a sample with size $N$ from the standard normal distribution.


In [28]:
print( np.random.rand(10) )


[ 0.92732254  0.44601266  0.59433485  0.39955111  0.54506167  0.95558355
  0.1375084   0.21573451  0.04771416  0.15855766]

The following small function generates $N$ normally distributed numbers:


In [29]:
def generate_many_numbers(N=10, mean=5, sigma=3):
    return mean + sigma * np.random.randn(N)

Generate 10 normally distributed numbers with mean 5 and sigma 3:


In [30]:
data = generate_many_numbers(N=10)
print(data)


[-1.70232553  6.50831593  6.00320728  9.22047422  7.34258577  4.30483447
  1.82233298  1.58048897  4.40789351  0.86419455]

The most immediate method to visualize 1-D data is just plotting it. Here we can use the scatter() function to draw a scatter plot. The most basic usage of this function is to provide x and y.


In [31]:
x = np.arange(1,11)
y = x + 5
print(x)
print(y)
plt.scatter(x, y)


[ 1  2  3  4  5  6  7  8  9 10]
[ 6  7  8  9 10 11 12 13 14 15]
Out[31]:
<matplotlib.collections.PathCollection at 0x7f9d5738fa90>

But here we only have x (the generated data). We can set the y values to 0. The np.zeros_like(data) function creates a numpy array (list) that have the same dimension as the argument.


In [32]:
print(np.zeros_like(data))


[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Now let's plot the generated 1-D data.


In [33]:
plt.figure(figsize=(10,1)) # set figure size, width = 10, height = 1
plt.scatter(data, np.zeros_like(data), s=50) # set size of symbols to 50. Change it and see what happens. 
plt.gca().axes.get_yaxis().set_visible(False) # set y axis invisible


Ok, I think we can see all data points. But what if we have more numbers?


In [34]:
# TODO: generate 100 numbers and plot them in the same way. 
data = np.random.rand(100)

plt.figure(figsize=(10,1))
plt.scatter(data, np.zeros_like(data), s = 50) 
plt.gca().axes.get_yaxis().set_visible(False)


Of course we can't see much at the center. We can add "jitters" using the np.random.rand() function.


In [35]:
data = generate_many_numbers(N=100)

# TODO: create a list of 100 random numbers using np.random.rand()
# zittered_ypos = ??

zittered_ypos = np.random.rand(100)

plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s=50)
plt.gca().axes.get_yaxis().set_visible(False)


Let's also make the symbol transparent. Here is a useful Google query, and the documentation of scatter() also helps.


In [36]:
data = generate_many_numbers(N=200)

# From the last question
# zittered_ypos = ??

# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)
# TODO: implement this
zittered_ypos = np.random.rand(200)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s = 50, alpha = 0.35)
plt.gca().axes.get_yaxis().set_visible(False)


We can use transparency as well as empty symbols.

  • Increase the number of points to 1,000
  • Set the symbol empty and edgecolor red (a useful query)

In [37]:
# TODO: implement this
# data = ?? 
# zittered_ypos = ??


# TODO: implement this
# plt.figure(figsize=(10,1))
# plt.scatter( ?? )
# plt.gca().axes.get_yaxis().set_visible(False)


data = np.random.rand(1000)
zittered_ypos = np.random.rand(1000)
plt.figure(figsize=(10,1))
plt.scatter(data, zittered_ypos, s = 50, c = 'white', edgecolors='r')
plt.gca().axes.get_yaxis().set_visible(False)


Lots and lots of points

Let's use real data. Load the IMDb dataset that we used before.


In [38]:
movie_df = pd.read_csv('imdb.csv', delimiter='\t')
movie_df.head()


Out[38]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61
2 #7DaysLater 2013 7.1 14
3 #Bikerlive 2014 6.8 11
4 #ByMySide 2012 5.5 13

Try to plot the 'Rating' information using 1D scatter plot. Does it work?


In [39]:
# TODO: plot 'rating'

rating = movie_df['Rating'].values
plt.figure(figsize=(10,1)) 
plt.scatter(rating, np.zeros_like(rating), s = 50) 
plt.gca().axes.get_yaxis().set_visible(False)


Q2 Histogram

There are too many data points! Let's try histogram. Actually pandas supports plotting through matplotlib and you can directly visualize dataframes and series.


In [40]:
movie_df['Rating'].hist()


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d5392f358>

Looks good! Can you increase or decrease the number of bins? Find the documentation here.


In [41]:
# TODO: try different number of bins
movie_df['Rating'].hist(bins = 30)


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d538db5f8>

In [42]:
movie_df['Rating'].hist(bins = 20)


Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d53806a90>

Q3 Boxplot

Now let's try boxplot. We can use pandas' plotting functions. The usages of boxplot is here.


In [43]:
movie_df['Rating'].plot(kind='box', vert=False)


Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d53875a20>

Or try seaborn's boxplot() function:


In [44]:
sns.boxplot(movie_df['Rating'])


Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d56b4f6a0>

We can also easily draw a series of boxplots grouped by categories. For example, let's do the boxplots of movie ratings for different decades.


In [45]:
df = movie_df.sort('Year')
df.head()


Out[45]:
Title Year Rating Votes
215207 Passage de Venus 1874 6.5 174
234798 Sallie Gardner at a Gallop 1878 7.3 452
186796 Man Walking Around the Corner 1887 5.1 365
57131 Accordion Player 1888 5.7 433
232543 Roundhay Garden Scene 1888 7.7 3451

One easy way to transform a particular year to the decade (e.g., 1874 -> 1870): divide by 10 and multiply it by 10 again.

In Python 3, the // operator is used for integer division.


In [46]:
print(1874//10)
print(1874//10*10)
decade = (df['Year']//10) * 10
decade.head()


187
1870
Out[46]:
215207    1870
234798    1870
186796    1880
57131     1880
232543    1880
Name: Year, dtype: int64

In [47]:
ax = sns.boxplot(x=decade, y=df['Rating'])
ax.figure.set_size_inches(12, 8)


Can you draw boxplots of movie votes for different decade?


In [48]:
# TODO
ax = sns.boxplot(x=decade, y=df['Votes'])
ax.figure.set_size_inches(12, 8)


What do you see? Can you actually see the "box"? The number of votes span a very wide range, from 1 to more than 1.4 million. One way to deal with this is to make a log-transformation of votes, which can be done with the numpy.log() function.


In [49]:
log_votes = np.log(df['Votes'])
log_votes.head()


Out[49]:
215207    5.159055
234798    6.113682
186796    5.899897
57131     6.070738
232543    8.146419
Name: Votes, dtype: float64

Can you draw boxplots of log-transformed movie votes for different decade?


In [50]:
# TODO
ax = sns.boxplot(x=decade, y = log_votes)
ax.figure.set_size_inches(12, 8)



In [ ]: