In-Class Coding Lab: Data Visualization

The goals of this lab are to help you understand:

  • The value of visualization: A picture is worth 1,000 words!
  • The various ways to visualize information
  • The basic requirements for any visualization
  • How to plot complex visualizations such as multi-series charts and maps
  • Visualization Tools:
    • Matplolib
    • Plot.ly
    • Folium Maps

In [ ]:
# first, make sure we have the right modules installed
!pip install --upgrade chart-studio plotly

In [1]:
%matplotlib inline 

import matplotlib
import matplotlib.pyplot as plt
import chart_studio as plotly
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import folium
import warnings
#matplotlib.rcParams['figure.figsize'] = (20.0, 10.0) # larger figure size
warnings.filterwarnings('ignore')

Back to the movie goers data set

For this lab, we will once again use the movie goers dataset. As you may recall this data set is a survey demographic survey of people who go to the movies. Let's reload the data and setup our age_group feature again.


In [3]:
goers = pd.read_csv('CCL-moviegoers.csv')
goers['age_group'] = ''
goers['age_group'][goers['age'] <=18] = 'Youth'
goers['age_group'][(goers['age'] >=19) & (goers['age'] <=55)] = 'Adult'
goers['age_group'][goers['age'] >=56] = 'Senior'

goers.sample(5)


Out[3]:
user_id age gender occupation zip_code age_group
835 836 44 M artist 10018 Adult
447 448 23 M entertainment 10021 Adult
314 315 31 M educator 18301 Adult
585 586 20 M student 79508 Adult
728 729 19 M student 56567 Adult

Visualizing Data

There are many ways your can visualize information. Which one is the most appropriate? It depends on the data, of course.

  • Counting Categorial data belongs in charts like pie charts and bar charts.
  • Counting Numerical data is best suited for histograms.
  • Timeseries data and continuous data belongs in line charts.
  • A comparision of two continuous values is best suited for a scatter plot.
  • Geographical data is best displauyed on maps.

Let's use this knowledge to plot some data in the goers DataFrame!

Males or Females?

The first thing we might want to visualize is a count of gender in the dataset. A pie chart is well suited for this task as it displays data as a portion of a whole. To create a pie chart we need the data to count and the labels for the counts.

Let's try it.

First we get the value counts as a series gender:


In [ ]:
gender = goers['gender'].value_counts()
gender

Then we make it into a dataframe:


In [ ]:
gender_df = pd.DataFrame( { 'Gender' : gender.index,  "Counts" : gender })
gender_df

Then we plot! The index has the labels, and the value at the index is what we want to plot:


In [ ]:
gender_df.plot.pie( y = 'Counts') # y are the values we are plotting

Now You Try it!

Create a pie chart based on age_group first create a series of the value_counts() second, create the DataFrame with two columns AgeGroup and Counts then plot with .plot.pie().

Follow the steps we did in the previous three cells, but comvine into one cell!


In [ ]:
#todo write code here

Too many pieces of the pie?

Pie charts are nice, but they are only useful when you have a small number of labels. More that 5-7 labels and the pie becomes messy. For example take a look at this pie chart of occupation:


In [ ]:
occ = goers['occupation'].value_counts()
occ_df = pd.DataFrame( { 'occupation' : occ.index,  "counts" : occ })
occ_df.plot.pie(y = 'counts')

That's crazy... and difficult to comprehend. Also pie charts visualize data as part of the whole. We have no idea how many students there are. Sometimes we want to know actual counts. This is where the bar chart comes in handy!

Raising the bar!

Let's reproduce the same plot as a bar:


In [ ]:
occ_df.plot.bar()

Ahh. that's much better. So much easier to understand!

Now you try it!

Write a one-liner to plot goers_df as a Bar!


In [ ]:
# todo write code here

When bar charts fail...

Bar charts have the same problem as pie charts. Too many categories overcomplicate the chart, or show the data in a meaningless way. For example, let's create a bart chart for ages:


In [ ]:
ages = goers['age'].value_counts()
ages_df = pd.DataFrame( { 'age' : ages.index,  "counts" : ages })
ages_df.plot.bar(y = 'counts')

Meaningless. For two key reasons:

  1. too many categories
  2. age is a continuous variable not a categorical variable. In plain English, this means there's a relationship between one age and the next. 20 < 21 < 22. This is not represented in a bar chart.

...Call in the Histogram!

What we want is a historgram, which takes a continuous variable and loads counts into "buckets". Notice how we didn't have to lump data with value_counts(). Histograms can do that automatically because the age variable is continuous. Let's try it:


In [ ]:
goers.hist(column ='age', color='gray', edgecolor='blue')

The default histogram has 10 bins. You can tweak the number of bins in your plot with a named argument. For example, here's 15 bins.


In [ ]:
goers.hist(column ='age', bins=15, color='pink', edgecolor='red')

Now you try it!

Make a histogram of ages with 7 bins bar color cyan and edge color black


In [ ]:
# todo write code here

Plot.ly

Plot.ly is data visualization as a service. You give it data, it gives you back a web-based plot. Plot.ly is free and works with a variety of environments and programming languages, including Python.

For Python is has bindings so that you can use it just like matplotlib! No need to manually invoke the web service call.

To get started with plot.ly you must sign up for an account and get a set of credentials:

  • Visit https://plot.ly/settings/api
  • Create an account or sign-in with Google or GitHub
  • Generate your API key and paste your username and key in the code below:

In [ ]:
# todo: setup the credentials replace ??? and ??? with your Plot.ly username and api_key

plotly.tools.set_credentials_file(username='???', api_key='???')

Plotly Express... easy as pie!

Using plot.ly is as easy as, or sometimes easier than matplotlib. The Plotly expresss module (imported here as px) allows for easy plotting of data frames. Here's an example of plotting the same pie chart from above:


In [ ]:
px.pie(gender_df,  labels="Gender", names='Gender', values = "Counts")

Notice that plot.ly is a bit more interactive. You can hover over the part of the pie chart and see counts!

Ploty Express... raising the bar!

Here's the same information in a bar chart using plotly express.


In [ ]:
px.bar(gender_df,  labels="Gender", x='Gender', y = "Counts")

Many chart types to choose from.

If you dir(px) you can see all the different chart types supported by Plotly Express. To learn how to plot one, use help(px.bar) for example to learn how to plot a bar chart. You can also consult https://plot.ly/python/ for more information.


In [ ]:
# TODO: Type dir(px) to see the different plot types, then use help() to bring up help for a plot type.

Now You Try it!

Use Plotly Express's px to create a bar chart on the occ_df Data Frame:


In [ ]:
# todo: write code here

PoP (Plain old Plotly)

Plotly Express is great for Pandas dataframes, but Plotly can plot anything. Basically the setup is :

  1. make a figure object
  2. add traces ( series) to the plot
  3. set the labels,if you want
  4. show the figure.

Here's the same gender plot, Plotly style:


In [ ]:
series = go.Bar(x=gender_df['Gender'], y=gender_df['Counts']) 

fig = go.Figure()
fig.add_trace(series)
fig.update_layout(title="Count of Moive Goers By Gender", xaxis_title = 'Gender', yaxis_title='Number of People')
fig.show()

Need more than one series?

When you use Plain old Plotly, its simple to add multiple series to the plot. All you do is add additional Graph objects using the add_trace() method Let's make up some extra data and add it to the plot. Note how we add the name= argument to label each series. Also each series does not need to be the same plot type. You can mix Bar with Line etc...


In [ ]:
projections = [800, 400]

fig2 = go.Figure()
fig2.add_trace(go.Bar(x=gender_df['Gender'], y=gender_df['Counts'], name="Actual")) # Bar Plot of actual
fig2.add_trace(go.Bar(x=gender_df['Gender'], y=projections, name="Projected"))        # Bar plot of Projected
fig2.update_layout(title="Count of Moive Goers By Gender", xaxis_title = 'Gender', yaxis_title='Number of People')
fig2.show()

Now You Try It!

The following code adds a column 'last_year' to the occ_df.

Create a two-series plotly plot of movie goers counts. Label counts as This year and last_year as Last Year, of course.


In [ ]:
import random
occ_df['last_year'] = random.randint(-15,15) +occ_df['counts']
occ_df.head(3)

In [ ]:
# TODO: Make your 2 series plotly plot here

Folium with Leaflet.js

Folium is a Python module wrapper for Leaflet.js, which uses Open Street Maps. These are two, popular open source mapping libraries. Unlike Google maps API, its 100% free!

You can use Folium to render maps in Python and put data on the maps. Here's how easy it is to bring up a map:


In [ ]:
CENTER_US = (39.8333333,-98.585522)
london = (51.5074, -0.1278)
map = folium.Map(location=CENTER_US, zoom_start=4)
map

You can zoom right down to the street level and get a amazing detail. There are different maps you can use, as was covered in this week's reading.

Mapping the student movie goers.

Let's take the largest category of movie goers and map their whereabouts. We will first need to import a data set to give us a lat/lng for the zip_code we have in the dataframe. We could look this up with Google's geolookup API, but that's too slow as we will be making 100's of requests. It's better to have them stored already and merge them with goers!

Let's import the zipcode database into a Pandas DataFrame, then merge it with the goers DataFrame:


In [4]:
zipcodes = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv', dtype = {'Zipcode' :object})
data = goers.merge(zipcodes,  how ='inner', left_on='zip_code', right_on='Zipcode')
students = data[ data['occupation'] == 'student']
students.sample()


Out[4]:
user_id age gender occupation zip_code age_group Zipcode ZipCodeType City State LocationType Lat Long Location Decommisioned TaxReturnsFiled EstimatedPopulation TotalWages
693 742 35 M student 29210 Adult 29210 STANDARD COLUMBIA SC PRIMARY 34.0 -81.03 NA-US-SC-COLUMBIA False 16527.0 25830.0 440319512.0

Let's explain the code, as a Pandas refresher course:

  1. in the first line I added dtype = {'Zipcode' :object} to force the Zipcode column to be of type object without that, it imports as type int and cannot match with the goers DataFrame.
  2. the next line merges the two dataframes together where the zip_code in goers (on_left) matches Zipcode in zipcodes (on_right)
  3. the result data is a combined DataFrame, which we then filter to only student occupations, sorting that in the students DataFrame

Slapping those students on a map!

We're ready to place the students on a map. It's easy:

  1. For each row in the students dataframe:
  2. get the coordinates (lat /lng )
  3. make a marker with the coordinates
  4. add the marker to the map with add_children()

Here we go!


In [ ]:
for row in students.to_records():
    pos = (row['Lat'],row['Long'])
    message = f"{row['age']} year old {row['gender']}  {row['occupation']} from {row['City']},{row['State']}"
    marker = folium.Marker(location=pos, 
                    popup=message
                          )
    map.add_child(marker)
map

Now you try it!

  1. use the data DataFrame to retrieve only the occupation programmer
  2. create another map map2 plot the programmers on that map!

In [ ]:
## todo write code here!

Metacognition

Please answer the following questions. This should be a personal narrative, in your own voice. Answer the questions by double clicking on the question and placing your answer next to the Answer: prompt.

Questions

  1. Record any questions you have about this lab that you would like to ask in recitation. It is expected you will have questions if you did not complete the code sections correctly. Learning how to articulate what you do not understand is an important skill of critical thinking.

Answer:

  1. What was the most difficult aspect of completing this lab? Least difficult?

Answer:

  1. What aspects of this lab do you find most valuable? Least valuable?

Answer:

  1. Rate your comfort level with this week's material so far.

1 ==> I can do this on my own and explain how to do it.
2 ==> I can do this on my own without any help.
3 ==> I can do this with help or guidance from others. If you choose this level please list those who helped you.
4 ==> I don't understand this at all yet and need extra help. If you choose this please try to articulate that which you do not understand.

Answer:


In [ ]:
# SAVE YOUR WORK FIRST! CTRL+S
# RUN THIS CODE CELL TO TURN IN YOUR WORK!
from ist256.submission import Submission
Submission().submit()