The goals of this lab are to help you understand:
In [ ]:
# first, make sure we have the right modules installed
!pip install --upgrade chart-studio plotly
In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import chart_studio as plotly
import chart_studio.plotly as py
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
import folium
import warnings
#matplotlib.rcParams['figure.figsize'] = (20.0, 10.0) # larger figure size
warnings.filterwarnings('ignore')
In [3]:
goers = pd.read_csv('CCL-moviegoers.csv')
goers['age_group'] = ''
goers['age_group'][goers['age'] <=18] = 'Youth'
goers['age_group'][(goers['age'] >=19) & (goers['age'] <=55)] = 'Adult'
goers['age_group'][goers['age'] >=56] = 'Senior'
goers.sample(5)
Out[3]:
There are many ways your can visualize information. Which one is the most appropriate? It depends on the data, of course.
Let's use this knowledge to plot some data in the goers
DataFrame
!
The first thing we might want to visualize is a count of gender in the dataset. A pie chart is well suited for this task as it displays data as a portion of a whole. To create a pie chart we need the data to count and the labels for the counts.
Let's try it.
First we get the value counts as a series gender
:
In [ ]:
gender = goers['gender'].value_counts()
gender
Then we make it into a dataframe:
In [ ]:
gender_df = pd.DataFrame( { 'Gender' : gender.index, "Counts" : gender })
gender_df
Then we plot! The index has the labels, and the value at the index is what we want to plot:
In [ ]:
gender_df.plot.pie( y = 'Counts') # y are the values we are plotting
In [ ]:
#todo write code here
In [ ]:
occ = goers['occupation'].value_counts()
occ_df = pd.DataFrame( { 'occupation' : occ.index, "counts" : occ })
occ_df.plot.pie(y = 'counts')
In [ ]:
occ_df.plot.bar()
In [ ]:
# todo write code here
In [ ]:
ages = goers['age'].value_counts()
ages_df = pd.DataFrame( { 'age' : ages.index, "counts" : ages })
ages_df.plot.bar(y = 'counts')
Meaningless. For two key reasons:
What we want is a historgram, which takes a continuous variable and loads counts into "buckets". Notice how we didn't have to lump data with value_counts()
. Histograms can do that automatically because the age
variable is continuous. Let's try it:
In [ ]:
goers.hist(column ='age', color='gray', edgecolor='blue')
The default histogram has 10 bins. You can tweak the number of bins in your plot with a named argument. For example, here's 15 bins.
In [ ]:
goers.hist(column ='age', bins=15, color='pink', edgecolor='red')
In [ ]:
# todo write code here
Plot.ly is data visualization as a service. You give it data, it gives you back a web-based plot. Plot.ly is free and works with a variety of environments and programming languages, including Python.
For Python is has bindings so that you can use it just like matplotlib
! No need to manually invoke the web service call.
To get started with plot.ly you must sign up for an account and get a set of credentials:
In [ ]:
# todo: setup the credentials replace ??? and ??? with your Plot.ly username and api_key
plotly.tools.set_credentials_file(username='???', api_key='???')
In [ ]:
px.pie(gender_df, labels="Gender", names='Gender', values = "Counts")
In [ ]:
px.bar(gender_df, labels="Gender", x='Gender', y = "Counts")
If you dir(px)
you can see all the different chart types supported by Plotly Express. To learn how to plot one, use help(px.bar)
for example to learn how to plot a bar chart. You can also consult https://plot.ly/python/ for more information.
In [ ]:
# TODO: Type dir(px) to see the different plot types, then use help() to bring up help for a plot type.
In [ ]:
# todo: write code here
In [ ]:
series = go.Bar(x=gender_df['Gender'], y=gender_df['Counts'])
fig = go.Figure()
fig.add_trace(series)
fig.update_layout(title="Count of Moive Goers By Gender", xaxis_title = 'Gender', yaxis_title='Number of People')
fig.show()
When you use Plain old Plotly, its simple to add multiple series to the plot. All you do is add additional Graph objects using the add_trace()
method Let's make up some extra data and add it to the plot. Note how we add the name=
argument to label each series. Also each series does not need to be the same plot type. You can mix Bar
with Line
etc...
In [ ]:
projections = [800, 400]
fig2 = go.Figure()
fig2.add_trace(go.Bar(x=gender_df['Gender'], y=gender_df['Counts'], name="Actual")) # Bar Plot of actual
fig2.add_trace(go.Bar(x=gender_df['Gender'], y=projections, name="Projected")) # Bar plot of Projected
fig2.update_layout(title="Count of Moive Goers By Gender", xaxis_title = 'Gender', yaxis_title='Number of People')
fig2.show()
In [ ]:
import random
occ_df['last_year'] = random.randint(-15,15) +occ_df['counts']
occ_df.head(3)
In [ ]:
# TODO: Make your 2 series plotly plot here
Folium is a Python module wrapper for Leaflet.js, which uses Open Street Maps. These are two, popular open source mapping libraries. Unlike Google maps API, its 100% free!
You can use Folium to render maps in Python and put data on the maps. Here's how easy it is to bring up a map:
In [ ]:
CENTER_US = (39.8333333,-98.585522)
london = (51.5074, -0.1278)
map = folium.Map(location=CENTER_US, zoom_start=4)
map
You can zoom right down to the street level and get a amazing detail. There are different maps you can use, as was covered in this week's reading.
Let's take the largest category of movie goers and map their whereabouts. We will first need to import a data set to give us a lat/lng for the zip_code
we have in the dataframe. We could look this up with Google's geolookup API, but that's too slow as we will be making 100's of requests. It's better to have them stored already and merge them with goers
!
Let's import the zipcode database into a Pandas DataFrame, then merge it with the goers
DataFrame:
In [4]:
zipcodes = pd.read_csv('https://raw.githubusercontent.com/mafudge/datasets/master/zipcodes/free-zipcode-database-Primary.csv', dtype = {'Zipcode' :object})
data = goers.merge(zipcodes, how ='inner', left_on='zip_code', right_on='Zipcode')
students = data[ data['occupation'] == 'student']
students.sample()
Out[4]:
Let's explain the code, as a Pandas refresher course:
dtype = {'Zipcode' :object}
to force the Zipcode
column to be of type object
without that, it imports as type int
and cannot match with the goers
DataFrame.zip_code
in goers
(on_left) matches Zipcode
in zipcodes
(on_right)data
is a combined DataFrame, which we then filter to only student
occupations, sorting that in the students
DataFrameWe're ready to place the students on a map. It's easy:
marker
with the coordinatesadd_children()
Here we go!
In [ ]:
for row in students.to_records():
pos = (row['Lat'],row['Long'])
message = f"{row['age']} year old {row['gender']} {row['occupation']} from {row['City']},{row['State']}"
marker = folium.Marker(location=pos,
popup=message
)
map.add_child(marker)
map
In [ ]:
## todo write code here!
Please answer the following questions. This should be a personal narrative, in your own voice. Answer the questions by double clicking on the question and placing your answer next to the Answer: prompt.
Answer:
Answer:
Answer:
1 ==> I can do this on my own and explain how to do it.
2 ==> I can do this on my own without any help.
3 ==> I can do this with help or guidance from others. If you choose this level please list those who helped you.
4 ==> I don't understand this at all yet and need extra help. If you choose this please try to articulate that which you do not understand.
Answer:
In [ ]:
# SAVE YOUR WORK FIRST! CTRL+S
# RUN THIS CODE CELL TO TURN IN YOUR WORK!
from ist256.submission import Submission
Submission().submit()