This session looks at some of the finer points of data manipulation in Pandas, some econometric methods available from StatModels, as well as how to fine tune your Matplotlib output. In Pandas we will focus on data IO and data types, merge, grouping, reshaping, and time series. In StatsModels we will look at some examples of OLS, discrete choice models, and plotting, but these will only be brief examples meant to introduce the library. Finally, in Matplotlib we will look at the matplotlibrc file, discuss axes objects and object oriented plotting (including subplots), and a latex table output. As with the previous session, these topics only scratch the surface, but should be enough to get you started. By the end you should at least know where to look to find statistical methods for most economics applications, how to deal with data sets, and be familiar with the finer points of plotting in Python.
In order to ground this session in something more practical, we will focus on a real world econometric application. In particular, we will download data from the PSID, manipulate that data, and study a simple regression (ok, maybe not real world for you, but real world for an undergrad!). This session uses the PSIDPy package, which is a copy of PSIDr by Florian Oswald.
In order to access data on the PSID website, you need a few things. First, you need a username and password. Second, you need Requests. And third, you need Beautiful Soup.
Anaconda should already contain both of these packages. I'll assume you can figure out the user account yourself!
To read in the data from the SAS files, you'll need PSIDPy. You can get it by running pip install psid_py
from the command line.
Requests is a package whose self proclaimed sub-title is "HTTP for Humans". And it is just that! HTTP is very difficult if you aren't a web guru and Requests makes it truly easy.
Let's look at how we can post a form to the PSID website using requests:
NOTE: If you get an error indicating you do not have one of the packages imported below, try running conda install PACKAGE_NAME
from the command line. If this doesn't work, try pip install PACKAGE_NAME
. In particular, use pip install psid_py
to install the read_sas
function. If this still doesn't work, google it and send me an email.
In [1]:
#Just to avoid problems later, import everything now
import zipfile
import tempfile
import os
import requests
import shutil
import getpass
#import seaborn.apionly
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from psid_py import read_sas
from io import BytesIO
%matplotlib inline
In [2]:
#First, define a username and password
#NOTE: Please don't abuse my account... I don't want to be barred from the PSID!
#USERNAME = input("Please enter your PSID username: ")
#PASSWORD = getpass.getpass("Please enter your PSID"
# + " password: ")
USERNAME = "tyler.abbot@sciencespo.fr"
PASSWORD = "tyler.abbot"
#Create a requests session object, which will do all the HTTP for you
c = requests.Session()
#Define the URL of the login page
URL = 'http://simba.isr.umich.edu/u/login.aspx'
#Call the login page to "get" its raw HTML data
page = c.get(URL)
At this point, page is a requests object which contains the page data. We can retrieve this html file using the content
method, then scrape out the form variables we need. But what variables DO we need?
Navigate to the PSID login page, right-click on the login window, and click "Inspect Element". This will open a side bar showing the html and some other stuff.
At the top, there will be several tabs; select the 'Network' tab.
Now, click the 'Clear' button, right next to the red dot at the top.
Now login. At this point you'll see a bunch of things show up in the table.
Find anything related to login. For the PSID website this is 'Login.aspx'. Click this and select the 'Headers' tab on the lower part of the side bar.
Now scroll down to the 'Form Data' section.
All of the variables listed here will be submitted to the form. We are going to need to scrape the html for the values and submit them with our login information.
Here's how we'll scrape the page using Beautiful Soup:
In [5]:
soup = BeautifulSoup(page.content)
viewstate = soup.findAll("input", {"type": "hidden",
"name": "__VIEWSTATE"})
radscript = soup.findAll("input", {"type": "hidden",
"name": "RadScriptManager1_TSM"})
eventtarget = soup.findAll("input", {"type": "hidden",
"name": "__EVENTTARGET"})
eventargument = soup.findAll("input", {"type": "hidden",
"name": " __EVENTARGUMENT"})
viewstategenerator = soup.findAll("input", {"type": "hidden",
"name": "__VIEWSTATEGENERATOR"})
eventvalidation = soup.findAll("input", {"type": "hidden",
"name": "__EVENTVALIDATION"})
radscript = soup.findAll("input", {"type": "hidden", "name":
"RadScriptManager1_TSM"})
In [6]:
print(viewstate)
print(eventtarget)
Notice that Beautiful soup Returns the entire html object associated with each variable. This makes life so much easier. So, now that you have your form variables, you want to pack them into a dictionary to pass to your Requests object.
In [7]:
#Gather form data into a single dictionary
params = {'RadScriptManager1_TSM': radscript[0]['value'],
'__EVENTTARGET': '',
' __EVENTARGUMENT': '',
'__VIEWSTATE': viewstate[0]['value'],
'__VIEWSTATEGENERATOR': viewstategenerator[0]['value'],
'__EVENTVALIDATION': eventvalidation[0]['value'],
'ctl00$ContentPlaceHolder1$Login1$UserName': USERNAME,
'ctl00$ContentPlaceHolder1$Login1$Password': PASSWORD,
'ctl00$ContentPlaceHolder1$Login1$LoginButton': 'Log In',
'ctl00_RadWindowManager1_ClientState': ''}
Now, we can post to the login page!
In [17]:
#Post the login form. NOTE: Response code 200 implies OK
c.post('http://simba.isr.umich.edu/U/Login.aspx?redir=http%3a%2f%2fsimba.isr.umich.edu%2fU%2fLogout.aspx', data=params,
headers={"Referer": "http://simba.isr.umich.edu/U/Login.aspx?redir=http://simba.isr.umich.edu/U/Logout.aspx"},
allow_redirects=True)
Out[17]:
Great, we're now logged into the PSID website. Next, we need to download some data. We are going to download a family file and an individual file, both for 1968, the first year the PSID was performed. We'll also save this in a temporary directory, which we'll use until the end of this course and then delete, along with our wonderful data...
In [13]:
#File names in the psid are numbered
file = "1056"
url = 'http://simba.isr.umich.edu/Zips/GetFile.aspx?file='\
+ file + '&mainurl=Y'
referer = "http://simba.isr.umich.edu/Zips/ZipMain.aspx"
data1 = c.get(url, allow_redirects=False,
headers={"Referer": referer})
That's it! We've successfully downloaded a zipped data file from the PSID website. Since it would be nice to work on heirarchical indices, we are going to download data for another year, 1968. Then, we'll unzip the files and extract the sas data to data-frames using the psid_py
package:
In [16]:
#Download a second file
file = "1058"
url = 'http://simba.isr.umich.edu/Zips/GetFile.aspx?file='\
+ file + '&mainurl=Y'
data2 = c.get(url, allow_redirects=False,
headers={"Referer": referer})
#Create a temporary directory to store unzipped files
temp_dir = tempfile.mkdtemp() + os.sep
x = pd.DataFrame()
y = pd.DataFrame()
frames = [x, y]
for i, data in enumerate([data1, data2]):
#Extract the zipped files
zipped = zipfile.ZipFile(BytesIO(data.content))
files_to_unzip = (x for x in zipped.namelist() if
any(['.sas' in x, '.txt' in x]))
for NAME in files_to_unzip:
temp_name = zipped.extract(NAME, temp_dir)
#Test if you have just found the dictionary
if temp_name.find('.sas') >= 0:
dict_file = str(temp_name)
#If not, you have found the data
else:
data_file = str(temp_name)
#Use psidPy to read the sas file in
frames[i] = pd.concat([frames[i], read_sas.read_sas(data_file, dict_file)])
#Remove the temporary directory
shutil.rmtree(temp_dir)
This should return two DataFrame objects:
In [9]:
x = pd.DataFrame(frames[0])
y = pd.DataFrame(frames[1])
print(x.shape)
print(y.shape)
We've done it: we now have two data frames containing a bunch of variables. Now, we'll work on building a panel from this pair of DataFrames.
I personally find this to be the worst thing ever, but hopefully you can learn from my struggles (and maybe I'm just being a baby about it). So we have two years of data and we would like to pull out several variables. Here is a list of the variables we are going to pull out and their corresponding code in each year:
Variable | 1968 | 1969 |
---|---|---|
Family Number | V3 | V442 |
Total Food Consumption | V334 | V863 |
Head Hourly Earn | V337 | V871 |
Head Education | V313 | V794 |
Wifes Education | V246 | NA |
So, to start we'll just drop all of the unnecessary columns. This can be done quickly and easily by simply passing a list of column names to the DataFrame.
In [10]:
vars68 = ['V3',
'V334',
'V337',
'V313',
'V246']
vars69 = ['V442',
'V863',
'V871',
'V794']
frame68 = x[vars68].copy()
frame69 = y[vars69].copy()
frame68.head()
Out[10]:
Notice here that I explicitly created a copy. If I hadn't, frameXX
would have been a view of the originial DataFrame. This would have propogated a warning later on, as I would later like to edit the new frame.
Now that we have our stripped down DataFrames, we can combine them into a single DataFrame. First, let's change the column names to be something a bit more recognizable.
In [11]:
frame68.columns = ['fam_id',
'foodc',
'head_hourly_wage',
'head_educ',
'wife_educ']
frame69.columns = ['fam_id',
'foodc',
'head_hourly_wage',
'head_educ']
In [12]:
frame68['year'] = 1968
frame69['year'] = 1969
In [13]:
frame69.head()
Out[13]:
Now we can carry out a join/merge operation.
Combining several DataFrames is a typical database operation that can quickly become a pain. The Pandas documentation offers a very thorough treatment of the topic, but we'll talk about some of the basics here.
First of all, there are two functions you can use to do this sort of merge: join
and merge
. These are essentially the same thing, but join is less verbose and easier to type, however it offers less flexibility in the type of merge operation you can do.
Second, Pandas expects you to be specific about exactly what type of merge operation you would like to do, whether that be left, right, inner, or outer, on what key or keys you would like to merge, and along what axis. These are important things to keep in mind, and we'll do some examples that show what can happen to your data under each assumption.
Finally, you need to be aware of duplicates in your data. Duplicates can cause the merge operation to do strange things and give you unexpected results. If a merge operation begins to go awry, this should be one of the first things you check.
The first type of opetation you might be interested in is pooling the two data frames into one. This is called a 'concatenation', as you will either add more rows or more columns.
Here is how you might simply concatenate the rows:
In [14]:
concat_rows = pd.concat([frame68, frame69])
print(frame68.shape, frame69.shape, concat_rows.shape)
This operation simply tacks the the two DataFrames together. However, you might be careful about missing values. Notice that we only have 'wife_educ' for one year. Pandas handles this by simply adding NaN
to the entries where there is missing data:
In [15]:
concat_rows
Out[15]:
Another, even simpler way to carry out the identical operation is with append:
In [16]:
concat_rows = frame68.append(frame69)
Additionally, you'll notice that the concatenation has carried over the indices from the original DataFrames. This means that the index now repeats. To fix this, we can simply add the ignore_index
option to the concat function:
In [17]:
concat_rows = pd.concat([frame68, frame69], ignore_index=True)
concat_rows
Out[17]:
On the other hand, what if we would like to add columns instead of rows? We can do this just as easily, but we need to be aware of the indexing. Currently, our DataFrames contain indexes that have nothing to do with the data, so concatenating the columns will be equally arbitrary. However, we can set the indices, in this case to fam_index
, and then concatenate:
In [18]:
temp68 = frame68.set_index('fam_id')
temp69 = frame69.set_index('fam_id')
concat_cols = pd.concat([temp68, temp69], axis=1)
concat_cols.head()
Out[18]:
Several things to notice here are that we have duplicate column names and the treatment of missing entries is the same. For fam_id
s that don't appear in both frames, Pandas adds NaN
. We can deal with duplicate column names by simply renaming as we did before.
In [19]:
concat_cols.columns = ['foodc68',
'head_hourly_wage68',
'head_educ68',
'wife_educ68',
'year68',
'foodc69',
'head_hourly_wage69',
'head_educ69',
'year69']
concat_cols.tail()
Out[19]:
For more complex situations, Pandas offers join
and merge
. These are database type operations that are similar to those used by SQL. The idea is to give the merge function two DataFrames and a set of one or more keys. The keys will be used to match the rows or columns of the two. For those of you who are not familiar with merging data (I wasn't until recently) there are several types of merge operations you might want to carry out:
Methods:
The simplest method is to use join
. This will merge on the index, so it's important to set this from the beginning. Here's an example:
In [20]:
temp68 = frame68.set_index('fam_id')
temp69 = frame69.set_index('fam_id')
join1 = temp68.join(temp69)
You'll notice that the above code produces an error. This is because the column names are identical. In order to deal with this, join offers a suffix
option, which it will add to additional columns of the same name.
In [21]:
join1 = temp68.join(temp69, rsuffix='69')
join1.head()
Out[21]:
As you can see this produces the same result as concatenating along the columns. However, the default method is a left join, so this has dropped any observations in frame69
not present in frame68
:
In [22]:
print(temp68.shape, join1.shape)
By specifying the how argument we can achieve different results:
In [23]:
#Right join drops entries in 68 not present in 68
join1 = temp68.join(temp69, how='right',rsuffix='69')
print(temp69.shape, join1.shape)
#Inner join drops entries not present in both
join1 = temp68.join(temp69, how='inner',rsuffix='69')
print(temp68.shape, temp69.shape, join1.shape)
#Outer join keeps all entries
join1 = temp68.join(temp69, how='outer',rsuffix='69')
print(temp68.shape, temp69.shape, join1.shape)
Join
is equivalent to a less verbose merge
, which allows you to set more options, but requires you to be more aware of the operation. For instance, we can carry out the previous join operations using merge
:
In [24]:
#Right merge drops entries in 68 not present in 68
merge1 = pd.merge(temp68.reset_index(), temp69.reset_index(),
how='right', on='fam_id', suffixes=('', '69'))\
.set_index('fam_id')
print(temp69.shape, merge1.shape)
Join and merge both represent quite powerful functions that can give unexpected output. Here is a list of things to check before completely panicking if your merge operation is not giving you what you expect:
In [25]:
data = pd.concat([frame68, frame69], ignore_index=True).set_index('fam_id')
data.head()
Out[25]:
I've chosen to use the row concatenated data frame for two reasons: first, it provides an easy transition to heirarchical indexing, and second, it is more in line whith how you might be presented with panel data in general.
In many cases, you may want to apply conditional operations or group operations to entries in your data set. Pandas makes this easy with the groupby
method. For example, let's say we are interested in the average hourly wage of the head of household by education group. Furthermore, imagine we would like to include this as an explanatory variable in an estimation.
To accomplish this, we will take two steps (there may be more efficient ways to do this, but this is instructive and clear):
groupby
.data
DataFrame.But first, let's talk about groupby
. According to the Pandas documentaiton, the action "group by" refers to a three step process:
For this demonstration, we'll cover each step individually.
Splitting your data by groups is fairly easy using groupby
:
In [26]:
grouped = data.groupby('head_educ')
grouped
Out[26]:
You'll notice that the method returns a groupby object type. This is an iterable, which keeps memory free of the clutter of an entirely new DataFrame. These groupby objects have some mysterious methods that are not well documented, but think of it as an organized list, seperating individuals by the key, here 'head_educ'.
Once the data is split (or grouped) you can apply some function. In our case, we are interested in applying a mean function to each group. We do this with the following syntax:
In [27]:
means = grouped.aggregate(np.mean)
means
Out[27]:
This function has returned group means for all education levels and for all variables. If you have many variables, you could also select a single column:
In [28]:
wage_means = grouped['head_hourly_wage'].aggregate(np.mean)
wage_means
Out[28]:
There are several other methods you could use for this "apply" step. Taking a mean is considered an "aggregation", while you could also carry out a "transformation" (such as scaling) or "filtration" (such as choosing only those groups whose group average is greater than some amount. Here are a couple examples:
In [29]:
#Transforming wage by taking the difference from the mean
meandif = lambda x: x - x.mean()
transformed = grouped['head_hourly_wage'].transform(meandif)
print(data['head_hourly_wage'].head(),transformed.head())
#Filtering out those with very low hourly wage
filtered = grouped['head_hourly_wage'].filter(lambda x: np.mean(x) > 10.)
print(filtered.head())
In [30]:
#First, rename the column in wage means
wage_means.name = 'avg_wage_educ'
data1 = pd.merge(data.reset_index(), wage_means.reset_index(),
how='left', on='head_educ').set_index('fam_id')
data1.head()
Out[30]:
In [31]:
print(data.shape, data1.shape)
print(wage_means)
The groupby
function is incredibly versatile, so you'll have to work with it a lot to understand all it can do, but as long as you keep in mind the split, apply, combine workflow, you can do essentially everything we just did in a single line (well, almost given the rename):
In [32]:
wage_means = data.groupby('head_educ')['head_hourly_wage']\
.aggregate(np.mean)
wage_means.name = 'avg_wage_educ'
data2 = pd.merge(data.reset_index(), wage_means.reset_index(),
how='left', on='head_educ').set_index('fam_id')
data2.head()
Out[32]:
Now that we have our DataFrame, we might make a pitstop to discuss locating data within the frame. There are several ways to achieve this depending on what you are interested in finding.
First, let's consider the simple case of selecting by column. We've already seen this and requires simply referencing the title of the column:
In [33]:
# Drop wife_educ, because we are never going to use it
data = data1.copy().drop('wife_educ', axis=1)
data['foodc'].head()
Out[33]:
Another way to achieve the same effect is to use the column name as a method:
In [34]:
data.foodc.head()
Out[34]:
Second, there is selecting by label. This refers to the label specified in the index. For instance, you can select only the rows corresponding to fam_id == 1
by the following:
In [35]:
data.loc[1]
Out[35]:
You can also combine the two to retrieve only a specific column:
In [36]:
data.foodc.loc[1]
Out[36]:
Additionally, slicing works with loc
In [37]:
data.sort_index(inplace=True)
data.foodc.loc[1:5]
Out[37]:
Note here that I sorted the data frame first. This is because a slice requires the index to be monotonic. In fact, many Pandas methods require this, so you might think about sorting early in your analysis. However, if you will be modifying the data and indices again and your DataFrame is large, a sort could be a waste of time.
Another way to accomplish the above is to use the loc
method to reference entriex like a NumPy array might, referenceing the row and the column:
In [38]:
data.loc[1:5, 'foodc']
Out[38]:
There is a similar operation for integer based indices, called iloc
, but it acts position and not labels:
In [39]:
data.iloc[1:5]
Out[39]:
Another fancy indexing method is boolean indexing. For instance, if you would like only the entries who spent more than 1000 on food, you could do the following:
NOTE: This method doesn't work, but it was a nice try using some different methods, so I'm leaving my failure here. For a group by solution, scroll down.
In [40]:
data[data.foodc > 1000][:10]
Out[40]:
But, you'll notice that we only dropped entries where the row had low consumption. What if we wanted to drop both entries for a family if they had one year of low consumption. This is where heirarchical indexing can help.
Heirarchical indexing defines several layers of indices. In our case, the natural indices are fam_id
and year
. So, we can redefine the index as follows, specifying a list of index names instead of a single string:
In [41]:
data = data.reset_index().set_index(['fam_id', 'year'])
data.head()
Out[41]:
Notice now that the fam_id
index spans several columns. This is because each entry in fam_id
is associated to several entries in year
, which in turn are associated to individual rows.
Now, we can again try the boolean indexing we had before:
In [42]:
data[data.foodc > 1000][:10]
Out[42]:
But wait! That didn't fix our problem! What we need to do is retrieve the primary index of the boolean array that is returned. What does this mean? Let's break down the steps of a boolean heirarchical index operation.
First, what does the boolean operation return?
In [43]:
data.foodc > 1000
Out[43]:
In [44]:
high_food = data.foodc > 1000
high_food.index.get_level_values('fam_id')
Out[44]:
It returns a multi-indexed series object containing True
and False
values. We would like to use this to reference the fam_id
index. There are several ways we could achieve this using techniques we've already learned:
any
to return a vector for any false entries.
In [45]:
data.index.get_level_values('fam_id')
Out[45]:
In [46]:
idx = pd.IndexSlice
high_food = data.foodc > 1000
high_food[data.index.get_level_values('fam_id')]
#data.loc[idx[high_food,:], :].head()
#idx[high_food]
In [2]:
# Since heirarcical indexing didn't work, let's use groupby
data.groupby(level=0).\
filter(lambda x: all([v > 1000 for v in x['foodc']])).head()
In [47]:
data.describe()
Out[47]:
This can be combined with any of the methods we've already seen to get column statistics, row statistics, or even group statistics. For example:
In [48]:
data.groupby('head_educ').describe()
Out[48]:
If we would like to do more in terms of description, we should look at some plotting functionality. For this we will use matplotlib and seaborn.
Most plotting can't deal with missing data, so we will go ahead and drop any missing:
In [49]:
data = data.dropna()
Now we can use either the built in pandas plotting or matplotlib to generate a histogram:
In [50]:
data.foodc.plot(kind='hist')
Out[50]:
In [51]:
plt.hist(data.foodc.values)
Out[51]:
You'll notice that the above graphs look really good! This is because Seaborn has customized the styling. You can also cusomize your matplotlibrc
file to get your own style, but we'll talk abou that in a minute.
We can also specify the number of bins
In [52]:
data.foodc.plot(kind='hist', bins=20)
Out[52]:
Or we can change the styling, trying to plot multiple histograms simultaneously:
In [53]:
data.plot(kind='hist', stacked=True, bins=20)
Out[53]:
But clearly that won't always work...
Setting aside style for now, we can use seaborn to produce some very fancy plots to visualize our data. All seaborn does is give you a set of functions that produce the types of statistical plots that you might generally want. It takes the hard work out of defining legends, smoothing, bins, etc. Let's see some examples:
In [55]:
#Plotting a distribution over our histogram
sns.distplot(data.foodc)
Out[55]:
In [57]:
#Adding a 'rugplot' to the histogram
sns.distplot(data.foodc, kde=False, rug=True)
Out[57]:
In [59]:
#Kernel density estimation
sns.distplot(data.foodc, hist=False)
Out[59]:
In [61]:
#Using kdeplot to do the same thing, but changing the bandwidth
sns.kdeplot(data.foodc)
sns.kdeplot(data.foodc, bw=0.2)
sns.kdeplot(data.foodc, bw=2)
Out[61]:
In [67]:
#Using distplot to fit a parametric distribution
sns.distplot(data.foodc, kde=False, fit=stats.gamma)
Out[67]:
One of the coolest things (in my opinion) is how easy seaborn makes plotting bivariate distributions.
In [70]:
sns.jointplot(x = data.foodc, y = data.head_hourly_wage)
Out[70]:
Noticing the outliers in the wage distribution, we can easily drop these from our plot:
In [72]:
sns.jointplot(x = data.foodc[data.head_hourly_wage < 80],
y = data.head_hourly_wage[data.head_hourly_wage < 80])
Out[72]:
We can even exclude those with zero wage and zero food consumption by using multiple conditions:
In [73]:
sns.jointplot(x = data.foodc[(data.head_hourly_wage < 80) &
(data.head_hourly_wage > 0) &
(data.foodc > 0)],
y = data.head_hourly_wage[(data.head_hourly_wage < 80) &
(data.head_hourly_wage > 0) &
(data.foodc > 0)])
Out[73]:
Still not cool enough?! What if I told you that seaborn would do multi-dimensional kernel density estimation for you? Hmmmm? That good enough?
In [75]:
sns.jointplot(x = data.foodc[(data.head_hourly_wage < 80) &
(data.head_hourly_wage > 0) &
(data.foodc > 0)],
y = data.head_hourly_wage[(data.head_hourly_wage < 80) &
(data.head_hourly_wage > 0) &
(data.foodc > 0)],
kind = 'kde')
Out[75]:
And if you'd like to visualize all of the pairwise relationships in your data, all you gotta do is ask:
In [76]:
sns.pairplot(data)
Out[76]:
Which, given the discrete nature of two of our variables, looks pretty silly, but saves you a lot of time and headache.
What if you're interested in plotting based on subgroups? For instance, let's make our two dimensional scatter plot, but try to differentiate between education levels. Seaborn offers a class just for this, called FacetGrid
.
In [86]:
#First, let's generate a subset of data we are interested in
#NOTE: Here I'm resetting the index so I can use it more easily,
#I'm completely ignoring the fact that subsetting the data in
#this way drops observations... don't worry about it, just enjoy
#the pretty pictures.
subset = pd.DataFrame(data[(data.head_hourly_wage < 20) &
(data.head_hourly_wage > 0) &
(data.foodc > 0)].reset_index())
g = sns.FacetGrid(subset, col='year', hue='head_educ')
g.map(plt.scatter, 'foodc', 'head_hourly_wage')
Out[86]:
In [89]:
#We can also make the dots more transparent using the `alpha`
#argument, which is available for almost all plotting using
#matplotlib
#Here we'll pool all observations together
g = sns.FacetGrid(subset, hue='head_educ')
g.map(plt.scatter, 'foodc', 'head_hourly_wage', alpha = 0.5)
Out[89]:
Using the map
method you can make generally any kind of plot you'd like, even creating custom plot types.
Another type of plot you might find interesting is a lmplot
, which is a linear regression plot. It is used for plotting regressions over subsets of data, combining regplot
and FacetGrid
.
In [93]:
sns.regplot(x = subset.foodc, y = subset.head_hourly_wage)
Out[93]:
In [97]:
sns.lmplot(x = 'foodc', y = 'head_hourly_wage', data = subset)
Out[97]:
In [99]:
sns.lmplot(x = 'foodc', y = 'head_hourly_wage', hue='head_educ',
data = subset)
Out[99]:
Which is kind of messy, but you get the drift! However, if you still want something clearer, you can plot each regression individually:
In [102]:
sns.lmplot(x = 'foodc', y = 'head_hourly_wage', col = 'head_educ',
hue='head_educ', data = subset)
Out[102]:
In [103]:
#... which is kind of small, so you could also do it as rows
sns.lmplot(x = 'foodc', y = 'head_hourly_wage', row = 'head_educ',
hue='head_educ', data = subset)
Out[103]: