Introducing Pandas

From the docs:

A Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.

We also use matplotlib:

A Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

Requirements:

(venv) $ pip install pandas matplotlib

We're going to see a sliver of the functionality provided by these packages.



In [1]:

    
import pandas as pd
pd.options.display.max_rows = 20

%matplotlib inline

Introducing `DataFrame`

From the docs:

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. It is generally the most commonly used pandas object.

There are many ways to get a DataFrame, but we'll start with a list of dictionaries.



In [2]:

    
df = pd.DataFrame([
    {'integer': 1, 'float': 1.0, 'string': 'one'},
    {'integer': 2, 'float': 2.0, 'string': 'two'},
    {'integer': 2, 'float': 2.0, 'string': 'two'},
    {'integer': 3, 'float': 3.0, 'string': 'three'},
])

# Print some details about the DataFrame
df.info()
df









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
float      4 non-null float64
integer    4 non-null int64
string     4 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 176.0+ bytes






    Out[2]:






  
    
      
      float
      integer
      string
    
  
  
    
      0
      1.0
      1
      one
    
    
      1
      2.0
      2
      two
    
    
      2
      2.0
      2
      two
    
    
      3
      3.0
      3
      three

The Jupyter Notebook automatically renders DataFrame as HTML!

Note the first column; this is an Index, and is an essential component of DataFrame. Here, it was auto-generated, but we can also set it:



In [3]:

    
df_index = df.set_index('string')
df_index

The Index plays a key role in slicing the DataFrame:



In [4]:

    
# Slice by label
df_index.loc['two']



In [5]:

    
# Slice by position
df_index.iloc[-2:]

We can also get individual columns:



In [6]:

    
floats = df_index['float']
floats









    Out[6]:





string
one      1.0
two      2.0
two      2.0
three    3.0
Name: float, dtype: float64

This is a Series, which is essentially a one-dimensional DataFrame, with a defined data type. Put another way, a DataFrame is a collection of Series.

Note that the Series retained the Index of our DataFrame, so we can use similar slicing:



In [7]:

    
floats['two']









    Out[7]:





two    2.0
two    2.0
Name: float, dtype: float64

Series and DataFrame support element-wise operations:



In [8]:

    
df_index['float'] * df_index['integer']









    Out[8]:





string
one      1.0
two      4.0
two      4.0
three    9.0
dtype: float64



In [9]:

    
df_index * df_index



In [10]:

    
number_format = 'Number {}'.format
df_index['integer'].apply(number_format)









    Out[10]:





string
one      Number 1
two      Number 2
two      Number 2
three    Number 3
Name: integer, dtype: object



In [11]:

    
df_index.applymap(number_format)









    Out[11]:






  
    
      
      float
      integer
    
    
      string
      
      
    
  
  
    
      one
      Number 1.0
      Number 1
    
    
      two
      Number 2.0
      Number 2
    
    
      two
      Number 2.0
      Number 2
    
    
      three
      Number 3.0
      Number 3

Using `DataFrame` with Django

Django gives us a handy way to build a list of dictionaries:



In [12]:

    
gig_values = Gig.objects.past().published().values('date', 'venue__name', 'venue__city')
gig_values[:5]









    Out[12]:





[{'date': datetime.date(2017, 3, 11), 'venue__name': 'Secret Venue', 'venue__city': 'Somerville, MA'}, {'date': datetime.date(2016, 12, 16), 'venue__name': 'Great Scott', 'venue__city': 'Allston, MA'}, {'date': datetime.date(2016, 9, 25), 'venue__name': 'Brighton Music Hall', 'venue__city': 'Allston, MA'}, {'date': datetime.date(2016, 9, 10), 'venue__name': 'Pinebanks Field', 'venue__city': 'Jamaica Plain, MA'}, {'date': datetime.date(2016, 8, 7), 'venue__name': 'State Park', 'venue__city': 'Cambridge, MA'}]

DataFrame doesn't know what to do with a QuerySet; it wants something that looks more like a list.
We could use list(gig_values), but gig_values.iterator() is more efficient.



In [13]:

    
gigs = pd.DataFrame(gig_values.iterator())
gigs.info()
gigs









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 269 entries, 0 to 268
Data columns (total 3 columns):
date           269 non-null object
venue__city    269 non-null object
venue__name    269 non-null object
dtypes: object(3)
memory usage: 6.4+ KB






    Out[13]:






  
    
      
      date
      venue__city
      venue__name
    
  
  
    
      0
      2017-03-11
      Somerville, MA
      Secret Venue
    
    
      1
      2016-12-16
      Allston, MA
      Great Scott
    
    
      2
      2016-09-25
      Allston, MA
      Brighton Music Hall
    
    
      3
      2016-09-10
      Jamaica Plain, MA
      Pinebanks Field
    
    
      4
      2016-08-07
      Cambridge, MA
      State Park
    
    
      5
      2016-08-06
      Manchester, NH
      Fuzz Hut
    
    
      6
      2016-08-05
      Stockbridge, MA
      Chesterfest
    
    
      7
      2016-07-30
      Providence, RI
      Dusk
    
    
      8
      2016-07-29
      New York, NY
      Cake Shop
    
    
      9
      2016-07-28
      Baltimore, MD
      Club 603
    
    
      ...
      ...
      ...
      ...
    
    
      259
      2006-06-20
      Cambridge, MA
      The Middle East Upstairs
    
    
      260
      2006-05-25
      Cambridge, MA
      TT The Bear's Place
    
    
      261
      2006-03-25
      Boston, MA
      Bill's Bar
    
    
      262
      2006-03-07
      Cambridge, MA
      Lizard Lounge
    
    
      263
      2006-03-05
      New York, NY
      Sin-E
    
    
      264
      2006-02-16
      Allston, MA
      Great Scott
    
    
      265
      2006-02-01
      Cambridge, MA
      TT The Bear's Place
    
    
      266
      2006-01-07
      New York, NY
      Sin-E
    
    
      267
      2005-12-18
      Cambridge, MA
      TT The Bear's Place
    
    
      268
      2005-11-20
      Allston, MA
      Great Scott
    
  

269 rows × 3 columns

This is a good place to start, and we've already got the answer to "How many gigs have we played"?

However, there are a few ways we can make this easier to work with:

Shorter column names
Predictable column order
Indexed and sorted by date

For more control, we'll use a list of tuples to initialize the DataFrame.



In [14]:

    
gig_values = Gig.objects.past().published().values_list('date', 'venue__name', 'venue__city')
gig_values[:5]









    Out[14]:





[(datetime.date(2017, 3, 11), 'Secret Venue', 'Somerville, MA'), (datetime.date(2016, 12, 16), 'Great Scott', 'Allston, MA'), (datetime.date(2016, 9, 25), 'Brighton Music Hall', 'Allston, MA'), (datetime.date(2016, 9, 10), 'Pinebanks Field', 'Jamaica Plain, MA'), (datetime.date(2016, 8, 7), 'State Park', 'Cambridge, MA')]



In [15]:

    
gigs = pd.DataFrame(gig_values.iterator(), columns=['date', 'venue', 'city'])

gigs['date'] = pd.to_datetime(gigs['date'])
gigs = gigs.set_index('date').sort_index()

gigs.info()
gigs.head()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 269 entries, 2005-11-20 to 2017-03-11
Data columns (total 2 columns):
venue    269 non-null object
city     269 non-null object
dtypes: object(2)
memory usage: 6.3+ KB






    Out[15]:






  
    
      
      venue
      city
    
    
      date
      
      
    
  
  
    
      2005-11-20
      Great Scott
      Allston, MA
    
    
      2005-12-18
      TT The Bear's Place
      Cambridge, MA
    
    
      2006-01-07
      Sin-E
      New York, NY
    
    
      2006-02-01
      TT The Bear's Place
      Cambridge, MA
    
    
      2006-02-16
      Great Scott
      Allston, MA

The previous cell demonstrates a good practice: make all of your modifications to a variable in one cell. This will help prevent surprises when you execute cells out of order as you play with code. If you need to make modifications later in the notebook, assign the result to a new variable.

Answering questions

What gigs did we play last year?

Using the date as the Index allows for fast slicing:



In [16]:

    
gigs.loc['2016']









    Out[16]:






  
    
      
      venue
      city
    
    
      date
      
      
    
  
  
    
      2016-01-31
      Great Scott
      Allston, MA
    
    
      2016-03-12
      The Bowery Electric
      New York, NY
    
    
      2016-05-12
      Great Scott
      Allston, MA
    
    
      2016-05-14
      C'Mon Everybody
      Brooklyn, NY
    
    
      2016-07-08
      The Sinclair
      Cambridge, MA
    
    
      2016-07-09
      Grumpy's
      Falmouth, MA
    
    
      2016-07-27
      Black Cat
      Washington, DC
    
    
      2016-07-28
      Club 603
      Baltimore, MD
    
    
      2016-07-29
      Cake Shop
      New York, NY
    
    
      2016-07-30
      Dusk
      Providence, RI
    
    
      2016-08-05
      Chesterfest
      Stockbridge, MA
    
    
      2016-08-06
      Fuzz Hut
      Manchester, NH
    
    
      2016-08-07
      State Park
      Cambridge, MA
    
    
      2016-09-10
      Pinebanks Field
      Jamaica Plain, MA
    
    
      2016-09-25
      Brighton Music Hall
      Allston, MA
    
    
      2016-12-16
      Great Scott
      Allston, MA

How many gigs have we played each year?

The date Index also allows for fast aggregration:



In [17]:

    
# resample('A') creates year-end groups like '2005-12-31'
# to_period() turns that into '2005'
years = gigs.resample('A').size().to_period()
years









    Out[17]:





date
2005     2
2006    17
2007    41
2008    40
2009    50
2010    20
2011     8
2012    25
2013    15
2014    21
2015    13
2016    16
2017     1
Freq: A-DEC, dtype: int64



In [18]:

    
years.plot.bar()









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x115383128>

What are our most active months?

Get the dates as a Series:



In [19]:

    
gig_dates = gigs.reset_index()['date']
gig_dates









    Out[19]:





0     2005-11-20
1     2005-12-18
2     2006-01-07
3     2006-02-01
4     2006-02-16
5     2006-03-05
6     2006-03-07
7     2006-03-25
8     2006-05-25
9     2006-06-20
         ...    
259   2016-07-28
260   2016-07-29
261   2016-07-30
262   2016-08-05
263   2016-08-06
264   2016-08-07
265   2016-09-10
266   2016-09-25
267   2016-12-16
268   2017-03-11
Name: date, dtype: datetime64[ns]

Convert those to sortable month names:



In [20]:

    
# Series.dt gives us access to date-related methods
gig_months = gig_dates.dt.strftime('%m %b')
gig_months









    Out[20]:





0      11 Nov
1      12 Dec
2      01 Jan
3      02 Feb
4      02 Feb
5      03 Mar
6      03 Mar
7      03 Mar
8      05 May
9      06 Jun
        ...  
259    07 Jul
260    07 Jul
261    07 Jul
262    08 Aug
263    08 Aug
264    08 Aug
265    09 Sep
266    09 Sep
267    12 Dec
268    03 Mar
Name: date, dtype: object

Count the unique values:



In [21]:

    
months = gig_months.value_counts()
months









    Out[21]:





03 Mar    41
10 Oct    37
06 Jun    32
11 Nov    31
07 Jul    29
05 May    23
09 Sep    22
02 Feb    15
08 Aug    12
12 Dec    10
04 Apr     9
01 Jan     8
Name: date, dtype: int64



In [22]:

    
# matplotlib has lots of options for customization
ax = months.sort_index().plot.bar(table=True, figsize=(10,5))
ax.get_xaxis().set_visible(False)

What cities have we played?



In [23]:

    
gig_cities = gigs['city']
gig_cities









    Out[23]:





date
2005-11-20          Allston, MA
2005-12-18        Cambridge, MA
2006-01-07         New York, NY
2006-02-01        Cambridge, MA
2006-02-16          Allston, MA
2006-03-05         New York, NY
2006-03-07        Cambridge, MA
2006-03-25           Boston, MA
2006-05-25        Cambridge, MA
2006-06-20        Cambridge, MA
                    ...        
2016-07-28        Baltimore, MD
2016-07-29         New York, NY
2016-07-30       Providence, RI
2016-08-05      Stockbridge, MA
2016-08-06       Manchester, NH
2016-08-07        Cambridge, MA
2016-09-10    Jamaica Plain, MA
2016-09-25          Allston, MA
2016-12-16          Allston, MA
2017-03-11       Somerville, MA
Name: city, dtype: object



In [24]:

    
cities = gig_cities.value_counts()
cities









    Out[24]:





Cambridge, MA        50
Allston, MA          30
New York, NY         20
Brooklyn, NY         12
Boston, MA            9
Somerville, MA        7
Chicago, IL           6
Jamaica Plain, MA     6
Pittsburgh, PA        5
Buffalo, NY           5
                     ..
Rochester, NY         1
Sacramento, CA        1
Seattle, WA           1
Norwood, MA           1
Ashland, OR           1
Manchester, NH        1
Charlotte, NC         1
Ames, IA              1
Dedham, MA            1
Lexington, KY         1
Name: city, dtype: int64



In [25]:

    
cities.describe()









    Out[25]:





count    86.000000
mean      3.127907
std       6.511161
min       1.000000
25%       1.000000
50%       1.000000
75%       2.750000
max      50.000000
Name: city, dtype: float64



In [26]:

    
# Adding the ; suppresses `<matplotlib.axes._subplots.AxesSubplot ...>`
cities[:10].sort_values().plot.barh();

When did we play in Pittsburgh?

We can use a Series of boolean values to slice our DataFrame:



In [27]:

    
# Series.str gives us access to string methods
in_pgh = gig_cities.str.contains('Pittsburgh')
in_pgh









    Out[27]:





date
2005-11-20    False
2005-12-18    False
2006-01-07    False
2006-02-01    False
2006-02-16    False
2006-03-05    False
2006-03-07    False
2006-03-25    False
2006-05-25    False
2006-06-20    False
              ...  
2016-07-28    False
2016-07-29    False
2016-07-30    False
2016-08-05    False
2016-08-06    False
2016-08-07    False
2016-09-10    False
2016-09-25    False
2016-12-16    False
2017-03-11    False
Name: city, dtype: bool



In [28]:

    
gigs[in_pgh]









    Out[28]:






  
    
      
      venue
      city
    
    
      date
      
      
    
  
  
    
      2007-06-30
      Bloomfield Bridge Tavern
      Pittsburgh, PA
    
    
      2008-03-02
      Garfield Art Works
      Pittsburgh, PA
    
    
      2009-10-19
      Thunderbird Cafe
      Pittsburgh, PA
    
    
      2010-10-17
      Garfield Art Works
      Pittsburgh, PA
    
    
      2012-06-22
      Garfield Art Works
      Pittsburgh, PA

What states have we played?

The Gig model doesn't have a state field, so we need to parse it out. In vanilla Python, we'd do:



In [29]:

    
'Boston, MA'.split(',')[1].strip()









    Out[29]:





'MA'

With pandas, we can do the same thing for every Series element:



In [30]:

    
states = gig_cities.str.split(',').str.get(1).str.strip().value_counts()
states









    Out[30]:





MA    117
NY     40
PA     11
OH     11
CA     10
IL      8
CT      7
VA      6
TX      6
KY      4
     ... 
ON      2
UT      1
RI      1
QC      1
MO      1
WA      1
LA      1
CO      1
VT      1
KS      1
Name: city, dtype: int64



In [31]:

    
states.describe()









    Out[31]:





count     34.000000
mean       7.911765
std       20.467813
min        1.000000
25%        1.250000
50%        3.000000
75%        5.500000
max      117.000000
Name: city, dtype: float64



In [32]:

    
# Don't rotate the x-axis labels
states[:5].plot.bar(rot=0);

What venues have we played?

DataFrame has powerful grouping and aggregration functionality:



In [33]:

    
venues = gigs.groupby(['venue', 'city']).size()
venues









    Out[33]:





venue                        city              
3rd Floor Studio             Fredericksburg, VA    1
501 Club                     Minneapolis, MN       1
Allandale Farm               Jamaica Plain, MA     1
Allston Village Street Fair  Allston, MA           1
Arlene's Grocery             New York, NY          1
ArtBeat                      Somerville, MA        1
Atwood's Tavern              Cambridge, MA         1
Bayside Bowl                 Portland, ME          1
Beat Kitchen                 Chicago, IL           1
Berbati's Pan                Portland, OR          1
                                                  ..
Union Hall                   Brooklyn, NY          2
Union Pool                   Brooklyn, NY          1
Ursinus College              Collegeville, PA      1
Valentine's                  Albany, NY            2
Vaudeville Mews              Des Moines, IA        1
Viper Room                   Los Angeles, CA       1
WMBR 88.1 FM Pipeline        Cambridge, MA         2
Wesleyan College             Middletown, CT        1
Zirzamin                     New York, NY          1
ZuZu                         Cambridge, MA         3
dtype: int64

This Series has a MultiIndex. Very useful, but beyond the scope of this presentation...



In [34]:

    
venues.describe()









    Out[34]:





count    171.000000
mean       1.573099
std        2.169083
min        1.000000
25%        1.000000
50%        1.000000
75%        1.000000
max       25.000000
dtype: float64



In [35]:

    
top_venues = venues.nlargest(10)
top_venues









    Out[35]:





venue                       city          
Great Scott                 Allston, MA       25
TT The Bear's Place         Cambridge, MA     11
The Middle East Upstairs    Cambridge, MA      8
The Middle East Downstairs  Cambridge, MA      7
The Mercury Lounge          New York, NY       5
The Sinclair                Cambridge, MA      5
Cake Shop                   New York, NY       4
Garfield Art Works          Pittsburgh, PA     3
Ralph's Diner               Worcester, MA      3
The Beachland Tavern        Cleveland, OH      3
dtype: int64



In [36]:

    
top_venues.sort_values().plot.barh();



In [ ]:

	float	integer
string
one	Number 1.0	Number 1
two	Number 2.0	Number 2
two	Number 2.0	Number 2
three	Number 3.0	Number 3

	date	venue__city	venue__name
0	2017-03-11	Somerville, MA	Secret Venue
1	2016-12-16	Allston, MA	Great Scott
2	2016-09-25	Allston, MA	Brighton Music Hall
3	2016-09-10	Jamaica Plain, MA	Pinebanks Field
4	2016-08-07	Cambridge, MA	State Park
5	2016-08-06	Manchester, NH	Fuzz Hut
6	2016-08-05	Stockbridge, MA	Chesterfest
7	2016-07-30	Providence, RI	Dusk
8	2016-07-29	New York, NY	Cake Shop
9	2016-07-28	Baltimore, MD	Club 603
...	...	...	...
259	2006-06-20	Cambridge, MA	The Middle East Upstairs
260	2006-05-25	Cambridge, MA	TT The Bear's Place
261	2006-03-25	Boston, MA	Bill's Bar
262	2006-03-07	Cambridge, MA	Lizard Lounge
263	2006-03-05	New York, NY	Sin-E
264	2006-02-16	Allston, MA	Great Scott
265	2006-02-01	Cambridge, MA	TT The Bear's Place
266	2006-01-07	New York, NY	Sin-E
267	2005-12-18	Cambridge, MA	TT The Bear's Place
268	2005-11-20	Allston, MA	Great Scott

	venue	city
date
2005-11-20	Great Scott	Allston, MA
2005-12-18	TT The Bear's Place	Cambridge, MA
2006-01-07	Sin-E	New York, NY
2006-02-01	TT The Bear's Place	Cambridge, MA
2006-02-16	Great Scott	Allston, MA

	venue	city
date
2016-01-31	Great Scott	Allston, MA
2016-03-12	The Bowery Electric	New York, NY
2016-05-12	Great Scott	Allston, MA
2016-05-14	C'Mon Everybody	Brooklyn, NY
2016-07-08	The Sinclair	Cambridge, MA
2016-07-09	Grumpy's	Falmouth, MA
2016-07-27	Black Cat	Washington, DC
2016-07-28	Club 603	Baltimore, MD
2016-07-29	Cake Shop	New York, NY
2016-07-30	Dusk	Providence, RI
2016-08-05	Chesterfest	Stockbridge, MA
2016-08-06	Fuzz Hut	Manchester, NH
2016-08-07	State Park	Cambridge, MA
2016-09-10	Pinebanks Field	Jamaica Plain, MA
2016-09-25	Brighton Music Hall	Allston, MA
2016-12-16	Great Scott	Allston, MA

	venue	city
date
2007-06-30	Bloomfield Bridge Tavern	Pittsburgh, PA
2008-03-02	Garfield Art Works	Pittsburgh, PA
2009-10-19	Thunderbird Cafe	Pittsburgh, PA
2010-10-17	Garfield Art Works	Pittsburgh, PA
2012-06-22	Garfield Art Works	Pittsburgh, PA