pandas

crunching data like bamboo since 2008

Image is Creative Comments Attribution licensed. The original image can be found on flickr.



In [13]:

    
import pandas
%matplotlib inline

The DataFrame

In-memory data table
- Filter
- Join
- Group
- Aggregate
- Sort
- Index
- Plot

They can be easily created from flat text files like CSVs



In [14]:

    
sce = pandas.read_csv(
    'fake_student_courseenrollment.csv',
    header=None,
    names=['id', 'user_id', 'course_id', 'created', 'is_active', 'mode'],
    parse_dates=['created']
)

Cool, what does this DataFrame look like?



In [15]:

    
sce.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 6 columns):
id           5 non-null int64
user_id      5 non-null int64
course_id    5 non-null object
created      5 non-null datetime64[ns]
is_active    5 non-null int64
mode         5 non-null object
dtypes: datetime64[ns](1), int64(3), object(2)



In [16]:

    
sce.head()









    Out[16]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
    
  
  
    
      0
       0
       1
             ValleyOfPeace/KungFu101/First
      2013-03-01 09:22:11
       1
       honor
    
    
      1
       0
       2
             ValleyOfPeace/KungFu101/First
      2013-03-01 09:25:52
       0
       honor
    
    
      2
       0
       3
             ValleyOfPeace/KungFu101/First
      2013-03-05 08:15:52
       1
       honor
    
    
      3
       0
       1
       ValleyOfPeace/AdvancedKungFu/Second
      2013-03-05 03:10:02
       1
       honor
    
    
      4
       0
       4
             ValleyOfPeace/KungFu101/First
      2013-03-10 12:43:10
       1
       honor

Filtering

pandas lets you easily slice up your data to drill in on areas of interest.



In [17]:

    
sce[ sce['course_id'] == 'ValleyOfPeace/AdvancedKungFu/Second' ]









    Out[17]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
    
  
  
    
      3
       0
       1
       ValleyOfPeace/AdvancedKungFu/Second
      2013-03-05 03:10:02
       1
       honor



In [18]:

    
sce[ sce['course_id'] == 'ValleyOfPeace/KungFu101/First' ]









    Out[18]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
    
  
  
    
      0
       0
       1
       ValleyOfPeace/KungFu101/First
      2013-03-01 09:22:11
       1
       honor
    
    
      1
       0
       2
       ValleyOfPeace/KungFu101/First
      2013-03-01 09:25:52
       0
       honor
    
    
      2
       0
       3
       ValleyOfPeace/KungFu101/First
      2013-03-05 08:15:52
       1
       honor
    
    
      4
       0
       4
       ValleyOfPeace/KungFu101/First
      2013-03-10 12:43:10
       1
       honor



In [19]:

    
sce[ sce['created'] > '20130302' ]









    Out[19]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
    
  
  
    
      2
       0
       3
             ValleyOfPeace/KungFu101/First
      2013-03-05 08:15:52
       1
       honor
    
    
      3
       0
       1
       ValleyOfPeace/AdvancedKungFu/Second
      2013-03-05 03:10:02
       1
       honor
    
    
      4
       0
       4
             ValleyOfPeace/KungFu101/First
      2013-03-10 12:43:10
       1
       honor

Joining

You can perform common joins like inner joins and left outer joins.



In [20]:

    
users = pandas.read_csv('fake_user_names.csv', header=None, names=['user_id', 'full_name'])
users.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
user_id      3 non-null int64
full_name    3 non-null object
dtypes: int64(1), object(1)



In [25]:

    
pandas.merge(sce, users, on='user_id', how='left')









    Out[25]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
      full_name
    
  
  
    
      0
       0
       1
             ValleyOfPeace/KungFu101/First
      2013-03-01 09:22:11
       1
       honor
                 Po
    
    
      1
       0
       1
       ValleyOfPeace/AdvancedKungFu/Second
      2013-03-05 03:10:02
       1
       honor
                 Po
    
    
      2
       0
       2
             ValleyOfPeace/KungFu101/First
      2013-03-01 09:25:52
       0
       honor
       Master Shifu
    
    
      3
       0
       3
             ValleyOfPeace/KungFu101/First
      2013-03-05 08:15:52
       1
       honor
            Tigress
    
    
      4
       0
       4
             ValleyOfPeace/KungFu101/First
      2013-03-10 12:43:10
       1
       honor
                NaN

Indexing

pandas contains a fairly sophisticated indexing system.

Let's take a look at a quick example of the type of thing you can do.



In [22]:

    
sce_idx = sce.set_index('created')
sce_idx.info()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5 entries, 2013-03-01 09:22:11 to 2013-03-10 12:43:10
Data columns (total 5 columns):
id           5 non-null int64
user_id      5 non-null int64
course_id    5 non-null object
is_active    5 non-null int64
mode         5 non-null object
dtypes: int64(3), object(2)

I wonder how many records are created each day?



In [23]:

    
records_per_day = sce_idx['id'].resample('D', how='count')
records_per_day.head()









    Out[23]:





created
2013-03-01    2
2013-03-02    0
2013-03-03    0
2013-03-04    0
2013-03-05    2
Freq: D, Name: id, dtype: int64

Plotting



In [24]:

    
records_per_day.plot()









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x4100990>

	user_id	course_id	created	is_active	mode
0	1	ValleyOfPeace/KungFu101/First	2013-03-01 09:22:11	1	honor
1	2	ValleyOfPeace/KungFu101/First	2013-03-01 09:25:52	0	honor
2	3	ValleyOfPeace/KungFu101/First	2013-03-05 08:15:52	1	honor
3	1	ValleyOfPeace/AdvancedKungFu/Second	2013-03-05 03:10:02	1	honor
4	4	ValleyOfPeace/KungFu101/First	2013-03-10 12:43:10	1	honor