pandas

crunching data like bamboo since 2008

Image is Creative Comments Attribution licensed. The original image can be found on flickr.



In [34]:

    
import pandas
%matplotlib inline

The DataFrame

In-memory data table
- Filter
- Join
- Group
- Aggregate
- Sort
- Index
- Plot

They can be easily created from flat text files like CSVs



In [35]:

    
sce = pandas.read_csv(
    'fake_student_courseenrollment.csv',
    header=None,
    names=['id', 'user_id', 'course_id', 'created', 'is_active', 'mode'],
    parse_dates=['created']
)

Cool, what does this DataFrame look like?



In [36]:

    
sce.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
id           4 non-null int64
user_id      4 non-null int64
course_id    4 non-null object
created      4 non-null datetime64[ns]
is_active    4 non-null int64
mode         4 non-null object
dtypes: datetime64[ns](1), int64(3), object(2)



In [37]:

    
sce.head()









    Out[37]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
    
  
  
    
      0
       0
       1
         FooX/BarX/Baz
      2013-03-01 09:22:11
       1
       honor
    
    
      1
       0
       2
         FooX/BarX/Baz
      2013-03-01 09:25:52
       0
       honor
    
    
      2
       0
       3
         FooX/BarX/Baz
      2013-03-05 08:15:52
       1
       honor
    
    
      3
       0
       1
       FooX/OtherX/Baz
      2013-03-05 03:10:02
       1
       honor

Filtering

pandas lets you easily slice up your data to drill in on areas of interest.



In [38]:

    
sce[ sce['course_id'] == 'FooX/OtherX/Baz' ]









    Out[38]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
    
  
  
    
      3
       0
       1
       FooX/OtherX/Baz
      2013-03-05 03:10:02
       1
       honor



In [39]:

    
sce[ sce['course_id'] == 'FooX/BarX/Baz' ]









    Out[39]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
    
  
  
    
      0
       0
       1
       FooX/BarX/Baz
      2013-03-01 09:22:11
       1
       honor
    
    
      1
       0
       2
       FooX/BarX/Baz
      2013-03-01 09:25:52
       0
       honor
    
    
      2
       0
       3
       FooX/BarX/Baz
      2013-03-05 08:15:52
       1
       honor



In [40]:

    
sce[ sce['created'] > '20130302' ]









    Out[40]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
    
  
  
    
      2
       0
       3
         FooX/BarX/Baz
      2013-03-05 08:15:52
       1
       honor
    
    
      3
       0
       1
       FooX/OtherX/Baz
      2013-03-05 03:10:02
       1
       honor

Joining

You can perform common joins like inner joins and left outer joins.



In [41]:

    
users = pandas.read_csv('fake_user_names.csv', header=None, names=['user_id', 'full_name'])
users.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
user_id      3 non-null int64
full_name    3 non-null object
dtypes: int64(1), object(1)



In [42]:

    
pandas.merge(sce, users, on='user_id')









    Out[42]:






  
    
      
      id
      user_id
      course_id
      created
      is_active
      mode
      full_name
    
  
  
    
      0
       0
       1
         FooX/BarX/Baz
      2013-03-01 09:22:11
       1
       honor
                 Po
    
    
      1
       0
       1
       FooX/OtherX/Baz
      2013-03-05 03:10:02
       1
       honor
                 Po
    
    
      2
       0
       2
         FooX/BarX/Baz
      2013-03-01 09:25:52
       0
       honor
       Master Shifu
    
    
      3
       0
       3
         FooX/BarX/Baz
      2013-03-05 08:15:52
       1
       honor
            Tigress

Indexing

pandas contains a fairly sophisticated indexing system.

Let's take a look at a quick example of the type of thing you can do.



In [43]:

    
sce_idx = sce.set_index('created')
sce_idx.info()









    



<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4 entries, 2013-03-01 09:22:11 to 2013-03-05 03:10:02
Data columns (total 5 columns):
id           4 non-null int64
user_id      4 non-null int64
course_id    4 non-null object
is_active    4 non-null int64
mode         4 non-null object
dtypes: int64(3), object(2)

I wonder how many records are created each day?



In [44]:

    
records_per_day = sce_idx['id'].resample('D', how='count')
records_per_day.head()









    Out[44]:





created
2013-03-01    2
2013-03-02    0
2013-03-03    0
2013-03-04    0
2013-03-05    2
Freq: D, Name: id, dtype: int64

Plotting



In [45]:

    
records_per_day.plot()









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x4467710>

	user_id	course_id	created	is_active	mode
0	1	FooX/BarX/Baz	2013-03-01 09:22:11	1	honor
1	2	FooX/BarX/Baz	2013-03-01 09:25:52	0	honor
2	3	FooX/BarX/Baz	2013-03-05 08:15:52	1	honor
3	1	FooX/OtherX/Baz	2013-03-05 03:10:02	1	honor