pandas

crunching data like bamboo since 2008

Image is Creative Comments Attribution licensed. The original image can be found on flickr.


In [34]:
import pandas
%matplotlib inline

The DataFrame

  • In-memory data table
    • Filter
    • Join
    • Group
    • Aggregate
    • Sort
    • Index
    • Plot

They can be easily created from flat text files like CSVs


In [35]:
sce = pandas.read_csv(
    'fake_student_courseenrollment.csv',
    header=None,
    names=['id', 'user_id', 'course_id', 'created', 'is_active', 'mode'],
    parse_dates=['created']
)

Cool, what does this DataFrame look like?


In [36]:
sce.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 6 columns):
id           4 non-null int64
user_id      4 non-null int64
course_id    4 non-null object
created      4 non-null datetime64[ns]
is_active    4 non-null int64
mode         4 non-null object
dtypes: datetime64[ns](1), int64(3), object(2)

In [37]:
sce.head()


Out[37]:
id user_id course_id created is_active mode
0 0 1 FooX/BarX/Baz 2013-03-01 09:22:11 1 honor
1 0 2 FooX/BarX/Baz 2013-03-01 09:25:52 0 honor
2 0 3 FooX/BarX/Baz 2013-03-05 08:15:52 1 honor
3 0 1 FooX/OtherX/Baz 2013-03-05 03:10:02 1 honor

Filtering

pandas lets you easily slice up your data to drill in on areas of interest.


In [38]:
sce[ sce['course_id'] == 'FooX/OtherX/Baz' ]


Out[38]:
id user_id course_id created is_active mode
3 0 1 FooX/OtherX/Baz 2013-03-05 03:10:02 1 honor

In [39]:
sce[ sce['course_id'] == 'FooX/BarX/Baz' ]


Out[39]:
id user_id course_id created is_active mode
0 0 1 FooX/BarX/Baz 2013-03-01 09:22:11 1 honor
1 0 2 FooX/BarX/Baz 2013-03-01 09:25:52 0 honor
2 0 3 FooX/BarX/Baz 2013-03-05 08:15:52 1 honor

In [40]:
sce[ sce['created'] > '20130302' ]


Out[40]:
id user_id course_id created is_active mode
2 0 3 FooX/BarX/Baz 2013-03-05 08:15:52 1 honor
3 0 1 FooX/OtherX/Baz 2013-03-05 03:10:02 1 honor

Joining

You can perform common joins like inner joins and left outer joins.


In [41]:
users = pandas.read_csv('fake_user_names.csv', header=None, names=['user_id', 'full_name'])
users.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
user_id      3 non-null int64
full_name    3 non-null object
dtypes: int64(1), object(1)

In [42]:
pandas.merge(sce, users, on='user_id')


Out[42]:
id user_id course_id created is_active mode full_name
0 0 1 FooX/BarX/Baz 2013-03-01 09:22:11 1 honor Po
1 0 1 FooX/OtherX/Baz 2013-03-05 03:10:02 1 honor Po
2 0 2 FooX/BarX/Baz 2013-03-01 09:25:52 0 honor Master Shifu
3 0 3 FooX/BarX/Baz 2013-03-05 08:15:52 1 honor Tigress

Indexing

pandas contains a fairly sophisticated indexing system.

Let's take a look at a quick example of the type of thing you can do.


In [43]:
sce_idx = sce.set_index('created')
sce_idx.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4 entries, 2013-03-01 09:22:11 to 2013-03-05 03:10:02
Data columns (total 5 columns):
id           4 non-null int64
user_id      4 non-null int64
course_id    4 non-null object
is_active    4 non-null int64
mode         4 non-null object
dtypes: int64(3), object(2)

I wonder how many records are created each day?


In [44]:
records_per_day = sce_idx['id'].resample('D', how='count')
records_per_day.head()


Out[44]:
created
2013-03-01    2
2013-03-02    0
2013-03-03    0
2013-03-04    0
2013-03-05    2
Freq: D, Name: id, dtype: int64

Plotting


In [45]:
records_per_day.plot()


Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x4467710>