pandas

crunching data like bamboo since 2008

Image is Creative Comments Attribution licensed. The original image can be found on flickr.


In [13]:
import pandas
%matplotlib inline

The DataFrame

  • In-memory data table
    • Filter
    • Join
    • Group
    • Aggregate
    • Sort
    • Index
    • Plot

They can be easily created from flat text files like CSVs


In [14]:
sce = pandas.read_csv(
    'fake_student_courseenrollment.csv',
    header=None,
    names=['id', 'user_id', 'course_id', 'created', 'is_active', 'mode'],
    parse_dates=['created']
)

Cool, what does this DataFrame look like?


In [15]:
sce.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 6 columns):
id           5 non-null int64
user_id      5 non-null int64
course_id    5 non-null object
created      5 non-null datetime64[ns]
is_active    5 non-null int64
mode         5 non-null object
dtypes: datetime64[ns](1), int64(3), object(2)

In [16]:
sce.head()


Out[16]:
id user_id course_id created is_active mode
0 0 1 ValleyOfPeace/KungFu101/First 2013-03-01 09:22:11 1 honor
1 0 2 ValleyOfPeace/KungFu101/First 2013-03-01 09:25:52 0 honor
2 0 3 ValleyOfPeace/KungFu101/First 2013-03-05 08:15:52 1 honor
3 0 1 ValleyOfPeace/AdvancedKungFu/Second 2013-03-05 03:10:02 1 honor
4 0 4 ValleyOfPeace/KungFu101/First 2013-03-10 12:43:10 1 honor

Filtering

pandas lets you easily slice up your data to drill in on areas of interest.


In [17]:
sce[ sce['course_id'] == 'ValleyOfPeace/AdvancedKungFu/Second' ]


Out[17]:
id user_id course_id created is_active mode
3 0 1 ValleyOfPeace/AdvancedKungFu/Second 2013-03-05 03:10:02 1 honor

In [18]:
sce[ sce['course_id'] == 'ValleyOfPeace/KungFu101/First' ]


Out[18]:
id user_id course_id created is_active mode
0 0 1 ValleyOfPeace/KungFu101/First 2013-03-01 09:22:11 1 honor
1 0 2 ValleyOfPeace/KungFu101/First 2013-03-01 09:25:52 0 honor
2 0 3 ValleyOfPeace/KungFu101/First 2013-03-05 08:15:52 1 honor
4 0 4 ValleyOfPeace/KungFu101/First 2013-03-10 12:43:10 1 honor

In [19]:
sce[ sce['created'] > '20130302' ]


Out[19]:
id user_id course_id created is_active mode
2 0 3 ValleyOfPeace/KungFu101/First 2013-03-05 08:15:52 1 honor
3 0 1 ValleyOfPeace/AdvancedKungFu/Second 2013-03-05 03:10:02 1 honor
4 0 4 ValleyOfPeace/KungFu101/First 2013-03-10 12:43:10 1 honor

Joining

You can perform common joins like inner joins and left outer joins.


In [20]:
users = pandas.read_csv('fake_user_names.csv', header=None, names=['user_id', 'full_name'])
users.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 2 columns):
user_id      3 non-null int64
full_name    3 non-null object
dtypes: int64(1), object(1)

In [25]:
pandas.merge(sce, users, on='user_id', how='left')


Out[25]:
id user_id course_id created is_active mode full_name
0 0 1 ValleyOfPeace/KungFu101/First 2013-03-01 09:22:11 1 honor Po
1 0 1 ValleyOfPeace/AdvancedKungFu/Second 2013-03-05 03:10:02 1 honor Po
2 0 2 ValleyOfPeace/KungFu101/First 2013-03-01 09:25:52 0 honor Master Shifu
3 0 3 ValleyOfPeace/KungFu101/First 2013-03-05 08:15:52 1 honor Tigress
4 0 4 ValleyOfPeace/KungFu101/First 2013-03-10 12:43:10 1 honor NaN

Indexing

pandas contains a fairly sophisticated indexing system.

Let's take a look at a quick example of the type of thing you can do.


In [22]:
sce_idx = sce.set_index('created')
sce_idx.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5 entries, 2013-03-01 09:22:11 to 2013-03-10 12:43:10
Data columns (total 5 columns):
id           5 non-null int64
user_id      5 non-null int64
course_id    5 non-null object
is_active    5 non-null int64
mode         5 non-null object
dtypes: int64(3), object(2)

I wonder how many records are created each day?


In [23]:
records_per_day = sce_idx['id'].resample('D', how='count')
records_per_day.head()


Out[23]:
created
2013-03-01    2
2013-03-02    0
2013-03-03    0
2013-03-04    0
2013-03-05    2
Freq: D, Name: id, dtype: int64

Plotting


In [24]:
records_per_day.plot()


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x4100990>