4-7 pandas DataFrame Summary Statistics, Filtering, Dropping and adding Rows and Columns, Grouping Basics



In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt


3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2
0.16.2

Alright in this section we're going to continue with the running data set but we're going to dive a bit deeper into ways of analyzing the data including filtering, dropping rows, doing some groupings and that sort of thing.

So what we'll do is read in our csv file.


In [7]:
pd.read_csv?

In [8]:
list(range(1,7))


Out[8]:
[1, 2, 3, 4, 5, 6]

In [13]:
df = pd.read_csv('../data/date_fixed_running_data_with_time.csv', parse_dates=['Date'], usecols=list(range(0,6)))

In [14]:
df.dtypes


Out[14]:
Date            datetime64[ns]
Miles                  float64
Time                    object
Seconds                float64
Minutes                float64
Min_per_mile           float64
dtype: object

In [15]:
df.sort(inplace=True)
df.head()


Out[15]:
Date Miles Time Seconds Minutes Min_per_mile
0 2014-08-02 1.70 NaN NaN NaN NaN
1 2014-08-03 1.40 NaN NaN NaN NaN
2 2014-08-04 2.10 NaN NaN NaN NaN
3 2014-08-05 0.00 NaN NaN NaN NaN
4 2014-08-06 2.38 NaN NaN NaN NaN

Now let's about getting some summary statistics. I would encourage you to try these on your own. We've learned pretty much everything we need to in order to be able to do these on without guidance and as always if you need clarification just ask on the side.

What was the longest run in miles and minutes that I ran?


In [16]:
df.Minutes.max()


Out[16]:
260.71666666666664

In [17]:
df.Miles.max()


Out[17]:
26.199999999999999

What about the shortest in miles and minutes that I ran?


In [18]:
df.Minutes.min()


Out[18]:
6.2666666666666675

In [19]:
df.Miles.min()


Out[19]:
0.0

We forgot to ignore our null values, so how would we do it by ignoring those?


In [20]:
df.Miles[df.Miles > 0].min()


Out[20]:
0.65000000000000002

What was the most common running distance I did excluding times when I didn't run at all.


In [21]:
df.Miles[df.Miles > 0].value_counts().index[0]


Out[21]:
1.3700000000000001

Plot a graph of the cumulative running distance in this dataset.


In [22]:
df.Miles.cumsum().plot()
plt.xlabel("Day Number")
plt.ylabel("Distance")


Out[22]:
<matplotlib.text.Text at 0x107a04d10>

Plot a graph of the cumulative running hours in this data set.


In [23]:
(df.Minutes.fillna(0).cumsum() / 60).plot()


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x107aa8890>

Another interesting question we could ask is what days of the week do I commonly go for runs. Am I faster on certain days or does my speed improve over time relative to the distance that I'm running.

So let's get our days of the week


In [24]:
df.Date[0].strftime("%A")


Out[24]:
'Saturday'

We will do that by mapping our date column to a the time format we need


In [25]:
df.Date.map(lambda x: x.strftime("%A")).head()


Out[25]:
0     Saturday
1       Sunday
2       Monday
3      Tuesday
4    Wednesday
Name: Date, dtype: object

then we just set that to a new column.


In [26]:
df['Day_of_week'] = df.Date.map(lambda x: x.strftime("%A"))

In [27]:
df.head(10)


Out[27]:
Date Miles Time Seconds Minutes Min_per_mile Day_of_week
0 2014-08-02 1.70 NaN NaN NaN NaN Saturday
1 2014-08-03 1.40 NaN NaN NaN NaN Sunday
2 2014-08-04 2.10 NaN NaN NaN NaN Monday
3 2014-08-05 0.00 NaN NaN NaN NaN Tuesday
4 2014-08-06 2.38 NaN NaN NaN NaN Wednesday
5 2014-08-07 2.30 NaN NaN NaN NaN Thursday
6 2014-08-08 3.40 NaN NaN NaN NaN Friday
7 2014-08-09 2.50 NaN NaN NaN NaN Saturday
8 2014-08-10 3.36 0:28:37 1717 28.616667 8.516865 Sunday
9 2014-08-11 1.67 0:13:07 787 13.116667 7.854291 Monday

and we can make a bar plot of it, but let's see if we can distinguish anything unique about certain days of the week.


In [28]:
df[df.Miles > 0].Day_of_week.value_counts().plot(kind='bar')


Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x107c09f90>

We will do that by creating groups of data frames

We can see that in this sample I run a lot more on the Friday Saturday and Monday. Some interesting patterns. Why don't we try looking at the means and that sort of thing.

But before we get there, at this point, our data frame is getting pretty messy and I think it's worth explaining how to remove columns and add remove rows and columns.

First let's remove the Time column - seeing as we already have minutes and seconds


In [29]:
del(df['Time'])

del will delete it in place


In [30]:
df.head()


Out[30]:
Date Miles Seconds Minutes Min_per_mile Day_of_week
0 2014-08-02 1.70 NaN NaN NaN Saturday
1 2014-08-03 1.40 NaN NaN NaN Sunday
2 2014-08-04 2.10 NaN NaN NaN Monday
3 2014-08-05 0.00 NaN NaN NaN Tuesday
4 2014-08-06 2.38 NaN NaN NaN Wednesday

Finally we can use drop to drop a column. Now we have to specify the axis( we can also use this to drop rows), now this does not happen in place.


In [31]:
df.drop('Seconds',axis=1)


Out[31]:
Date Miles Minutes Min_per_mile Day_of_week
0 2014-08-02 1.70 NaN NaN Saturday
1 2014-08-03 1.40 NaN NaN Sunday
2 2014-08-04 2.10 NaN NaN Monday
3 2014-08-05 0.00 NaN NaN Tuesday
4 2014-08-06 2.38 NaN NaN Wednesday
5 2014-08-07 2.30 NaN NaN Thursday
6 2014-08-08 3.40 NaN NaN Friday
7 2014-08-09 2.50 NaN NaN Saturday
8 2014-08-10 3.36 28.616667 8.516865 Sunday
9 2014-08-11 1.67 13.116667 7.854291 Monday
10 2014-08-12 1.42 10.583333 7.453052 Tuesday
11 2014-08-13 2.35 17.416667 7.411348 Wednesday
12 2014-08-14 0.90 6.266667 6.962963 Thursday
13 2014-08-14 0.90 6.266667 6.962963 Thursday
14 2014-08-15 1.78 NaN NaN Friday
15 2014-08-16 0.65 NaN NaN Saturday
16 2014-08-17 1.34 NaN NaN Sunday
17 2014-08-18 0.65 NaN NaN Monday
18 2014-08-19 1.37 NaN NaN Tuesday
19 2014-08-19 1.37 NaN NaN Tuesday
20 2014-08-20 1.37 9.550000 6.970803 Wednesday
21 2014-08-21 1.37 9.400000 6.861314 Thursday
22 2014-08-22 1.37 9.083333 6.630170 Friday
23 2014-08-23 1.34 NaN NaN Saturday
24 2014-08-24 3.00 NaN NaN Sunday
25 2014-08-25 1.37 8.916667 6.508516 Monday
26 2014-08-25 1.34 NaN NaN Monday
27 2014-08-26 1.37 10.500000 7.664234 Tuesday
28 2014-08-27 2.50 25.000000 10.000000 Wednesday
29 2014-08-28 0.00 NaN NaN Thursday
... ... ... ... ... ...
136 2014-12-04 0.00 NaN NaN Thursday
137 2014-12-05 0.00 NaN NaN Friday
138 2014-12-06 0.00 NaN NaN Saturday
139 2014-12-07 0.00 NaN NaN Sunday
140 2014-12-08 0.00 NaN NaN Monday
141 2014-12-09 2.93 25.000000 8.532423 Tuesday
142 2014-12-10 0.00 NaN NaN Wednesday
143 2014-12-11 0.00 NaN NaN Thursday
144 2014-12-12 1.37 9.250000 6.751825 Friday
145 2014-12-13 5.30 43.600000 8.226415 Saturday
146 2014-12-13 2.21 18.983333 8.589744 Saturday
147 2014-12-14 0.00 NaN NaN Sunday
148 2014-12-15 6.50 58.716667 9.033333 Monday
149 2014-12-16 0.00 NaN NaN Tuesday
150 2014-12-17 12.00 99.000000 8.250000 Wednesday
151 2014-12-18 0.00 NaN NaN Thursday
152 2014-12-19 3.10 26.250000 8.467742 Friday
153 2014-12-20 14.80 135.000000 9.121622 Saturday
154 2014-12-21 0.00 NaN NaN Sunday
155 2014-12-22 3.10 28.000000 9.032258 Monday
156 2014-12-23 5.63 51.833333 9.206631 Tuesday
157 2014-12-24 3.05 25.133333 8.240437 Wednesday
158 2014-12-25 14.80 140.000000 9.459459 Thursday
159 2014-12-26 5.20 NaN NaN Friday
160 2014-12-27 0.00 NaN NaN Saturday
161 2014-12-28 0.00 NaN NaN Sunday
162 2014-12-29 0.00 NaN NaN Monday
163 2014-12-30 3.15 22.166667 7.037037 Tuesday
164 2014-12-31 0.00 NaN NaN Wednesday
165 2015-01-01 26.20 260.716667 9.951018 Thursday

166 rows × 5 columns

we can also use drop to drop a specific row by specifying the 0 axis


In [32]:
tempdf = pd.DataFrame(np.arange(4).reshape(2,2))
tempdf


Out[32]:
0 1
0 0 1
1 2 3

In [33]:
tempdf.drop(1,axis=0)


Out[33]:
0 1
0 0 1

we Already saw how to create a new column, we can also create a new row using the append method. This takes in a data frame or Series and appends it to the end of the data frame.


In [34]:
tempdf.append(pd.Series([4,5]), ignore_index=True)


Out[34]:
0 1
0 0 1
1 2 3
2 4 5

In [35]:
df.head()


Out[35]:
Date Miles Seconds Minutes Min_per_mile Day_of_week
0 2014-08-02 1.70 NaN NaN NaN Saturday
1 2014-08-03 1.40 NaN NaN NaN Sunday
2 2014-08-04 2.10 NaN NaN NaN Monday
3 2014-08-05 0.00 NaN NaN NaN Tuesday
4 2014-08-06 2.38 NaN NaN NaN Wednesday

We can also pop out a column which will remove it from a data frame and return the Series. You'll see that it happens in place.


In [36]:
df.pop('Seconds')


Out[36]:
0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
5        NaN
6        NaN
7        NaN
8       1717
9        787
10       635
11      1045
12       376
13       376
14       NaN
15       NaN
16       NaN
17       NaN
18       NaN
19       NaN
20       573
21       564
22       545
23       NaN
24       NaN
25       535
26       NaN
27       630
28      1500
29       NaN
       ...  
136      NaN
137      NaN
138      NaN
139      NaN
140      NaN
141     1500
142      NaN
143      NaN
144      555
145     2616
146     1139
147      NaN
148     3523
149      NaN
150     5940
151      NaN
152     1575
153     8100
154      NaN
155     1680
156     3110
157     1508
158     8400
159      NaN
160      NaN
161      NaN
162      NaN
163     1330
164      NaN
165    15643
Name: Seconds, dtype: float64

In [37]:
df.head()


Out[37]:
Date Miles Minutes Min_per_mile Day_of_week
0 2014-08-02 1.70 NaN NaN Saturday
1 2014-08-03 1.40 NaN NaN Sunday
2 2014-08-04 2.10 NaN NaN Monday
3 2014-08-05 0.00 NaN NaN Tuesday
4 2014-08-06 2.38 NaN NaN Wednesday

Now we've made our dataset a bit more manageable. We've kind of just got the basics of what we need to perform some groupwise analysis.

Now at this point we're going to do some groupings. This is an extremely powerful part of pandas and one that you'll use all the time.

pandas follows the the Split-Apply-Combine style of data analysis.

Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

Hadley Wickhan from Rice University: http://www.jstatsoft.org/v40/i01/paper

Since we're going to want to check things in groups. What I'm going to do is try to analyze each day of the week to see if there are any differences in the types of running that I do on those days.

we'll start by grouping or data set on those weekdays. Basically creating a dictionary of the data where the key is the weekday and the value is the dataframe of all those values.

First let's do this the hard way....


In [39]:
for dow in df.Day_of_week.unique():
    print(dow)
    print(df[df.Day_of_week == dow])
    break


Saturday
          Date  Miles     Minutes  Min_per_mile Day_of_week
0   2014-08-02   1.70         NaN           NaN    Saturday
7   2014-08-09   2.50         NaN           NaN    Saturday
15  2014-08-16   0.65         NaN           NaN    Saturday
23  2014-08-23   1.34         NaN           NaN    Saturday
31  2014-08-30   0.00         NaN           NaN    Saturday
38  2014-09-06   0.00         NaN           NaN    Saturday
45  2014-09-13   0.00         NaN           NaN    Saturday
52  2014-09-20   0.00         NaN           NaN    Saturday
59  2014-09-27   0.00         NaN           NaN    Saturday
67  2014-10-04   4.63   40.000000      8.639309    Saturday
68  2014-10-04   1.87         NaN           NaN    Saturday
69  2014-10-04   4.08   33.000000      8.088235    Saturday
76  2014-10-11   0.00         NaN           NaN    Saturday
83  2014-10-18   1.50    9.666667      6.444444    Saturday
84  2014-10-18   1.50         NaN           NaN    Saturday
85  2014-10-18   1.37         NaN           NaN    Saturday
92  2014-10-25   2.00         NaN           NaN    Saturday
99  2014-11-01   0.00         NaN           NaN    Saturday
108 2014-11-08   0.00         NaN           NaN    Saturday
117 2014-11-15   0.00         NaN           NaN    Saturday
124 2014-11-22   0.00         NaN           NaN    Saturday
131 2014-11-29   0.00         NaN           NaN    Saturday
138 2014-12-06   0.00         NaN           NaN    Saturday
145 2014-12-13   5.30   43.600000      8.226415    Saturday
146 2014-12-13   2.21   18.983333      8.589744    Saturday
153 2014-12-20  14.80  135.000000      9.121622    Saturday
160 2014-12-27   0.00         NaN           NaN    Saturday

This is clearly an ugly way to do this and pandas provides a much more simple way of approaching this problem. by creating a groupby object.

But first I'm going to filter out our zero values because they'll throw off our analysis.


In [40]:
df['Miles'] = df.Miles[df.Miles > 0]

In [42]:
dows = df.groupby('Day_of_week')
print(dows)


<pandas.core.groupby.DataFrameGroupBy object at 0x107cc0b50>

We can get the size of each one by using the size command. This basically tells us how many items are in each category.


In [43]:
dows.size()


Out[43]:
Day_of_week
Friday       23
Monday       25
Saturday     27
Sunday       23
Thursday     23
Tuesday      23
Wednesday    22
dtype: int64

In [44]:
dows.count()


Out[44]:
Date Miles Minutes Min_per_mile
Day_of_week
Friday 23 15 8 8
Monday 25 14 10 10
Saturday 27 14 6 6
Sunday 23 11 8 8
Thursday 23 8 5 5
Tuesday 23 11 8 8
Wednesday 22 8 7 7

Now we have our groups and we can start doing groupwise analysis, now what does that mean?

It means we can start answering questions like what is the average speed per weekday or what is the total miles run per weekday?


In [45]:
dows.mean()


Out[45]:
Miles Minutes Min_per_mile
Day_of_week
Friday 2.786000 24.308333 7.747657
Monday 2.607143 22.243333 7.463291
Saturday 3.246429 46.708333 8.184961
Sunday 2.422727 19.762500 7.463840
Thursday 6.315000 84.530000 8.039543
Tuesday 2.428182 21.770833 7.659706
Wednesday 3.315000 28.021429 7.829348

In [46]:
dows.sum()


Out[46]:
Miles Minutes Min_per_mile
Day_of_week
Friday 41.79 194.466667 61.981259
Monday 36.50 222.433333 74.632908
Saturday 45.45 280.250000 49.109769
Sunday 26.65 158.100000 59.710723
Thursday 50.52 422.650000 40.197717
Tuesday 26.71 174.166667 61.277645
Wednesday 26.52 196.150000 54.805434

It might be interesting to see the total sum of the amount of runs to try and see any outliers simply because Thursday, Friday, Saturday are close in distances, relatively, but not so much in speed.

We also get access to a lot of summary statistics from here that we can get from the groups.


In [47]:
dows.describe()


Out[47]:
Miles Min_per_mile Minutes
Day_of_week
Friday count 15.000000 8.000000 8.000000
mean 2.786000 7.747657 24.308333
std 1.510718 0.844590 15.376198
min 1.200000 6.630170 9.083333
25% 1.370000 7.025547 9.625000
50% 3.000000 7.803279 23.800000
75% 3.250000 8.523306 30.550000
max 5.630000 8.718176 49.083333
Monday count 14.000000 10.000000 10.000000
mean 2.607143 7.463291 22.243333
std 1.688538 1.037698 16.318699
min 0.650000 6.000000 8.600000
25% 1.402500 6.677835 9.637500
50% 1.885000 7.520672 17.516667
75% 3.087500 7.815719 26.795833
max 6.500000 9.033333 58.716667
Saturday count 14.000000 6.000000 6.000000
mean 3.246429 8.184961 46.708333
std 3.592504 0.926076 45.119311
min 0.650000 6.444444 9.666667
25% 1.500000 8.122780 22.487500
50% 1.935000 8.408079 36.500000
75% 3.685000 8.626918 42.700000
max 14.800000 9.121622 135.000000
Sunday count 11.000000 8.000000 8.000000
mean 2.422727 7.463840 19.762500
std 1.000241 0.578829 8.282836
min 1.340000 6.362530 8.716667
25% 1.450000 7.408470 11.300000
50% 3.000000 7.474317 22.591667
75% 3.050000 7.533788 24.116667
max 4.030000 8.516865 30.366667
Thursday count 8.000000 5.000000 5.000000
mean 6.315000 8.039543 84.530000
std 9.300811 1.531027 114.032318
min 0.900000 6.861314 6.266667
25% 0.975000 6.962963 6.266667
50% 1.835000 6.962963 9.400000
75% 5.987500 9.459459 140.000000
max 26.200000 9.951018 260.716667
Tuesday count 11.000000 8.000000 8.000000
mean 2.428182 7.659706 21.770833
std 1.321990 0.830158 13.777549
min 1.370000 6.630170 9.083333
25% 1.370000 7.218276 10.562500
50% 2.000000 7.464231 22.183333
75% 3.050000 7.881281 23.350000
max 5.630000 9.206631 51.833333
Wednesday count 8.000000 7.000000 7.000000
mean 3.315000 7.829348 28.021429
std 3.562291 1.142713 32.039394
min 1.370000 6.532847 8.950000
25% 1.467500 7.185401 10.325000
50% 2.365000 7.411348 17.416667
75% 2.637500 8.245219 25.066667
max 12.000000 10.000000 99.000000

In [48]:
df.groupby('Day_of_week').mean()


Out[48]:
Miles Minutes Min_per_mile
Day_of_week
Friday 2.786000 24.308333 7.747657
Monday 2.607143 22.243333 7.463291
Saturday 3.246429 46.708333 8.184961
Sunday 2.422727 19.762500 7.463840
Thursday 6.315000 84.530000 8.039543
Tuesday 2.428182 21.770833 7.659706
Wednesday 3.315000 28.021429 7.829348

In [49]:
df.groupby('Day_of_week').std()


Out[49]:
Miles Minutes Min_per_mile
Day_of_week
Friday 1.510718 15.376198 0.844590
Monday 1.688538 16.318699 1.037698
Saturday 3.592504 45.119311 0.926076
Sunday 1.000241 8.282836 0.578829
Thursday 9.300811 114.032318 1.531027
Tuesday 1.321990 13.777549 0.830158
Wednesday 3.562291 32.039394 1.142713

iterating through the groups is also very straightforward


In [51]:
for name, group in dows:
    print(name)
    print(group)


Friday
          Date  Miles    Minutes  Min_per_mile Day_of_week
6   2014-08-08   3.40        NaN           NaN      Friday
14  2014-08-15   1.78        NaN           NaN      Friday
22  2014-08-22   1.37   9.083333      6.630170      Friday
30  2014-08-29   5.00  43.450000      8.690000      Friday
37  2014-09-05   3.05  23.833333      7.814208      Friday
44  2014-09-12   3.05  23.766667      7.792350      Friday
51  2014-09-19   1.37   9.750000      7.116788      Friday
58  2014-09-26    NaN        NaN           NaN      Friday
66  2014-10-03    NaN        NaN           NaN      Friday
75  2014-10-10    NaN        NaN           NaN      Friday
82  2014-10-17    NaN        NaN           NaN      Friday
91  2014-10-24    NaN        NaN           NaN      Friday
98  2014-10-31   2.01        NaN           NaN      Friday
105 2014-11-07   5.63  49.083333      8.718176      Friday
106 2014-11-07   1.26        NaN           NaN      Friday
107 2014-11-07   1.20        NaN           NaN      Friday
116 2014-11-14    NaN        NaN           NaN      Friday
123 2014-11-21    NaN        NaN           NaN      Friday
130 2014-11-28   3.00        NaN           NaN      Friday
137 2014-12-05    NaN        NaN           NaN      Friday
144 2014-12-12   1.37   9.250000      6.751825      Friday
152 2014-12-19   3.10  26.250000      8.467742      Friday
159 2014-12-26   5.20        NaN           NaN      Friday
Monday
          Date  Miles    Minutes  Min_per_mile Day_of_week
2   2014-08-04   2.10        NaN           NaN      Monday
9   2014-08-11   1.67  13.116667      7.854291      Monday
17  2014-08-18   0.65        NaN           NaN      Monday
25  2014-08-25   1.37   8.916667      6.508516      Monday
26  2014-08-25   1.34        NaN           NaN      Monday
33  2014-09-01    NaN        NaN           NaN      Monday
40  2014-09-08   3.05  23.183333      7.601093      Monday
47  2014-09-15    NaN        NaN           NaN      Monday
54  2014-09-22    NaN        NaN           NaN      Monday
62  2014-09-29   3.05  21.916667      7.185792      Monday
71  2014-10-06    NaN        NaN           NaN      Monday
78  2014-10-13    NaN        NaN           NaN      Monday
87  2014-10-20   1.37   8.600000      6.277372      Monday
94  2014-10-27    NaN        NaN           NaN      Monday
101 2014-11-03   5.30  39.433333      7.440252      Monday
110 2014-11-10   1.50   9.000000      6.000000      Monday
111 2014-11-10   1.50  11.550000      7.700000      Monday
112 2014-11-10   4.00        NaN           NaN      Monday
119 2014-11-17    NaN        NaN           NaN      Monday
126 2014-11-24    NaN        NaN           NaN      Monday
133 2014-12-01    NaN        NaN           NaN      Monday
140 2014-12-08    NaN        NaN           NaN      Monday
148 2014-12-15   6.50  58.716667      9.033333      Monday
155 2014-12-22   3.10  28.000000      9.032258      Monday
162 2014-12-29    NaN        NaN           NaN      Monday
Saturday
          Date  Miles     Minutes  Min_per_mile Day_of_week
0   2014-08-02   1.70         NaN           NaN    Saturday
7   2014-08-09   2.50         NaN           NaN    Saturday
15  2014-08-16   0.65         NaN           NaN    Saturday
23  2014-08-23   1.34         NaN           NaN    Saturday
31  2014-08-30    NaN         NaN           NaN    Saturday
38  2014-09-06    NaN         NaN           NaN    Saturday
45  2014-09-13    NaN         NaN           NaN    Saturday
52  2014-09-20    NaN         NaN           NaN    Saturday
59  2014-09-27    NaN         NaN           NaN    Saturday
67  2014-10-04   4.63   40.000000      8.639309    Saturday
68  2014-10-04   1.87         NaN           NaN    Saturday
69  2014-10-04   4.08   33.000000      8.088235    Saturday
76  2014-10-11    NaN         NaN           NaN    Saturday
83  2014-10-18   1.50    9.666667      6.444444    Saturday
84  2014-10-18   1.50         NaN           NaN    Saturday
85  2014-10-18   1.37         NaN           NaN    Saturday
92  2014-10-25   2.00         NaN           NaN    Saturday
99  2014-11-01    NaN         NaN           NaN    Saturday
108 2014-11-08    NaN         NaN           NaN    Saturday
117 2014-11-15    NaN         NaN           NaN    Saturday
124 2014-11-22    NaN         NaN           NaN    Saturday
131 2014-11-29    NaN         NaN           NaN    Saturday
138 2014-12-06    NaN         NaN           NaN    Saturday
145 2014-12-13   5.30   43.600000      8.226415    Saturday
146 2014-12-13   2.21   18.983333      8.589744    Saturday
153 2014-12-20  14.80  135.000000      9.121622    Saturday
160 2014-12-27    NaN         NaN           NaN    Saturday
Sunday
          Date  Miles    Minutes  Min_per_mile Day_of_week
1   2014-08-03   1.40        NaN           NaN      Sunday
8   2014-08-10   3.36  28.616667      8.516865      Sunday
16  2014-08-17   1.34        NaN           NaN      Sunday
24  2014-08-24   3.00        NaN           NaN      Sunday
32  2014-08-31    NaN        NaN           NaN      Sunday
39  2014-09-07    NaN        NaN           NaN      Sunday
46  2014-09-14    NaN        NaN           NaN      Sunday
53  2014-09-21   4.03  30.366667      7.535153      Sunday
60  2014-09-28   1.50  11.300000      7.533333      Sunday
61  2014-09-28   1.50  11.300000      7.533333      Sunday
70  2014-10-05    NaN        NaN           NaN      Sunday
77  2014-10-12   3.05  22.616667      7.415301      Sunday
86  2014-10-19    NaN        NaN           NaN      Sunday
93  2014-10-26    NaN        NaN           NaN      Sunday
100 2014-11-02   1.37   8.716667      6.362530      Sunday
109 2014-11-09   3.05  22.600000      7.409836      Sunday
118 2014-11-16    NaN        NaN           NaN      Sunday
125 2014-11-23   3.05  22.583333      7.404372      Sunday
132 2014-11-30    NaN        NaN           NaN      Sunday
139 2014-12-07    NaN        NaN           NaN      Sunday
147 2014-12-14    NaN        NaN           NaN      Sunday
154 2014-12-21    NaN        NaN           NaN      Sunday
161 2014-12-28    NaN        NaN           NaN      Sunday
Thursday
          Date  Miles     Minutes  Min_per_mile Day_of_week
5   2014-08-07   2.30         NaN           NaN    Thursday
12  2014-08-14   0.90    6.266667      6.962963    Thursday
13  2014-08-14   0.90    6.266667      6.962963    Thursday
21  2014-08-21   1.37    9.400000      6.861314    Thursday
29  2014-08-28    NaN         NaN           NaN    Thursday
36  2014-09-04   3.05         NaN           NaN    Thursday
43  2014-09-11    NaN         NaN           NaN    Thursday
50  2014-09-18    NaN         NaN           NaN    Thursday
57  2014-09-25    NaN         NaN           NaN    Thursday
65  2014-10-02    NaN         NaN           NaN    Thursday
74  2014-10-09    NaN         NaN           NaN    Thursday
81  2014-10-16    NaN         NaN           NaN    Thursday
90  2014-10-23    NaN         NaN           NaN    Thursday
97  2014-10-30    NaN         NaN           NaN    Thursday
104 2014-11-06    NaN         NaN           NaN    Thursday
115 2014-11-13    NaN         NaN           NaN    Thursday
122 2014-11-20    NaN         NaN           NaN    Thursday
129 2014-11-27   1.00         NaN           NaN    Thursday
136 2014-12-04    NaN         NaN           NaN    Thursday
143 2014-12-11    NaN         NaN           NaN    Thursday
151 2014-12-18    NaN         NaN           NaN    Thursday
158 2014-12-25  14.80  140.000000      9.459459    Thursday
165 2015-01-01  26.20  260.716667      9.951018    Thursday
Tuesday
          Date  Miles    Minutes  Min_per_mile Day_of_week
3   2014-08-05    NaN        NaN           NaN     Tuesday
10  2014-08-12   1.42  10.583333      7.453052     Tuesday
18  2014-08-19   1.37        NaN           NaN     Tuesday
19  2014-08-19   1.37        NaN           NaN     Tuesday
27  2014-08-26   1.37  10.500000      7.664234     Tuesday
34  2014-09-02   2.00        NaN           NaN     Tuesday
41  2014-09-09    NaN        NaN           NaN     Tuesday
48  2014-09-16   3.05  22.800000      7.475410     Tuesday
55  2014-09-23    NaN        NaN           NaN     Tuesday
63  2014-09-30    NaN        NaN           NaN     Tuesday
72  2014-10-07    NaN        NaN           NaN     Tuesday
79  2014-10-14   1.37   9.083333      6.630170     Tuesday
88  2014-10-21    NaN        NaN           NaN     Tuesday
95  2014-10-28    NaN        NaN           NaN     Tuesday
102 2014-11-04   3.05  22.200000      7.278689     Tuesday
113 2014-11-11    NaN        NaN           NaN     Tuesday
120 2014-11-18    NaN        NaN           NaN     Tuesday
127 2014-11-25    NaN        NaN           NaN     Tuesday
134 2014-12-02    NaN        NaN           NaN     Tuesday
141 2014-12-09   2.93  25.000000      8.532423     Tuesday
149 2014-12-16    NaN        NaN           NaN     Tuesday
156 2014-12-23   5.63  51.833333      9.206631     Tuesday
163 2014-12-30   3.15  22.166667      7.037037     Tuesday
Wednesday
          Date  Miles    Minutes  Min_per_mile Day_of_week
4   2014-08-06   2.38        NaN           NaN   Wednesday
11  2014-08-13   2.35  17.416667      7.411348   Wednesday
20  2014-08-20   1.37   9.550000      6.970803   Wednesday
28  2014-08-27   2.50  25.000000     10.000000   Wednesday
35  2014-09-03    NaN        NaN           NaN   Wednesday
42  2014-09-10   1.37   8.950000      6.532847   Wednesday
49  2014-09-17   1.50  11.100000      7.400000   Wednesday
56  2014-09-24    NaN        NaN           NaN   Wednesday
64  2014-10-01    NaN        NaN           NaN   Wednesday
73  2014-10-08    NaN        NaN           NaN   Wednesday
80  2014-10-15    NaN        NaN           NaN   Wednesday
89  2014-10-22    NaN        NaN           NaN   Wednesday
96  2014-10-29    NaN        NaN           NaN   Wednesday
103 2014-11-05    NaN        NaN           NaN   Wednesday
114 2014-11-12    NaN        NaN           NaN   Wednesday
121 2014-11-19    NaN        NaN           NaN   Wednesday
128 2014-11-26    NaN        NaN           NaN   Wednesday
135 2014-12-03    NaN        NaN           NaN   Wednesday
142 2014-12-10    NaN        NaN           NaN   Wednesday
150 2014-12-17  12.00  99.000000      8.250000   Wednesday
157 2014-12-24   3.05  25.133333      8.240437   Wednesday
164 2014-12-31    NaN        NaN           NaN   Wednesday

you can get specific groups by using the get_group method.


In [52]:
dows.get_group('Friday')


Out[52]:
Date Miles Min_per_mile Minutes
6 2014-08-08 3.40 NaN NaN
14 2014-08-15 1.78 NaN NaN
22 2014-08-22 1.37 6.630170 9.083333
30 2014-08-29 5.00 8.690000 43.450000
37 2014-09-05 3.05 7.814208 23.833333
44 2014-09-12 3.05 7.792350 23.766667
51 2014-09-19 1.37 7.116788 9.750000
58 2014-09-26 NaN NaN NaN
66 2014-10-03 NaN NaN NaN
75 2014-10-10 NaN NaN NaN
82 2014-10-17 NaN NaN NaN
91 2014-10-24 NaN NaN NaN
98 2014-10-31 2.01 NaN NaN
105 2014-11-07 5.63 8.718176 49.083333
106 2014-11-07 1.26 NaN NaN
107 2014-11-07 1.20 NaN NaN
116 2014-11-14 NaN NaN NaN
123 2014-11-21 NaN NaN NaN
130 2014-11-28 3.00 NaN NaN
137 2014-12-05 NaN NaN NaN
144 2014-12-12 1.37 6.751825 9.250000
152 2014-12-19 3.10 8.467742 26.250000
159 2014-12-26 5.20 NaN NaN

We can use an aggregation command to perform an operation to get all the counts for each data frame.


In [53]:
dows.agg(lambda x: len(x))['Miles']


Out[53]:
Day_of_week
Friday       23
Monday       25
Saturday     27
Sunday       23
Thursday     23
Tuesday      23
Wednesday    22
Name: Miles, dtype: float64

another way to do this would be to add a count column to our data frame, then sum up each column


In [54]:
df['Count'] = 1
df.head(10)


Out[54]:
Date Miles Minutes Min_per_mile Day_of_week Count
0 2014-08-02 1.70 NaN NaN Saturday 1
1 2014-08-03 1.40 NaN NaN Sunday 1
2 2014-08-04 2.10 NaN NaN Monday 1
3 2014-08-05 NaN NaN NaN Tuesday 1
4 2014-08-06 2.38 NaN NaN Wednesday 1
5 2014-08-07 2.30 NaN NaN Thursday 1
6 2014-08-08 3.40 NaN NaN Friday 1
7 2014-08-09 2.50 NaN NaN Saturday 1
8 2014-08-10 3.36 28.616667 8.516865 Sunday 1
9 2014-08-11 1.67 13.116667 7.854291 Monday 1

In [55]:
df.groupby('Day_of_week').sum()


Out[55]:
Miles Minutes Min_per_mile Count
Day_of_week
Friday 41.79 194.466667 61.981259 23
Monday 36.50 222.433333 74.632908 25
Saturday 45.45 280.250000 49.109769 27
Sunday 26.65 158.100000 59.710723 23
Thursday 50.52 422.650000 40.197717 23
Tuesday 26.71 174.166667 61.277645 23
Wednesday 26.52 196.150000 54.805434 22

That about covers everything that I wanted to regarding pandas basics for grouping, filtering and that sort of thing. This is really the basics and likely a lot of what you'll be using on a regular basis. In the next section we will dive a bit deeper into more intensive analysis or more detailed function usage.


In [ ]: