Software Engineering for Data Scientists

Sophisticated Data Manipulation

DATA 515 A

1. Python's Data Science Ecosystem

With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis.

Python's Data Science Ecosystem

In addition to Python's built-in modules like the math module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python. Some of the most important ones are:

numpy: Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data. If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

scipy: Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more. We will not look closely at Scipy today, but we will use its functionality later in the course.

pandas: Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a Data Frame. If you've used the R statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

matplotlib: Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

2. Installation

Installing Pandas & friends

Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like conda. All it takes is to run

$ conda install numpy scipy pandas matplotlib

and (so long as your conda setup is working) the packages will be downloaded and installed on your system.

3. Arrays and slicing in Numpy


In [1]:
import numpy as np

Lists in native Python

Let's create a list, a native Python object that we've used earlier today.


In [2]:
my_list = [2, 5, 7, 8]
my_list


Out[2]:
[2, 5, 7, 8]

This list is one-dimensional, let's make it multidimensional!


In [5]:
multi_list = [[1, 2, 3], [4, 5, 6]]

How do we access the 6 element in the second row, third column for native Python list?


In [7]:
multi_list[1][2]


Out[7]:
6

In [8]:
second_list = [1, "NaN", 3.0]

In [27]:
len(multi_list[0])


Out[27]:
3

In [17]:
for element in second_list:
    print(type(element))


<class 'int'>
<class 'str'>
<class 'float'>

Converting to numpy Arrays


In [19]:
my_array = np.array(my_list, dtype="int8")
my_array


Out[19]:
array([2, 5, 7, 8], dtype=int8)

In [12]:
type(my_array)


Out[12]:
numpy.ndarray

In [22]:
my_array[0]


Out[22]:
1

In [20]:
my_array.dtype


Out[20]:
dtype('int8')

In [24]:
multi_array = np.array([[7, 8], [9, 10]], dtype="int8")

In [25]:
multi_array.shape


Out[25]:
(2, 2)

In [28]:
multi_array = np.array([[1, 2, 3],
                        [4, 5, 6]], np.int32)

How do we access the 6 element in the second row, third column for numpy array?


In [30]:
multi_array[1,2]


Out[30]:
6

How do we retrieve a slice of the array, array([[1, 2], [4, 5]])?


In [31]:
multi_array[:,:2]


Out[31]:
array([[1, 2],
       [4, 5]], dtype=int32)

How do we retrieve the second column of the array?


In [34]:
multi_array[:,1]


Out[34]:
array([2, 5], dtype=int32)

4. Introduction to Pandas DataFrames

What are the elements of a table?


In [35]:
# Pandas DataFrames as table elements
import pandas as pd

What operations do we perform on tables?


In [37]:
df = pd.DataFrame({'A': [1,2,3], 'B': [2, 4, 6], 'ccc': [1.0, 33, 4]})
df


Out[37]:
A B ccc
0 1 2 1.0
1 2 4 33.0
2 3 6 4.0

In [42]:
print(df["A"].dtype)
print(df["B"].dtype)
print(df["ccc"].dtype)


int64
int64
float64

In [44]:
df.dtypes


Out[44]:
A        int64
B        int64
ccc    float64
dtype: object

In [45]:
sub_df = df[['A', 'ccc']]
sub_df


Out[45]:
A ccc
0 1 1.0
1 2 33.0
2 3 4.0

In [47]:
type(df['A'] + 2*df['B'])


Out[47]:
pandas.core.series.Series

Operations on a Pandas DataFrame

5. Manipulating Data with DataFrames

Downloading the data

Shell commands can be run from the notebook by preceding them with an exclamation point:


In [51]:
!ls


numpy_pandas.ipynb  pronto.csv

uncomment this to download the data:


In [50]:
!curl -o pronto.csv https://data.seattle.gov/api/views/tw7j-dfaw/rows.csv?accessType=DOWNLOAD


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 42.7M    0 42.7M    0     0  3669k      0 --:--:--  0:00:11 --:--:-- 3169k

Loading Data into a DataFrame

Because we'll use it so much, we often import under a shortened name using the import ... as ... pattern:


In [2]:
import pandas as pd
df = pd.read_csv('pronto.csv')

In [53]:
df


Out[53]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1960.0
1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1970.0
2 433 10/13/2014 10:33:00 AM 10/13/2014 10:48:00 AM SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1988.0
3 434 10/13/2014 10:34:00 AM 10/13/2014 10:48:00 AM SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1977.0
4 435 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1971.0
... ... ... ... ... ... ... ... ... ... ... ... ...
275086 295463 03/31/2017 11:16:00 PM 04/01/2017 02:11:00 AM SEA00347 10487.475 NE 42nd St & University Way NE E Harrison St & Broadway Ave E UD-02 CH-02 Short-Term Pass Holder NaN NaN
275087 295464 03/31/2017 11:16:00 PM 04/01/2017 02:11:00 AM SEA00031 10487.366 NE 42nd St & University Way NE E Harrison St & Broadway Ave E UD-02 CH-02 Member Female 1982.0
275088 295465 03/31/2017 11:44:00 PM 04/01/2017 12:04:00 AM SEA00240 1205.656 Bellevue Ave & E Pine St Lake Union Park / Westlake Ave & Aloha St CH-12 SLU-23 Member Male 1990.0
275089 295466 03/31/2017 11:46:00 PM 04/01/2017 12:04:00 AM SEA00121 1115.658 Bellevue Ave & E Pine St Lake Union Park / Westlake Ave & Aloha St CH-12 SLU-23 Member Female 1987.0
275090 295467 03/31/2017 11:52:00 PM 04/01/2017 02:33:00 AM SEA00151 9667.679 Cal Anderson Park / 11th Ave & Pine St Broadway and E Denny Way CH-08 CH-16 Member Male 1988.0

275091 rows × 12 columns


In [54]:
type(df)


Out[54]:
pandas.core.frame.DataFrame

In [55]:
len(df)


Out[55]:
275091

Now we can use the read_csv command to read the comma-separated-value data:

Note: strings in Python can be defined either with double quotes or single quotes

Viewing Pandas Dataframes

The head() and tail() methods show us the first and last rows of the data


In [56]:
df.head()


Out[56]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1960.0
1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1970.0
2 433 10/13/2014 10:33:00 AM 10/13/2014 10:48:00 AM SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1988.0
3 434 10/13/2014 10:34:00 AM 10/13/2014 10:48:00 AM SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1977.0
4 435 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1971.0

In [57]:
df.columns


Out[57]:
Index(['trip_id', 'starttime', 'stoptime', 'bikeid', 'tripduration',
       'from_station_name', 'to_station_name', 'from_station_id',
       'to_station_id', 'usertype', 'gender', 'birthyear'],
      dtype='object')

In [58]:
df.index


Out[58]:
RangeIndex(start=0, stop=275091, step=1)

In [59]:
smaller_df = df.loc[[1,4,6,7,9,34],:]

In [60]:
smaller_df.index


Out[60]:
Int64Index([1, 4, 6, 7, 9, 34], dtype='int64')

The shape attribute shows us the number of elements:


In [61]:
df.shape


Out[61]:
(275091, 12)

The columns attribute gives us the column names

The index attribute gives us the index names

The dtypes attribute gives the data types of each column:


In [62]:
df.dtypes


Out[62]:
trip_id                int64
starttime             object
stoptime              object
bikeid                object
tripduration         float64
from_station_name     object
to_station_name       object
from_station_id       object
to_station_id         object
usertype              object
gender                object
birthyear            float64
dtype: object

Sophisticated Data Manipulation

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:


In [63]:
df_small = df['stoptime']

In [64]:
df_small


Out[64]:
0         10/13/2014 10:48:00 AM
1         10/13/2014 10:48:00 AM
2         10/13/2014 10:48:00 AM
3         10/13/2014 10:48:00 AM
4         10/13/2014 10:49:00 AM
                   ...          
275086    04/01/2017 02:11:00 AM
275087    04/01/2017 02:11:00 AM
275088    04/01/2017 12:04:00 AM
275089    04/01/2017 12:04:00 AM
275090    04/01/2017 02:33:00 AM
Name: stoptime, Length: 275091, dtype: object

In [65]:
type(df_small)


Out[65]:
pandas.core.series.Series

In [67]:
df['tripduration']


Out[67]:
0           985.935
1           926.375
2           883.831
3           865.937
4           923.923
            ...    
275086    10487.475
275087    10487.366
275088     1205.656
275089     1115.658
275090     9667.679
Name: tripduration, Length: 275091, dtype: float64

Mathematical operations on columns happen element-wise:


In [70]:
trip_duration_hours = df['tripduration']/3600
trip_duration_hours[:5]


Out[70]:
0    0.273871
1    0.257326
2    0.245509
3    0.240538
4    0.256645
Name: tripduration, dtype: float64

In [69]:
trip_duration_hours.head()


Out[69]:
0    0.273871
1    0.257326
2    0.245509
3    0.240538
4    0.256645
Name: tripduration, dtype: float64

In [71]:
df['trip_duration_hours'] = df['tripduration']/3600

In [74]:
df.head()


Out[74]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1960.0
1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1970.0
2 433 10/13/2014 10:33:00 AM 10/13/2014 10:48:00 AM SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1988.0
3 434 10/13/2014 10:34:00 AM 10/13/2014 10:48:00 AM SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1977.0
4 435 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1971.0

In [75]:
del df['trip_duration_hours']


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'trip_duration_hours'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-75-da88136fa6f4> in <module>
----> 1 del df['trip_duration_hours']

~/miniconda3/lib/python3.7/site-packages/pandas/core/generic.py in __delitem__(self, key)
   3757             # there was no match, this call should raise the appropriate
   3758             # exception:
-> 3759             self._data.delete(key)
   3760 
   3761         # delete from the caches

~/miniconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in delete(self, item)
   1000         Delete selected item (items if non-unique) in-place.
   1001         """
-> 1002         indexer = self.items.get_loc(item)
   1003 
   1004         is_deleted = np.zeros(self.shape[0], dtype=np.bool_)

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'trip_duration_hours'

In [76]:
df.head()


Out[76]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1960.0
1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1970.0
2 433 10/13/2014 10:33:00 AM 10/13/2014 10:48:00 AM SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1988.0
3 434 10/13/2014 10:34:00 AM 10/13/2014 10:48:00 AM SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1977.0
4 435 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1971.0

In [77]:
df.loc[[0,1],:]


Out[77]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1960.0
1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1970.0

In [80]:
df_long_trips = df[df['tripduration'] >10000]

In [81]:
df_long_trips


Out[81]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
68 506 10/13/2014 12:08:00 PM 10/13/2014 07:07:00 PM SEA00155 25119.114 2nd Ave & Spring St 2nd Ave & Vine St CBD-06 BT-03 Member Male 1988.0
193 649 10/13/2014 02:54:00 PM 10/13/2014 05:59:00 PM SEA00193 11144.159 NE 42nd St & University Way NE NE 42nd St & University Way NE UD-02 UD-02 Short-Term Pass Holder NaN NaN
408 908 10/13/2014 11:12:00 PM 10/14/2014 06:52:00 AM SEA00230 27583.945 7th Ave & Union St 2nd Ave & Pine St CBD-03 CBD-13 Short-Term Pass Holder NaN NaN
475 983 10/14/2014 09:12:00 AM 10/14/2014 01:20:00 PM SEA00181 14885.176 Pier 69 / Alaskan Way & Clay St Pier 69 / Alaskan Way & Clay St WF-01 WF-01 Short-Term Pass Holder NaN NaN
512 1025 10/14/2014 10:49:00 AM 10/14/2014 05:19:00 PM SEA00244 23346.911 E Pine St & 16th Ave PATH / 9th Ave & Westlake Ave CH-07 SLU-07 Member Male 1986.0
... ... ... ... ... ... ... ... ... ... ... ... ...
274051 294382 03/25/2017 02:28:00 PM 03/25/2017 07:05:00 PM SEA00494 16634.168 Cal Anderson Park / 11th Ave & Pine St Pier 69 / Alaskan Way & Clay St CH-08 WF-01 Short-Term Pass Holder NaN NaN
274052 294383 03/25/2017 02:28:00 PM 03/25/2017 07:05:00 PM SEA00182 16639.182 Cal Anderson Park / 11th Ave & Pine St Pier 69 / Alaskan Way & Clay St CH-08 WF-01 Short-Term Pass Holder NaN NaN
274266 294604 03/27/2017 11:56:00 AM 03/27/2017 03:29:00 PM SEA00291 12758.624 6th Ave S & S King St NE 42nd St & University Way NE ID-04 UD-02 Short-Term Pass Holder NaN NaN
275086 295463 03/31/2017 11:16:00 PM 04/01/2017 02:11:00 AM SEA00347 10487.475 NE 42nd St & University Way NE E Harrison St & Broadway Ave E UD-02 CH-02 Short-Term Pass Holder NaN NaN
275087 295464 03/31/2017 11:16:00 PM 04/01/2017 02:11:00 AM SEA00031 10487.366 NE 42nd St & University Way NE E Harrison St & Broadway Ave E UD-02 CH-02 Member Female 1982.0

3243 rows × 12 columns


In [84]:
sel = df['tripduration'] > 10000
df_long_trips = df[sel]

In [83]:
sel


Out[83]:
0         False
1         False
2         False
3         False
4         False
          ...  
275086     True
275087     True
275088    False
275089    False
275090    False
Name: tripduration, Length: 275091, dtype: bool

In [85]:
df_long_trips


Out[85]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
68 506 10/13/2014 12:08:00 PM 10/13/2014 07:07:00 PM SEA00155 25119.114 2nd Ave & Spring St 2nd Ave & Vine St CBD-06 BT-03 Member Male 1988.0
193 649 10/13/2014 02:54:00 PM 10/13/2014 05:59:00 PM SEA00193 11144.159 NE 42nd St & University Way NE NE 42nd St & University Way NE UD-02 UD-02 Short-Term Pass Holder NaN NaN
408 908 10/13/2014 11:12:00 PM 10/14/2014 06:52:00 AM SEA00230 27583.945 7th Ave & Union St 2nd Ave & Pine St CBD-03 CBD-13 Short-Term Pass Holder NaN NaN
475 983 10/14/2014 09:12:00 AM 10/14/2014 01:20:00 PM SEA00181 14885.176 Pier 69 / Alaskan Way & Clay St Pier 69 / Alaskan Way & Clay St WF-01 WF-01 Short-Term Pass Holder NaN NaN
512 1025 10/14/2014 10:49:00 AM 10/14/2014 05:19:00 PM SEA00244 23346.911 E Pine St & 16th Ave PATH / 9th Ave & Westlake Ave CH-07 SLU-07 Member Male 1986.0
... ... ... ... ... ... ... ... ... ... ... ... ...
274051 294382 03/25/2017 02:28:00 PM 03/25/2017 07:05:00 PM SEA00494 16634.168 Cal Anderson Park / 11th Ave & Pine St Pier 69 / Alaskan Way & Clay St CH-08 WF-01 Short-Term Pass Holder NaN NaN
274052 294383 03/25/2017 02:28:00 PM 03/25/2017 07:05:00 PM SEA00182 16639.182 Cal Anderson Park / 11th Ave & Pine St Pier 69 / Alaskan Way & Clay St CH-08 WF-01 Short-Term Pass Holder NaN NaN
274266 294604 03/27/2017 11:56:00 AM 03/27/2017 03:29:00 PM SEA00291 12758.624 6th Ave S & S King St NE 42nd St & University Way NE ID-04 UD-02 Short-Term Pass Holder NaN NaN
275086 295463 03/31/2017 11:16:00 PM 04/01/2017 02:11:00 AM SEA00347 10487.475 NE 42nd St & University Way NE E Harrison St & Broadway Ave E UD-02 CH-02 Short-Term Pass Holder NaN NaN
275087 295464 03/31/2017 11:16:00 PM 04/01/2017 02:11:00 AM SEA00031 10487.366 NE 42nd St & University Way NE E Harrison St & Broadway Ave E UD-02 CH-02 Member Female 1982.0

3243 rows × 12 columns


In [86]:
df[sel].shape


Out[86]:
(3243, 12)

In [87]:
# Make a copy of a slice
df_subset = df[['starttime', 'stoptime']].copy()
df_subset['trip_hours'] = df['tripduration']/3600

Columns can be created (or overwritten) with the assignment operator. Let's create a tripminutes column with the number of minutes for each trip

More complicated mathematical operations can be done with tools in the numpy package:

Working with Times

One trick to know when working with columns of times is that Pandas DateTimeIndex provides a nice interface for working with columns of times.

For a dataset of this size, using pd.to_datetime and specifying the date format can make things much faster (from the strftime reference, we see that the pronto data has format "%m/%d/%Y %I:%M:%S %p"

(Note: you can also use infer_datetime_format=True in most cases to automatically infer the correct format, though due to a bug it doesn't work when AM/PM are present)

With it, we can extract, the hour of the day, the day of the week, the month, and a wide range of other views of the time:

Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at value counts and the basics of group-by operations.

Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data. We'll take a look at two of these here.

The pandas.value_counts returns statistics on the unique values within each column.

We can use it, for example, to break down rides by gender:


In [90]:
pd.value_counts(df["gender"])


Out[90]:
Male      135171
Female     34997
Other       3647
Name: gender, dtype: int64

Or to break down rides by age:


In [91]:
pd.value_counts(2019 - df["birthyear"])


Out[91]:
32.0    15381
34.0    10491
38.0     9028
37.0     8950
35.0     8699
        ...  
83.0        7
20.0        6
77.0        2
75.0        1
88.0        1
Name: birthyear, Length: 61, dtype: int64

By default, the values rather than the index are sorted. Use sort=False to turn this behavior off:


In [92]:
pd.value_counts(df["birthyear"], sort=False)


Out[92]:
1931.0       1
1936.0       7
1939.0      47
1942.0       2
1943.0      11
          ... 
1995.0    1437
1996.0     547
1997.0     285
1998.0     124
1999.0       6
Name: birthyear, Length: 61, dtype: int64

We can explore other things as well: day of week, hour of day, etc.


In [ ]:

Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations. You can visualize the group-by like this (image borrowed from the Python Data Science Handbook)


In [93]:
df.head()


Out[93]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
0 431 10/13/2014 10:31:00 AM 10/13/2014 10:48:00 AM SEA00298 985.935 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1960.0
1 432 10/13/2014 10:32:00 AM 10/13/2014 10:48:00 AM SEA00195 926.375 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1970.0
2 433 10/13/2014 10:33:00 AM 10/13/2014 10:48:00 AM SEA00486 883.831 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1988.0
3 434 10/13/2014 10:34:00 AM 10/13/2014 10:48:00 AM SEA00333 865.937 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Female 1977.0
4 435 10/13/2014 10:34:00 AM 10/13/2014 10:49:00 AM SEA00202 923.923 2nd Ave & Spring St Occidental Park / Occidental Ave S & S Washing... CBD-06 PS-04 Member Male 1971.0

In [94]:
df_count = df.groupby(['from_station_id']).count()
df_count.head()


Out[94]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name to_station_id usertype gender birthyear
from_station_id
BT-01 10463 10463 10463 10463 10463 10463 10463 10463 10463 4162 4162
BT-03 7334 7334 7334 7334 7334 7334 7334 7334 7334 4862 4862
BT-04 4666 4666 4666 4666 4666 4666 4666 4666 4666 3424 3424
BT-05 5699 5699 5699 5699 5699 5699 5699 5699 5699 2975 2975
BT-06 150 150 150 150 150 150 150 150 150 130 130

In [95]:
df_mean = df.groupby(['from_station_id']).mean()
df_mean.head()


Out[95]:
trip_id tripduration birthyear
from_station_id
BT-01 147831.009844 1375.031203 1980.131427
BT-03 139404.294655 1019.200684 1976.505142
BT-04 157992.809687 891.095897 1979.877044
BT-05 139283.572381 1199.949481 1975.937479
BT-06 291807.953333 659.770547 1975.830769

In [97]:
dfgroup = df.groupby(['from_station_id'])
dfgroup.groups


Out[97]:
{'BT-01': Int64Index([   217,    227,    228,    282,    283,    310,    326,    327,
                329,    331,
             ...
             274971, 274973, 274974, 274975, 274976, 274979, 275032, 275033,
             275075, 275076],
            dtype='int64', length=10463),
 'BT-03': Int64Index([    87,     88,    230,    261,    366,    407,    414,    439,
                453,    754,
             ...
             268122, 268181, 268307, 268318, 268319, 268391, 268392, 268467,
             268527, 268528],
            dtype='int64', length=7334),
 'BT-04': Int64Index([    66,     67,     94,    104,    108,    166,    233,    259,
                322,    333,
             ...
             274350, 274361, 274424, 274704, 274789, 274970, 275009, 275064,
             275065, 275083],
            dtype='int64', length=4666),
 'BT-05': Int64Index([   110,    413,    426,    513,    585,    618,    744,    753,
                795,   1003,
             ...
             274547, 274605, 274610, 274621, 274817, 274847, 274910, 274911,
             275029, 275034],
            dtype='int64', length=5699),
 'BT-06': Int64Index([268581, 268642, 268667, 268718, 268735, 268781, 268897, 268903,
             268914, 268961,
             ...
             274777, 274778, 274865, 274931, 274937, 274949, 274951, 274956,
             274962, 274965],
            dtype='int64', length=150),
 'CBD-03': Int64Index([   118,    119,    164,    229,    275,    285,    328,    339,
                356,    357,
             ...
             274574, 274594, 274599, 274622, 274672, 274743, 274774, 274810,
             274915, 275053],
            dtype='int64', length=4822),
 'CBD-04': Int64Index([105392, 105458, 105467, 105472, 105614, 105615, 105835, 105836,
             105855, 105858,
             ...
             274730, 274922, 274924, 274925, 274926, 274927, 274952, 274958,
             274983, 275080],
            dtype='int64', length=3440),
 'CBD-05': Int64Index([    54,     79,     95,    148,    149,    150,    151,    165,
                211,    219,
             ...
             274081, 274085, 274086, 274138, 274139, 274623, 274624, 274692,
             274814, 274906],
            dtype='int64', length=5068),
 'CBD-06': Int64Index([     0,      1,      2,      3,      4,      5,     63,     68,
                 70,     71,
             ...
             274534, 274544, 274570, 274671, 274744, 274765, 274797, 274836,
             274838, 274980],
            dtype='int64', length=4911),
 'CBD-07': Int64Index([    42,     69,     78,    141,    196,    269,    478,    510,
                522,    542,
             ...
             274317, 274413, 274439, 274467, 274579, 274673, 274688, 274752,
             274879, 274907],
            dtype='int64', length=3263),
 'CBD-13': Int64Index([    99,    139,    198,    249,    276,    334,    381,    388,
                424,    442,
             ...
             274444, 274535, 274555, 274613, 274726, 274780, 274903, 274916,
             274966, 274968],
            dtype='int64', length=9067),
 'CD-01': Int64Index([ 68531,  68532,  69169,  69170,  69954,  70367,  70529,  70546,
              70621,  70622,
             ...
             224878, 225305, 225422, 225427, 225492, 225493, 225587, 225666,
             226277, 226597],
            dtype='int64', length=958),
 'CH-01': Int64Index([   256,    355,    382,    416,    437,    444,    502,    503,
                600,    679,
             ...
             274585, 274632, 274641, 274656, 274670, 274827, 274829, 274871,
             274933, 274995],
            dtype='int64', length=6409),
 'CH-02': Int64Index([    55,     56,     58,     83,    113,    126,    127,    154,
                162,    163,
             ...
             274450, 274488, 274593, 274660, 274668, 274868, 274917, 275067,
             275081, 275082],
            dtype='int64', length=8546),
 'CH-03': Int64Index([   290,    417,    428,    435,    436,    452,    494,    516,
                640,    665,
             ...
             274625, 274650, 274658, 274693, 274822, 274844, 274872, 274893,
             274894, 274950],
            dtype='int64', length=6218),
 'CH-05': Int64Index([   134,    195,    205,    248,    250,    251,    315,    324,
                337,    390,
             ...
             274457, 274597, 274714, 274850, 274855, 274873, 274889, 274932,
             274936, 274993],
            dtype='int64', length=6948),
 'CH-06': Int64Index([   212,    253,    277,    278,    279,    403,    449,    504,
                684,    881,
             ...
             274679, 274745, 274746, 274750, 274751, 274781, 274788, 274834,
             274839, 274875],
            dtype='int64', length=3765),
 'CH-07': Int64Index([   146,    210,    299,    341,    374,    377,    401,    415,
                431,    466,
             ...
             274832, 274846, 274890, 274891, 274892, 274955, 275002, 275049,
             275060, 275077],
            dtype='int64', length=11568),
 'CH-08': Int64Index([   120,    136,    144,    158,    159,    242,    262,    294,
                311,    321,
             ...
             274824, 274848, 274853, 274904, 274905, 274935, 274943, 274982,
             275006, 275090],
            dtype='int64', length=8573),
 'CH-09': Int64Index([   101,    168,    222,    349,    380,    467,    567,    578,
                628,    647,
             ...
             274357, 274466, 274468, 274475, 274500, 274595, 274611, 274732,
             274895, 275066],
            dtype='int64', length=5246),
 'CH-12': Int64Index([   319,    384,    411,    441,    451,    462,    540,    554,
                558,    605,
             ...
             274577, 274603, 274609, 274615, 274631, 274680, 275050, 275056,
             275088, 275089],
            dtype='int64', length=5857),
 'CH-15': Int64Index([   109,    160,    244,    340,    402,    430,    459,    468,
                723,    724,
             ...
             274674, 274756, 274791, 274812, 274826, 274840, 274852, 274857,
             274921, 274969],
            dtype='int64', length=6550),
 'CH-16': Int64Index([175075, 175093, 175108, 175126, 175127, 175131, 175136, 175144,
             175145, 175210,
             ...
             274629, 274805, 274816, 274825, 274854, 274920, 275035, 275042,
             275051, 275072],
            dtype='int64', length=2089),
 'DPD-01': Int64Index([    59,     91,     93,    289,    568,    644,    667,    740,
                839,    974,
             ...
             274530, 274604, 274665, 274681, 274798, 274823, 274939, 274957,
             274961, 275026],
            dtype='int64', length=4822),
 'DPD-03': Int64Index([   131,    197,    345,    347,    727,    844,   1075,   1144,
               1347,   1430,
             ...
             273047, 273048, 274048, 274092, 274776, 274800, 275001, 275003,
             275004, 275005],
            dtype='int64', length=1423),
 'EL-01': Int64Index([   199,    400,    700,    702,    715,    716,    769,   1175,
               1350,   1351,
             ...
             274676, 274718, 274731, 274770, 274837, 274928, 274929, 274948,
             274959, 275038],
            dtype='int64', length=3604),
 'EL-03': Int64Index([   344,    358,    360,    425,    492,    583,    927,   1027,
               1071,   1110,
             ...
             274757, 274861, 274862, 274882, 274984, 274988, 274990, 274994,
             275031, 275052],
            dtype='int64', length=5788),
 'EL-05': Int64Index([   200,    201,    447,    456,    488,    615,    646,    694,
                763,    858,
             ...
             274019, 274157, 274162, 274253, 274368, 274477, 274584, 274725,
             274877, 274878],
            dtype='int64', length=3400),
 'FH-01': Int64Index([   100,    231,    325,    330,    373,    386,    455,    485,
                505,    521,
             ...
             173748, 173988, 174253, 174384, 174549, 174647, 174657, 174690,
             174986, 175005],
            dtype='int64', length=2349),
 'FH-04': Int64Index([   364,    371,    392,    396,    460,    482,    529,    950,
                970,    984,
             ...
             274428, 274519, 274600, 274640, 274646, 274648, 274849, 274851,
             274964, 275000],
            dtype='int64', length=4208),
 'ID-04': Int64Index([    89,    123,    155,    156,    169,    170,    214,    223,
                237,    309,
             ...
             274353, 274445, 274548, 274792, 274930, 275014, 275057, 275058,
             275084, 275085],
            dtype='int64', length=2474),
 'PS-04': Int64Index([     6,      7,      8,      9,     10,     11,     12,     13,
                 14,     15,
             ...
             274446, 274471, 274572, 274734, 274766, 274874, 274901, 274902,
             274944, 275068],
            dtype='int64', length=5409),
 'PS-05': Int64Index([    45,     49,     53,     57,     90,    130,    202,    218,
                246,    247,
             ...
             274633, 274634, 274635, 274666, 274820, 274828, 274978, 274987,
             275022, 275073],
            dtype='int64', length=3969),
 'SLU-01': Int64Index([   111,    142,    143,    147,    152,    153,    220,    308,
                312,    370,
             ...
             274686, 274722, 274768, 274769, 274883, 274884, 274997, 275041,
             275059, 275071],
            dtype='int64', length=7084),
 'SLU-02': Int64Index([   137,    181,    296,    397,    427,    458,    464,    500,
                530,    531,
             ...
             274678, 274684, 274687, 274747, 274753, 274833, 274835, 274885,
             274899, 274991],
            dtype='int64', length=7018),
 'SLU-04': Int64Index([   213,    245,    273,    288,    291,    295,    316,    432,
                589,    639,
             ...
             274185, 274309, 274311, 274415, 274493, 274711, 274887, 274941,
             275027, 275048],
            dtype='int64', length=5226),
 'SLU-07': Int64Index([   368,    454,    551,    552,    575,    577,    633,    648,
                735,    741,
             ...
             274607, 274608, 274647, 274715, 274716, 274771, 274845, 274898,
             274946, 275040],
            dtype='int64', length=6339),
 'SLU-15': Int64Index([   102,    178,    232,    243,    284,    287,    292,    313,
                318,    338,
             ...
             274808, 274863, 274897, 274923, 274953, 274967, 274985, 275036,
             275037, 275061],
            dtype='int64', length=9741),
 'SLU-16': Int64Index([   391,    406,    420,    448,    486,    487,    532,    536,
                537,    538,
             ...
             274099, 274103, 274296, 274442, 274602, 274720, 274763, 274764,
             274867, 274896],
            dtype='int64', length=5045),
 'SLU-18': Int64Index([   103,    320,    359,    446,    544,    556,    565,    566,
                591,    614,
             ...
             209477, 209600, 209625, 209663, 209671, 209907, 209917, 209918,
             209929, 210002],
            dtype='int64', length=3461),
 'SLU-19': Int64Index([   129,    280,    304,    350,    351,    353,    354,    457,
                493,    564,
             ...
             274407, 274512, 274590, 274651, 274701, 274702, 274703, 274841,
             274963, 275062],
            dtype='int64', length=7285),
 'SLU-20': Int64Index([ 79307,  79441,  79473,  79584,  79657,  79658,  79659,  79864,
              79868,  79994,
             ...
             273606, 274539, 274540, 274561, 274606, 274758, 274806, 274807,
             274918, 275039],
            dtype='int64', length=2452),
 'SLU-21': Int64Index([133364, 133365, 133388, 133620, 133621, 133744, 133745, 134178,
             134179, 135016,
             ...
             274136, 274193, 274497, 274612, 274698, 274801, 274960, 275030,
             275043, 275074],
            dtype='int64', length=1114),
 'SLU-22': Int64Index([210885, 210897, 210898, 210899, 210913, 210918, 211084, 211085,
             211264, 211318,
             ...
             274525, 274652, 274669, 274683, 274699, 274772, 274803, 274804,
             274842, 274866],
            dtype='int64', length=1748),
 'SLU-23': Int64Index([   192,    206,    224,    225,    226,    305,    306,    549,
                550,    635,
             ...
             275010, 275011, 275012, 275016, 275017, 275018, 275019, 275020,
             275021, 275023],
            dtype='int64', length=5739),
 'UD-01': Int64Index([    60,     61,     76,    177,    182,    208,    608,    942,
                943,   1054,
             ...
             274123, 274124, 274158, 274460, 274575, 274700, 274705, 274869,
             274881, 275025],
            dtype='int64', length=3889),
 'UD-02': Int64Index([    92,     97,    183,    193,    204,    240,    241,    543,
                654,    655,
             ...
             274286, 274287, 274369, 274418, 274502, 274815, 274919, 275024,
             275086, 275087],
            dtype='int64', length=1417),
 'UD-04': Int64Index([    96,    161,    184,    188,    260,    372,    499,    611,
                678,    891,
             ...
             274104, 274105, 274160, 274259, 274264, 274283, 274400, 274528,
             274659, 274870],
            dtype='int64', length=3534),
 'UD-07': Int64Index([   115,    116,    281,    469,    669,    696,    738,    904,
                963,   1040,
             ...
             273080, 273086, 273331, 273359, 273545, 273783, 274165, 274175,
             274281, 274759],
            dtype='int64', length=2429),
 'UW-01': Int64Index([   730,   1691,   1759,   2124,   2383,   2746,   3087,   3356,
               3404,   3510,
             ...
             142135, 142136, 142249, 142254, 142259, 143101, 144918, 145571,
             147714, 147773],
            dtype='int64', length=480),
 'UW-02': Int64Index([    72,     73,     74,     80,    421,    857,    964,   1026,
               1183,   1433,
             ...
             274078, 274079, 274088, 274089, 274300, 274301, 274537, 274586,
             274998, 275063],
            dtype='int64', length=2002),
 'UW-04': Int64Index([   187,    343,    375,    463,    477,    580,    673,    762,
                781,    833,
             ...
             274811, 274856, 274858, 274876, 274888, 274996, 275045, 275054,
             275069, 275070],
            dtype='int64', length=2688),
 'UW-06': Int64Index([   167,    272,    385,    631,    774,    951,   1011,   1048,
               1078,   1083,
             ...
             274454, 274499, 274562, 274617, 274742, 274773, 274786, 274802,
             274809, 274945],
            dtype='int64', length=2383),
 'UW-07': Int64Index([   121,    122,    215,    216,    365,    367,    404,    721,
               1090,   1141,
             ...
             273936, 273944, 273961, 274498, 274571, 274576, 274721, 274727,
             274947, 275046],
            dtype='int64', length=1905),
 'UW-10': Int64Index([   105,    124,    128,    314,    619,    896,    934,    935,
                999,   1006,
             ...
             238201, 238453, 238514, 238816, 238854, 239274, 239545, 240295,
             240446, 240775],
            dtype='int64', length=1175),
 'UW-11': Int64Index([150250, 150776, 151044, 151373, 151690, 152037, 152061, 153903,
             153941, 154905,
             ...
             273385, 273607, 273760, 273892, 273904, 274016, 274017, 274018,
             274473, 274675],
            dtype='int64', length=1237),
 'UW-12': Int64Index([241157, 241173, 241175, 241194, 241208, 241245, 241292, 241403,
             241435, 241447,
             ...
             274748, 274794, 274795, 274843, 274859, 274864, 274938, 274940,
             274999, 275055],
            dtype='int64', length=689),
 'WF-01': Int64Index([   133,    135,    297,    298,    300,    302,    307,    369,
                475,    514,
             ...
             274972, 274977, 274989, 275013, 275015, 275028, 275044, 275047,
             275078, 275079],
            dtype='int64', length=13038),
 'WF-03': Int64Index([226781, 226784, 226827, 227100, 227321, 227322, 227569, 227570,
             227768, 227769,
             ...
             274097, 274100, 274306, 274307, 274325, 274383, 274667, 274708,
             274909, 274934],
            dtype='int64', length=646),
 'WF-04': Int64Index([    64,     65,    132,    157,    203,    207,    236,    264,
                266,    267,
             ...
             274382, 274630, 274637, 274697, 274707, 274709, 274880, 274913,
             274981, 274986],
            dtype='int64', length=6271)}

In [100]:
df.loc[dfgroup.groups["CBD-03"],:]


Out[100]:
trip_id starttime stoptime bikeid tripduration from_station_name to_station_name from_station_id to_station_id usertype gender birthyear
118 562 10/13/2014 01:30:00 PM 10/13/2014 01:42:00 PM SEA00071 713.859 7th Ave & Union St Cal Anderson Park / 11th Ave & Pine St CBD-03 CH-08 Member Female 1946.0
119 563 10/13/2014 01:30:00 PM 10/13/2014 01:42:00 PM SEA00190 739.791 7th Ave & Union St Cal Anderson Park / 11th Ave & Pine St CBD-03 CH-08 Member Female 1949.0
164 617 10/13/2014 02:18:00 PM 10/13/2014 02:34:00 PM SEA00021 970.064 7th Ave & Union St Occidental Park / Occidental Ave S & S Washing... CBD-03 PS-04 Member Female 1951.0
229 694 10/13/2014 03:58:00 PM 10/13/2014 04:05:00 PM SEA00456 409.011 7th Ave & Union St Republican St & Westlake Ave N CBD-03 SLU-04 Member Male 1972.0
275 745 10/13/2014 04:58:00 PM 10/13/2014 05:10:00 PM SEA00475 738.729 7th Ave & Union St Lake Union Park / Westlake Ave & Aloha St CBD-03 SLU-23 Member Female 1963.0
... ... ... ... ... ... ... ... ... ... ... ... ...
274743 295099 03/30/2017 05:10:00 PM 03/30/2017 05:19:00 PM SEA00484 503.644 7th Ave & Union St Fred Hutchinson Cancer Research Center / Fairv... CBD-03 EL-01 Member Male 1981.0
274774 295130 03/30/2017 06:12:00 PM 03/30/2017 06:23:00 PM SEA00024 670.540 7th Ave & Union St 3rd Ave & Broad St CBD-03 BT-01 Member Male 1982.0
274810 295167 03/30/2017 10:53:00 PM 03/30/2017 10:54:00 PM SEA00374 105.278 7th Ave & Union St Union St & 4th Ave CBD-03 CBD-04 Member Male 1956.0
274915 295279 03/31/2017 01:26:00 PM 03/31/2017 01:32:00 PM SEA00150 391.883 7th Ave & Union St REI / Yale Ave N & John St CBD-03 SLU-01 Member Male 1969.0
275053 295428 03/31/2017 07:24:00 PM 03/31/2017 07:37:00 PM SEA00232 777.493 7th Ave & Union St E Blaine St & Fairview Ave E CBD-03 EL-03 Member Male 1985.0

4822 rows × 12 columns

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

<data object>.groupby(<grouping values>).<aggregate>()

for example, we can group by gender and find the average of all numerical columns:


In [ ]:
df.groupby(gender).mean()

It's also possible to index the grouped object like it is a dataframe:


In [ ]:

You can even group by multiple values: for example we can look at the trip duration by time of day and by gender:


In [102]:
df.groupby(["from_station_id","bikeid"]).count()


Out[102]:
trip_id starttime stoptime tripduration from_station_name to_station_name to_station_id usertype gender birthyear
from_station_id bikeid
BT-01 SEA00001 7 7 7 7 7 7 7 7 4 4
SEA00002 8 8 8 8 8 8 8 8 7 7
SEA00005 10 10 10 10 10 10 10 10 5 5
SEA00006 9 9 9 9 9 9 9 9 5 5
SEA00007 12 12 12 12 12 12 12 12 2 2
... ... ... ... ... ... ... ... ... ... ... ...
WF-04 SEA00496 3 3 3 3 3 3 3 3 0 0
SEA00497 17 17 17 17 17 17 17 17 5 5
SEA00498 12 12 12 12 12 12 12 12 1 1
SEA00499 8 8 8 8 8 8 8 8 2 2
SEA00500 19 19 19 19 19 19 19 19 1 1

27354 rows × 10 columns

The unstack() operation can help make sense of this type of multiply-grouped data. What this technically does is split a multiple-valued index into an index plus columns:


In [ ]:

Visualizing data with pandas

Of course, looking at tables of data is not very intuitive. Fortunately Pandas has many useful plotting functions built-in, all of which make use of the matplotlib library to generate plots.

Whenever you do plotting in the IPython notebook, you will want to first run this magic command which configures the notebook to work well with plots:


In [1]:
%matplotlib inline

Now we can simply call the plot() method of any series or dataframe to get a reasonable view of the data:


In [5]:
import matplotlib.pyplot as plt
df['tripduration'].hist()


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5456f9dd50>

Adjusting the Plot Style

Matplotlib has a number of plot styles you can use. For example, if you like R you might use the ggplot style:


In [4]:
plt.style.use("ggplot")

Other plot types

Pandas supports a range of other plotting types; you can find these by using the autocomplete on the plot method:


In [ ]:
plt.

For example, we can create a histogram of trip durations:


In [6]:
df["tripduration"].hist()


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f54556c1850>

If you'd like to adjust the x and y limits of the plot, you can use the set_xlim() and set_ylim() method of the resulting object:


In [ ]:

Breakout: Exploring the Data

Make a plot of the total number of rides as a function of month of the year (You'll need to extract the month, use a groupby, and find the appropriate aggregation to count the number in each group).


In [ ]:

Split this plot by gender. Do you see any seasonal ridership patterns by gender?


In [ ]:

Split this plot by user type. Do you see any seasonal ridership patterns by usertype?


In [ ]:

Repeat the above three steps, counting the number of rides by time of day rather that by month.


In [ ]:

Are there any other interesting insights you can discover in the data using these tools?


In [ ]:

Using Files

  • Writing and running python modules
  • Using python modules in your Jupyter Notebook

In [ ]:
# A script for creating a dataframe with counts of the occurrence of a columns' values
df_count = df.groupby('from_station_id').count()
df_count1 = df_count[['trip_id']]
df_count2 = df_count1.rename(columns={'trip_id': 'count'})

In [ ]:
df_count2.head()

In [ ]:
def make_table_count(df_arg, groupby_column):
    df_count = df_arg.groupby(groupby_column).count()
    column_name = df.columns[0]
    df_count1 = df_count[[column_name]]
    df_count2 = df_count1.rename(columns={column_name: 'count'})
    return df_count2

In [ ]:
dff = make_table_count(df, 'from_station_id')
dff.head()

In [ ]: