Software Engineering for Data Scientists

Sophisticated Data Manipulation

DATA 515 A

1. Python's Data Science Ecosystem

With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis.

Python's Data Science Ecosystem

In addition to Python's built-in modules like the math module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python. Some of the most important ones are:

`numpy`: Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data. If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

`scipy`: Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more. We will not look closely at Scipy today, but we will use its functionality later in the course.

`pandas`: Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a Data Frame. If you've used the R statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

`matplotlib`: Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

2. Installation

Installing Pandas & friends

Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like conda. All it takes is to run

$ conda install numpy scipy pandas matplotlib

and (so long as your conda setup is working) the packages will be downloaded and installed on your system.

3. Arrays and slicing in Numpy



In [1]:

    
import numpy as np

Lists in native Python

Let's create a list, a native Python object that we've used earlier today.



In [2]:

    
my_list = [2, 5, 7, 8]
my_list









    Out[2]:





[2, 5, 7, 8]

This list is one-dimensional, let's make it multidimensional!



In [5]:

    
multi_list = [[1, 2, 3], [4, 5, 6]]

How do we access the 6 element in the second row, third column for native Python list?



In [7]:

    
multi_list[1][2]









    Out[7]:





6



In [8]:

    
second_list = [1, "NaN", 3.0]



In [27]:

    
len(multi_list[0])









    Out[27]:





3



In [17]:

    
for element in second_list:
    print(type(element))









    



<class 'int'>
<class 'str'>
<class 'float'>

Converting to numpy Arrays



In [19]:

    
my_array = np.array(my_list, dtype="int8")
my_array









    Out[19]:





array([2, 5, 7, 8], dtype=int8)



In [12]:

    
type(my_array)









    Out[12]:





numpy.ndarray



In [22]:

    
my_array[0]









    Out[22]:





1



In [20]:

    
my_array.dtype









    Out[20]:





dtype('int8')



In [24]:

    
multi_array = np.array([[7, 8], [9, 10]], dtype="int8")



In [25]:

    
multi_array.shape









    Out[25]:





(2, 2)



In [28]:

    
multi_array = np.array([[1, 2, 3],
                        [4, 5, 6]], np.int32)

How do we access the 6 element in the second row, third column for numpy array?



In [30]:

    
multi_array[1,2]









    Out[30]:





6

How do we retrieve a slice of the array, array([[1, 2], [4, 5]])?



In [31]:

    
multi_array[:,:2]









    Out[31]:





array([[1, 2],
       [4, 5]], dtype=int32)

How do we retrieve the second column of the array?



In [34]:

    
multi_array[:,1]









    Out[34]:





array([2, 5], dtype=int32)

4. Introduction to Pandas DataFrames

What are the elements of a table?



In [35]:

    
# Pandas DataFrames as table elements
import pandas as pd

What operations do we perform on tables?



In [37]:

    
df = pd.DataFrame({'A': [1,2,3], 'B': [2, 4, 6], 'ccc': [1.0, 33, 4]})
df



In [42]:

    
print(df["A"].dtype)
print(df["B"].dtype)
print(df["ccc"].dtype)









    



int64
int64
float64



In [44]:

    
df.dtypes









    Out[44]:





A        int64
B        int64
ccc    float64
dtype: object



In [45]:

    
sub_df = df[['A', 'ccc']]
sub_df



In [47]:

    
type(df['A'] + 2*df['B'])









    Out[47]:





pandas.core.series.Series

Operations on a Pandas DataFrame

5. Manipulating Data with DataFrames

Downloading the data

Shell commands can be run from the notebook by preceding them with an exclamation point:



In [51]:

    
!ls









    



numpy_pandas.ipynb  pronto.csv

uncomment this to download the data:



In [50]:

    
!curl -o pronto.csv https://data.seattle.gov/api/views/tw7j-dfaw/rows.csv?accessType=DOWNLOAD









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 42.7M    0 42.7M    0     0  3669k      0 --:--:--  0:00:11 --:--:-- 3169k

Loading Data into a DataFrame

Because we'll use it so much, we often import under a shortened name using the import ... as ... pattern:



In [2]:

    
import pandas as pd
df = pd.read_csv('pronto.csv')



In [53]:

    
df









    Out[53]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33:00 AM
      10/13/2014 10:48:00 AM
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34:00 AM
      10/13/2014 10:48:00 AM
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34:00 AM
      10/13/2014 10:49:00 AM
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1971.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      275086
      295463
      03/31/2017 11:16:00 PM
      04/01/2017 02:11:00 AM
      SEA00347
      10487.475
      NE 42nd St & University Way NE
      E Harrison St & Broadway Ave E
      UD-02
      CH-02
      Short-Term Pass Holder
      NaN
      NaN
    
    
      275087
      295464
      03/31/2017 11:16:00 PM
      04/01/2017 02:11:00 AM
      SEA00031
      10487.366
      NE 42nd St & University Way NE
      E Harrison St & Broadway Ave E
      UD-02
      CH-02
      Member
      Female
      1982.0
    
    
      275088
      295465
      03/31/2017 11:44:00 PM
      04/01/2017 12:04:00 AM
      SEA00240
      1205.656
      Bellevue Ave & E Pine St
      Lake Union Park / Westlake Ave & Aloha St
      CH-12
      SLU-23
      Member
      Male
      1990.0
    
    
      275089
      295466
      03/31/2017 11:46:00 PM
      04/01/2017 12:04:00 AM
      SEA00121
      1115.658
      Bellevue Ave & E Pine St
      Lake Union Park / Westlake Ave & Aloha St
      CH-12
      SLU-23
      Member
      Female
      1987.0
    
    
      275090
      295467
      03/31/2017 11:52:00 PM
      04/01/2017 02:33:00 AM
      SEA00151
      9667.679
      Cal Anderson Park / 11th Ave & Pine St
      Broadway and E Denny Way
      CH-08
      CH-16
      Member
      Male
      1988.0
    
  

275091 rows × 12 columns



In [54]:

    
type(df)









    Out[54]:





pandas.core.frame.DataFrame



In [55]:

    
len(df)









    Out[55]:





275091

Now we can use the read_csv command to read the comma-separated-value data:

Note: strings in Python can be defined either with double quotes or single quotes

Viewing Pandas Dataframes

The head() and tail() methods show us the first and last rows of the data



In [56]:

    
df.head()









    Out[56]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33:00 AM
      10/13/2014 10:48:00 AM
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34:00 AM
      10/13/2014 10:48:00 AM
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34:00 AM
      10/13/2014 10:49:00 AM
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1971.0



In [57]:

    
df.columns









    Out[57]:





Index(['trip_id', 'starttime', 'stoptime', 'bikeid', 'tripduration',
       'from_station_name', 'to_station_name', 'from_station_id',
       'to_station_id', 'usertype', 'gender', 'birthyear'],
      dtype='object')



In [58]:

    
df.index









    Out[58]:





RangeIndex(start=0, stop=275091, step=1)



In [59]:

    
smaller_df = df.loc[[1,4,6,7,9,34],:]



In [60]:

    
smaller_df.index









    Out[60]:





Int64Index([1, 4, 6, 7, 9, 34], dtype='int64')

The shape attribute shows us the number of elements:



In [61]:

    
df.shape









    Out[61]:





(275091, 12)

The columns attribute gives us the column names

The index attribute gives us the index names

The dtypes attribute gives the data types of each column:



In [62]:

    
df.dtypes









    Out[62]:





trip_id                int64
starttime             object
stoptime              object
bikeid                object
tripduration         float64
from_station_name     object
to_station_name       object
from_station_id       object
to_station_id         object
usertype              object
gender                object
birthyear            float64
dtype: object

Sophisticated Data Manipulation

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:



In [63]:

    
df_small = df['stoptime']



In [64]:

    
df_small









    Out[64]:





0         10/13/2014 10:48:00 AM
1         10/13/2014 10:48:00 AM
2         10/13/2014 10:48:00 AM
3         10/13/2014 10:48:00 AM
4         10/13/2014 10:49:00 AM
                   ...          
275086    04/01/2017 02:11:00 AM
275087    04/01/2017 02:11:00 AM
275088    04/01/2017 12:04:00 AM
275089    04/01/2017 12:04:00 AM
275090    04/01/2017 02:33:00 AM
Name: stoptime, Length: 275091, dtype: object



In [65]:

    
type(df_small)









    Out[65]:





pandas.core.series.Series



In [67]:

    
df['tripduration']









    Out[67]:





0           985.935
1           926.375
2           883.831
3           865.937
4           923.923
            ...    
275086    10487.475
275087    10487.366
275088     1205.656
275089     1115.658
275090     9667.679
Name: tripduration, Length: 275091, dtype: float64

Mathematical operations on columns happen element-wise:



In [70]:

    
trip_duration_hours = df['tripduration']/3600
trip_duration_hours[:5]









    Out[70]:





0    0.273871
1    0.257326
2    0.245509
3    0.240538
4    0.256645
Name: tripduration, dtype: float64



In [69]:

    
trip_duration_hours.head()









    Out[69]:





0    0.273871
1    0.257326
2    0.245509
3    0.240538
4    0.256645
Name: tripduration, dtype: float64



In [71]:

    
df['trip_duration_hours'] = df['tripduration']/3600



In [74]:

    
df.head()









    Out[74]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33:00 AM
      10/13/2014 10:48:00 AM
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34:00 AM
      10/13/2014 10:48:00 AM
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34:00 AM
      10/13/2014 10:49:00 AM
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1971.0



In [75]:

    
del df['trip_duration_hours']









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'trip_duration_hours'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-75-da88136fa6f4> in <module>
----> 1 del df['trip_duration_hours']

~/miniconda3/lib/python3.7/site-packages/pandas/core/generic.py in __delitem__(self, key)
   3757             # there was no match, this call should raise the appropriate
   3758             # exception:
-> 3759             self._data.delete(key)
   3760 
   3761         # delete from the caches

~/miniconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in delete(self, item)
   1000         Delete selected item (items if non-unique) in-place.
   1001         """
-> 1002         indexer = self.items.get_loc(item)
   1003 
   1004         is_deleted = np.zeros(self.shape[0], dtype=np.bool_)

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'trip_duration_hours'



In [76]:

    
df.head()









    Out[76]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33:00 AM
      10/13/2014 10:48:00 AM
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34:00 AM
      10/13/2014 10:48:00 AM
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34:00 AM
      10/13/2014 10:49:00 AM
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1971.0



In [77]:

    
df.loc[[0,1],:]









    Out[77]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0



In [80]:

    
df_long_trips = df[df['tripduration'] >10000]



In [81]:

    
df_long_trips









    Out[81]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      68
      506
      10/13/2014 12:08:00 PM
      10/13/2014 07:07:00 PM
      SEA00155
      25119.114
      2nd Ave & Spring St
      2nd Ave & Vine St
      CBD-06
      BT-03
      Member
      Male
      1988.0
    
    
      193
      649
      10/13/2014 02:54:00 PM
      10/13/2014 05:59:00 PM
      SEA00193
      11144.159
      NE 42nd St & University Way NE
      NE 42nd St & University Way NE
      UD-02
      UD-02
      Short-Term Pass Holder
      NaN
      NaN
    
    
      408
      908
      10/13/2014 11:12:00 PM
      10/14/2014 06:52:00 AM
      SEA00230
      27583.945
      7th Ave & Union St
      2nd Ave & Pine St
      CBD-03
      CBD-13
      Short-Term Pass Holder
      NaN
      NaN
    
    
      475
      983
      10/14/2014 09:12:00 AM
      10/14/2014 01:20:00 PM
      SEA00181
      14885.176
      Pier 69 / Alaskan Way & Clay St
      Pier 69 / Alaskan Way & Clay St
      WF-01
      WF-01
      Short-Term Pass Holder
      NaN
      NaN
    
    
      512
      1025
      10/14/2014 10:49:00 AM
      10/14/2014 05:19:00 PM
      SEA00244
      23346.911
      E Pine St & 16th Ave
      PATH / 9th Ave & Westlake Ave
      CH-07
      SLU-07
      Member
      Male
      1986.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      274051
      294382
      03/25/2017 02:28:00 PM
      03/25/2017 07:05:00 PM
      SEA00494
      16634.168
      Cal Anderson Park / 11th Ave & Pine St
      Pier 69 / Alaskan Way & Clay St
      CH-08
      WF-01
      Short-Term Pass Holder
      NaN
      NaN
    
    
      274052
      294383
      03/25/2017 02:28:00 PM
      03/25/2017 07:05:00 PM
      SEA00182
      16639.182
      Cal Anderson Park / 11th Ave & Pine St
      Pier 69 / Alaskan Way & Clay St
      CH-08
      WF-01
      Short-Term Pass Holder
      NaN
      NaN
    
    
      274266
      294604
      03/27/2017 11:56:00 AM
      03/27/2017 03:29:00 PM
      SEA00291
      12758.624
      6th Ave S & S King St
      NE 42nd St & University Way NE
      ID-04
      UD-02
      Short-Term Pass Holder
      NaN
      NaN
    
    
      275086
      295463
      03/31/2017 11:16:00 PM
      04/01/2017 02:11:00 AM
      SEA00347
      10487.475
      NE 42nd St & University Way NE
      E Harrison St & Broadway Ave E
      UD-02
      CH-02
      Short-Term Pass Holder
      NaN
      NaN
    
    
      275087
      295464
      03/31/2017 11:16:00 PM
      04/01/2017 02:11:00 AM
      SEA00031
      10487.366
      NE 42nd St & University Way NE
      E Harrison St & Broadway Ave E
      UD-02
      CH-02
      Member
      Female
      1982.0
    
  

3243 rows × 12 columns



In [84]:

    
sel = df['tripduration'] > 10000
df_long_trips = df[sel]



In [83]:

    
sel









    Out[83]:





0         False
1         False
2         False
3         False
4         False
          ...  
275086     True
275087     True
275088    False
275089    False
275090    False
Name: tripduration, Length: 275091, dtype: bool



In [85]:

    
df_long_trips









    Out[85]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      68
      506
      10/13/2014 12:08:00 PM
      10/13/2014 07:07:00 PM
      SEA00155
      25119.114
      2nd Ave & Spring St
      2nd Ave & Vine St
      CBD-06
      BT-03
      Member
      Male
      1988.0
    
    
      193
      649
      10/13/2014 02:54:00 PM
      10/13/2014 05:59:00 PM
      SEA00193
      11144.159
      NE 42nd St & University Way NE
      NE 42nd St & University Way NE
      UD-02
      UD-02
      Short-Term Pass Holder
      NaN
      NaN
    
    
      408
      908
      10/13/2014 11:12:00 PM
      10/14/2014 06:52:00 AM
      SEA00230
      27583.945
      7th Ave & Union St
      2nd Ave & Pine St
      CBD-03
      CBD-13
      Short-Term Pass Holder
      NaN
      NaN
    
    
      475
      983
      10/14/2014 09:12:00 AM
      10/14/2014 01:20:00 PM
      SEA00181
      14885.176
      Pier 69 / Alaskan Way & Clay St
      Pier 69 / Alaskan Way & Clay St
      WF-01
      WF-01
      Short-Term Pass Holder
      NaN
      NaN
    
    
      512
      1025
      10/14/2014 10:49:00 AM
      10/14/2014 05:19:00 PM
      SEA00244
      23346.911
      E Pine St & 16th Ave
      PATH / 9th Ave & Westlake Ave
      CH-07
      SLU-07
      Member
      Male
      1986.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      274051
      294382
      03/25/2017 02:28:00 PM
      03/25/2017 07:05:00 PM
      SEA00494
      16634.168
      Cal Anderson Park / 11th Ave & Pine St
      Pier 69 / Alaskan Way & Clay St
      CH-08
      WF-01
      Short-Term Pass Holder
      NaN
      NaN
    
    
      274052
      294383
      03/25/2017 02:28:00 PM
      03/25/2017 07:05:00 PM
      SEA00182
      16639.182
      Cal Anderson Park / 11th Ave & Pine St
      Pier 69 / Alaskan Way & Clay St
      CH-08
      WF-01
      Short-Term Pass Holder
      NaN
      NaN
    
    
      274266
      294604
      03/27/2017 11:56:00 AM
      03/27/2017 03:29:00 PM
      SEA00291
      12758.624
      6th Ave S & S King St
      NE 42nd St & University Way NE
      ID-04
      UD-02
      Short-Term Pass Holder
      NaN
      NaN
    
    
      275086
      295463
      03/31/2017 11:16:00 PM
      04/01/2017 02:11:00 AM
      SEA00347
      10487.475
      NE 42nd St & University Way NE
      E Harrison St & Broadway Ave E
      UD-02
      CH-02
      Short-Term Pass Holder
      NaN
      NaN
    
    
      275087
      295464
      03/31/2017 11:16:00 PM
      04/01/2017 02:11:00 AM
      SEA00031
      10487.366
      NE 42nd St & University Way NE
      E Harrison St & Broadway Ave E
      UD-02
      CH-02
      Member
      Female
      1982.0
    
  

3243 rows × 12 columns



In [86]:

    
df[sel].shape









    Out[86]:





(3243, 12)



In [87]:

    
# Make a copy of a slice
df_subset = df[['starttime', 'stoptime']].copy()
df_subset['trip_hours'] = df['tripduration']/3600

Columns can be created (or overwritten) with the assignment operator. Let's create a tripminutes column with the number of minutes for each trip

More complicated mathematical operations can be done with tools in the numpy package:

Working with Times

One trick to know when working with columns of times is that Pandas DateTimeIndex provides a nice interface for working with columns of times.

For a dataset of this size, using pd.to_datetime and specifying the date format can make things much faster (from the strftime reference, we see that the pronto data has format "%m/%d/%Y %I:%M:%S %p"

(Note: you can also use infer_datetime_format=True in most cases to automatically infer the correct format, though due to a bug it doesn't work when AM/PM are present)

With it, we can extract, the hour of the day, the day of the week, the month, and a wide range of other views of the time:

Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at value counts and the basics of group-by operations.

Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data. We'll take a look at two of these here.

The pandas.value_counts returns statistics on the unique values within each column.

We can use it, for example, to break down rides by gender:



In [90]:

    
pd.value_counts(df["gender"])









    Out[90]:





Male      135171
Female     34997
Other       3647
Name: gender, dtype: int64

Or to break down rides by age:



In [91]:

    
pd.value_counts(2019 - df["birthyear"])









    Out[91]:





32.0    15381
34.0    10491
38.0     9028
37.0     8950
35.0     8699
        ...  
83.0        7
20.0        6
77.0        2
75.0        1
88.0        1
Name: birthyear, Length: 61, dtype: int64

By default, the values rather than the index are sorted. Use sort=False to turn this behavior off:



In [92]:

    
pd.value_counts(df["birthyear"], sort=False)









    Out[92]:





1931.0       1
1936.0       7
1939.0      47
1942.0       2
1943.0      11
          ... 
1995.0    1437
1996.0     547
1997.0     285
1998.0     124
1999.0       6
Name: birthyear, Length: 61, dtype: int64

We can explore other things as well: day of week, hour of day, etc.



In [ ]:

Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations. You can visualize the group-by like this (image borrowed from the Python Data Science Handbook)



In [93]:

    
df.head()









    Out[93]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33:00 AM
      10/13/2014 10:48:00 AM
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34:00 AM
      10/13/2014 10:48:00 AM
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34:00 AM
      10/13/2014 10:49:00 AM
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1971.0



In [94]:

    
df_count = df.groupby(['from_station_id']).count()
df_count.head()









    Out[94]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      to_station_id
      usertype
      gender
      birthyear
    
    
      from_station_id
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      BT-01
      10463
      10463
      10463
      10463
      10463
      10463
      10463
      10463
      10463
      4162
      4162
    
    
      BT-03
      7334
      7334
      7334
      7334
      7334
      7334
      7334
      7334
      7334
      4862
      4862
    
    
      BT-04
      4666
      4666
      4666
      4666
      4666
      4666
      4666
      4666
      4666
      3424
      3424
    
    
      BT-05
      5699
      5699
      5699
      5699
      5699
      5699
      5699
      5699
      5699
      2975
      2975
    
    
      BT-06
      150
      150
      150
      150
      150
      150
      150
      150
      150
      130
      130



In [95]:

    
df_mean = df.groupby(['from_station_id']).mean()
df_mean.head()









    Out[95]:







  
    
      
      trip_id
      tripduration
      birthyear
    
    
      from_station_id
      
      
      
    
  
  
    
      BT-01
      147831.009844
      1375.031203
      1980.131427
    
    
      BT-03
      139404.294655
      1019.200684
      1976.505142
    
    
      BT-04
      157992.809687
      891.095897
      1979.877044
    
    
      BT-05
      139283.572381
      1199.949481
      1975.937479
    
    
      BT-06
      291807.953333
      659.770547
      1975.830769



In [97]:

    
dfgroup = df.groupby(['from_station_id'])
dfgroup.groups









    Out[97]:





{'BT-01': Int64Index([   217,    227,    228,    282,    283,    310,    326,    327,
                329,    331,
             ...
             274971, 274973, 274974, 274975, 274976, 274979, 275032, 275033,
             275075, 275076],
            dtype='int64', length=10463),
 'BT-03': Int64Index([    87,     88,    230,    261,    366,    407,    414,    439,
                453,    754,
             ...
             268122, 268181, 268307, 268318, 268319, 268391, 268392, 268467,
             268527, 268528],
            dtype='int64', length=7334),
 'BT-04': Int64Index([    66,     67,     94,    104,    108,    166,    233,    259,
                322,    333,
             ...
             274350, 274361, 274424, 274704, 274789, 274970, 275009, 275064,
             275065, 275083],
            dtype='int64', length=4666),
 'BT-05': Int64Index([   110,    413,    426,    513,    585,    618,    744,    753,
                795,   1003,
             ...
             274547, 274605, 274610, 274621, 274817, 274847, 274910, 274911,
             275029, 275034],
            dtype='int64', length=5699),
 'BT-06': Int64Index([268581, 268642, 268667, 268718, 268735, 268781, 268897, 268903,
             268914, 268961,
             ...
             274777, 274778, 274865, 274931, 274937, 274949, 274951, 274956,
             274962, 274965],
            dtype='int64', length=150),
 'CBD-03': Int64Index([   118,    119,    164,    229,    275,    285,    328,    339,
                356,    357,
             ...
             274574, 274594, 274599, 274622, 274672, 274743, 274774, 274810,
             274915, 275053],
            dtype='int64', length=4822),
 'CBD-04': Int64Index([105392, 105458, 105467, 105472, 105614, 105615, 105835, 105836,
             105855, 105858,
             ...
             274730, 274922, 274924, 274925, 274926, 274927, 274952, 274958,
             274983, 275080],
            dtype='int64', length=3440),
 'CBD-05': Int64Index([    54,     79,     95,    148,    149,    150,    151,    165,
                211,    219,
             ...
             274081, 274085, 274086, 274138, 274139, 274623, 274624, 274692,
             274814, 274906],
            dtype='int64', length=5068),
 'CBD-06': Int64Index([     0,      1,      2,      3,      4,      5,     63,     68,
                 70,     71,
             ...
             274534, 274544, 274570, 274671, 274744, 274765, 274797, 274836,
             274838, 274980],
            dtype='int64', length=4911),
 'CBD-07': Int64Index([    42,     69,     78,    141,    196,    269,    478,    510,
                522,    542,
             ...
             274317, 274413, 274439, 274467, 274579, 274673, 274688, 274752,
             274879, 274907],
            dtype='int64', length=3263),
 'CBD-13': Int64Index([    99,    139,    198,    249,    276,    334,    381,    388,
                424,    442,
             ...
             274444, 274535, 274555, 274613, 274726, 274780, 274903, 274916,
             274966, 274968],
            dtype='int64', length=9067),
 'CD-01': Int64Index([ 68531,  68532,  69169,  69170,  69954,  70367,  70529,  70546,
              70621,  70622,
             ...
             224878, 225305, 225422, 225427, 225492, 225493, 225587, 225666,
             226277, 226597],
            dtype='int64', length=958),
 'CH-01': Int64Index([   256,    355,    382,    416,    437,    444,    502,    503,
                600,    679,
             ...
             274585, 274632, 274641, 274656, 274670, 274827, 274829, 274871,
             274933, 274995],
            dtype='int64', length=6409),
 'CH-02': Int64Index([    55,     56,     58,     83,    113,    126,    127,    154,
                162,    163,
             ...
             274450, 274488, 274593, 274660, 274668, 274868, 274917, 275067,
             275081, 275082],
            dtype='int64', length=8546),
 'CH-03': Int64Index([   290,    417,    428,    435,    436,    452,    494,    516,
                640,    665,
             ...
             274625, 274650, 274658, 274693, 274822, 274844, 274872, 274893,
             274894, 274950],
            dtype='int64', length=6218),
 'CH-05': Int64Index([   134,    195,    205,    248,    250,    251,    315,    324,
                337,    390,
             ...
             274457, 274597, 274714, 274850, 274855, 274873, 274889, 274932,
             274936, 274993],
            dtype='int64', length=6948),
 'CH-06': Int64Index([   212,    253,    277,    278,    279,    403,    449,    504,
                684,    881,
             ...
             274679, 274745, 274746, 274750, 274751, 274781, 274788, 274834,
             274839, 274875],
            dtype='int64', length=3765),
 'CH-07': Int64Index([   146,    210,    299,    341,    374,    377,    401,    415,
                431,    466,
             ...
             274832, 274846, 274890, 274891, 274892, 274955, 275002, 275049,
             275060, 275077],
            dtype='int64', length=11568),
 'CH-08': Int64Index([   120,    136,    144,    158,    159,    242,    262,    294,
                311,    321,
             ...
             274824, 274848, 274853, 274904, 274905, 274935, 274943, 274982,
             275006, 275090],
            dtype='int64', length=8573),
 'CH-09': Int64Index([   101,    168,    222,    349,    380,    467,    567,    578,
                628,    647,
             ...
             274357, 274466, 274468, 274475, 274500, 274595, 274611, 274732,
             274895, 275066],
            dtype='int64', length=5246),
 'CH-12': Int64Index([   319,    384,    411,    441,    451,    462,    540,    554,
                558,    605,
             ...
             274577, 274603, 274609, 274615, 274631, 274680, 275050, 275056,
             275088, 275089],
            dtype='int64', length=5857),
 'CH-15': Int64Index([   109,    160,    244,    340,    402,    430,    459,    468,
                723,    724,
             ...
             274674, 274756, 274791, 274812, 274826, 274840, 274852, 274857,
             274921, 274969],
            dtype='int64', length=6550),
 'CH-16': Int64Index([175075, 175093, 175108, 175126, 175127, 175131, 175136, 175144,
             175145, 175210,
             ...
             274629, 274805, 274816, 274825, 274854, 274920, 275035, 275042,
             275051, 275072],
            dtype='int64', length=2089),
 'DPD-01': Int64Index([    59,     91,     93,    289,    568,    644,    667,    740,
                839,    974,
             ...
             274530, 274604, 274665, 274681, 274798, 274823, 274939, 274957,
             274961, 275026],
            dtype='int64', length=4822),
 'DPD-03': Int64Index([   131,    197,    345,    347,    727,    844,   1075,   1144,
               1347,   1430,
             ...
             273047, 273048, 274048, 274092, 274776, 274800, 275001, 275003,
             275004, 275005],
            dtype='int64', length=1423),
 'EL-01': Int64Index([   199,    400,    700,    702,    715,    716,    769,   1175,
               1350,   1351,
             ...
             274676, 274718, 274731, 274770, 274837, 274928, 274929, 274948,
             274959, 275038],
            dtype='int64', length=3604),
 'EL-03': Int64Index([   344,    358,    360,    425,    492,    583,    927,   1027,
               1071,   1110,
             ...
             274757, 274861, 274862, 274882, 274984, 274988, 274990, 274994,
             275031, 275052],
            dtype='int64', length=5788),
 'EL-05': Int64Index([   200,    201,    447,    456,    488,    615,    646,    694,
                763,    858,
             ...
             274019, 274157, 274162, 274253, 274368, 274477, 274584, 274725,
             274877, 274878],
            dtype='int64', length=3400),
 'FH-01': Int64Index([   100,    231,    325,    330,    373,    386,    455,    485,
                505,    521,
             ...
             173748, 173988, 174253, 174384, 174549, 174647, 174657, 174690,
             174986, 175005],
            dtype='int64', length=2349),
 'FH-04': Int64Index([   364,    371,    392,    396,    460,    482,    529,    950,
                970,    984,
             ...
             274428, 274519, 274600, 274640, 274646, 274648, 274849, 274851,
             274964, 275000],
            dtype='int64', length=4208),
 'ID-04': Int64Index([    89,    123,    155,    156,    169,    170,    214,    223,
                237,    309,
             ...
             274353, 274445, 274548, 274792, 274930, 275014, 275057, 275058,
             275084, 275085],
            dtype='int64', length=2474),
 'PS-04': Int64Index([     6,      7,      8,      9,     10,     11,     12,     13,
                 14,     15,
             ...
             274446, 274471, 274572, 274734, 274766, 274874, 274901, 274902,
             274944, 275068],
            dtype='int64', length=5409),
 'PS-05': Int64Index([    45,     49,     53,     57,     90,    130,    202,    218,
                246,    247,
             ...
             274633, 274634, 274635, 274666, 274820, 274828, 274978, 274987,
             275022, 275073],
            dtype='int64', length=3969),
 'SLU-01': Int64Index([   111,    142,    143,    147,    152,    153,    220,    308,
                312,    370,
             ...
             274686, 274722, 274768, 274769, 274883, 274884, 274997, 275041,
             275059, 275071],
            dtype='int64', length=7084),
 'SLU-02': Int64Index([   137,    181,    296,    397,    427,    458,    464,    500,
                530,    531,
             ...
             274678, 274684, 274687, 274747, 274753, 274833, 274835, 274885,
             274899, 274991],
            dtype='int64', length=7018),
 'SLU-04': Int64Index([   213,    245,    273,    288,    291,    295,    316,    432,
                589,    639,
             ...
             274185, 274309, 274311, 274415, 274493, 274711, 274887, 274941,
             275027, 275048],
            dtype='int64', length=5226),
 'SLU-07': Int64Index([   368,    454,    551,    552,    575,    577,    633,    648,
                735,    741,
             ...
             274607, 274608, 274647, 274715, 274716, 274771, 274845, 274898,
             274946, 275040],
            dtype='int64', length=6339),
 'SLU-15': Int64Index([   102,    178,    232,    243,    284,    287,    292,    313,
                318,    338,
             ...
             274808, 274863, 274897, 274923, 274953, 274967, 274985, 275036,
             275037, 275061],
            dtype='int64', length=9741),
 'SLU-16': Int64Index([   391,    406,    420,    448,    486,    487,    532,    536,
                537,    538,
             ...
             274099, 274103, 274296, 274442, 274602, 274720, 274763, 274764,
             274867, 274896],
            dtype='int64', length=5045),
 'SLU-18': Int64Index([   103,    320,    359,    446,    544,    556,    565,    566,
                591,    614,
             ...
             209477, 209600, 209625, 209663, 209671, 209907, 209917, 209918,
             209929, 210002],
            dtype='int64', length=3461),
 'SLU-19': Int64Index([   129,    280,    304,    350,    351,    353,    354,    457,
                493,    564,
             ...
             274407, 274512, 274590, 274651, 274701, 274702, 274703, 274841,
             274963, 275062],
            dtype='int64', length=7285),
 'SLU-20': Int64Index([ 79307,  79441,  79473,  79584,  79657,  79658,  79659,  79864,
              79868,  79994,
             ...
             273606, 274539, 274540, 274561, 274606, 274758, 274806, 274807,
             274918, 275039],
            dtype='int64', length=2452),
 'SLU-21': Int64Index([133364, 133365, 133388, 133620, 133621, 133744, 133745, 134178,
             134179, 135016,
             ...
             274136, 274193, 274497, 274612, 274698, 274801, 274960, 275030,
             275043, 275074],
            dtype='int64', length=1114),
 'SLU-22': Int64Index([210885, 210897, 210898, 210899, 210913, 210918, 211084, 211085,
             211264, 211318,
             ...
             274525, 274652, 274669, 274683, 274699, 274772, 274803, 274804,
             274842, 274866],
            dtype='int64', length=1748),
 'SLU-23': Int64Index([   192,    206,    224,    225,    226,    305,    306,    549,
                550,    635,
             ...
             275010, 275011, 275012, 275016, 275017, 275018, 275019, 275020,
             275021, 275023],
            dtype='int64', length=5739),
 'UD-01': Int64Index([    60,     61,     76,    177,    182,    208,    608,    942,
                943,   1054,
             ...
             274123, 274124, 274158, 274460, 274575, 274700, 274705, 274869,
             274881, 275025],
            dtype='int64', length=3889),
 'UD-02': Int64Index([    92,     97,    183,    193,    204,    240,    241,    543,
                654,    655,
             ...
             274286, 274287, 274369, 274418, 274502, 274815, 274919, 275024,
             275086, 275087],
            dtype='int64', length=1417),
 'UD-04': Int64Index([    96,    161,    184,    188,    260,    372,    499,    611,
                678,    891,
             ...
             274104, 274105, 274160, 274259, 274264, 274283, 274400, 274528,
             274659, 274870],
            dtype='int64', length=3534),
 'UD-07': Int64Index([   115,    116,    281,    469,    669,    696,    738,    904,
                963,   1040,
             ...
             273080, 273086, 273331, 273359, 273545, 273783, 274165, 274175,
             274281, 274759],
            dtype='int64', length=2429),
 'UW-01': Int64Index([   730,   1691,   1759,   2124,   2383,   2746,   3087,   3356,
               3404,   3510,
             ...
             142135, 142136, 142249, 142254, 142259, 143101, 144918, 145571,
             147714, 147773],
            dtype='int64', length=480),
 'UW-02': Int64Index([    72,     73,     74,     80,    421,    857,    964,   1026,
               1183,   1433,
             ...
             274078, 274079, 274088, 274089, 274300, 274301, 274537, 274586,
             274998, 275063],
            dtype='int64', length=2002),
 'UW-04': Int64Index([   187,    343,    375,    463,    477,    580,    673,    762,
                781,    833,
             ...
             274811, 274856, 274858, 274876, 274888, 274996, 275045, 275054,
             275069, 275070],
            dtype='int64', length=2688),
 'UW-06': Int64Index([   167,    272,    385,    631,    774,    951,   1011,   1048,
               1078,   1083,
             ...
             274454, 274499, 274562, 274617, 274742, 274773, 274786, 274802,
             274809, 274945],
            dtype='int64', length=2383),
 'UW-07': Int64Index([   121,    122,    215,    216,    365,    367,    404,    721,
               1090,   1141,
             ...
             273936, 273944, 273961, 274498, 274571, 274576, 274721, 274727,
             274947, 275046],
            dtype='int64', length=1905),
 'UW-10': Int64Index([   105,    124,    128,    314,    619,    896,    934,    935,
                999,   1006,
             ...
             238201, 238453, 238514, 238816, 238854, 239274, 239545, 240295,
             240446, 240775],
            dtype='int64', length=1175),
 'UW-11': Int64Index([150250, 150776, 151044, 151373, 151690, 152037, 152061, 153903,
             153941, 154905,
             ...
             273385, 273607, 273760, 273892, 273904, 274016, 274017, 274018,
             274473, 274675],
            dtype='int64', length=1237),
 'UW-12': Int64Index([241157, 241173, 241175, 241194, 241208, 241245, 241292, 241403,
             241435, 241447,
             ...
             274748, 274794, 274795, 274843, 274859, 274864, 274938, 274940,
             274999, 275055],
            dtype='int64', length=689),
 'WF-01': Int64Index([   133,    135,    297,    298,    300,    302,    307,    369,
                475,    514,
             ...
             274972, 274977, 274989, 275013, 275015, 275028, 275044, 275047,
             275078, 275079],
            dtype='int64', length=13038),
 'WF-03': Int64Index([226781, 226784, 226827, 227100, 227321, 227322, 227569, 227570,
             227768, 227769,
             ...
             274097, 274100, 274306, 274307, 274325, 274383, 274667, 274708,
             274909, 274934],
            dtype='int64', length=646),
 'WF-04': Int64Index([    64,     65,    132,    157,    203,    207,    236,    264,
                266,    267,
             ...
             274382, 274630, 274637, 274697, 274707, 274709, 274880, 274913,
             274981, 274986],
            dtype='int64', length=6271)}



In [100]:

    
df.loc[dfgroup.groups["CBD-03"],:]









    Out[100]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      118
      562
      10/13/2014 01:30:00 PM
      10/13/2014 01:42:00 PM
      SEA00071
      713.859
      7th Ave & Union St
      Cal Anderson Park / 11th Ave & Pine St
      CBD-03
      CH-08
      Member
      Female
      1946.0
    
    
      119
      563
      10/13/2014 01:30:00 PM
      10/13/2014 01:42:00 PM
      SEA00190
      739.791
      7th Ave & Union St
      Cal Anderson Park / 11th Ave & Pine St
      CBD-03
      CH-08
      Member
      Female
      1949.0
    
    
      164
      617
      10/13/2014 02:18:00 PM
      10/13/2014 02:34:00 PM
      SEA00021
      970.064
      7th Ave & Union St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-03
      PS-04
      Member
      Female
      1951.0
    
    
      229
      694
      10/13/2014 03:58:00 PM
      10/13/2014 04:05:00 PM
      SEA00456
      409.011
      7th Ave & Union St
      Republican St & Westlake Ave N
      CBD-03
      SLU-04
      Member
      Male
      1972.0
    
    
      275
      745
      10/13/2014 04:58:00 PM
      10/13/2014 05:10:00 PM
      SEA00475
      738.729
      7th Ave & Union St
      Lake Union Park / Westlake Ave & Aloha St
      CBD-03
      SLU-23
      Member
      Female
      1963.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      274743
      295099
      03/30/2017 05:10:00 PM
      03/30/2017 05:19:00 PM
      SEA00484
      503.644
      7th Ave & Union St
      Fred Hutchinson Cancer Research Center / Fairv...
      CBD-03
      EL-01
      Member
      Male
      1981.0
    
    
      274774
      295130
      03/30/2017 06:12:00 PM
      03/30/2017 06:23:00 PM
      SEA00024
      670.540
      7th Ave & Union St
      3rd Ave & Broad St
      CBD-03
      BT-01
      Member
      Male
      1982.0
    
    
      274810
      295167
      03/30/2017 10:53:00 PM
      03/30/2017 10:54:00 PM
      SEA00374
      105.278
      7th Ave & Union St
      Union St & 4th Ave
      CBD-03
      CBD-04
      Member
      Male
      1956.0
    
    
      274915
      295279
      03/31/2017 01:26:00 PM
      03/31/2017 01:32:00 PM
      SEA00150
      391.883
      7th Ave & Union St
      REI / Yale Ave N & John St
      CBD-03
      SLU-01
      Member
      Male
      1969.0
    
    
      275053
      295428
      03/31/2017 07:24:00 PM
      03/31/2017 07:37:00 PM
      SEA00232
      777.493
      7th Ave & Union St
      E Blaine St & Fairview Ave E
      CBD-03
      EL-03
      Member
      Male
      1985.0
    
  

4822 rows × 12 columns

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

<data object>.groupby(<grouping values>).<aggregate>()

for example, we can group by gender and find the average of all numerical columns:



In [ ]:

    
df.groupby(gender).mean()

It's also possible to index the grouped object like it is a dataframe:



In [ ]:

You can even group by multiple values: for example we can look at the trip duration by time of day and by gender:



In [102]:

    
df.groupby(["from_station_id","bikeid"]).count()









    Out[102]:







  
    
      
      
      trip_id
      starttime
      stoptime
      tripduration
      from_station_name
      to_station_name
      to_station_id
      usertype
      gender
      birthyear
    
    
      from_station_id
      bikeid
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      BT-01
      SEA00001
      7
      7
      7
      7
      7
      7
      7
      7
      4
      4
    
    
      SEA00002
      8
      8
      8
      8
      8
      8
      8
      8
      7
      7
    
    
      SEA00005
      10
      10
      10
      10
      10
      10
      10
      10
      5
      5
    
    
      SEA00006
      9
      9
      9
      9
      9
      9
      9
      9
      5
      5
    
    
      SEA00007
      12
      12
      12
      12
      12
      12
      12
      12
      2
      2
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      WF-04
      SEA00496
      3
      3
      3
      3
      3
      3
      3
      3
      0
      0
    
    
      SEA00497
      17
      17
      17
      17
      17
      17
      17
      17
      5
      5
    
    
      SEA00498
      12
      12
      12
      12
      12
      12
      12
      12
      1
      1
    
    
      SEA00499
      8
      8
      8
      8
      8
      8
      8
      8
      2
      2
    
    
      SEA00500
      19
      19
      19
      19
      19
      19
      19
      19
      1
      1
    
  

27354 rows × 10 columns

The unstack() operation can help make sense of this type of multiply-grouped data. What this technically does is split a multiple-valued index into an index plus columns:



In [ ]:

Visualizing data with `pandas`

Of course, looking at tables of data is not very intuitive. Fortunately Pandas has many useful plotting functions built-in, all of which make use of the matplotlib library to generate plots.

Whenever you do plotting in the IPython notebook, you will want to first run this magic command which configures the notebook to work well with plots:



In [1]:

    
%matplotlib inline

Now we can simply call the plot() method of any series or dataframe to get a reasonable view of the data:



In [5]:

    
import matplotlib.pyplot as plt
df['tripduration'].hist()









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f5456f9dd50>

Adjusting the Plot Style

Matplotlib has a number of plot styles you can use. For example, if you like R you might use the ggplot style:



In [4]:

    
plt.style.use("ggplot")

Other plot types

Pandas supports a range of other plotting types; you can find these by using the autocomplete on the plot method:



In [ ]:

    
plt.

For example, we can create a histogram of trip durations:



In [6]:

    
df["tripduration"].hist()









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f54556c1850>

If you'd like to adjust the x and y limits of the plot, you can use the set_xlim() and set_ylim() method of the resulting object:



In [ ]:

Breakout: Exploring the Data

Make a plot of the total number of rides as a function of month of the year (You'll need to extract the month, use a groupby, and find the appropriate aggregation to count the number in each group).



In [ ]:

Split this plot by gender. Do you see any seasonal ridership patterns by gender?



In [ ]:

Split this plot by user type. Do you see any seasonal ridership patterns by usertype?



In [ ]:

Repeat the above three steps, counting the number of rides by time of day rather that by month.



In [ ]:

Are there any other interesting insights you can discover in the data using these tools?



In [ ]:

Using Files

Writing and running python modules
Using python modules in your Jupyter Notebook



In [ ]:

    
# A script for creating a dataframe with counts of the occurrence of a columns' values
df_count = df.groupby('from_station_id').count()
df_count1 = df_count[['trip_id']]
df_count2 = df_count1.rename(columns={'trip_id': 'count'})



In [ ]:

    
df_count2.head()



In [ ]:

    
def make_table_count(df_arg, groupby_column):
    df_count = df_arg.groupby(groupby_column).count()
    column_name = df.columns[0]
    df_count1 = df_count[[column_name]]
    df_count2 = df_count1.rename(columns={column_name: 'count'})
    return df_count2



In [ ]:

    
dff = make_table_count(df, 'from_station_id')
dff.head()



In [ ]:

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear
0	431	10/13/2014 10:31:00 AM	10/13/2014 10:48:00 AM	SEA00298	985.935	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Male	1960.0
1	432	10/13/2014 10:32:00 AM	10/13/2014 10:48:00 AM	SEA00195	926.375	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Male	1970.0
2	433	10/13/2014 10:33:00 AM	10/13/2014 10:48:00 AM	SEA00486	883.831	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Female	1988.0
3	434	10/13/2014 10:34:00 AM	10/13/2014 10:48:00 AM	SEA00333	865.937	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Female	1977.0
4	435	10/13/2014 10:34:00 AM	10/13/2014 10:49:00 AM	SEA00202	923.923	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Male	1971.0
...	...	...	...	...	...	...	...	...	...	...	...	...
275086	295463	03/31/2017 11:16:00 PM	04/01/2017 02:11:00 AM	SEA00347	10487.475	NE 42nd St & University Way NE	E Harrison St & Broadway Ave E	UD-02	CH-02	Short-Term Pass Holder	NaN	NaN
275087	295464	03/31/2017 11:16:00 PM	04/01/2017 02:11:00 AM	SEA00031	10487.366	NE 42nd St & University Way NE	E Harrison St & Broadway Ave E	UD-02	CH-02	Member	Female	1982.0
275088	295465	03/31/2017 11:44:00 PM	04/01/2017 12:04:00 AM	SEA00240	1205.656	Bellevue Ave & E Pine St	Lake Union Park / Westlake Ave & Aloha St	CH-12	SLU-23	Member	Male	1990.0
275089	295466	03/31/2017 11:46:00 PM	04/01/2017 12:04:00 AM	SEA00121	1115.658	Bellevue Ave & E Pine St	Lake Union Park / Westlake Ave & Aloha St	CH-12	SLU-23	Member	Female	1987.0
275090	295467	03/31/2017 11:52:00 PM	04/01/2017 02:33:00 AM	SEA00151	9667.679	Cal Anderson Park / 11th Ave & Pine St	Broadway and E Denny Way	CH-08	CH-16	Member	Male	1988.0

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear
68	506	10/13/2014 12:08:00 PM	10/13/2014 07:07:00 PM	SEA00155	25119.114	2nd Ave & Spring St	2nd Ave & Vine St	CBD-06	BT-03	Member	Male	1988.0
193	649	10/13/2014 02:54:00 PM	10/13/2014 05:59:00 PM	SEA00193	11144.159	NE 42nd St & University Way NE	NE 42nd St & University Way NE	UD-02	UD-02	Short-Term Pass Holder	NaN	NaN
408	908	10/13/2014 11:12:00 PM	10/14/2014 06:52:00 AM	SEA00230	27583.945	7th Ave & Union St	2nd Ave & Pine St	CBD-03	CBD-13	Short-Term Pass Holder	NaN	NaN
475	983	10/14/2014 09:12:00 AM	10/14/2014 01:20:00 PM	SEA00181	14885.176	Pier 69 / Alaskan Way & Clay St	Pier 69 / Alaskan Way & Clay St	WF-01	WF-01	Short-Term Pass Holder	NaN	NaN
512	1025	10/14/2014 10:49:00 AM	10/14/2014 05:19:00 PM	SEA00244	23346.911	E Pine St & 16th Ave	PATH / 9th Ave & Westlake Ave	CH-07	SLU-07	Member	Male	1986.0
...	...	...	...	...	...	...	...	...	...	...	...	...
274051	294382	03/25/2017 02:28:00 PM	03/25/2017 07:05:00 PM	SEA00494	16634.168	Cal Anderson Park / 11th Ave & Pine St	Pier 69 / Alaskan Way & Clay St	CH-08	WF-01	Short-Term Pass Holder	NaN	NaN
274052	294383	03/25/2017 02:28:00 PM	03/25/2017 07:05:00 PM	SEA00182	16639.182	Cal Anderson Park / 11th Ave & Pine St	Pier 69 / Alaskan Way & Clay St	CH-08	WF-01	Short-Term Pass Holder	NaN	NaN
274266	294604	03/27/2017 11:56:00 AM	03/27/2017 03:29:00 PM	SEA00291	12758.624	6th Ave S & S King St	NE 42nd St & University Way NE	ID-04	UD-02	Short-Term Pass Holder	NaN	NaN
275086	295463	03/31/2017 11:16:00 PM	04/01/2017 02:11:00 AM	SEA00347	10487.475	NE 42nd St & University Way NE	E Harrison St & Broadway Ave E	UD-02	CH-02	Short-Term Pass Holder	NaN	NaN
275087	295464	03/31/2017 11:16:00 PM	04/01/2017 02:11:00 AM	SEA00031	10487.366	NE 42nd St & University Way NE	E Harrison St & Broadway Ave E	UD-02	CH-02	Member	Female	1982.0

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	to_station_id	usertype	gender	birthyear
from_station_id
BT-01	10463	10463	10463	10463	10463	10463	10463	10463	10463	4162	4162
BT-03	7334	7334	7334	7334	7334	7334	7334	7334	7334	4862	4862
BT-04	4666	4666	4666	4666	4666	4666	4666	4666	4666	3424	3424
BT-05	5699	5699	5699	5699	5699	5699	5699	5699	5699	2975	2975
BT-06	150	150	150	150	150	150	150	150	150	130	130

	trip_id	tripduration	birthyear
from_station_id
BT-01	147831.009844	1375.031203	1980.131427
BT-03	139404.294655	1019.200684	1976.505142
BT-04	157992.809687	891.095897	1979.877044
BT-05	139283.572381	1199.949481	1975.937479
BT-06	291807.953333	659.770547	1975.830769

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear
118	562	10/13/2014 01:30:00 PM	10/13/2014 01:42:00 PM	SEA00071	713.859	7th Ave & Union St	Cal Anderson Park / 11th Ave & Pine St	CBD-03	CH-08	Member	Female	1946.0
119	563	10/13/2014 01:30:00 PM	10/13/2014 01:42:00 PM	SEA00190	739.791	7th Ave & Union St	Cal Anderson Park / 11th Ave & Pine St	CBD-03	CH-08	Member	Female	1949.0
164	617	10/13/2014 02:18:00 PM	10/13/2014 02:34:00 PM	SEA00021	970.064	7th Ave & Union St	Occidental Park / Occidental Ave S & S Washing...	CBD-03	PS-04	Member	Female	1951.0
229	694	10/13/2014 03:58:00 PM	10/13/2014 04:05:00 PM	SEA00456	409.011	7th Ave & Union St	Republican St & Westlake Ave N	CBD-03	SLU-04	Member	Male	1972.0
275	745	10/13/2014 04:58:00 PM	10/13/2014 05:10:00 PM	SEA00475	738.729	7th Ave & Union St	Lake Union Park / Westlake Ave & Aloha St	CBD-03	SLU-23	Member	Female	1963.0
...	...	...	...	...	...	...	...	...	...	...	...	...
274743	295099	03/30/2017 05:10:00 PM	03/30/2017 05:19:00 PM	SEA00484	503.644	7th Ave & Union St	Fred Hutchinson Cancer Research Center / Fairv...	CBD-03	EL-01	Member	Male	1981.0
274774	295130	03/30/2017 06:12:00 PM	03/30/2017 06:23:00 PM	SEA00024	670.540	7th Ave & Union St	3rd Ave & Broad St	CBD-03	BT-01	Member	Male	1982.0
274810	295167	03/30/2017 10:53:00 PM	03/30/2017 10:54:00 PM	SEA00374	105.278	7th Ave & Union St	Union St & 4th Ave	CBD-03	CBD-04	Member	Male	1956.0
274915	295279	03/31/2017 01:26:00 PM	03/31/2017 01:32:00 PM	SEA00150	391.883	7th Ave & Union St	REI / Yale Ave N & John St	CBD-03	SLU-01	Member	Male	1969.0
275053	295428	03/31/2017 07:24:00 PM	03/31/2017 07:37:00 PM	SEA00232	777.493	7th Ave & Union St	E Blaine St & Fairview Ave E	CBD-03	EL-03	Member	Male	1985.0

		trip_id	starttime	stoptime	tripduration	from_station_name	to_station_name	to_station_id	usertype	gender	birthyear
from_station_id	bikeid
BT-01	SEA00001	7	7	7	7	7	7	7	7	4	4
	SEA00002	8	8	8	8	8	8	8	8	7	7
	SEA00005	10	10	10	10	10	10	10	10	5	5
	SEA00006	9	9	9	9	9	9	9	9	5	5
	SEA00007	12	12	12	12	12	12	12	12	2	2
...	...	...	...	...	...	...	...	...	...	...	...
WF-04	SEA00496	3	3	3	3	3	3	3	3	0	0
	SEA00497	17	17	17	17	17	17	17	17	5	5
	SEA00498	12	12	12	12	12	12	12	12	1	1
	SEA00499	8	8	8	8	8	8	8	8	2	2
	SEA00500	19	19	19	19	19	19	19	19	1	1

Software Engineering for Data Scientists

Sophisticated Data Manipulation

DATA 515 A

1. Python's Data Science Ecosystem

Python's Data Science Ecosystem

numpy: Numerical Python

scipy: Scientific Python

pandas: Labeled Data Manipulation in Python

matplotlib: Visualization in Python

2. Installation

Installing Pandas & friends

3. Arrays and slicing in Numpy

Lists in native Python

Converting to numpy Arrays

4. Introduction to Pandas DataFrames

Operations on a Pandas DataFrame

5. Manipulating Data with DataFrames

Downloading the data

Loading Data into a DataFrame

Viewing Pandas Dataframes

Sophisticated Data Manipulation

Working with Times

Simple Grouping of Data

Value Counts

Group-by Operation

Visualizing data with pandas

Adjusting the Plot Style

Other plot types

Breakout: Exploring the Data

Using Files

`numpy`: Numerical Python

`scipy`: Scientific Python

`pandas`: Labeled Data Manipulation in Python

`matplotlib`: Visualization in Python

Visualizing data with `pandas`