Software Engineering for Data Scientists

Manipulating Data with Python

CSE 599 B1

Today's Objectives

1. Opening & Navigating the IPython Notebook

2. Simple Math in the IPython Notebook

3. Loading data with `pandas`

4. Cleaning and Manipulating data with `pandas`

5. Visualizing data with `pandas`

1. Opening and Navigating the IPython Notebook

We will start today with the interactive environment that we will be using often through the course: the IPython/Jupyter Notebook.

We will walk through the following steps together:

Download miniconda (be sure to get Version 3.5) and install it on your system (hopefully you have done this before coming to class)
Use the conda command-line tool to update your package listing and install the IPython notebook:

Update conda's listing of packages for your system:
```
$ conda update conda
```
Install IPython notebook and all its requirements
```
$ conda install ipython-notebook
```
Navigate to the directory containing the course material. For example:
```
$ cd ~/courses/CSE599/
```
You should see a number of files in the directory, including these:
```
$ ls
...
Breakout-Simple-Math.ipynb
CSE599_Lecture_2.ipynb
...
```
Type ipython notebook in the terminal to start the notebook
```
$ ipython notebook
```
If everything has worked correctly, it should automatically launch your default browser
Click on CSE599_Lecture_2.ipynb to open the notebook containing the content for this lecture.

With that, you're set up to use the IPython notebook!

2. Simple Math in the IPython Notebook

Now that we have the IPython notebook up and running, we're going to do a short breakout exploring some of the mathematical functionality that Python offers.

Please open Breakout-Simple-Math.ipynb, find a partner, and make your way through that notebook, typing and executing code along the way.

3. Loading data with `pandas`

With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis.

Python's Data Science Ecosystem

In addition to Python's built-in modules like the math module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python. Some of the most important ones are:

`numpy`: Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data. If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

`scipy`: Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more. We will not look closely at Scipy today, but we will use its functionality later in the course.

`pandas`: Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a Data Frame. If you've used the R statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

`matplotlib`: Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

Installing Pandas & friends

Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like conda. All it takes is to run

$ conda install numpy scipy pandas matplotlib

and (so long as your conda setup is working) the packages will be downloaded and installed on your system.

Loading Data with Pandas



In [1]:

    
import numpy
numpy.__path__









    Out[1]:





['/home/ubuntu/miniconda2/envs/python3/lib/python2.7/site-packages/numpy']



In [2]:

    
import pandas



In [5]:

    
df = pandas.DataFrame()

Because we'll use it so much, we often import under a shortened name using the import ... as ... pattern:



In [6]:

    
import pandas as pd



In [7]:

    
df = pd.DataFrame()

Now we can use the read_csv command to read the comma-separated-value data:



In [8]:

    
data = pd.read_csv('2015_trip_data.csv')

Note: strings in Python can be defined either with double quotes or single quotes

Viewing Pandas Dataframes

The head() and tail() methods show us the first and last rows of the data



In [10]:

    
data.head()









    Out[10]:






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31
      10/13/2014 10:48
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32
      10/13/2014 10:48
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33
      10/13/2014 10:48
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34
      10/13/2014 10:48
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34
      10/13/2014 10:49
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1971.0



In [11]:

    
data.tail()









    Out[11]:






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      142841
      156796
      10/12/2015 20:41
      10/12/2015 20:47
      SEA00358
      377.183
      E Pine St & 16th Ave
      Summit Ave & E Denny Way
      CH-07
      CH-01
      Annual Member
      Male
      1990.0
    
    
      142842
      156797
      10/12/2015 20:43
      10/12/2015 20:48
      SEA00399
      303.330
      Bellevue Ave & E Pine St
      Summit Ave E & E Republican St
      CH-12
      CH-03
      Annual Member
      Male
      1978.0
    
    
      142843
      156798
      10/12/2015 21:03
      10/12/2015 21:06
      SEA00204
      165.597
      Harvard Ave & E Pine St
      E Harrison St & Broadway Ave E
      CH-09
      CH-02
      Annual Member
      Male
      1989.0
    
    
      142844
      156799
      10/12/2015 21:35
      10/12/2015 21:41
      SEA00073
      388.576
      Pine St & 9th Ave
      3rd Ave & Broad St
      SLU-16
      BT-01
      Short-Term Pass Holder
      NaN
      NaN
    
    
      142845
      156800
      10/12/2015 22:45
      10/12/2015 22:51
      SEA00033
      391.885
      NE 42nd St & University Way NE
      Eastlake Ave E & E Allison St
      UD-02
      EL-05
      Annual Member
      Male
      1985.0

The shape attribute shows us the number of elements:



In [12]:

    
data.shape









    Out[12]:





(142846, 12)

The columns attribute gives us the column names



In [13]:

    
data.columns









    Out[13]:





Index([u'trip_id', u'starttime', u'stoptime', u'bikeid', u'tripduration',
       u'from_station_name', u'to_station_name', u'from_station_id',
       u'to_station_id', u'usertype', u'gender', u'birthyear'],
      dtype='object')

The index attribute gives us the index names



In [14]:

    
data.index









    Out[14]:





RangeIndex(start=0, stop=142846, step=1)

The dtypes attribute gives the data types of each column:



In [15]:

    
data.dtypes









    Out[15]:





trip_id                int64
starttime             object
stoptime              object
bikeid                object
tripduration         float64
from_station_name     object
to_station_name       object
from_station_id       object
to_station_id         object
usertype              object
gender                object
birthyear            float64
dtype: object

4. Manipulating data with `pandas`

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:



In [17]:

    
data["trip_id"]









    Out[17]:





0            431
1            432
2            433
3            434
4            435
5            436
6            437
7            438
8            439
9            440
10           441
11           442
12           443
13           444
14           445
15           446
16           447
17           448
18           450
19           452
20           453
21           454
22           455
23           456
24           457
25           458
26           459
27           460
28           461
29           462
           ...  
142816    156770
142817    156771
142818    156772
142819    156773
142820    156774
142821    156775
142822    156776
142823    156778
142824    156779
142825    156780
142826    156781
142827    156782
142828    156783
142829    156784
142830    156785
142831    156786
142832    156787
142833    156788
142834    156789
142835    156790
142836    156791
142837    156792
142838    156793
142839    156794
142840    156795
142841    156796
142842    156797
142843    156798
142844    156799
142845    156800
Name: trip_id, dtype: int64

Mathematical operations on columns happen element-wise:



In [18]:

    
data['tripduration'] / 60









    Out[18]:





0         16.432250
1         15.439583
2         14.730517
3         14.432283
4         15.398717
5         13.480083
6          9.945250
7          9.868850
8          9.772450
9          9.793900
10         9.414983
11        10.335683
12        10.568117
13        10.238933
14        10.024383
15        10.313017
16        10.284750
17        10.000833
18         8.328900
19         9.588450
20         9.530117
21         9.396050
22         7.294817
23         8.052567
24         7.989700
25         8.892200
26         8.027150
27         7.679533
28         7.652750
29        11.340950
            ...    
142816     8.671450
142817    14.607983
142818     4.938550
142819    11.882017
142820    21.947550
142821    21.883067
142822    21.240667
142823    18.747133
142824    22.390117
142825    12.643100
142826     4.613267
142827     6.036600
142828     4.168183
142829     4.278083
142830     6.128367
142831     7.455983
142832     2.109933
142833     9.045133
142834     4.308700
142835    15.671783
142836    10.276817
142837    10.726967
142838     9.055200
142839     3.375017
142840     8.962267
142841     6.286383
142842     5.055500
142843     2.759950
142844     6.476267
142845     6.531417
Name: tripduration, dtype: float64

Columns can be created (or overwritten) with the assignment operator. Let's create a tripminutes column with the number of minutes for each trip



In [19]:

    
data['tripminutes'] = data['tripduration'] / 60



In [20]:

    
data.head()









    Out[20]:






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
      tripminutes
    
  
  
    
      0
      431
      10/13/2014 10:31
      10/13/2014 10:48
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1960.0
      16.432250
    
    
      1
      432
      10/13/2014 10:32
      10/13/2014 10:48
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1970.0
      15.439583
    
    
      2
      433
      10/13/2014 10:33
      10/13/2014 10:48
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1988.0
      14.730517
    
    
      3
      434
      10/13/2014 10:34
      10/13/2014 10:48
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1977.0
      14.432283
    
    
      4
      435
      10/13/2014 10:34
      10/13/2014 10:49
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1971.0
      15.398717

Working with Times

One trick to know when working with columns of times is that Pandas DateTimeIndex provides a nice interface for working with columns of times:



In [21]:

    
times = pd.DatetimeIndex(data['starttime'])

With it, we can extract, the hour of the day, the day of the week, the month, and a wide range of other views of the time:



In [23]:

    
times









    Out[23]:





DatetimeIndex(['2014-10-13 10:31:00', '2014-10-13 10:32:00',
               '2014-10-13 10:33:00', '2014-10-13 10:34:00',
               '2014-10-13 10:34:00', '2014-10-13 10:34:00',
               '2014-10-13 11:35:00', '2014-10-13 11:35:00',
               '2014-10-13 11:35:00', '2014-10-13 11:35:00',
               ...
               '2015-10-12 20:09:00', '2015-10-12 20:11:00',
               '2015-10-12 20:18:00', '2015-10-12 20:39:00',
               '2015-10-12 20:41:00', '2015-10-12 20:41:00',
               '2015-10-12 20:43:00', '2015-10-12 21:03:00',
               '2015-10-12 21:35:00', '2015-10-12 22:45:00'],
              dtype='datetime64[ns]', length=142846, freq=None)



In [24]:

    
times.dayofweek









    Out[24]:





array([0, 0, 0, ..., 0, 0, 0], dtype=int32)



In [25]:

    
times.month









    Out[25]:





array([10, 10, 10, ..., 10, 10, 10], dtype=int32)

Note: math functionality can be applied to columns using the NumPy package: for example:



In [26]:

    
import numpy as np
np.exp(data['tripminutes'])









    Out[26]:





0         1.369101e+07
1         5.073712e+06
2         2.496791e+06
3         1.852939e+06
4         4.870546e+06
5         7.150325e+05
6         2.085294e+04
7         1.931911e+04
8         1.754370e+04
9         1.792407e+04
10        1.227087e+04
11        3.081273e+04
12        3.887539e+04
13        2.797127e+04
14        2.257015e+04
15        3.012217e+04
16        2.928264e+04
17        2.204483e+04
18        4.141859e+03
19        1.459523e+04
20        1.376820e+04
21        1.204073e+04
22        1.472647e+03
23        3.141849e+03
24        2.950412e+03
25        7.275007e+03
26        3.063000e+03
27        2.163610e+03
28        2.106430e+03
29        8.419998e+04
              ...     
142816    5.833952e+03
142817    2.208852e+06
142818    1.395677e+02
142819    1.446420e+05
142820    3.401730e+09
142821    3.189298e+09
142822    1.677661e+09
142823    1.386043e+08
142824    5.295465e+09
142825    3.096197e+05
142826    1.008129e+02
142827    4.184678e+02
142828    6.459799e+01
142829    7.210211e+01
142830    4.586864e+02
142831    1.730185e+03
142832    8.247691e+00
142833    8.477182e+03
142834    7.434378e+01
142835    6.399839e+06
142836    2.905125e+04
142837    4.556826e+04
142838    8.562950e+03
142839    2.922477e+01
142840    7.803024e+03
142841    5.372069e+02
142842    1.568830e+02
142843    1.579905e+01
142844    6.495415e+02
142845    6.863699e+02
Name: tripminutes, dtype: float64

Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at value counts and the basics of group-by operations.

Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data. We'll take a look at two of these here.

The pandas.value_counts returns statistics on the unique values within each column.

We can use it, for example, to break down rides by gender:



In [27]:

    
pd.value_counts(data['gender'])









    Out[27]:





Male      67608
Female    18245
Other      1507
Name: gender, dtype: int64



In [28]:

    
pd.value_counts(data['birthyear'])









    Out[28]:





1987.0    9320
1985.0    5370
1981.0    4779
1982.0    4629
1988.0    4188
1983.0    3965
1984.0    3815
1990.0    3605
1986.0    3492
1991.0    2912
1989.0    2755
1977.0    2465
1980.0    2236
1978.0    2063
1979.0    1976
1975.0    1969
1972.0    1921
1992.0    1798
1962.0    1769
1976.0    1577
1965.0    1510
1974.0    1497
1964.0    1374
1967.0    1354
1971.0    1279
1969.0    1185
1993.0    1126
1968.0    1085
1973.0    1076
1970.0    1056
1961.0     875
1963.0     828
1966.0     761
1959.0     696
1950.0     657
1994.0     549
1956.0     481
1960.0     429
1995.0     412
1955.0     397
1953.0     337
1951.0     251
1947.0     244
1957.0     224
1952.0     204
1958.0     160
1949.0     154
1954.0     152
1996.0     121
1945.0     115
1946.0      39
1948.0      34
1998.0      25
1939.0      23
1997.0      21
1943.0      11
1936.0       6
1999.0       5
1942.0       2
1944.0       1
Name: birthyear, dtype: int64

Or to break down rides by age:



In [29]:

    
pd.value_counts(data['birthyear']).sort_index()









    Out[29]:





1936.0       6
1939.0      23
1942.0       2
1943.0      11
1944.0       1
1945.0     115
1946.0      39
1947.0     244
1948.0      34
1949.0     154
1950.0     657
1951.0     251
1952.0     204
1953.0     337
1954.0     152
1955.0     397
1956.0     481
1957.0     224
1958.0     160
1959.0     696
1960.0     429
1961.0     875
1962.0    1769
1963.0     828
1964.0    1374
1965.0    1510
1966.0     761
1967.0    1354
1968.0    1085
1969.0    1185
1970.0    1056
1971.0    1279
1972.0    1921
1973.0    1076
1974.0    1497
1975.0    1969
1976.0    1577
1977.0    2465
1978.0    2063
1979.0    1976
1980.0    2236
1981.0    4779
1982.0    4629
1983.0    3965
1984.0    3815
1985.0    5370
1986.0    3492
1987.0    9320
1988.0    4188
1989.0    2755
1990.0    3605
1991.0    2912
1992.0    1798
1993.0    1126
1994.0     549
1995.0     412
1996.0     121
1997.0      21
1998.0      25
1999.0       5
Name: birthyear, dtype: int64



In [30]:

    
pd.value_counts(2015 - data['birthyear']).sort_index()









    Out[30]:





16.0       5
17.0      25
18.0      21
19.0     121
20.0     412
21.0     549
22.0    1126
23.0    1798
24.0    2912
25.0    3605
26.0    2755
27.0    4188
28.0    9320
29.0    3492
30.0    5370
31.0    3815
32.0    3965
33.0    4629
34.0    4779
35.0    2236
36.0    1976
37.0    2063
38.0    2465
39.0    1577
40.0    1969
41.0    1497
42.0    1076
43.0    1921
44.0    1279
45.0    1056
46.0    1185
47.0    1085
48.0    1354
49.0     761
50.0    1510
51.0    1374
52.0     828
53.0    1769
54.0     875
55.0     429
56.0     696
57.0     160
58.0     224
59.0     481
60.0     397
61.0     152
62.0     337
63.0     204
64.0     251
65.0     657
66.0     154
67.0      34
68.0     244
69.0      39
70.0     115
71.0       1
72.0      11
73.0       2
76.0      23
79.0       6
Name: birthyear, dtype: int64

What else might we break down rides by?



In [31]:

    
pd.value_counts(times.dayofweek)









    Out[31]:





3    21505
0    21266
4    21097
2    20748
1    20465
5    20358
6    17407
dtype: int64

We can sort by the index rather than the counts if we wish:



In [ ]:

    
pd.value_counts(times.dayofweek, sort=False)



In [ ]:

    
pd.value_counts(times.month)



In [ ]:

    
pd.value_counts(times.month, sort=False)

Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations. You can visualize the group-by like this (image borrowed from the Python Data Science Handbook)



In [32]:

    
from IPython.display import Image
Image('split_apply_combine.png')









    Out[32]:

Let's break take this in smaller steps. First, let's look at the data by hour across all days in the year.



In [33]:

    
pd.value_counts(times.hour)









    Out[33]:





17    14163
16    11629
8     10967
18    10382
15     9850
9      9751
13     9575
12     9571
14     9096
11     8864
10     7761
19     6939
7      6093
20     4792
21     3730
22     2484
6      1855
23     1749
0      1022
5       905
1       682
2       478
4       316
3       192
dtype: int64

groupby allows us to look at the number of values for each column and each value.



In [34]:

    
data.groupby(times.hour).count()









    Out[34]:






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
      tripminutes
    
  
  
    
      0
      1022
      1022
      1022
      1022
      1022
      1022
      1022
      1022
      1022
      1022
      571
      571
      1022
    
    
      1
      682
      682
      682
      682
      682
      682
      682
      682
      682
      682
      341
      341
      682
    
    
      2
      478
      478
      478
      478
      478
      478
      478
      478
      478
      478
      178
      178
      478
    
    
      3
      192
      192
      192
      192
      192
      192
      192
      192
      192
      192
      89
      89
      192
    
    
      4
      316
      316
      316
      316
      316
      316
      316
      316
      316
      316
      241
      241
      316
    
    
      5
      905
      905
      905
      905
      905
      905
      905
      905
      905
      905
      731
      731
      905
    
    
      6
      1855
      1855
      1855
      1855
      1855
      1855
      1855
      1855
      1855
      1855
      1591
      1591
      1855
    
    
      7
      6093
      6093
      6093
      6093
      6093
      6093
      6093
      6093
      6093
      6093
      5428
      5428
      6093
    
    
      8
      10967
      10967
      10967
      10967
      10967
      10967
      10967
      10967
      10967
      10967
      9661
      9661
      10967
    
    
      9
      9751
      9751
      9751
      9751
      9751
      9751
      9751
      9751
      9751
      9751
      7708
      7708
      9751
    
    
      10
      7761
      7761
      7761
      7761
      7761
      7761
      7761
      7761
      7761
      7761
      4657
      4657
      7761
    
    
      11
      8864
      8864
      8864
      8864
      8864
      8864
      8864
      8864
      8864
      8864
      4735
      4735
      8864
    
    
      12
      9571
      9571
      9571
      9571
      9571
      9571
      9571
      9571
      9571
      9571
      4816
      4816
      9571
    
    
      13
      9575
      9575
      9575
      9575
      9575
      9575
      9575
      9575
      9575
      9575
      4349
      4349
      9575
    
    
      14
      9096
      9096
      9096
      9096
      9096
      9096
      9096
      9096
      9096
      9096
      3580
      3580
      9096
    
    
      15
      9850
      9850
      9850
      9850
      9850
      9850
      9850
      9850
      9850
      9850
      4348
      4348
      9850
    
    
      16
      11629
      11629
      11629
      11629
      11629
      11629
      11629
      11629
      11629
      11629
      6485
      6485
      11629
    
    
      17
      14163
      14163
      14163
      14163
      14163
      14163
      14163
      14163
      14163
      14163
      9588
      9588
      14163
    
    
      18
      10382
      10382
      10382
      10382
      10382
      10382
      10382
      10382
      10382
      10382
      6709
      6709
      10382
    
    
      19
      6939
      6939
      6939
      6939
      6939
      6939
      6939
      6939
      6939
      6939
      4198
      4198
      6939
    
    
      20
      4792
      4792
      4792
      4792
      4792
      4792
      4792
      4792
      4792
      4792
      2741
      2741
      4792
    
    
      21
      3730
      3730
      3730
      3730
      3730
      3730
      3730
      3730
      3730
      3730
      2229
      2229
      3730
    
    
      22
      2484
      2484
      2484
      2484
      2484
      2484
      2484
      2484
      2484
      2484
      1416
      1416
      2484
    
    
      23
      1749
      1749
      1749
      1749
      1749
      1749
      1749
      1749
      1749
      1749
      970
      970
      1749

Now, let's find the average length of a ride as a function of time of day:



In [35]:

    
data.groupby(times.hour)['tripminutes'].mean()









    Out[35]:





0     18.293162
1     16.812000
2     26.467510
3     22.643443
4     18.595762
5     13.565035
6     12.091993
7     12.378344
8     12.544350
9     15.175861
10    23.558911
11    25.645489
12    26.052903
13    27.878785
14    28.354453
15    26.164124
16    21.257375
17    17.388788
18    16.706635
19    16.886609
20    17.463562
21    15.227905
22    15.931296
23    16.006255
Name: tripminutes, dtype: float64

You can specify a groupby using the names of table columns and compute other functions, such as the mean.



In [36]:

    
data.groupby(['gender'])['tripminutes'].mean()









    Out[36]:





gender
Female    12.137525
Male       9.547313
Other     10.898911
Name: tripminutes, dtype: float64

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

<data object>.groupby(<grouping values>).<aggregate>()

You can even group by multiple values: for example we can look at the trip duration by time of day and by gender:



In [ ]:

    
grouped = data.groupby([times.hour, 'gender'])['tripminutes'].mean()
grouped

The unstack() operation can help make sense of this type of multiply-grouped data. What this technically does is split a multiple-valued index into an index plus columns:



In [ ]:

    
grouped.unstack()

5. Visualizing data with `pandas`

Of course, looking at tables of data is not very intuitive. Fortunately Pandas has many useful plotting functions built-in, all of which make use of the matplotlib library to generate plots.

Whenever you do plotting in the IPython notebook, you will want to first run this magic command which configures the notebook to work well with plots:



In [37]:

    
%matplotlib inline

Now we can simply call the plot() method of any series or dataframe to get a reasonable view of the data:



In [38]:

    
data.groupby([times.hour, 'usertype'])['tripminutes'].mean().unstack().plot()









    Out[38]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f54579d2710>

Adjusting the Plot Style

The default formatting is not very nice; I often make use of the Seaborn library for better plotting defaults.

You should do this in bash

$ conda install seaborn

Then this in python

import seaborn
seaborn.set()
data.groupby([times.hour, 'usertype'])['tripminutes'].mean().unstack().plot()

Other plot types

Pandas supports a range of other plotting types; you can find these by using the autocomplete on the plot method:



In [39]:

    
data.plot.hist()









    Out[39]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f5457930b90>

For example, we can create a histogram of trip durations:



In [ ]:

    
data['tripminutes'].plot.hist(bins=100)

If you'd like to adjust the x and y limits of the plot, you can use the set_xlim() and set_ylim() method of the resulting object:



In [ ]:

    
plot = data['tripminutes'].plot.hist(bins=500)
plot.set_xlim(0, 50)

Breakout: Exploring the Data

Make a plot of the total number of rides as a function of month of the year (You'll need to extract the month, use a groupby, and find the appropriate aggregation to count the number in each group).



In [ ]:

Split this plot by gender. Do you see any seasonal ridership patterns by gender?



In [ ]:

Split this plot by user type. Do you see any seasonal ridership patterns by usertype?



In [ ]:

Repeat the above three steps, counting the number of rides by time of day rather thatn by month.



In [ ]:

Are there any other interesting insights you can discover in the data using these tools?



In [ ]:



In [ ]:

Looking Forward to Homework

In the homework this week, you will have a chance to apply some of these patterns to a brand new (but closely related) dataset.

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear
0	431	10/13/2014 10:31	10/13/2014 10:48	SEA00298	985.935	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Male	1960.0
1	432	10/13/2014 10:32	10/13/2014 10:48	SEA00195	926.375	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Male	1970.0
2	433	10/13/2014 10:33	10/13/2014 10:48	SEA00486	883.831	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Female	1988.0
3	434	10/13/2014 10:34	10/13/2014 10:48	SEA00333	865.937	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Female	1977.0
4	435	10/13/2014 10:34	10/13/2014 10:49	SEA00202	923.923	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Male	1971.0

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear
142841	156796	10/12/2015 20:41	10/12/2015 20:47	SEA00358	377.183	E Pine St & 16th Ave	Summit Ave & E Denny Way	CH-07	CH-01	Annual Member	Male	1990.0
142842	156797	10/12/2015 20:43	10/12/2015 20:48	SEA00399	303.330	Bellevue Ave & E Pine St	Summit Ave E & E Republican St	CH-12	CH-03	Annual Member	Male	1978.0
142843	156798	10/12/2015 21:03	10/12/2015 21:06	SEA00204	165.597	Harvard Ave & E Pine St	E Harrison St & Broadway Ave E	CH-09	CH-02	Annual Member	Male	1989.0
142844	156799	10/12/2015 21:35	10/12/2015 21:41	SEA00073	388.576	Pine St & 9th Ave	3rd Ave & Broad St	SLU-16	BT-01	Short-Term Pass Holder	NaN	NaN
142845	156800	10/12/2015 22:45	10/12/2015 22:51	SEA00033	391.885	NE 42nd St & University Way NE	Eastlake Ave E & E Allison St	UD-02	EL-05	Annual Member	Male	1985.0

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear	tripminutes
0	1022	1022	1022	1022	1022	1022	1022	1022	1022	1022	571	571	1022
1	682	682	682	682	682	682	682	682	682	682	341	341	682
2	478	478	478	478	478	478	478	478	478	478	178	178	478
3	192	192	192	192	192	192	192	192	192	192	89	89	192
4	316	316	316	316	316	316	316	316	316	316	241	241	316
5	905	905	905	905	905	905	905	905	905	905	731	731	905
6	1855	1855	1855	1855	1855	1855	1855	1855	1855	1855	1591	1591	1855
7	6093	6093	6093	6093	6093	6093	6093	6093	6093	6093	5428	5428	6093
8	10967	10967	10967	10967	10967	10967	10967	10967	10967	10967	9661	9661	10967
9	9751	9751	9751	9751	9751	9751	9751	9751	9751	9751	7708	7708	9751
10	7761	7761	7761	7761	7761	7761	7761	7761	7761	7761	4657	4657	7761
11	8864	8864	8864	8864	8864	8864	8864	8864	8864	8864	4735	4735	8864
12	9571	9571	9571	9571	9571	9571	9571	9571	9571	9571	4816	4816	9571
13	9575	9575	9575	9575	9575	9575	9575	9575	9575	9575	4349	4349	9575
14	9096	9096	9096	9096	9096	9096	9096	9096	9096	9096	3580	3580	9096
15	9850	9850	9850	9850	9850	9850	9850	9850	9850	9850	4348	4348	9850
16	11629	11629	11629	11629	11629	11629	11629	11629	11629	11629	6485	6485	11629
17	14163	14163	14163	14163	14163	14163	14163	14163	14163	14163	9588	9588	14163
18	10382	10382	10382	10382	10382	10382	10382	10382	10382	10382	6709	6709	10382
19	6939	6939	6939	6939	6939	6939	6939	6939	6939	6939	4198	4198	6939
20	4792	4792	4792	4792	4792	4792	4792	4792	4792	4792	2741	2741	4792
21	3730	3730	3730	3730	3730	3730	3730	3730	3730	3730	2229	2229	3730
22	2484	2484	2484	2484	2484	2484	2484	2484	2484	2484	1416	1416	2484
23	1749	1749	1749	1749	1749	1749	1749	1749	1749	1749	970	970	1749

Software Engineering for Data Scientists

Manipulating Data with Python

CSE 599 B1

Today's Objectives

1. Opening & Navigating the IPython Notebook

2. Simple Math in the IPython Notebook

3. Loading data with pandas

4. Cleaning and Manipulating data with pandas

5. Visualizing data with pandas

1. Opening and Navigating the IPython Notebook

2. Simple Math in the IPython Notebook

3. Loading data with pandas

Python's Data Science Ecosystem

numpy: Numerical Python

scipy: Scientific Python

pandas: Labeled Data Manipulation in Python

matplotlib: Visualization in Python

Installing Pandas & friends

Loading Data with Pandas

Viewing Pandas Dataframes

4. Manipulating data with pandas

Working with Times

Simple Grouping of Data

Value Counts

Group-by Operation

5. Visualizing data with pandas

Adjusting the Plot Style

Other plot types

Breakout: Exploring the Data

Looking Forward to Homework

3. Loading data with `pandas`

4. Cleaning and Manipulating data with `pandas`

5. Visualizing data with `pandas`

3. Loading data with `pandas`

`numpy`: Numerical Python

`scipy`: Scientific Python

`pandas`: Labeled Data Manipulation in Python

`matplotlib`: Visualization in Python

4. Manipulating data with `pandas`

5. Visualizing data with `pandas`