Software Engineering for Data Scientists

Manipulating Data with Python

DATA 515A

Today's Objectives

1. Opening & Navigating the Jupyter Notebook

2. Simple Math in the Jupyter Notebook

3. Loading data with `pandas`

4. Cleaning and Manipulating data with `pandas`

5. Visualizing data with `pandas` & `matplotlib`

1. Opening and Navigating the IPython Notebook

We will start today with the interactive environment that we will be using often through the course: the Jupyter Notebook.

We will walk through the following steps together:

Download miniconda (be sure to get Version 3.6) and install it on your system (hopefully you have done this before coming to class)
Use the conda command-line tool to update your package listing and install the IPython notebook:

Update conda's listing of packages for your system:
```
$ conda update conda
```
Install IPython notebook and all its requirements
```
$ conda install jupyter notebook
```
Navigate to the directory containing the course material. For example:
```
$ cd ~/courses/CSE583/
```
You should see a number of files in the directory, including these:
```
$ ls
...
Breakout-Simple-Math.ipynb
CSE599_Lecture_2.ipynb
...
```
Type jupyter notebook in the terminal to start the notebook
```
$ jupyter notebook
```
If everything has worked correctly, it should automatically launch your default browser
Click on Lecture-Python-And-Data-Autumn-2017.ipynb to open the notebook containing the content for this lecture.

With that, you're set up to use the Jupyter notebook!

1.1 Some Theory

Components with the same capabilities are of the same type.
- For example, the numbers 2 and 200 are both integers.
A type is defined recursively. Some examples.
- A list is a collection of objects that can be indexed by position.
- A list of integers contains an integer at each position.
A type has a set of supported operations. For example:
- Integers can be added
- Strings can be concatented
- A table can find the name of its columns
  - What type is returned from the operation?
In python, members (components and operations) are indicated by a '.'
- If a is a list, the a.append(1) adds 1 to the list.



In [1]:

    
a = 1



In [2]:

    
a_list = [1, 'a', [1,2]]



In [3]:

    
a_list.append(2)



In [4]:

    
a_list









    Out[4]:





[1, 'a', [1, 2], 2]



In [5]:

    
dir(a_list)









    Out[5]:





['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']



In [6]:

    
a_list.count(1)









    Out[6]:





1



In [7]:

    
a = 2

2. Simple Math in the Jupyter Notebook

Now that we have the Jupyter notebook up and running, we're going to do a short breakout exploring some of the mathematical functionality that Python offers.

Please open Breakout-Simple-Math.ipynb, find a partner, and make your way through that notebook, typing and executing code along the way.

3. Loading data with `pandas`

With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis.

Python's Data Science Ecosystem

In addition to Python's built-in modules like the math module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python. Some of the most important ones are:

`numpy`: Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data. If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

`scipy`: Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more. We will not look closely at Scipy today, but we will use its functionality later in the course.

`pandas`: Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a Data Frame. If you've used the R statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

`matplotlib`: Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

Installing Pandas & friends

Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like conda. All it takes is to run

$ conda install numpy scipy pandas matplotlib

and (so long as your conda setup is working) the packages will be downloaded and installed on your system.

Downloading the data

shell commands can be run from the notebook by preceding them with an exclamation point:



In [8]:

    
!ls









    



02-Python-and-Data.pdf
2015_trip_data.csv
Breakout-Simple-Math.ipynb
(Completed)Breakout-Simple-Math.ipynb
(Completed)Lecture-Python-And-Data.ipynb
Lecture-Python-And-Data-Autum-2016.ipynb
Lecture-Python-And-Data-Autumn-2017.ipynb
Lecture-Python-And-Data-CSE515A.ipynb
Lecture-Python-and-Data.ipynb
Lecture-Python-And-Data-Spring-2018.ipynb
Play With Notebooks.ipynb
pronto.csv
__pycache__
split_apply_combine.png
table_modifiers.py
Untitled.ipynb

uncomment this to download the data:



In [40]:

    
#!curl -o pronto.csv https://data.seattle.gov/api/views/tw7j-dfaw/rows.csv?accessType=DOWNLOAD









    



  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 42.7M    0 42.7M    0     0  1901k      0 --:--:--  0:00:23 --:--:-- 2192k00 21.8M    0 21.8M    0     0  1774k      0 --:--:--  0:00:12 --:--:-- 1836k

Loading Data with Pandas

Because we'll use it so much, we often import under a shortened name using the import ... as ... pattern:



In [10]:

    
import pandas as pd
df = pd.read_csv('pronto.csv')

Now we can use the read_csv command to read the comma-separated-value data:



In [ ]:

Note: strings in Python can be defined either with double quotes or single quotes

Viewing Pandas Dataframes

The head() and tail() methods show us the first and last rows of the data



In [11]:

    
df.head()









    Out[11]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33:00 AM
      10/13/2014 10:48:00 AM
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34:00 AM
      10/13/2014 10:48:00 AM
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34:00 AM
      10/13/2014 10:49:00 AM
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1971.0



In [12]:

    
df.columns









    Out[12]:





Index(['trip_id', 'starttime', 'stoptime', 'bikeid', 'tripduration',
       'from_station_name', 'to_station_name', 'from_station_id',
       'to_station_id', 'usertype', 'gender', 'birthyear'],
      dtype='object')



In [ ]:

The shape attribute shows us the number of elements:



In [13]:

    
df.shape









    Out[13]:





(275091, 12)

The columns attribute gives us the column names



In [ ]:

The index attribute gives us the index names



In [ ]:

The dtypes attribute gives the data types of each column:



In [14]:

    
df.dtypes









    Out[14]:





trip_id                int64
starttime             object
stoptime              object
bikeid                object
tripduration         float64
from_station_name     object
to_station_name       object
from_station_id       object
to_station_id         object
usertype              object
gender                object
birthyear            float64
dtype: object

4. Manipulating data with `pandas`

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:



In [15]:

    
df_small = df[ 'stoptime']



In [16]:

    
type(df_small)









    Out[16]:





pandas.core.series.Series

Mathematical operations on columns happen element-wise:



In [17]:

    
trip_duration_hours = df['tripduration']/3600
trip_duration_hours[:3]









    Out[17]:





0    0.273871
1    0.257326
2    0.245509
Name: tripduration, dtype: float64



In [18]:

    
df['trip_duration_hours'] = df['tripduration']/3600



In [19]:

    
del df['trip_duration_hours']



In [20]:

    
df.head()









    Out[20]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33:00 AM
      10/13/2014 10:48:00 AM
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34:00 AM
      10/13/2014 10:48:00 AM
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34:00 AM
      10/13/2014 10:49:00 AM
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1971.0



In [21]:

    
df.loc[[0,1],:]









    Out[21]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0



In [22]:

    
df_long_trips = df[df['tripduration'] >10000]



In [23]:

    
sel = df['tripduration'] >10000 
df_long_trips = df[sel]



In [24]:

    
len(df)









    Out[24]:





275091



In [25]:

    
# Make a copy of a slice
df_subset = df[['starttime', 'stoptime']].copy()
df_subset['trip_hours'] = df['tripduration']/3600

Columns can be created (or overwritten) with the assignment operator. Let's create a tripminutes column with the number of minutes for each trip



In [ ]:

More complicated mathematical operations can be done with tools in the numpy package:



In [ ]:

Working with Times

One trick to know when working with columns of times is that Pandas DateTimeIndex provides a nice interface for working with columns of times.

For a dataset of this size, using pd.to_datetime and specifying the date format can make things much faster (from the strftime reference, we see that the pronto data has format "%m/%d/%Y %I:%M:%S %p"



In [ ]:

(Note: you can also use infer_datetime_format=True in most cases to automatically infer the correct format, though due to a bug it doesn't work when AM/PM are present)

With it, we can extract, the hour of the day, the day of the week, the month, and a wide range of other views of the time:



In [ ]:



In [ ]:



In [ ]:

Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at value counts and the basics of group-by operations.

Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data. We'll take a look at two of these here.

The pandas.value_counts returns statistics on the unique values within each column.

We can use it, for example, to break down rides by gender:



In [ ]:

Or to break down rides by age:



In [ ]:

By default, the values rather than the index are sorted. Use sort=False to turn this behavior off:



In [ ]:

We can explore other things as well: day of week, hour of day, etc.



In [ ]:

Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations. You can visualize the group-by like this (image borrowed from the Python Data Science Handbook)



In [26]:

    
df.head()









    Out[26]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31:00 AM
      10/13/2014 10:48:00 AM
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1960.0
    
    
      1
      432
      10/13/2014 10:32:00 AM
      10/13/2014 10:48:00 AM
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1970.0
    
    
      2
      433
      10/13/2014 10:33:00 AM
      10/13/2014 10:48:00 AM
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1988.0
    
    
      3
      434
      10/13/2014 10:34:00 AM
      10/13/2014 10:48:00 AM
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Female
      1977.0
    
    
      4
      435
      10/13/2014 10:34:00 AM
      10/13/2014 10:49:00 AM
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Member
      Male
      1971.0



In [27]:

    
df_count = df.groupby(['from_station_id']).count()
df_count.head()









    Out[27]:







  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      to_station_id
      usertype
      gender
      birthyear
    
    
      from_station_id
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      BT-01
      10463
      10463
      10463
      10463
      10463
      10463
      10463
      10463
      10463
      4162
      4162
    
    
      BT-03
      7334
      7334
      7334
      7334
      7334
      7334
      7334
      7334
      7334
      4862
      4862
    
    
      BT-04
      4666
      4666
      4666
      4666
      4666
      4666
      4666
      4666
      4666
      3424
      3424
    
    
      BT-05
      5699
      5699
      5699
      5699
      5699
      5699
      5699
      5699
      5699
      2975
      2975
    
    
      BT-06
      150
      150
      150
      150
      150
      150
      150
      150
      150
      130
      130



In [28]:

    
df_count1 = df_count[['trip_id']]
df_count2 = df_count1.rename(columns={'trip_id': 'count'})
df_count2['new'] = 1
df_count2.head()









    Out[28]:







  
    
      
      count
      new
    
    
      from_station_id
      
      
    
  
  
    
      BT-01
      10463
      1
    
    
      BT-03
      7334
      1
    
    
      BT-04
      4666
      1
    
    
      BT-05
      5699
      1
    
    
      BT-06
      150
      1



In [29]:

    
df_mean = df.groupby(['from_station_id']).mean()
df_mean.head()









    Out[29]:







  
    
      
      trip_id
      tripduration
      birthyear
    
    
      from_station_id
      
      
      
    
  
  
    
      BT-01
      147831.009844
      1375.031203
      1980.131427
    
    
      BT-03
      139404.294655
      1019.200684
      1976.505142
    
    
      BT-04
      157992.809687
      891.095897
      1979.877044
    
    
      BT-05
      139283.572381
      1199.949481
      1975.937479
    
    
      BT-06
      291807.953333
      659.770547
      1975.830769



In [30]:

    
dfgroup = df.groupby(['from_station_id'])
dfgroup.groups









    Out[30]:





{'BT-01': Int64Index([   217,    227,    228,    282,    283,    310,    326,    327,
                329,    331,
             ...
             274971, 274973, 274974, 274975, 274976, 274979, 275032, 275033,
             275075, 275076],
            dtype='int64', length=10463),
 'BT-03': Int64Index([    87,     88,    230,    261,    366,    407,    414,    439,
                453,    754,
             ...
             268122, 268181, 268307, 268318, 268319, 268391, 268392, 268467,
             268527, 268528],
            dtype='int64', length=7334),
 'BT-04': Int64Index([    66,     67,     94,    104,    108,    166,    233,    259,
                322,    333,
             ...
             274350, 274361, 274424, 274704, 274789, 274970, 275009, 275064,
             275065, 275083],
            dtype='int64', length=4666),
 'BT-05': Int64Index([   110,    413,    426,    513,    585,    618,    744,    753,
                795,   1003,
             ...
             274547, 274605, 274610, 274621, 274817, 274847, 274910, 274911,
             275029, 275034],
            dtype='int64', length=5699),
 'BT-06': Int64Index([268581, 268642, 268667, 268718, 268735, 268781, 268897, 268903,
             268914, 268961,
             ...
             274777, 274778, 274865, 274931, 274937, 274949, 274951, 274956,
             274962, 274965],
            dtype='int64', length=150),
 'CBD-03': Int64Index([   118,    119,    164,    229,    275,    285,    328,    339,
                356,    357,
             ...
             274574, 274594, 274599, 274622, 274672, 274743, 274774, 274810,
             274915, 275053],
            dtype='int64', length=4822),
 'CBD-04': Int64Index([105392, 105458, 105467, 105472, 105614, 105615, 105835, 105836,
             105855, 105858,
             ...
             274730, 274922, 274924, 274925, 274926, 274927, 274952, 274958,
             274983, 275080],
            dtype='int64', length=3440),
 'CBD-05': Int64Index([    54,     79,     95,    148,    149,    150,    151,    165,
                211,    219,
             ...
             274081, 274085, 274086, 274138, 274139, 274623, 274624, 274692,
             274814, 274906],
            dtype='int64', length=5068),
 'CBD-06': Int64Index([     0,      1,      2,      3,      4,      5,     63,     68,
                 70,     71,
             ...
             274534, 274544, 274570, 274671, 274744, 274765, 274797, 274836,
             274838, 274980],
            dtype='int64', length=4911),
 'CBD-07': Int64Index([    42,     69,     78,    141,    196,    269,    478,    510,
                522,    542,
             ...
             274317, 274413, 274439, 274467, 274579, 274673, 274688, 274752,
             274879, 274907],
            dtype='int64', length=3263),
 'CBD-13': Int64Index([    99,    139,    198,    249,    276,    334,    381,    388,
                424,    442,
             ...
             274444, 274535, 274555, 274613, 274726, 274780, 274903, 274916,
             274966, 274968],
            dtype='int64', length=9067),
 'CD-01': Int64Index([ 68531,  68532,  69169,  69170,  69954,  70367,  70529,  70546,
              70621,  70622,
             ...
             224878, 225305, 225422, 225427, 225492, 225493, 225587, 225666,
             226277, 226597],
            dtype='int64', length=958),
 'CH-01': Int64Index([   256,    355,    382,    416,    437,    444,    502,    503,
                600,    679,
             ...
             274585, 274632, 274641, 274656, 274670, 274827, 274829, 274871,
             274933, 274995],
            dtype='int64', length=6409),
 'CH-02': Int64Index([    55,     56,     58,     83,    113,    126,    127,    154,
                162,    163,
             ...
             274450, 274488, 274593, 274660, 274668, 274868, 274917, 275067,
             275081, 275082],
            dtype='int64', length=8546),
 'CH-03': Int64Index([   290,    417,    428,    435,    436,    452,    494,    516,
                640,    665,
             ...
             274625, 274650, 274658, 274693, 274822, 274844, 274872, 274893,
             274894, 274950],
            dtype='int64', length=6218),
 'CH-05': Int64Index([   134,    195,    205,    248,    250,    251,    315,    324,
                337,    390,
             ...
             274457, 274597, 274714, 274850, 274855, 274873, 274889, 274932,
             274936, 274993],
            dtype='int64', length=6948),
 'CH-06': Int64Index([   212,    253,    277,    278,    279,    403,    449,    504,
                684,    881,
             ...
             274679, 274745, 274746, 274750, 274751, 274781, 274788, 274834,
             274839, 274875],
            dtype='int64', length=3765),
 'CH-07': Int64Index([   146,    210,    299,    341,    374,    377,    401,    415,
                431,    466,
             ...
             274832, 274846, 274890, 274891, 274892, 274955, 275002, 275049,
             275060, 275077],
            dtype='int64', length=11568),
 'CH-08': Int64Index([   120,    136,    144,    158,    159,    242,    262,    294,
                311,    321,
             ...
             274824, 274848, 274853, 274904, 274905, 274935, 274943, 274982,
             275006, 275090],
            dtype='int64', length=8573),
 'CH-09': Int64Index([   101,    168,    222,    349,    380,    467,    567,    578,
                628,    647,
             ...
             274357, 274466, 274468, 274475, 274500, 274595, 274611, 274732,
             274895, 275066],
            dtype='int64', length=5246),
 'CH-12': Int64Index([   319,    384,    411,    441,    451,    462,    540,    554,
                558,    605,
             ...
             274577, 274603, 274609, 274615, 274631, 274680, 275050, 275056,
             275088, 275089],
            dtype='int64', length=5857),
 'CH-15': Int64Index([   109,    160,    244,    340,    402,    430,    459,    468,
                723,    724,
             ...
             274674, 274756, 274791, 274812, 274826, 274840, 274852, 274857,
             274921, 274969],
            dtype='int64', length=6550),
 'CH-16': Int64Index([175075, 175093, 175108, 175126, 175127, 175131, 175136, 175144,
             175145, 175210,
             ...
             274629, 274805, 274816, 274825, 274854, 274920, 275035, 275042,
             275051, 275072],
            dtype='int64', length=2089),
 'DPD-01': Int64Index([    59,     91,     93,    289,    568,    644,    667,    740,
                839,    974,
             ...
             274530, 274604, 274665, 274681, 274798, 274823, 274939, 274957,
             274961, 275026],
            dtype='int64', length=4822),
 'DPD-03': Int64Index([   131,    197,    345,    347,    727,    844,   1075,   1144,
               1347,   1430,
             ...
             273047, 273048, 274048, 274092, 274776, 274800, 275001, 275003,
             275004, 275005],
            dtype='int64', length=1423),
 'EL-01': Int64Index([   199,    400,    700,    702,    715,    716,    769,   1175,
               1350,   1351,
             ...
             274676, 274718, 274731, 274770, 274837, 274928, 274929, 274948,
             274959, 275038],
            dtype='int64', length=3604),
 'EL-03': Int64Index([   344,    358,    360,    425,    492,    583,    927,   1027,
               1071,   1110,
             ...
             274757, 274861, 274862, 274882, 274984, 274988, 274990, 274994,
             275031, 275052],
            dtype='int64', length=5788),
 'EL-05': Int64Index([   200,    201,    447,    456,    488,    615,    646,    694,
                763,    858,
             ...
             274019, 274157, 274162, 274253, 274368, 274477, 274584, 274725,
             274877, 274878],
            dtype='int64', length=3400),
 'FH-01': Int64Index([   100,    231,    325,    330,    373,    386,    455,    485,
                505,    521,
             ...
             173748, 173988, 174253, 174384, 174549, 174647, 174657, 174690,
             174986, 175005],
            dtype='int64', length=2349),
 'FH-04': Int64Index([   364,    371,    392,    396,    460,    482,    529,    950,
                970,    984,
             ...
             274428, 274519, 274600, 274640, 274646, 274648, 274849, 274851,
             274964, 275000],
            dtype='int64', length=4208),
 'ID-04': Int64Index([    89,    123,    155,    156,    169,    170,    214,    223,
                237,    309,
             ...
             274353, 274445, 274548, 274792, 274930, 275014, 275057, 275058,
             275084, 275085],
            dtype='int64', length=2474),
 'PS-04': Int64Index([     6,      7,      8,      9,     10,     11,     12,     13,
                 14,     15,
             ...
             274446, 274471, 274572, 274734, 274766, 274874, 274901, 274902,
             274944, 275068],
            dtype='int64', length=5409),
 'PS-05': Int64Index([    45,     49,     53,     57,     90,    130,    202,    218,
                246,    247,
             ...
             274633, 274634, 274635, 274666, 274820, 274828, 274978, 274987,
             275022, 275073],
            dtype='int64', length=3969),
 'SLU-01': Int64Index([   111,    142,    143,    147,    152,    153,    220,    308,
                312,    370,
             ...
             274686, 274722, 274768, 274769, 274883, 274884, 274997, 275041,
             275059, 275071],
            dtype='int64', length=7084),
 'SLU-02': Int64Index([   137,    181,    296,    397,    427,    458,    464,    500,
                530,    531,
             ...
             274678, 274684, 274687, 274747, 274753, 274833, 274835, 274885,
             274899, 274991],
            dtype='int64', length=7018),
 'SLU-04': Int64Index([   213,    245,    273,    288,    291,    295,    316,    432,
                589,    639,
             ...
             274185, 274309, 274311, 274415, 274493, 274711, 274887, 274941,
             275027, 275048],
            dtype='int64', length=5226),
 'SLU-07': Int64Index([   368,    454,    551,    552,    575,    577,    633,    648,
                735,    741,
             ...
             274607, 274608, 274647, 274715, 274716, 274771, 274845, 274898,
             274946, 275040],
            dtype='int64', length=6339),
 'SLU-15': Int64Index([   102,    178,    232,    243,    284,    287,    292,    313,
                318,    338,
             ...
             274808, 274863, 274897, 274923, 274953, 274967, 274985, 275036,
             275037, 275061],
            dtype='int64', length=9741),
 'SLU-16': Int64Index([   391,    406,    420,    448,    486,    487,    532,    536,
                537,    538,
             ...
             274099, 274103, 274296, 274442, 274602, 274720, 274763, 274764,
             274867, 274896],
            dtype='int64', length=5045),
 'SLU-18': Int64Index([   103,    320,    359,    446,    544,    556,    565,    566,
                591,    614,
             ...
             209477, 209600, 209625, 209663, 209671, 209907, 209917, 209918,
             209929, 210002],
            dtype='int64', length=3461),
 'SLU-19': Int64Index([   129,    280,    304,    350,    351,    353,    354,    457,
                493,    564,
             ...
             274407, 274512, 274590, 274651, 274701, 274702, 274703, 274841,
             274963, 275062],
            dtype='int64', length=7285),
 'SLU-20': Int64Index([ 79307,  79441,  79473,  79584,  79657,  79658,  79659,  79864,
              79868,  79994,
             ...
             273606, 274539, 274540, 274561, 274606, 274758, 274806, 274807,
             274918, 275039],
            dtype='int64', length=2452),
 'SLU-21': Int64Index([133364, 133365, 133388, 133620, 133621, 133744, 133745, 134178,
             134179, 135016,
             ...
             274136, 274193, 274497, 274612, 274698, 274801, 274960, 275030,
             275043, 275074],
            dtype='int64', length=1114),
 'SLU-22': Int64Index([210885, 210897, 210898, 210899, 210913, 210918, 211084, 211085,
             211264, 211318,
             ...
             274525, 274652, 274669, 274683, 274699, 274772, 274803, 274804,
             274842, 274866],
            dtype='int64', length=1748),
 'SLU-23': Int64Index([   192,    206,    224,    225,    226,    305,    306,    549,
                550,    635,
             ...
             275010, 275011, 275012, 275016, 275017, 275018, 275019, 275020,
             275021, 275023],
            dtype='int64', length=5739),
 'UD-01': Int64Index([    60,     61,     76,    177,    182,    208,    608,    942,
                943,   1054,
             ...
             274123, 274124, 274158, 274460, 274575, 274700, 274705, 274869,
             274881, 275025],
            dtype='int64', length=3889),
 'UD-02': Int64Index([    92,     97,    183,    193,    204,    240,    241,    543,
                654,    655,
             ...
             274286, 274287, 274369, 274418, 274502, 274815, 274919, 275024,
             275086, 275087],
            dtype='int64', length=1417),
 'UD-04': Int64Index([    96,    161,    184,    188,    260,    372,    499,    611,
                678,    891,
             ...
             274104, 274105, 274160, 274259, 274264, 274283, 274400, 274528,
             274659, 274870],
            dtype='int64', length=3534),
 'UD-07': Int64Index([   115,    116,    281,    469,    669,    696,    738,    904,
                963,   1040,
             ...
             273080, 273086, 273331, 273359, 273545, 273783, 274165, 274175,
             274281, 274759],
            dtype='int64', length=2429),
 'UW-01': Int64Index([   730,   1691,   1759,   2124,   2383,   2746,   3087,   3356,
               3404,   3510,
             ...
             142135, 142136, 142249, 142254, 142259, 143101, 144918, 145571,
             147714, 147773],
            dtype='int64', length=480),
 'UW-02': Int64Index([    72,     73,     74,     80,    421,    857,    964,   1026,
               1183,   1433,
             ...
             274078, 274079, 274088, 274089, 274300, 274301, 274537, 274586,
             274998, 275063],
            dtype='int64', length=2002),
 'UW-04': Int64Index([   187,    343,    375,    463,    477,    580,    673,    762,
                781,    833,
             ...
             274811, 274856, 274858, 274876, 274888, 274996, 275045, 275054,
             275069, 275070],
            dtype='int64', length=2688),
 'UW-06': Int64Index([   167,    272,    385,    631,    774,    951,   1011,   1048,
               1078,   1083,
             ...
             274454, 274499, 274562, 274617, 274742, 274773, 274786, 274802,
             274809, 274945],
            dtype='int64', length=2383),
 'UW-07': Int64Index([   121,    122,    215,    216,    365,    367,    404,    721,
               1090,   1141,
             ...
             273936, 273944, 273961, 274498, 274571, 274576, 274721, 274727,
             274947, 275046],
            dtype='int64', length=1905),
 'UW-10': Int64Index([   105,    124,    128,    314,    619,    896,    934,    935,
                999,   1006,
             ...
             238201, 238453, 238514, 238816, 238854, 239274, 239545, 240295,
             240446, 240775],
            dtype='int64', length=1175),
 'UW-11': Int64Index([150250, 150776, 151044, 151373, 151690, 152037, 152061, 153903,
             153941, 154905,
             ...
             273385, 273607, 273760, 273892, 273904, 274016, 274017, 274018,
             274473, 274675],
            dtype='int64', length=1237),
 'UW-12': Int64Index([241157, 241173, 241175, 241194, 241208, 241245, 241292, 241403,
             241435, 241447,
             ...
             274748, 274794, 274795, 274843, 274859, 274864, 274938, 274940,
             274999, 275055],
            dtype='int64', length=689),
 'WF-01': Int64Index([   133,    135,    297,    298,    300,    302,    307,    369,
                475,    514,
             ...
             274972, 274977, 274989, 275013, 275015, 275028, 275044, 275047,
             275078, 275079],
            dtype='int64', length=13038),
 'WF-03': Int64Index([226781, 226784, 226827, 227100, 227321, 227322, 227569, 227570,
             227768, 227769,
             ...
             274097, 274100, 274306, 274307, 274325, 274383, 274667, 274708,
             274909, 274934],
            dtype='int64', length=646),
 'WF-04': Int64Index([    64,     65,    132,    157,    203,    207,    236,    264,
                266,    267,
             ...
             274382, 274630, 274637, 274697, 274707, 274709, 274880, 274913,
             274981, 274986],
            dtype='int64', length=6271)}

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

<data object>.groupby(<grouping values>).<aggregate>()

for example, we can group by gender and find the average of all numerical columns:



In [ ]:

It's also possible to indes the grouped object like it is a dataframe:



In [ ]:

You can even group by multiple values: for example we can look at the trip duration by time of day and by gender:



In [ ]:

The unstack() operation can help make sense of this type of multiply-grouped data. What this technically does is split a multiple-valued index into an index plus columns:



In [ ]:

5. Visualizing data with `pandas`

Of course, looking at tables of data is not very intuitive. Fortunately Pandas has many useful plotting functions built-in, all of which make use of the matplotlib library to generate plots.

Whenever you do plotting in the IPython notebook, you will want to first run this magic command which configures the notebook to work well with plots:



In [31]:

    
%matplotlib inline

Now we can simply call the plot() method of any series or dataframe to get a reasonable view of the data:



In [32]:

    
import matplotlib.pyplot as plt
df['tripduration'].hist()









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f7d60c68ef0>

Adjusting the Plot Style

Matplotlib has a number of plot styles you can use. For example, if you like R you might use the ggplot style:



In [ ]:

Other plot types

Pandas supports a range of other plotting types; you can find these by using the autocomplete on the plot method:



In [ ]:

For example, we can create a histogram of trip durations:



In [ ]:

If you'd like to adjust the x and y limits of the plot, you can use the set_xlim() and set_ylim() method of the resulting object:



In [ ]:

Breakout: Exploring the Data

Make a plot of the total number of rides as a function of month of the year (You'll need to extract the month, use a groupby, and find the appropriate aggregation to count the number in each group).



In [ ]:

Split this plot by gender. Do you see any seasonal ridership patterns by gender?



In [ ]:

Split this plot by user type. Do you see any seasonal ridership patterns by usertype?



In [ ]:

Repeat the above three steps, counting the number of rides by time of day rather thatn by month.



In [ ]:

Are there any other interesting insights you can discover in the data using these tools?



In [ ]:

Using Files

Writing and running python modules
Using python modules in your Jupyter Notebook



In [33]:

    
# A script for creating a dataframe with counts of the occurrence of a columns' values
df_count = df.groupby('from_station_id').count()
df_count1 = df_count[['trip_id']]
df_count2 = df_count1.rename(columns={'trip_id': 'count'})



In [34]:

    
df_count2.head()









    Out[34]:







  
    
      
      count
    
    
      from_station_id
      
    
  
  
    
      BT-01
      10463
    
    
      BT-03
      7334
    
    
      BT-04
      4666
    
    
      BT-05
      5699
    
    
      BT-06
      150



In [35]:

    
def make_table_count(df_arg, groupby_column):
    df_count = df_arg.groupby(groupby_column).count()
    column_name = df.columns[0]
    df_count1 = df_count[[column_name]]
    df_count2 = df_count1.rename(columns={column_name: 'count'})
    return df_count2



In [36]:

    
dff = make_table_count(df, 'from_station_id')
dff.head()









    Out[36]:







  
    
      
      count
    
    
      from_station_id
      
    
  
  
    
      BT-01
      10463
    
    
      BT-03
      7334
    
    
      BT-04
      4666
    
    
      BT-05
      5699
    
    
      BT-06
      150



In [37]:

    
import table_modifiers as tm



In [38]:

    
dir(tm)









    Out[38]:





['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'table_counter']



In [39]:

    
tm.table_counter(df, 'from_station_id')









    Out[39]:







  
    
      
      count
    
    
      from_station_id
      
    
  
  
    
      BT-01
      10463
    
    
      BT-03
      7334
    
    
      BT-04
      4666
    
    
      BT-05
      5699
    
    
      BT-06
      150
    
    
      CBD-03
      4822
    
    
      CBD-04
      3440
    
    
      CBD-05
      5068
    
    
      CBD-06
      4911
    
    
      CBD-07
      3263
    
    
      CBD-13
      9067
    
    
      CD-01
      958
    
    
      CH-01
      6409
    
    
      CH-02
      8546
    
    
      CH-03
      6218
    
    
      CH-05
      6948
    
    
      CH-06
      3765
    
    
      CH-07
      11568
    
    
      CH-08
      8573
    
    
      CH-09
      5246
    
    
      CH-12
      5857
    
    
      CH-15
      6550
    
    
      CH-16
      2089
    
    
      DPD-01
      4822
    
    
      DPD-03
      1423
    
    
      EL-01
      3604
    
    
      EL-03
      5788
    
    
      EL-05
      3400
    
    
      FH-01
      2349
    
    
      FH-04
      4208
    
    
      ID-04
      2474
    
    
      PS-04
      5409
    
    
      PS-05
      3969
    
    
      SLU-01
      7084
    
    
      SLU-02
      7018
    
    
      SLU-04
      5226
    
    
      SLU-07
      6339
    
    
      SLU-15
      9741
    
    
      SLU-16
      5045
    
    
      SLU-18
      3461
    
    
      SLU-19
      7285
    
    
      SLU-20
      2452
    
    
      SLU-21
      1114
    
    
      SLU-22
      1748
    
    
      SLU-23
      5739
    
    
      UD-01
      3889
    
    
      UD-02
      1417
    
    
      UD-04
      3534
    
    
      UD-07
      2429
    
    
      UW-01
      480
    
    
      UW-02
      2002
    
    
      UW-04
      2688
    
    
      UW-06
      2383
    
    
      UW-07
      1905
    
    
      UW-10
      1175
    
    
      UW-11
      1237
    
    
      UW-12
      689
    
    
      WF-01
      13038
    
    
      WF-03
      646
    
    
      WF-04
      6271

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear
0	431	10/13/2014 10:31:00 AM	10/13/2014 10:48:00 AM	SEA00298	985.935	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Male	1960.0
1	432	10/13/2014 10:32:00 AM	10/13/2014 10:48:00 AM	SEA00195	926.375	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Male	1970.0
2	433	10/13/2014 10:33:00 AM	10/13/2014 10:48:00 AM	SEA00486	883.831	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Female	1988.0
3	434	10/13/2014 10:34:00 AM	10/13/2014 10:48:00 AM	SEA00333	865.937	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Female	1977.0
4	435	10/13/2014 10:34:00 AM	10/13/2014 10:49:00 AM	SEA00202	923.923	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Member	Male	1971.0

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	to_station_id	usertype	gender	birthyear
from_station_id
BT-01	10463	10463	10463	10463	10463	10463	10463	10463	10463	4162	4162
BT-03	7334	7334	7334	7334	7334	7334	7334	7334	7334	4862	4862
BT-04	4666	4666	4666	4666	4666	4666	4666	4666	4666	3424	3424
BT-05	5699	5699	5699	5699	5699	5699	5699	5699	5699	2975	2975
BT-06	150	150	150	150	150	150	150	150	150	130	130

	trip_id	tripduration	birthyear
from_station_id
BT-01	147831.009844	1375.031203	1980.131427
BT-03	139404.294655	1019.200684	1976.505142
BT-04	157992.809687	891.095897	1979.877044
BT-05	139283.572381	1199.949481	1975.937479
BT-06	291807.953333	659.770547	1975.830769

Software Engineering for Data Scientists

Manipulating Data with Python

DATA 515A

Today's Objectives

1. Opening & Navigating the Jupyter Notebook

2. Simple Math in the Jupyter Notebook

3. Loading data with pandas

4. Cleaning and Manipulating data with pandas

5. Visualizing data with pandas & matplotlib

1. Opening and Navigating the IPython Notebook

1.1 Some Theory

2. Simple Math in the Jupyter Notebook

3. Loading data with pandas

Python's Data Science Ecosystem

numpy: Numerical Python

scipy: Scientific Python

pandas: Labeled Data Manipulation in Python

matplotlib: Visualization in Python

Installing Pandas & friends

Downloading the data

Loading Data with Pandas

Viewing Pandas Dataframes

4. Manipulating data with pandas

Working with Times

Simple Grouping of Data

Value Counts

Group-by Operation

5. Visualizing data with pandas

Adjusting the Plot Style

Other plot types

Breakout: Exploring the Data

Using Files

3. Loading data with `pandas`

4. Cleaning and Manipulating data with `pandas`

5. Visualizing data with `pandas` & `matplotlib`

3. Loading data with `pandas`

`numpy`: Numerical Python

`scipy`: Scientific Python

`pandas`: Labeled Data Manipulation in Python

`matplotlib`: Visualization in Python

4. Manipulating data with `pandas`

5. Visualizing data with `pandas`