Software Engineering for Data Scientists

Manipulating Data with Python

Today's Objectives

1. Opening & Navigating the IPython Notebook

2. Simple Math in the IPython Notebook

3. Loading data with `pandas`

4. Cleaning and Manipulating data with `pandas`

5. Visualizing data with `pandas`

1. Opening and Navigating the IPython Notebook

We will start today with the interactive environment that we will be using often through the course: the IPython/Jupyter Notebook.

We will walk through the following steps together:

Download miniconda (be sure to get Version 3.5) and install it on your system (hopefully you have done this before coming to class)
Use the conda command-line tool to update your package listing and install the IPython notebook:

Update conda's listing of packages for your system:
```
$ conda update conda
```
Install IPython notebook and all its requirements
```
$ conda install ipython-notebook
```
Navigate to the directory containing the course material. For example:
```
$ cd ~/short-course/LectureNotes/ExploratoryAnalysis
```
You should see a number of files in the directory, including these:
```
$ ls
...
Breakout-Simple-Math.ipynb
ExploratoryAnalysis.ipynb
...
```
Type ipython notebook in the terminal to start the notebook
```
$ ipython notebook
```
If everything has worked correctly, it should automatically launch your default browser
Click on CSE599_Lecture_2.ipynb to open the notebook containing the content for this lecture.

With that, you're set up to use the IPython notebook!

2. Simple Math in the IPython Notebook

Now that we have the IPython notebook up and running, we're going to do a short breakout exploring some of the mathematical functionality that Python offers.

Please open Breakout-Simple-Math.ipynb, find a partner, and make your way through that notebook, typing and executing code along the way.

3. Loading data with `pandas`

With this simple Python computation experience under our belt, we can now move to doing some more interesting analysis.

Python's Data Science Ecosystem

In addition to Python's built-in modules like the math module we explored above, there are also many often-used third-party modules that are core tools for doing data science with Python. Some of the most important ones are:

`numpy`: Numerical Python

Numpy is short for "Numerical Python", and contains tools for efficient manipulation of arrays of data. If you have used other computational tools like IDL or MatLab, Numpy should feel very familiar.

`scipy`: Scientific Python

Scipy is short for "Scientific Python", and contains a wide range of functionality for accomplishing common scientific tasks, such as optimization/minimization, numerical integration, interpolation, and much more. We will not look closely at Scipy today, but we will use its functionality later in the course.

`pandas`: Labeled Data Manipulation in Python

Pandas is short for "Panel Data", and contains tools for doing more advanced manipulation of labeled data in Python, in particular with a columnar data structure called a Data Frame. If you've used the R statistical language (and in particular the so-called "Hadley Stack"), much of the functionality in Pandas should feel very familiar.

`matplotlib`: Visualization in Python

Matplotlib started out as a Matlab plotting clone in Python, and has grown from there in the 15 years since its creation. It is the most popular data visualization tool currently in the Python data world (though other recent packages are starting to encroach on its monopoly).

Installing Pandas & friends

Because the above packages are not included in Python itself, you need to install them separately. While it is possible to install these from source (compiling the C and/or Fortran code that does the heavy lifting under the hood) it is much easier to use a package manager like conda. All it takes is to run

$ conda install numpy scipy pandas matplotlib

and (so long as your conda setup is working) the packages will be downloaded and installed on your system.

Loading Data with Pandas



In [1]:

    
import numpy
numpy.__path__









    Out[1]:





['/Users/jakevdp/anaconda/envs/python3.5/lib/python3.5/site-packages/numpy']



In [2]:

    
import pandas

Because we'll use it so much, we often import under a shortened name using the import ... as ... pattern:



In [3]:

    
import pandas as pd

Now we can use the read_csv command to read the comma-separated-value data:



In [4]:

    
data = pd.read_csv('2015_trip_data.csv')

Note: strings in Python can be defined either with double quotes or single quotes

Viewing Pandas Dataframes

The head() and tail() methods show us the first and last rows of the data



In [5]:

    
data.head()









    Out[5]:






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      0
      431
      10/13/2014 10:31
      10/13/2014 10:48
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1960
    
    
      1
      432
      10/13/2014 10:32
      10/13/2014 10:48
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1970
    
    
      2
      433
      10/13/2014 10:33
      10/13/2014 10:48
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1988
    
    
      3
      434
      10/13/2014 10:34
      10/13/2014 10:48
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1977
    
    
      4
      435
      10/13/2014 10:34
      10/13/2014 10:49
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1971



In [6]:

    
data.tail()









    Out[6]:






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
    
  
  
    
      142841
      156796
      10/12/2015 20:41
      10/12/2015 20:47
      SEA00358
      377.183
      E Pine St & 16th Ave
      Summit Ave & E Denny Way
      CH-07
      CH-01
      Annual Member
      Male
      1990
    
    
      142842
      156797
      10/12/2015 20:43
      10/12/2015 20:48
      SEA00399
      303.330
      Bellevue Ave & E Pine St
      Summit Ave E & E Republican St
      CH-12
      CH-03
      Annual Member
      Male
      1978
    
    
      142843
      156798
      10/12/2015 21:03
      10/12/2015 21:06
      SEA00204
      165.597
      Harvard Ave & E Pine St
      E Harrison St & Broadway Ave E
      CH-09
      CH-02
      Annual Member
      Male
      1989
    
    
      142844
      156799
      10/12/2015 21:35
      10/12/2015 21:41
      SEA00073
      388.576
      Pine St & 9th Ave
      3rd Ave & Broad St
      SLU-16
      BT-01
      Short-Term Pass Holder
      NaN
      NaN
    
    
      142845
      156800
      10/12/2015 22:45
      10/12/2015 22:51
      SEA00033
      391.885
      NE 42nd St & University Way NE
      Eastlake Ave E & E Allison St
      UD-02
      EL-05
      Annual Member
      Male
      1985

The shape attribute shows us the number of elements:



In [7]:

    
data.shape









    Out[7]:





(142846, 12)

The columns attribute gives us the column names



In [8]:

    
data.columns









    Out[8]:





Index(['trip_id', 'starttime', 'stoptime', 'bikeid', 'tripduration',
       'from_station_name', 'to_station_name', 'from_station_id',
       'to_station_id', 'usertype', 'gender', 'birthyear'],
      dtype='object')

The index attribute gives us the index names



In [9]:

    
data.index









    Out[9]:





Int64Index([     0,      1,      2,      3,      4,      5,      6,      7,
                 8,      9,
            ...
            142836, 142837, 142838, 142839, 142840, 142841, 142842, 142843,
            142844, 142845],
           dtype='int64', length=142846)

The dtypes attribute gives the data types of each column:



In [10]:

    
data.dtypes









    Out[10]:





trip_id                int64
starttime             object
stoptime              object
bikeid                object
tripduration         float64
from_station_name     object
to_station_name       object
from_station_id       object
to_station_id         object
usertype              object
gender                object
birthyear            float64
dtype: object

4. Manipulating data with `pandas`

Here we'll cover some key features of manipulating data with pandas

Access columns by name using square-bracket indexing:



In [11]:

    
data["usertype"]









    Out[11]:





0                  Annual Member
1                  Annual Member
2                  Annual Member
3                  Annual Member
4                  Annual Member
5                  Annual Member
6                  Annual Member
7                  Annual Member
8                  Annual Member
9                  Annual Member
10                 Annual Member
11                 Annual Member
12                 Annual Member
13                 Annual Member
14                 Annual Member
15                 Annual Member
16                 Annual Member
17                 Annual Member
18                 Annual Member
19                 Annual Member
20                 Annual Member
21                 Annual Member
22                 Annual Member
23                 Annual Member
24                 Annual Member
25                 Annual Member
26                 Annual Member
27                 Annual Member
28                 Annual Member
29                 Annual Member
                   ...          
142816             Annual Member
142817             Annual Member
142818             Annual Member
142819             Annual Member
142820    Short-Term Pass Holder
142821    Short-Term Pass Holder
142822    Short-Term Pass Holder
142823    Short-Term Pass Holder
142824             Annual Member
142825             Annual Member
142826             Annual Member
142827             Annual Member
142828             Annual Member
142829             Annual Member
142830             Annual Member
142831             Annual Member
142832             Annual Member
142833    Short-Term Pass Holder
142834             Annual Member
142835             Annual Member
142836             Annual Member
142837             Annual Member
142838             Annual Member
142839             Annual Member
142840             Annual Member
142841             Annual Member
142842             Annual Member
142843             Annual Member
142844    Short-Term Pass Holder
142845             Annual Member
Name: usertype, dtype: object

Mathematical operations on columns happen element-wise:



In [12]:

    
data['tripduration'] / 60









    Out[12]:





0         16.432250
1         15.439583
2         14.730517
3         14.432283
4         15.398717
5         13.480083
6          9.945250
7          9.868850
8          9.772450
9          9.793900
10         9.414983
11        10.335683
12        10.568117
13        10.238933
14        10.024383
15        10.313017
16        10.284750
17        10.000833
18         8.328900
19         9.588450
20         9.530117
21         9.396050
22         7.294817
23         8.052567
24         7.989700
25         8.892200
26         8.027150
27         7.679533
28         7.652750
29        11.340950
            ...    
142816     8.671450
142817    14.607983
142818     4.938550
142819    11.882017
142820    21.947550
142821    21.883067
142822    21.240667
142823    18.747133
142824    22.390117
142825    12.643100
142826     4.613267
142827     6.036600
142828     4.168183
142829     4.278083
142830     6.128367
142831     7.455983
142832     2.109933
142833     9.045133
142834     4.308700
142835    15.671783
142836    10.276817
142837    10.726967
142838     9.055200
142839     3.375017
142840     8.962267
142841     6.286383
142842     5.055500
142843     2.759950
142844     6.476267
142845     6.531417
Name: tripduration, dtype: float64

Columns can be created (or overwritten) with the assignment operator. Let's create a tripminutes column with the number of minutes for each trip



In [13]:

    
data['tripminutes'] = data['tripduration'] / 60



In [14]:

    
data.head()









    Out[14]:






  
    
      
      trip_id
      starttime
      stoptime
      bikeid
      tripduration
      from_station_name
      to_station_name
      from_station_id
      to_station_id
      usertype
      gender
      birthyear
      tripminutes
    
  
  
    
      0
      431
      10/13/2014 10:31
      10/13/2014 10:48
      SEA00298
      985.935
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1960
      16.432250
    
    
      1
      432
      10/13/2014 10:32
      10/13/2014 10:48
      SEA00195
      926.375
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1970
      15.439583
    
    
      2
      433
      10/13/2014 10:33
      10/13/2014 10:48
      SEA00486
      883.831
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1988
      14.730517
    
    
      3
      434
      10/13/2014 10:34
      10/13/2014 10:48
      SEA00333
      865.937
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Female
      1977
      14.432283
    
    
      4
      435
      10/13/2014 10:34
      10/13/2014 10:49
      SEA00202
      923.923
      2nd Ave & Spring St
      Occidental Park / Occidental Ave S & S Washing...
      CBD-06
      PS-04
      Annual Member
      Male
      1971
      15.398717

Working with Times

One trick to know when working with columns of times is that Pandas DateTimeIndex provides a nice interface for working with columns of times:



In [15]:

    
times = pd.DatetimeIndex(data['starttime'])

With it, we can extract, the hour of the day, the day of the week, the month, and a wide range of other views of the time:



In [16]:

    
times









    Out[16]:





DatetimeIndex(['2014-10-13 10:31:00', '2014-10-13 10:32:00',
               '2014-10-13 10:33:00', '2014-10-13 10:34:00',
               '2014-10-13 10:34:00', '2014-10-13 10:34:00',
               '2014-10-13 11:35:00', '2014-10-13 11:35:00',
               '2014-10-13 11:35:00', '2014-10-13 11:35:00',
               ...
               '2015-10-12 20:09:00', '2015-10-12 20:11:00',
               '2015-10-12 20:18:00', '2015-10-12 20:39:00',
               '2015-10-12 20:41:00', '2015-10-12 20:41:00',
               '2015-10-12 20:43:00', '2015-10-12 21:03:00',
               '2015-10-12 21:35:00', '2015-10-12 22:45:00'],
              dtype='datetime64[ns]', length=142846, freq=None)



In [17]:

    
times.dayofweek









    Out[17]:





array([0, 0, 0, ..., 0, 0, 0], dtype=int32)



In [18]:

    
times.month









    Out[18]:





array([10, 10, 10, ..., 10, 10, 10], dtype=int32)

Note: math functionality can be applied to columns using the NumPy package: for example:



In [19]:

    
import numpy as np
np.exp(data['tripminutes'])









    Out[19]:





0         1.369101e+07
1         5.073712e+06
2         2.496791e+06
3         1.852939e+06
4         4.870546e+06
5         7.150325e+05
6         2.085294e+04
7         1.931911e+04
8         1.754370e+04
9         1.792407e+04
10        1.227087e+04
11        3.081273e+04
12        3.887539e+04
13        2.797127e+04
14        2.257015e+04
15        3.012217e+04
16        2.928264e+04
17        2.204483e+04
18        4.141859e+03
19        1.459523e+04
20        1.376820e+04
21        1.204073e+04
22        1.472647e+03
23        3.141849e+03
24        2.950412e+03
25        7.275007e+03
26        3.063000e+03
27        2.163610e+03
28        2.106430e+03
29        8.419998e+04
              ...     
142816    5.833952e+03
142817    2.208852e+06
142818    1.395677e+02
142819    1.446420e+05
142820    3.401730e+09
142821    3.189298e+09
142822    1.677661e+09
142823    1.386043e+08
142824    5.295465e+09
142825    3.096197e+05
142826    1.008129e+02
142827    4.184678e+02
142828    6.459799e+01
142829    7.210211e+01
142830    4.586864e+02
142831    1.730185e+03
142832    8.247691e+00
142833    8.477182e+03
142834    7.434378e+01
142835    6.399839e+06
142836    2.905125e+04
142837    4.556826e+04
142838    8.562950e+03
142839    2.922477e+01
142840    7.803024e+03
142841    5.372069e+02
142842    1.568830e+02
142843    1.579905e+01
142844    6.495415e+02
142845    6.863699e+02
Name: tripminutes, dtype: float64

Simple Grouping of Data

The real power of Pandas comes in its tools for grouping and aggregating data. Here we'll look at value counts and the basics of group-by operations.

Value Counts

Pandas includes an array of useful functionality for manipulating and analyzing tabular data. We'll take a look at two of these here.

The pandas.value_counts returns statistics on the unique values within each column.

We can use it, for example, to break down rides by gender:



In [20]:

    
pd.value_counts(data['gender'])









    Out[20]:





Male      67608
Female    18245
Other      1507
Name: gender, dtype: int64

Or to break down rides by age:



In [21]:

    
pd.value_counts(data['birthyear']).sort_index()









    Out[21]:





1936       6
1939      23
1942       2
1943      11
1944       1
1945     115
1946      39
1947     244
1948      34
1949     154
1950     657
1951     251
1952     204
1953     337
1954     152
1955     397
1956     481
1957     224
1958     160
1959     696
1960     429
1961     875
1962    1769
1963     828
1964    1374
1965    1510
1966     761
1967    1354
1968    1085
1969    1185
1970    1056
1971    1279
1972    1921
1973    1076
1974    1497
1975    1969
1976    1577
1977    2465
1978    2063
1979    1976
1980    2236
1981    4779
1982    4629
1983    3965
1984    3815
1985    5370
1986    3492
1987    9320
1988    4188
1989    2755
1990    3605
1991    2912
1992    1798
1993    1126
1994     549
1995     412
1996     121
1997      21
1998      25
1999       5
Name: birthyear, dtype: int64

What else might we break down rides by?



In [22]:

    
pd.value_counts(times.dayofweek)









    Out[22]:





3    21505
0    21266
4    21097
2    20748
1    20465
5    20358
6    17407
dtype: int64

We can sort by the index rather than the counts if we wish:



In [23]:

    
pd.value_counts(times.dayofweek, sort=False)









    Out[23]:





0    21266
1    20465
2    20748
3    21505
4    21097
5    20358
6    17407
dtype: int64



In [24]:

    
pd.value_counts(times.month)









    Out[24]:





7     18808
8     17046
6     15999
5     15548
9     13134
4     12898
10    11081
3      9980
11     7823
1      7368
2      7330
12     5831
dtype: int64



In [25]:

    
pd.value_counts(times.month, sort=False)









    Out[25]:





1      7368
2      7330
3      9980
4     12898
5     15548
6     15999
7     18808
8     17046
9     13134
10    11081
11     7823
12     5831
dtype: int64

Group-by Operation

One of the killer features of the Pandas dataframe is the ability to do group-by operations. You can visualize the group-by like this (image borrowed from the Python Data Science Handbook)



In [26]:

    
from IPython.display import Image
Image('split_apply_combine.png')









    Out[26]:

So, for example, we can use this to find the average length of a ride as a function of time of day:



In [27]:

    
data.groupby(times.hour)['tripminutes'].mean()









    Out[27]:





0     18.293162
1     16.812000
2     26.467510
3     22.643443
4     18.595762
5     13.565035
6     12.091993
7     12.378344
8     12.544350
9     15.175861
10    23.558911
11    25.645489
12    26.052903
13    27.878785
14    28.354453
15    26.164124
16    21.257375
17    17.388788
18    16.706635
19    16.886609
20    17.463562
21    15.227905
22    15.931296
23    16.006255
Name: tripminutes, dtype: float64

The simplest version of a groupby looks like this, and you can use almost any aggregation function you wish (mean, median, sum, minimum, maximum, standard deviation, count, etc.)

<data object>.groupby(<grouping values>).<aggregate>()

You can even group by multiple values: for example we can look at the trip duration by time of day and by gender:



In [28]:

    
grouped = data.groupby([times.hour, 'gender'])['tripminutes'].mean()
grouped









    Out[28]:





    gender
0   Female     9.016608
    Male       8.750898
    Other      8.123788
1   Female     8.721844
    Male       7.865786
    Other     13.765683
2   Female     8.371583
    Male       8.880909
    Other      7.344022
3   Female    16.796675
    Male       8.087312
    Other      5.643350
4   Female     9.123506
    Male      12.134917
    Other      2.810767
5   Female    12.970906
    Male       8.261415
6   Female    10.954091
    Male       8.038184
    Other      9.752973
7   Female    12.392548
    Male       9.461130
    Other     10.525419
8   Female    11.309898
    Male       9.219013
    Other     10.129815
9   Female    10.881655
    Male       9.019535
    Other     11.167447
10  Female    12.152443
                ...    
14  Female    13.085510
    Male      10.097013
    Other     11.780128
15  Female    12.521131
    Male      10.586964
    Other     10.580798
16  Female    12.614844
    Male      10.534829
    Other     13.845402
17  Female    12.927452
    Male      10.507247
    Other     11.990486
18  Female    12.414073
    Male       9.432061
    Other      9.477819
19  Female    12.008509
    Male       9.068067
    Other      9.579921
20  Female    11.716320
    Male       9.253255
    Other      8.051081
21  Female    11.241531
    Male       8.859545
    Other      7.606792
22  Female    11.088358
    Male       8.733132
    Other      6.618971
23  Female    11.032336
    Male       8.085837
    Other      5.183287
Name: tripminutes, dtype: float64

The unstack() operation can help make sense of this type of multiply-grouped data. What this technically does is split a multiple-valued index into an index plus columns:



In [29]:

    
grouped.unstack()

5. Visualizing data with `pandas`

Of course, looking at tables of data is not very intuitive. Fortunately Pandas has many useful plotting functions built-in, all of which make use of the matplotlib library to generate plots.

Whenever you do plotting in the IPython notebook, you will want to first run this magic command which configures the notebook to work well with plots:



In [30]:

    
%matplotlib inline

Now we can simply call the plot() method of any series or dataframe to get a reasonable view of the data:



In [31]:

    
data.groupby([times.hour, 'usertype'])['tripminutes'].mean().unstack().plot()









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x108900518>

Adjusting the Plot Style

The default formatting is not very nice; I often make use of the Seaborn library for better plotting defaults.

First you'll have to

$ conda install seaborn

and then you can do this:



In [32]:

    
import seaborn
seaborn.set()









    



/Users/jakevdp/anaconda/envs/python3.5/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

And now re-run the plot from above:



In [33]:

    
data.groupby([times.hour, 'usertype'])['tripminutes'].mean().unstack().plot()









    Out[33]:





<matplotlib.axes._subplots.AxesSubplot at 0x10cdfa5f8>

Other plot types

Pandas supports a range of other plotting types; you can find these by using the autocomplete on the plot method:



In [34]:

    
data.plot.hist









    Out[34]:





<bound method FramePlotMethods.hist of <pandas.tools.plotting.FramePlotMethods object at 0x1071b5dd8>>

For example, we can create a histogram of trip durations:



In [35]:

    
data['tripminutes'].plot.hist(bins=100)









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x10ce979b0>

If you'd like to adjust the x and y limits of the plot, you can use the set_xlim() and set_ylim() method of the resulting object:



In [36]:

    
plot = data['tripminutes'].plot.hist(bins=500)
plot.set_xlim(0, 50)









    Out[36]:





(0, 50)

Breakout: Exploring the Data

Make a plot of the total number of rides as a function of month of the year (You'll need to extract the month, use a groupby, and find the appropriate aggregation to count the number in each group).



In [ ]:

Split this plot by gender. Do you see any seasonal ridership patterns by gender?



In [ ]:

Split this plot by user type. Do you see any seasonal ridership patterns by usertype?



In [ ]:

Repeat the above three steps, counting the number of rides by time of day rather thatn by month.



In [ ]:

Are there any other interesting insights you can discover in the data using these tools?



In [ ]:



In [ ]:

Looking Forward to Homework

In the homework this week, you will have a chance to apply some of these patterns to a brand new (but closely related) dataset.

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear
0	431	10/13/2014 10:31	10/13/2014 10:48	SEA00298	985.935	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Male	1960
1	432	10/13/2014 10:32	10/13/2014 10:48	SEA00195	926.375	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Male	1970
2	433	10/13/2014 10:33	10/13/2014 10:48	SEA00486	883.831	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Female	1988
3	434	10/13/2014 10:34	10/13/2014 10:48	SEA00333	865.937	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Female	1977
4	435	10/13/2014 10:34	10/13/2014 10:49	SEA00202	923.923	2nd Ave & Spring St	Occidental Park / Occidental Ave S & S Washing...	CBD-06	PS-04	Annual Member	Male	1971

	trip_id	starttime	stoptime	bikeid	tripduration	from_station_name	to_station_name	from_station_id	to_station_id	usertype	gender	birthyear
142841	156796	10/12/2015 20:41	10/12/2015 20:47	SEA00358	377.183	E Pine St & 16th Ave	Summit Ave & E Denny Way	CH-07	CH-01	Annual Member	Male	1990
142842	156797	10/12/2015 20:43	10/12/2015 20:48	SEA00399	303.330	Bellevue Ave & E Pine St	Summit Ave E & E Republican St	CH-12	CH-03	Annual Member	Male	1978
142843	156798	10/12/2015 21:03	10/12/2015 21:06	SEA00204	165.597	Harvard Ave & E Pine St	E Harrison St & Broadway Ave E	CH-09	CH-02	Annual Member	Male	1989
142844	156799	10/12/2015 21:35	10/12/2015 21:41	SEA00073	388.576	Pine St & 9th Ave	3rd Ave & Broad St	SLU-16	BT-01	Short-Term Pass Holder	NaN	NaN
142845	156800	10/12/2015 22:45	10/12/2015 22:51	SEA00033	391.885	NE 42nd St & University Way NE	Eastlake Ave E & E Allison St	UD-02	EL-05	Annual Member	Male	1985

gender	Female	Male	Other
0	9.016608	8.750898	8.123788
1	8.721844	7.865786	13.765683
2	8.371583	8.880909	7.344022
3	16.796675	8.087312	5.643350
4	9.123506	12.134917	2.810767
5	12.970906	8.261415	NaN
6	10.954091	8.038184	9.752973
7	12.392548	9.461130	10.525419
8	11.309898	9.219013	10.129815
9	10.881655	9.019535	11.167447
10	12.152443	9.275972	11.615609
11	11.763075	9.066147	13.589810
12	12.436949	9.859569	10.503408
13	13.031992	9.675181	10.693493
14	13.085510	10.097013	11.780128
15	12.521131	10.586964	10.580798
16	12.614844	10.534829	13.845402
17	12.927452	10.507247	11.990486
18	12.414073	9.432061	9.477819
19	12.008509	9.068067	9.579921
20	11.716320	9.253255	8.051081
21	11.241531	8.859545	7.606792
22	11.088358	8.733132	6.618971
23	11.032336	8.085837	5.183287

Software Engineering for Data Scientists

Manipulating Data with Python

Today's Objectives

1. Opening & Navigating the IPython Notebook

2. Simple Math in the IPython Notebook

3. Loading data with pandas

4. Cleaning and Manipulating data with pandas

5. Visualizing data with pandas

1. Opening and Navigating the IPython Notebook

2. Simple Math in the IPython Notebook

3. Loading data with pandas

Python's Data Science Ecosystem

numpy: Numerical Python

scipy: Scientific Python

pandas: Labeled Data Manipulation in Python

matplotlib: Visualization in Python

Installing Pandas & friends

Loading Data with Pandas

Viewing Pandas Dataframes

4. Manipulating data with pandas

Working with Times

Simple Grouping of Data

Value Counts

Group-by Operation

5. Visualizing data with pandas

Adjusting the Plot Style

Other plot types

Breakout: Exploring the Data

Looking Forward to Homework

3. Loading data with `pandas`

4. Cleaning and Manipulating data with `pandas`

5. Visualizing data with `pandas`

3. Loading data with `pandas`

`numpy`: Numerical Python

`scipy`: Scientific Python

`pandas`: Labeled Data Manipulation in Python

`matplotlib`: Visualization in Python

4. Manipulating data with `pandas`

5. Visualizing data with `pandas`