Manipulating and Visualizing Data

We've learned the basics of Loading files and now it's time to reorganize the loaded data into commonly-used data structures from NumPy and Pandas. To motivate the various data structures, we're going to feed them into matplotlib for visualization. This lecture-lab is then all about steps 2, 3, and 5 from our generic data science program template:

Acquire data, which means finding a suitable file or collecting data from the web and storing in a file
Load data from disk and place into memory organized into data structures
Normalize, clean, or otherwise prepare data
Process the data, which can mean training a machine learning model, computing summary statistics, or optimizing a cost function
Emit results, which can be anything from simply printing an answer to saving data to the disk to generating a fancy visualization

You'll learn more about step 4 in the courses on machine learning, timeseries analysis, and so on.

TODO: missing values delete row insert value

Let's get started by importing all of the packages we're going to need and setting a few parameters that make this Jupyter notebook look better:



In [1]:

    
import pandas
import numpy as np
import matplotlib.pyplot as plt

pandas.options.display.max_rows = 7 # Don't display too much data (Pandas)
np.set_printoptions(threshold=4)    # Don't display too much data (NumPy)

Your new BFFs

Analytics programs tend to use lots of one- and two-dimensional arrays. 2D arrays are matrices and tables of data. 1D arrays are vectors, such as points in Euclidean space. A column or row of a table is also a 1D array. Python has lists and lists of lists that would suffice for 1D and 2D arrays, but Pandas and NumPy define similar but more capable data structures.

Let's start with Pandas data frames, which are powerful tables very much like Excel tables. I also think that Pandas' read_csv() is the easiest way to load most kinds of data organized into rows and columns. Here's a sample data file with a list of prices over time, one data point per line and no header row:



In [61]:

    
! wc data/prices.txt
! head data/prices.txt









    



     345     345    2067 data/prices.txt
0.605
0.600
0.594
0.592
0.600
0.616
0.623
0.628
0.630
0.629

(The wc and head are bash commands that you might find useful in the future.)

Here's how to load that file using Pandas:



In [62]:

    
prices = pandas.read_csv('data/prices.txt', header=None)
prices # jupyter notebooks know how to display this nicely









    Out[62]:






  
    
      
      0
    
  
  
    
      0
      0.605
    
    
      1
      0.600
    
    
      2
      0.594
    
    
      ...
      ...
    
    
      342
      1.939
    
    
      343
      1.898
    
    
      344
      1.891
    
  

345 rows × 1 columns

The numbers in the left column are just the index and are displayed by Pandas for your information; they are not stored in memory as part of the data structure. Let's look at the type and shape of this data structure:



In [63]:

    
print "type is", type(prices)
print "shape is", prices.shape









    



type is <class 'pandas.core.frame.DataFrame'>
shape is (345, 1)

That output indicates that the data is stored in a DataFrame object and there are 344 rows and one column.

While Pandas is great for loading the data, and a few other things we'll see below, I prefer working with NumPy arrays; the actual type is called ndarray. Let's convert that list of prices from a data frame to a NumPy array:



In [64]:

    
m = prices.as_matrix()     # Convert data frame to numpy array
print "type is", type(m)
print "shape is", m.shape
print m









    



type is <type 'numpy.ndarray'>
shape is (345, 1)
[[ 0.605]
 [ 0.6  ]
 [ 0.594]
 ..., 
 [ 1.939]
 [ 1.898]
 [ 1.891]]

The printed array looks like a list of lists but it is a different data type. Just because two data structures print out in the same way, doesn't mean that they are the same kinds of objects.

We can access the 2D NumPy arrays using array index notation array[row, column]:



In [65]:

    
print m[0]        # Access the first row
print m[0,0]      # Access the first column of the first row
print m[1]        # Access the 2nd row
print m[1,0]      # Access the first column of the 2nd row









    



[ 0.605]
0.605
[ 0.6]
0.6

That is a little weird though. We think of that as a 1D array or just a list, not a 2D array with a single column (shape is 345 x 1). To get NumPy to treat that as 1D, we use the shape attribute of the array:



In [66]:

    
m.shape = (345,)   # len(m)==345
m









    Out[66]:





array([ 0.605,  0.6  ,  0.594, ...,  1.939,  1.898,  1.891])

Now, we can access the elements using a single index as we would expect:



In [67]:

    
print m[0]
print m[1]
print m[2]

A shape with an empty second parameter indicates a 1D array, which is how NumPy converts a regular Python list to an array:



In [68]:

    
sizes = [28, 32, 34, 36, 38, 39, 40, 41] # Plain old Python list
a = np.array(sizes)                      # Convert to NumPy array
print "shape is", a.shape
a









    



shape is (8,)






    Out[68]:





array([28, 32, 34, ..., 39, 40, 41])

While we're at it, here's how to convert a list of lists to a 2D NumPy array:



In [69]:

    
stuff = [
    [ 18, 8, 307, 3504],
    [ 15, 8, 350, 3693],
    [ 18, 8, 318, 3436]
]
m = np.array(stuff)
print "shape is", m.shape
m









    



shape is (3, 4)






    Out[69]:





array([[  18,    8,  307, 3504],
       [  15,    8,  350, 3693],
       [  18,    8,  318, 3436]])

Now multiple indices make sense. For example, to access the element containing value 3436, we'd use m[2,3] (3rd row, 4th column).

Types matter

TODO: show x+y could be string, int, float, list, or numpy array. overloaded operators. e.g., x*y could be a string if x is string and y is int



In [70]:

    
import numpy as np
def f(x):
    "Scalar or vector math!"
    return np.cos(3 * np.pi * x) / x

print f(3.4)
X = np.array([1.2,3.0]) # a numpy array is more flexible than list of numbers
print f(X)              # returns array due to vector math in f()!
print [f(x) for x in X] # manually apply f() to X









    



0.237946174816
[ 0.25751416 -0.33333333]
[0.25751416197912252, -0.33333333333333331]

Plotting Time Series Data

Our list of prices is representative of timeseries data, such as stock price, temperature, or population fluctuations. Matplotlib is a great library for visualizing data and in this section we're going to use it to display the prices as a timeseries using plot(). That function takes the X and Y coordinates as separate arrays.

I find Matplotlib kind of mysterious, but I have learned patterns that I use over and over again, such as this timeseries plot.



In [71]:

    
m = prices.as_matrix()           # Let's convert pandas data frame to numpy array
time = np.arange(0, len(m), 1)   # Time axis goes from 0 to len(m) by 1
#plt.figure(figsize=(5, 2))       # Prepare a plot 5x2 inches
plt.plot(time, m)                # Plot time vs the prices data
plt.xlabel("Time")               # Always set the axes labels
plt.ylabel("Price (dollars)")
plt.show()                       # Show the actual plot

Plotting functions

Sometimes we have a smooth function such as a cosine that we'd like to plot. To do that, we need to sample the function at regular intervals to collect a list of Y coordinates (like prices from before). Let's start by defining the function that maps X coordinates to Y values and then get a sample of X values at regular intervals between 0.1 and 1.1, stepping by 0.01:



In [72]:

    
def f(x):
    return np.cos(3 * np.pi * x) / x

X = np.arange(.1, 1.1, 0.01) # from .1 to 1.1 by step 0.01

There are three ways to sample the function f() at the coordinates contained in X, which I've delineated here. All of these 3 methods employ our Map pattern:



In [73]:

    
# Get f(x) values for all x in three different ways
# Option 1: (non-Pythonic)
Y = [] 
for x in X:
    Y.append(f(x))

# Option 2: Pythonic way (cool kids do this)
Y = [f(x) for x in X]

# Option 3: Data science way (the most popular kids do this)
Y = f(X)  # a so-called broadcast; implied map

print X
print Y









    



[ 0.1   0.11  0.12 ...,  1.07  1.08  1.09]
[ 5.87785252  4.62764923  3.54816076 ..., -0.73846263 -0.67497095
 -0.60670813]

Given X and Y coordinates, we can plot the function:



In [74]:

    
plt.figure(figsize=(5, 2))

plt.plot(X, Y)
plt.xlabel("x")
plt.ylabel("cos(3 * pi * x) / x")
plt.show()

Visualizing the relationship between variables

Let's move beyond one dimensional arrays now to 2D arrays and plot one column versus another. Here is some sample car data (with a header row) with columns for miles per gallon, number of cylinders, engine horsepower, and weight in pounds:



In [75]:

    
! head data/cars.csv









    



MPG,CYL,ENG,WGT
18,8,307,3504
15,8,350,3693
18,8,318,3436
16,8,304,3433
17,8,302,3449
15,8,429,4341
14,8,454,4354
14,8,440,4312
14,8,455,4425

We can use Pandas again to load the data into a data frame and then convert to a NumPy 2D array (a matrix):



In [76]:

    
cars = pandas.read_csv('data/cars.csv')
print "shape is", cars.shape
cars









    



shape is (392, 4)






    Out[76]:






  
    
      
      MPG
      CYL
      ENG
      WGT
    
  
  
    
      0
      18.0
      8
      307.0
      3504
    
    
      1
      15.0
      8
      350.0
      3693
    
    
      2
      18.0
      8
      318.0
      3436
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      389
      32.0
      4
      135.0
      2295
    
    
      390
      28.0
      4
      120.0
      2625
    
    
      391
      31.0
      4
      119.0
      2720
    
  

392 rows × 4 columns



In [77]:

    
m = cars.as_matrix()
print "shape is", m.shape
m









    



shape is (392, 4)






    Out[77]:





array([[   18.,     8.,   307.,  3504.],
       [   15.,     8.,   350.,  3693.],
       [   18.,     8.,   318.,  3436.],
       ..., 
       [   32.,     4.,   135.,  2295.],
       [   28.,     4.,   120.,  2625.],
       [   31.,     4.,   119.,  2720.]])

Let's say we're interested in the relationship between the weight of the car and the fuel efficiency. We can examine that relationship visually by plotting weight against efficiency using a scatterplot.

This brings us to the question of how to extract columns from numpy arrays, where each column represents the data associated with one attribute of all cars. The idea is to fix the column number but use a wildcard (the colon character) to indicate we want all rows:



In [78]:

    
# can do this:
print cars.MPG
print cars['MPG']

# but I like as numpy matrix
mpg = m[:,0]
wgt = m[:,3]
print "shape is", mpg.shape
mpg









    



0      18.0
1      15.0
2      18.0
       ... 
389    32.0
390    28.0
391    31.0
Name: MPG, dtype: float64
0      18.0
1      15.0
2      18.0
       ... 
389    32.0
390    28.0
391    31.0
Name: MPG, dtype: float64
shape is (392,)






    Out[78]:





array([ 18.,  15.,  18., ...,  32.,  28.,  31.])

Once we have the two columns, we can use matplotlib's scatter() function:



In [79]:

    
plt.figure(figsize=(5, 2))
plt.scatter(wgt, mpg, alpha=0.5) # looks cooler with alpha (opacity) at 50%
plt.xlabel('Weight (pounds)')
plt.ylabel('Miles per gallon')
plt.show()

Great! This shows a clear relationship between weight and efficiency: the heavier the car, the lower the efficiency.

It would also be interesting to know how the number of engine cylinders is related to weight and efficiency. We could go to a three-dimensional graph or even multiple graphs, but it's better to add another attribute to a single graph in this case. We could change the color according to the number of cylinders, but a better visualization would change the size of the plotted point.

We can pull out the number of cylinders with m[:,1] like we did before, but we need to plot each point individually now because we have to specify different sizes. That means we need a loop around scatter() to pass individual X and Y coordinates rather than a list. The s parameter to scatter() is actually proportional to the area of the circle we want (see pyplot scatter plot marker size). To accentuate the difference between engine size, I scale by .7. Here is the code to illustrate the relationship between three variables:



In [80]:

    
plt.figure(figsize=(5, 2))
hp = m[:,2]
plt.scatter(wgt, mpg, s=hp*.5, alpha=0.1)
plt.xlabel('Weight (pounds)')
plt.ylabel('Miles per gallon')
plt.show()

When exploring data, it's often useful to know the unique set of values. For example, if would like to know the set of number of cylinders, we can use set(mylist):



In [81]:

    
m = cars.as_matrix()
cyl = m[:,1]
print set(cyl)









    



set([8.0, 3.0, 4.0, 5.0, 6.0])

Interesting. I did not know there were cars with 3 cylinders.

Exercise: Convert cyl to a set of integers (not floating-point values) using a map pattern. Hint: int(3.0) gives 3.

Histograms

Instead of just a unique set of attribute values, we can count how many of each unique value appear in a data set. Python's Counter object, a kind of dict, knows how to count the elements. For example, here is how we'd get a dictionary-like object mapping number of cylinders to number of cars with that many cylinders:



In [82]:

    
from collections import Counter
m = cars.as_matrix()
cyl = m[:,1]
Counter(cyl)









    Out[82]:





Counter({3.0: 4, 4.0: 199, 5.0: 3, 6.0: 83, 8.0: 103})

That works great for categorical variables, which the number of cylinders kind of is, but Counter is not as appropriate for numerical values. Besides, looking at the cylinder counts, it's hard to quickly understand the relative populations. Visualizing the data with a histogram is a much better idea and works for numerical values too. Here is the code to make a histogram of the cylinder attribute:



In [83]:

    
plt.figure(figsize=(5, 2))
plt.hist(cyl, alpha=0.5)
plt.xlabel('Number of cylinders')
plt.ylabel('Number of cars')
plt.show()

The same pattern gives us a histogram for numerical data as well:



In [84]:

    
m = cars.as_matrix()
hp = m[:,2]
plt.figure(figsize=(5, 2))
plt.hist(hp, alpha=0.5)
plt.xlabel('Horsepower')
plt.ylabel('Number of cars')
plt.show()

A histogram is really a chunky estimate of a variable's density function and so it's often useful to normalize the histogram so that the area integrates (sums) to 1. To get a normalized histogram, use argument normed=True:



In [85]:

    
plt.figure(figsize=(5, 2))
n, bins, patches = plt.hist(hp, normed=True, alpha=0.5)
plt.xlabel('Horsepower')
plt.ylabel('Probability')
plt.show()

Note that it is not the sum of the heights of the bins that equals 1; it is the height * binwidth summed that equals 1.

Slicing and dicing

So far, all of the data we've loaded has been numerical but it's very common to load categorical or textual variables in the form of strings. Pandas data frames are very useful in this case. Let's load some sample sales data that has three string columns:



In [86]:

    
sales = pandas.read_csv('data/sales-small.csv')
sales









    Out[86]:






  
    
      
      Date
      Quantity
      Unit Price
      Shipping
      Customer Name
      Product Category
      Product Name
    
  
  
    
      0
      10/13/10
      6
      38.94
      35.00
      Muhammed MacIntyre
      Office Supplies
      Eldon Base for stackable storage shelf, platinum
    
    
      1
      10/1/12
      49
      208.16
      68.02
      Barry French
      Office Supplies
      1.7 Cubic Foot Compact "Cube" Office Refrigera...
    
    
      2
      10/1/12
      27
      8.69
      2.99
      Barry French
      Office Supplies
      Cardinal Slant-D� Ring Binder, Heavy Gauge Vinyl
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      28
      11/8/10
      28
      13.48
      4.51
      Carlos Soltero
      Office Supplies
      Tenex Personal Project File with Scoop Front D...
    
    
      29
      10/21/12
      49
      6.08
      1.17
      Grant Carroll
      Office Supplies
      Col-Erase� Pencils with Erasers
    
    
      30
      1/1/11
      10
      5.98
      4.38
      Don Miller
      Technology
      Imation 3.5" DS/HD IBM Formatted Diskettes, 10...
    
  

31 rows × 7 columns

The nice thing about the data frames is that we can access the columns by name, using either table.attribute or array indexing notation table[ attribute ]:



In [87]:

    
sales.Date









    Out[87]:





0     10/13/10
1      10/1/12
2      10/1/12
        ...   
28     11/8/10
29    10/21/12
30      1/1/11
Name: Date, dtype: object



In [88]:

    
sales['Date']









    Out[88]:





0     10/13/10
1      10/1/12
2      10/1/12
        ...   
28     11/8/10
29    10/21/12
30      1/1/11
Name: Date, dtype: object



In [89]:

    
sales['Customer Name']









    Out[89]:





0     Muhammed MacIntyre
1           Barry French
2           Barry French
             ...        
28        Carlos Soltero
29         Grant Carroll
30            Don Miller
Name: Customer Name, dtype: object

Accessing rows via sales[0] then doesn't work because Pandas wants to use array indexing notation for getting columns. Instead, we have to use slightly more awkward notation:



In [90]:

    
sales.iloc[0]  # get first row of data









    Out[90]:





Date                                                        10/13/10
Quantity                                                           6
Unit Price                                                     38.94
Shipping                                                          35
Customer Name                                     Muhammed MacIntyre
Product Category                                     Office Supplies
Product Name        Eldon Base for stackable storage shelf, platinum
Name: 0, dtype: object

To get individual elements, we can use regular list of lists Python notation after the loc:



In [91]:

    
print sales.iloc[0][0], sales.iloc[0][1], sales.iloc[0][2]









    



10/13/10 6 38.94

During construction and debugging of software, I often like the explicit printing of the column names as is the default shown above. On the other hand, if we need the elements as a plain old Python list, we can do that with list():



In [92]:

    
row = list(sales.iloc[0])
print row









    



['10/13/10', 6, 38.939999999999998, 35.0, 'Muhammed MacIntyre', 'Office Supplies', 'Eldon Base for stackable storage shelf, platinum']

Exercise: Convert all rows of sales to a list of lists. Hint: use the map pattern and list().

The task of that exercise is common enough that Pandas provides a conversion mechanism directly:



In [93]:

    
m = sales.as_matrix()
print "Type is", type(m)
print m[0] # get first row









    



Type is <type 'numpy.ndarray'>
['10/13/10' 6 38.94 ..., 'Muhammed MacIntyre' 'Office Supplies'
 'Eldon Base for stackable storage shelf, platinum']

We can still get the columns individually using the wildcard notation we saw before:



In [94]:

    
m[:,0] # get first column









    Out[94]:





array(['10/13/10', '10/1/12', '10/1/12', ..., '11/8/10', '10/21/12',
       '1/1/11'], dtype=object)



In [95]:

    
m[:,4] # get fifth column









    Out[95]:





array(['Muhammed MacIntyre', 'Barry French', 'Barry French', ...,
       'Carlos Soltero', 'Grant Carroll', 'Don Miller'], dtype=object)

Pulling data frames apart

For machine learning, we often want to separate out one of the columns as the dependent variable, keeping the others as a group of independent variables. Notation we typically use is X -> Y, meaning the set of observations in X predict or classify results in Y.

For example, let's say we wanted to predict engine size given the efficiency, number of cylinders, and overall car weight. We need to separate out the engine size as Y and combining the other columns into X. Using Pandas, we can easily separate the variables and keep the variables names:



In [96]:

    
cars = pandas.read_csv('data/cars.csv')
Y = cars['ENG']
X = cars[['MPG','CYL','WGT']]
print X
print
print Y









    



      MPG  CYL   WGT
0    18.0    8  3504
1    15.0    8  3693
2    18.0    8  3436
..    ...  ...   ...
389  32.0    4  2295
390  28.0    4  2625
391  31.0    4  2720

[392 rows x 3 columns]

0      307.0
1      350.0
2      318.0
       ...  
389    135.0
390    120.0
391    119.0
Name: ENG, dtype: float64

Converting to a NumPy array strips away the column names but let us treat it as a matrix, which is handy in a lot of cases (e.g., matrix addition). Separating columns from NumPy arrays is a bit more cumbersome, However:



In [97]:

    
m = cars.as_matrix()
Y = m[:,2]
X = np.column_stack((m[:,0],m[:,1], m[:,3])) # note extra parens; it's a tuple of columns
print X
print
print Y









    



[[   18.     8.  3504.]
 [   15.     8.  3693.]
 [   18.     8.  3436.]
 ..., 
 [   32.     4.  2295.]
 [   28.     4.  2625.]
 [   31.     4.  2720.]]

[ 307.  350.  318. ...,  135.  120.  119.]

While NumPy arrays are more cumbersome when pulling apart tables, accessing the elements without loc is usually more convenient:



In [98]:

    
print cars.iloc[0][1]
print m[0,1]

Mixed, missing data

Using tips from Jeremy Howard here on real-world data clean up.

Load and parse dates



In [99]:

    
import pandas
df = pandas.read_csv("data/mixed.csv", parse_dates=['Date'])
df









    Out[99]:






  
    
      
      Date
      Description
      Size
      Price
      Topic
    
  
  
    
      0
      2017-06-20
      NaN
      92.0
      1.50
      News
    
    
      1
      2017-06-21
      run forest
      42.0
      2.34
      Sports
    
    
      2
      2017-06-21
      not your droids
      19.0
      0.88
      Sports
    
    
      3
      2017-06-22
      hi mom
      NaN
      9.30
      Politics
    
    
      4
      2017-06-23
      foo&bar
      1.0
      10.00
      NaN
    
    
      5
      2017-06-24
      get off my lawn
      99.0
      8.90
      Sci

If you ever need to convert dates to the elapsed time, you can convert the date timestamp to UNIX time, the number of second since 1970:



In [146]:

    
d = df['Date']
delta = d - pandas.datetime(1970,1,1)
delta









    Out[146]:





0   17337 days
1   17338 days
2   17338 days
3   17339 days
4   17340 days
5   17341 days
Name: Date, dtype: timedelta64[ns]



In [147]:

    
df3 = df.copy()
df3['Date'] = delta.dt.total_seconds()
df3









    Out[147]:






  
    
      
      Date
      Description
      Size
      Price
      Topic
      Size_na
    
  
  
    
      0
      1.497917e+09
      NaN
      92.0
      1.50
      2
      False
    
    
      1
      1.498003e+09
      run forest
      42.0
      2.34
      5
      False
    
    
      2
      1.498003e+09
      not your droids
      19.0
      0.88
      5
      False
    
    
      3
      1.498090e+09
      hi mom
      42.0
      9.30
      3
      True
    
    
      4
      1.498176e+09
      foo&bar
      1.0
      10.00
      1
      False
    
    
      5
      1.498262e+09
      get off my lawn
      99.0
      8.90
      4
      False

Or, you can convert the timestamp into the number of days since 1970:



In [150]:

    
delta.dt.days









    Out[150]:





0    17337
1    17338
2    17338
3    17339
4    17340
5    17341
Name: Date, dtype: int64

String to categorical variable

Here is how we convert a column to a categorical variable:



In [100]:

    
df['Topic'] = df['Topic'].astype('category')
df









    Out[100]:






  
    
      
      Date
      Description
      Size
      Price
      Topic
    
  
  
    
      0
      2017-06-20
      NaN
      92.0
      1.50
      News
    
    
      1
      2017-06-21
      run forest
      42.0
      2.34
      Sports
    
    
      2
      2017-06-21
      not your droids
      19.0
      0.88
      Sports
    
    
      3
      2017-06-22
      hi mom
      NaN
      9.30
      Politics
    
    
      4
      2017-06-23
      foo&bar
      1.0
      10.00
      NaN
    
    
      5
      2017-06-24
      get off my lawn
      99.0
      8.90
      Sci



In [101]:

    
print df['Topic'].cat.categories  # .cat field gives us access to categories stuff









    



Index([u'News', u'Politics', u'Sci', u'Sports'], dtype='object')



In [102]:

    
print df['Topic'].cat.codes









    



0    0
1    3
2    3
3    1
4   -1
5    2
dtype: int8



In [103]:

    
print df['Topic'].cat.as_ordered()









    



0        News
1      Sports
2      Sports
3    Politics
4         NaN
5         Sci
dtype: category
Categories (4, object): [News < Politics < Sci < Sports]

String to ordinal

We can convert that category to an integer if we like:



In [114]:

    
# make sure you convert to categorical first
df['Topic'] = df['Topic'].astype('category')
df['Topic'] = df['Topic'].cat.codes+1 # add one so NA (-1) becomes 0
df









    Out[114]:






  
    
      
      Date
      Description
      Size
      Price
      Topic
      Size_na
    
  
  
    
      0
      2017-06-20
      NaN
      92.0
      1.50
      2
      False
    
    
      1
      2017-06-21
      run forest
      42.0
      2.34
      5
      False
    
    
      2
      2017-06-21
      not your droids
      19.0
      0.88
      5
      False
    
    
      3
      2017-06-22
      hi mom
      42.0
      9.30
      3
      True
    
    
      4
      2017-06-23
      foo&bar
      1.0
      10.00
      1
      False
    
    
      5
      2017-06-24
      get off my lawn
      99.0
      8.90
      4
      False

Missing data

Our data has a missing description, which we can ignore, but also has a missing size (numeric) and topic (categorical) entry.

If the element is numeric, we replace the missing value with the column median and add a column to indicate 0 or 1 as to whether the value is missing.
If the element is categorical, Pandas can handle the missing value automatically when we use parameter dummy_na=True on get_dummies() (see next section).

Let's convert the missing numeric data:



In [105]:

    
pandas.isnull(df['Size'])









    Out[105]:





0    False
1    False
2    False
3     True
4    False
5    False
Name: Size, dtype: bool



In [106]:

    
df['Size_na'] = pandas.isnull(df['Size'])
df









    Out[106]:






  
    
      
      Date
      Description
      Size
      Price
      Topic
      Size_na
    
  
  
    
      0
      2017-06-20
      NaN
      92.0
      1.50
      1
      False
    
    
      1
      2017-06-21
      run forest
      42.0
      2.34
      4
      False
    
    
      2
      2017-06-21
      not your droids
      19.0
      0.88
      4
      False
    
    
      3
      2017-06-22
      hi mom
      NaN
      9.30
      2
      True
    
    
      4
      2017-06-23
      foo&bar
      1.0
      10.00
      0
      False
    
    
      5
      2017-06-24
      get off my lawn
      99.0
      8.90
      3
      False



In [107]:

    
szcol = df['Size']
df['Size'] = szcol.fillna(szcol.median())
df









    Out[107]:






  
    
      
      Date
      Description
      Size
      Price
      Topic
      Size_na
    
  
  
    
      0
      2017-06-20
      NaN
      92.0
      1.50
      1
      False
    
    
      1
      2017-06-21
      run forest
      42.0
      2.34
      4
      False
    
    
      2
      2017-06-21
      not your droids
      19.0
      0.88
      4
      False
    
    
      3
      2017-06-22
      hi mom
      42.0
      9.30
      2
      True
    
    
      4
      2017-06-23
      foo&bar
      1.0
      10.00
      0
      False
    
    
      5
      2017-06-24
      get off my lawn
      99.0
      8.90
      3
      False

Dummy variables

Instead, we can convert the categorical variable to dummy variables, also called "one hot encoding."

If were lazy, we can just convert everything to dummies but the Description field is not something that we need to convert as it is mostly just information we are carrying along.



In [108]:

    
pandas.get_dummies(df) # convert all categorical to dummies









    Out[108]:






  
    
      
      Date
      Size
      Price
      Topic
      Size_na
      Description_foo&bar
      Description_get off my lawn
      Description_hi mom
      Description_not your droids
      Description_run forest
    
  
  
    
      0
      2017-06-20
      92.0
      1.50
      1
      False
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      2017-06-21
      42.0
      2.34
      4
      False
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      2
      2017-06-21
      19.0
      0.88
      4
      False
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      3
      2017-06-22
      42.0
      9.30
      2
      True
      0.0
      0.0
      1.0
      0.0
      0.0
    
    
      4
      2017-06-23
      1.0
      10.00
      0
      False
      1.0
      0.0
      0.0
      0.0
      0.0
    
    
      5
      2017-06-24
      99.0
      8.90
      3
      False
      0.0
      1.0
      0.0
      0.0
      0.0



In [109]:

    
pandas.get_dummies(df['Topic']) # One column's dummies



In [110]:

    
pandas.get_dummies(df['Topic'], dummy_na=True) # Add an "na" column

We can manually pack these new columns into the old data frame and delete the old column:



In [119]:

    
df2 = pandas.concat([df,pandas.get_dummies(df['Topic'], dummy_na=True)], axis=1)
df2.drop('Topic', axis=1, inplace=True) # Considered better than del df2['Topic'] I think
df2









    Out[119]:






  
    
      
      Date
      Description
      Size
      Price
      Size_na
      1.0
      2.0
      3.0
      4.0
      5.0
      nan
    
  
  
    
      0
      2017-06-20
      NaN
      92.0
      1.50
      False
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      2017-06-21
      run forest
      42.0
      2.34
      False
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      2
      2017-06-21
      not your droids
      19.0
      0.88
      False
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      3
      2017-06-22
      hi mom
      42.0
      9.30
      True
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
    
    
      4
      2017-06-23
      foo&bar
      1.0
      10.00
      False
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      5
      2017-06-24
      get off my lawn
      99.0
      8.90
      False
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0

Or, we can do it the easy way by just specifying the columns to convert:



In [112]:

    
pandas.get_dummies(df, columns=['Topic'], dummy_na=True) # The easy way









    Out[112]:






  
    
      
      Date
      Description
      Size
      Price
      Size_na
      Topic_0.0
      Topic_1.0
      Topic_2.0
      Topic_3.0
      Topic_4.0
      Topic_nan
    
  
  
    
      0
      2017-06-20
      NaN
      92.0
      1.50
      False
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      2017-06-21
      run forest
      42.0
      2.34
      False
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      2
      2017-06-21
      not your droids
      19.0
      0.88
      False
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      3
      2017-06-22
      hi mom
      42.0
      9.30
      True
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
    
    
      4
      2017-06-23
      foo&bar
      1.0
      10.00
      False
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      5
      2017-06-24
      get off my lawn
      99.0
      8.90
      False
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0

If you ever need to walk the columns in your data frame, you can do that with a for each loop:



In [113]:

    
# walk columns; col is the actual series
for name,col in df.iteritems():
    print name









    



Date
Description
Size
Price
Topic
Size_na



In [116]:

    
df.describe() # useful stats about columns









    Out[116]:






  
    
      
      Size
      Price
      Topic
    
  
  
    
      count
      6.000000
      6.000000
      6.000000
    
    
      mean
      49.166667
      5.486667
      3.333333
    
    
      std
      39.117345
      4.326175
      1.632993
    
    
      ...
      ...
      ...
      ...
    
    
      50%
      42.000000
      5.620000
      3.500000
    
    
      75%
      79.500000
      9.200000
      4.750000
    
    
      max
      99.000000
      10.000000
      5.000000
    
  

8 rows × 3 columns

Summary

In this lecture, you've learned the basics of loading and manipulating data:

Loading data into Pandas data frames using read_csv()
Converting data frames to NumPy arrays using as_matrix()
Extracting columns with dataframe.columnname or matrix[:,columnindex]
Accessing elements via dataframe.iloc[rowindex][columnindex] or matrix[rowindex,columnindex]
Getting unique elements with set(mylist)

And, you've learned how to visualize:

Time series data
Functions over a given range
The relationship between variables using a scatterplot
Histograms approximating density function

	MPG	CYL	ENG	WGT
0	18.0	8	307.0	3504
1	15.0	8	350.0	3693
2	18.0	8	318.0	3436
...	...	...	...	...
389	32.0	4	135.0	2295
390	28.0	4	120.0	2625
391	31.0	4	119.0	2720

	Date	Quantity	Unit Price	Shipping	Customer Name	Product Category	Product Name
0	10/13/10	6	38.94	35.00	Muhammed MacIntyre	Office Supplies	Eldon Base for stackable storage shelf, platinum
1	10/1/12	49	208.16	68.02	Barry French	Office Supplies	1.7 Cubic Foot Compact "Cube" Office Refrigera...
2	10/1/12	27	8.69	2.99	Barry French	Office Supplies	Cardinal Slant-D� Ring Binder, Heavy Gauge Vinyl
...	...	...	...	...	...	...	...
28	11/8/10	28	13.48	4.51	Carlos Soltero	Office Supplies	Tenex Personal Project File with Scoop Front D...
29	10/21/12	49	6.08	1.17	Grant Carroll	Office Supplies	Col-Erase� Pencils with Erasers
30	1/1/11	10	5.98	4.38	Don Miller	Technology	Imation 3.5" DS/HD IBM Formatted Diskettes, 10...

	Date	Description	Size	Price	Topic
0	2017-06-20	NaN	92.0	1.50	News
1	2017-06-21	run forest	42.0	2.34	Sports
2	2017-06-21	not your droids	19.0	0.88	Sports
3	2017-06-22	hi mom	NaN	9.30	Politics
4	2017-06-23	foo&bar	1.0	10.00	NaN
5	2017-06-24	get off my lawn	99.0	8.90	Sci

	Date	Description	Size	Price	Topic	Size_na
0	1.497917e+09	NaN	92.0	1.50	2	False
1	1.498003e+09	run forest	42.0	2.34	5	False
2	1.498003e+09	not your droids	19.0	0.88	5	False
3	1.498090e+09	hi mom	42.0	9.30	3	True
4	1.498176e+09	foo&bar	1.0	10.00	1	False
5	1.498262e+09	get off my lawn	99.0	8.90	4	False

	0	1	2	3	4
0	0.0	1.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	1.0
2	0.0	0.0	0.0	0.0	1.0
3	0.0	0.0	1.0	0.0	0.0
4	1.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	1.0	0.0

	0.0	1.0	2.0	3.0	4.0
0	0.0	1.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	1.0
2	0.0	0.0	0.0	0.0	1.0
3	0.0	0.0	1.0	0.0	0.0
4	1.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	1.0	0.0

	Size	Price	Topic
count	6.000000	6.000000	6.000000
mean	49.166667	5.486667	3.333333
std	39.117345	4.326175	1.632993
...	...	...	...
50%	42.000000	5.620000	3.500000
75%	79.500000	9.200000	4.750000
max	99.000000	10.000000	5.000000

	0	1	2	3	4
0	0.0	1.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	1.0
2	0.0	0.0	0.0	0.0	1.0
3	0.0	0.0	1.0	0.0	0.0
4	1.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	1.0	0.0

	0.0	1.0	2.0	3.0	4.0
0	0.0	1.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	1.0
2	0.0	0.0	0.0	0.0	1.0
3	0.0	0.0	1.0	0.0	0.0
4	1.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	1.0	0.0

	0	1	2	3	4
0	0.0	1.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	1.0
2	0.0	0.0	0.0	0.0	1.0
3	0.0	0.0	1.0	0.0	0.0
4	1.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	1.0	0.0

	0.0	1.0	2.0	3.0	4.0
0	0.0	1.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	1.0
2	0.0	0.0	0.0	0.0	1.0
3	0.0	0.0	1.0	0.0	0.0
4	1.0	0.0	0.0	0.0	0.0
5	0.0	0.0	0.0	1.0	0.0