We've learned the basics of Loading files and now it's time to reorganize the loaded data into commonly-used data structures from NumPy and Pandas. To motivate the various data structures, we're going to feed them into matplotlib for visualization. This lecture-lab is then all about steps 2, 3, and 5 from our generic data science program template:
You'll learn more about step 4 in the courses on machine learning, timeseries analysis, and so on.
TODO: missing values delete row insert value
Let's get started by importing all of the packages we're going to need and setting a few parameters that make this Jupyter notebook look better:
In [1]:
import pandas
import numpy as np
import matplotlib.pyplot as plt
pandas.options.display.max_rows = 7 # Don't display too much data (Pandas)
np.set_printoptions(threshold=4) # Don't display too much data (NumPy)
Analytics programs tend to use lots of one- and two-dimensional arrays. 2D arrays are matrices and tables of data. 1D arrays are vectors, such as points in Euclidean space. A column or row of a table is also a 1D array. Python has lists and lists of lists that would suffice for 1D and 2D arrays, but Pandas and NumPy define similar but more capable data structures.
Let's start with Pandas data frames, which are powerful tables very much like Excel tables. I also think that Pandas' read_csv()
is the easiest way to load most kinds of data organized into rows and columns. Here's a sample data file with a list of prices over time, one data point per line and no header row:
In [61]:
! wc data/prices.txt
! head data/prices.txt
(The wc
and head
are bash commands that you might find useful in the future.)
Here's how to load that file using Pandas:
In [62]:
prices = pandas.read_csv('data/prices.txt', header=None)
prices # jupyter notebooks know how to display this nicely
Out[62]:
The numbers in the left column are just the index and are displayed by Pandas for your information; they are not stored in memory as part of the data structure. Let's look at the type and shape of this data structure:
In [63]:
print "type is", type(prices)
print "shape is", prices.shape
That output indicates that the data is stored in a DataFrame
object and there are 344 rows and one column.
While Pandas is great for loading the data, and a few other things we'll see below, I prefer working with NumPy arrays; the actual type is called ndarray
. Let's convert that list of prices from a data frame to a NumPy array:
In [64]:
m = prices.as_matrix() # Convert data frame to numpy array
print "type is", type(m)
print "shape is", m.shape
print m
We can access the 2D NumPy arrays using array index notation array[
row, column]
:
In [65]:
print m[0] # Access the first row
print m[0,0] # Access the first column of the first row
print m[1] # Access the 2nd row
print m[1,0] # Access the first column of the 2nd row
That is a little weird though. We think of that as a 1D array or just a list, not a 2D array with a single column (shape is 345 x 1). To get NumPy to treat that as 1D, we use the shape
attribute of the array:
In [66]:
m.shape = (345,) # len(m)==345
m
Out[66]:
Now, we can access the elements using a single index as we would expect:
In [67]:
print m[0]
print m[1]
print m[2]
A shape with an empty second parameter indicates a 1D array, which is how NumPy converts a regular Python list to an array:
In [68]:
sizes = [28, 32, 34, 36, 38, 39, 40, 41] # Plain old Python list
a = np.array(sizes) # Convert to NumPy array
print "shape is", a.shape
a
Out[68]:
While we're at it, here's how to convert a list of lists to a 2D NumPy array:
In [69]:
stuff = [
[ 18, 8, 307, 3504],
[ 15, 8, 350, 3693],
[ 18, 8, 318, 3436]
]
m = np.array(stuff)
print "shape is", m.shape
m
Out[69]:
Now multiple indices make sense. For example, to access the element containing value 3436, we'd use m[2,3]
(3rd row, 4th column).
In [70]:
import numpy as np
def f(x):
"Scalar or vector math!"
return np.cos(3 * np.pi * x) / x
print f(3.4)
X = np.array([1.2,3.0]) # a numpy array is more flexible than list of numbers
print f(X) # returns array due to vector math in f()!
print [f(x) for x in X] # manually apply f() to X
Our list of prices is representative of timeseries data, such as stock price, temperature, or population fluctuations. Matplotlib is a great library for visualizing data and in this section we're going to use it to display the prices as a timeseries using plot()
. That function takes the X and Y coordinates as separate arrays.
I find Matplotlib kind of mysterious, but I have learned patterns that I use over and over again, such as this timeseries plot.
In [71]:
m = prices.as_matrix() # Let's convert pandas data frame to numpy array
time = np.arange(0, len(m), 1) # Time axis goes from 0 to len(m) by 1
#plt.figure(figsize=(5, 2)) # Prepare a plot 5x2 inches
plt.plot(time, m) # Plot time vs the prices data
plt.xlabel("Time") # Always set the axes labels
plt.ylabel("Price (dollars)")
plt.show() # Show the actual plot
Plotting functions
Sometimes we have a smooth function such as a cosine that we'd like to plot. To do that, we need to sample the function at regular intervals to collect a list of Y coordinates (like prices from before). Let's start by defining the function that maps X coordinates to Y values and then get a sample of X values at regular intervals between 0.1 and 1.1, stepping by 0.01:
In [72]:
def f(x):
return np.cos(3 * np.pi * x) / x
X = np.arange(.1, 1.1, 0.01) # from .1 to 1.1 by step 0.01
There are three ways to sample the function f()
at the coordinates contained in X, which I've delineated here. All of these 3 methods employ our Map pattern:
In [73]:
# Get f(x) values for all x in three different ways
# Option 1: (non-Pythonic)
Y = []
for x in X:
Y.append(f(x))
# Option 2: Pythonic way (cool kids do this)
Y = [f(x) for x in X]
# Option 3: Data science way (the most popular kids do this)
Y = f(X) # a so-called broadcast; implied map
print X
print Y
Given X and Y coordinates, we can plot the function:
In [74]:
plt.figure(figsize=(5, 2))
plt.plot(X, Y)
plt.xlabel("x")
plt.ylabel("cos(3 * pi * x) / x")
plt.show()
In [75]:
! head data/cars.csv
We can use Pandas again to load the data into a data frame and then convert to a NumPy 2D array (a matrix):
In [76]:
cars = pandas.read_csv('data/cars.csv')
print "shape is", cars.shape
cars
Out[76]:
In [77]:
m = cars.as_matrix()
print "shape is", m.shape
m
Out[77]:
Let's say we're interested in the relationship between the weight of the car and the fuel efficiency. We can examine that relationship visually by plotting weight against efficiency using a scatterplot.
This brings us to the question of how to extract columns from numpy arrays, where each column represents the data associated with one attribute of all cars. The idea is to fix the column number but use a wildcard (the colon character) to indicate we want all rows:
In [78]:
# can do this:
print cars.MPG
print cars['MPG']
# but I like as numpy matrix
mpg = m[:,0]
wgt = m[:,3]
print "shape is", mpg.shape
mpg
Out[78]:
Once we have the two columns, we can use matplotlib's scatter()
function:
In [79]:
plt.figure(figsize=(5, 2))
plt.scatter(wgt, mpg, alpha=0.5) # looks cooler with alpha (opacity) at 50%
plt.xlabel('Weight (pounds)')
plt.ylabel('Miles per gallon')
plt.show()
Great! This shows a clear relationship between weight and efficiency: the heavier the car, the lower the efficiency.
It would also be interesting to know how the number of engine cylinders is related to weight and efficiency. We could go to a three-dimensional graph or even multiple graphs, but it's better to add another attribute to a single graph in this case. We could change the color according to the number of cylinders, but a better visualization would change the size of the plotted point.
We can pull out the number of cylinders with m[:,1]
like we did before, but we need to plot each point individually now because we have to specify different sizes. That means we need a loop around scatter()
to pass individual X and Y coordinates rather than a list. The s
parameter to scatter()
is actually proportional to the area of the circle we want (see pyplot scatter plot marker size). To accentuate the difference between engine size, I scale by .7. Here is the code to illustrate the relationship between three variables:
In [80]:
plt.figure(figsize=(5, 2))
hp = m[:,2]
plt.scatter(wgt, mpg, s=hp*.5, alpha=0.1)
plt.xlabel('Weight (pounds)')
plt.ylabel('Miles per gallon')
plt.show()
When exploring data, it's often useful to know the unique set of values. For example, if would like to know the set of number of cylinders, we can use set(
mylist)
:
In [81]:
m = cars.as_matrix()
cyl = m[:,1]
print set(cyl)
Interesting. I did not know there were cars with 3 cylinders.
Exercise: Convert cyl
to a set of integers (not floating-point values) using a map pattern. Hint: int(3.0)
gives 3.
Instead of just a unique set of attribute values, we can count how many of each unique value appear in a data set. Python's Counter
object, a kind of dict
, knows how to count the elements. For example, here is how we'd get a dictionary-like object mapping number of cylinders to number of cars with that many cylinders:
In [82]:
from collections import Counter
m = cars.as_matrix()
cyl = m[:,1]
Counter(cyl)
Out[82]:
That works great for categorical variables, which the number of cylinders kind of is, but Counter
is not as appropriate for numerical values. Besides, looking at the cylinder counts, it's hard to quickly understand the relative populations. Visualizing the data with a histogram is a much better idea and works for numerical values too. Here is the code to make a histogram of the cylinder attribute:
In [83]:
plt.figure(figsize=(5, 2))
plt.hist(cyl, alpha=0.5)
plt.xlabel('Number of cylinders')
plt.ylabel('Number of cars')
plt.show()
The same pattern gives us a histogram for numerical data as well:
In [84]:
m = cars.as_matrix()
hp = m[:,2]
plt.figure(figsize=(5, 2))
plt.hist(hp, alpha=0.5)
plt.xlabel('Horsepower')
plt.ylabel('Number of cars')
plt.show()
A histogram is really a chunky estimate of a variable's density function and so it's often useful to normalize the histogram so that the area integrates (sums) to 1. To get a normalized histogram, use argument normed=True
:
In [85]:
plt.figure(figsize=(5, 2))
n, bins, patches = plt.hist(hp, normed=True, alpha=0.5)
plt.xlabel('Horsepower')
plt.ylabel('Probability')
plt.show()
Note that it is not the sum of the heights of the bins that equals 1; it is the height * binwidth summed that equals 1.
In [86]:
sales = pandas.read_csv('data/sales-small.csv')
sales
Out[86]:
The nice thing about the data frames is that we can access the columns by name, using either table.
attribute or array indexing notation table[
attribute ]
:
In [87]:
sales.Date
Out[87]:
In [88]:
sales['Date']
Out[88]:
In [89]:
sales['Customer Name']
Out[89]:
Accessing rows via sales[0]
then doesn't work because Pandas wants to use array indexing notation for getting columns. Instead, we have to use slightly more awkward notation:
In [90]:
sales.iloc[0] # get first row of data
Out[90]:
To get individual elements, we can use regular list of lists Python notation after the loc
:
In [91]:
print sales.iloc[0][0], sales.iloc[0][1], sales.iloc[0][2]
During construction and debugging of software, I often like the explicit printing of the column names as is the default shown above. On the other hand, if we need the elements as a plain old Python list, we can do that with list()
:
In [92]:
row = list(sales.iloc[0])
print row
Exercise: Convert all rows of sales
to a list of lists. Hint: use the map pattern and list()
.
The task of that exercise is common enough that Pandas provides a conversion mechanism directly:
In [93]:
m = sales.as_matrix()
print "Type is", type(m)
print m[0] # get first row
We can still get the columns individually using the wildcard notation we saw before:
In [94]:
m[:,0] # get first column
Out[94]:
In [95]:
m[:,4] # get fifth column
Out[95]:
For machine learning, we often want to separate out one of the columns as the dependent variable, keeping the others as a group of independent variables. Notation we typically use is X -> Y, meaning the set of observations in X predict or classify results in Y.
For example, let's say we wanted to predict engine size given the efficiency, number of cylinders, and overall car weight. We need to separate out the engine size as Y and combining the other columns into X. Using Pandas, we can easily separate the variables and keep the variables names:
In [96]:
cars = pandas.read_csv('data/cars.csv')
Y = cars['ENG']
X = cars[['MPG','CYL','WGT']]
print X
print
print Y
Converting to a NumPy array strips away the column names but let us treat it as a matrix, which is handy in a lot of cases (e.g., matrix addition). Separating columns from NumPy arrays is a bit more cumbersome, However:
In [97]:
m = cars.as_matrix()
Y = m[:,2]
X = np.column_stack((m[:,0],m[:,1], m[:,3])) # note extra parens; it's a tuple of columns
print X
print
print Y
While NumPy arrays are more cumbersome when pulling apart tables, accessing the elements without loc
is usually more convenient:
In [98]:
print cars.iloc[0][1]
print m[0,1]
Using tips from Jeremy Howard here on real-world data clean up.
In [99]:
import pandas
df = pandas.read_csv("data/mixed.csv", parse_dates=['Date'])
df
Out[99]:
If you ever need to convert dates to the elapsed time, you can convert the date timestamp to UNIX time, the number of second since 1970:
In [146]:
d = df['Date']
delta = d - pandas.datetime(1970,1,1)
delta
Out[146]:
In [147]:
df3 = df.copy()
df3['Date'] = delta.dt.total_seconds()
df3
Out[147]:
Or, you can convert the timestamp into the number of days since 1970:
In [150]:
delta.dt.days
Out[150]:
In [100]:
df['Topic'] = df['Topic'].astype('category')
df
Out[100]:
In [101]:
print df['Topic'].cat.categories # .cat field gives us access to categories stuff
In [102]:
print df['Topic'].cat.codes
In [103]:
print df['Topic'].cat.as_ordered()
In [114]:
# make sure you convert to categorical first
df['Topic'] = df['Topic'].astype('category')
df['Topic'] = df['Topic'].cat.codes+1 # add one so NA (-1) becomes 0
df
Out[114]:
Our data has a missing description, which we can ignore, but also has a missing size (numeric) and topic (categorical) entry.
dummy_na=True
on get_dummies()
(see next section).Let's convert the missing numeric data:
In [105]:
pandas.isnull(df['Size'])
Out[105]:
In [106]:
df['Size_na'] = pandas.isnull(df['Size'])
df
Out[106]:
In [107]:
szcol = df['Size']
df['Size'] = szcol.fillna(szcol.median())
df
Out[107]:
In [108]:
pandas.get_dummies(df) # convert all categorical to dummies
Out[108]:
In [109]:
pandas.get_dummies(df['Topic']) # One column's dummies
Out[109]:
In [110]:
pandas.get_dummies(df['Topic'], dummy_na=True) # Add an "na" column
Out[110]:
We can manually pack these new columns into the old data frame and delete the old column:
In [119]:
df2 = pandas.concat([df,pandas.get_dummies(df['Topic'], dummy_na=True)], axis=1)
df2.drop('Topic', axis=1, inplace=True) # Considered better than del df2['Topic'] I think
df2
Out[119]:
Or, we can do it the easy way by just specifying the columns to convert:
In [112]:
pandas.get_dummies(df, columns=['Topic'], dummy_na=True) # The easy way
Out[112]:
If you ever need to walk the columns in your data frame, you can do that with a for each loop:
In [113]:
# walk columns; col is the actual series
for name,col in df.iteritems():
print name
In [116]:
df.describe() # useful stats about columns
Out[116]:
In this lecture, you've learned the basics of loading and manipulating data:
read_csv()
as_matrix()
.
columnname or matrix[:,
columnindex]
.iloc[
rowindex][
columnindex]
or matrix[
rowindex,
columnindex]
set(
mylist)
And, you've learned how to visualize: