In [1]:
import pandas as pd
import numpy as np
pd.set_option('max_columns', 50)
A Series is a one-dimensional object similar to an array, list, or column in a table.
It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.
In [2]:
# create a Series with an arbitrary list
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'])
s
Out[2]:
Alternatively, you can specify an index to use when creating the Series.
In [3]:
s = pd.Series([7, 'Heisenberg', 3.14, -1789710578, 'Happy Eating!'],
index=['A', 'Z', 'C', 'Y', 'E'])
s
Out[3]:
The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.
In [4]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities
Out[4]:
In [5]:
cities['Chicago']
Out[5]:
In [6]:
cities[['Chicago', 'Portland', 'San Francisco']]
Out[6]:
Or you can use boolean indexing for selection.
In [7]:
cities[cities < 1000]
Out[7]:
That last one might be a little weird, so let's make it more clear - cities < 1000
returns a Series of True/False values, which we then pass to our Series cities
, returning the corresponding True items.
In [8]:
less_than_1000 = cities < 1000
print less_than_1000
print '\n'
print cities[less_than_1000]
In [9]:
# changing based on the index
print 'Old value:', cities['Chicago']
cities['Chicago'] = 1400
print 'New value:', cities['Chicago']
In [10]:
# changing values using boolean logic
print cities[cities < 1000]
print '\n'
cities[cities < 1000] = 750
print cities[cities < 1000]
In [11]:
# divide city values by 3
cities / 3
Out[11]:
In [12]:
# square city values
np.square(cities)
Out[12]:
You can add two Series together, which returns a union of the two Series with the addition occurring on the shared index values. Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).
In [13]:
print cities[['Chicago', 'New York', 'Portland']]
print'\n'
print cities[['Austin', 'New York']]
print'\n'
print cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']]
Notice that because Austin, Chicago, and Portland were not found in both Series, they were returned with NULL/NaN values.
In [14]:
print 'Seattle' in cities
print 'San Francisco' in cities
NULL checking can be performed with isnull
and notnull
.
In [15]:
# returns a boolean series indicating which values aren't NULL
cities.notnull()
Out[15]:
In [16]:
# use boolean logic to grab the NULL cities
print cities.isnull()
print '\n'
print cities[cities.isnull()]
To create a DataFrame out of common Python data structures, we can pass a dictionary of lists to the DataFrame constructor.
Using the columns
parameter allows us to tell the constructor how we'd like the columns ordered. By default, the DataFrame constructor will order the columns alphabetically (though this isn't the case when reading from a file - more on that next).
In [17]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
'wins': [11, 8, 10, 15, 11, 6, 10, 4],
'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data, columns=['year', 'team', 'wins', 'losses'])
print football
Much more often, you'll have a dataset you want to read into a DataFrame. Let's go through several common ways of doing so.
In [18]:
%cd ~/Dropbox/tutorials/pandas/
In [19]:
# Source: baseball-reference.com/players/r/riverma01.shtml
!head -n 5 data/mariano-rivera.csv
In [20]:
from_csv = pd.read_csv('data/mariano-rivera.csv')
from_csv.head()
Out[20]:
Our file had headers, which the function inferred upon reading in the file. Had we wanted to be more explicit, we could have passed header=None
to the function along with a list of column names to use:
In [21]:
# command line : read head of file
# Source: pro-football-reference.com/players/M/MannPe00/touchdowns/passing/2012/
!head -n 5 data/peyton-passing-TDs-2012.csv
In [22]:
cols = ['num', 'game', 'date', 'team', 'home_away', 'opponent',
'result', 'quarter', 'distance', 'receiver', 'score_before',
'score_after']
no_headers = pd.read_csv('data/peyton-passing-TDs-2012.csv', sep=',', header=None,
names=cols)
no_headers.head()
Out[22]:
pandas various reader functions have many parameters allowing you to do things like skipping lines of the file, parsing dates, or specifying how to handle NA/NULL datapoints.
Writing to CSV
There's also a set of writer functions for writing to a variety of formats (CSVs, HTML tables, JSON). They function exactly as you'd expect and are typically called to_format
:
my_dataframe.to_csv('path_to_file.csv')
Take a look at the IO documentation to familiarize yourself with file reading/writing functionality.
Know who hates VBA? Me. I bet you do, too. Thankfully, pandas allows you to read and write Excel files, so you can easily read from Excel, write your code in Python, and then write back out to Excel - no need for VBA.
Reading Excel files requires the xlrd library. You can install it via pip (pip install xlrd).
Let's first write a DataFrame to Excel.
In [23]:
# this is the DataFrame we created from a dictionary earlier
print football.head()
In [24]:
# since our index on the football DataFrame is meaningless, let's not write it
football.to_excel('data/football.xlsx', index=False)
In [25]:
# command line : list .xlsx files
!ls -l data/*.xlsx
In [26]:
# delete the DataFrame
del football
In [27]:
# read from Excel
football = pd.read_excel('data/football.xlsx')
print football
pandas also has some support for reading/writing DataFrames directly from/to a database [docs]. You'll typically just need to pass a connection object to the read_frame
or write_frame
functions within the pandas.io
module.
Note that write_frame
executes as a series of INSERT INTO statements and thus trades speed for simplicity. If you're writing a large DataFrame to a database, it might be quicker to write the DataFrame to CSV and load that directly using the database's file import arguments.
In [28]:
from pandas.io import sql
import sqlite3
conn = sqlite3.connect('/Users/greda/Dropbox/gregreda.com/_code/towed')
query = "SELECT * FROM towed WHERE make = 'FORD';"
results = sql.read_frame(query, con=conn)
print results.head()
While the results of a query can be read directly into a DataFrame, I prefer to read the results directly from the clipboard. I'm often tweaking queries in my SQL client (Sequel Pro), so I would rather see the results before I read it into pandas. Once I'm confident I have the data I want, then I'll read it into a DataFrame.
This works just as well with any type of delimited data you've copied to your clipboard. The function does a good job of inferring the delimiter, but you can also use the sep
parameter to be explicit.
In [29]:
hank = pd.read_clipboard()
hank.head()
Out[29]:
We can also use the Python's StringIO library to read data directly from a URL. StringIO allows you to treat a string as a file-like object.
Let's use the best sandwiches data that I wrote about scraping a while back.
In [30]:
from urllib2 import urlopen
from StringIO import StringIO
# store the text from the URL response in our url variable
url = urlopen('https://raw.github.com/gjreda/best-sandwiches/master/data/best-sandwiches-geocode.tsv').read()
# treat the tab-separated text as a file with StringIO and read it into a DataFrame
from_url = pd.read_table(StringIO(url), sep='\t')
from_url.head(3)
Out[30]:
Use the pandas.merge()
static method to merge/join datasets in a relational manner. (See DOC)
Like SQL's JOIN clause, pandas.merge()
allows two DataFrames to be joined on one or more keys.
how
: specify which keys are to be included in the resulting tableon, left_on, right_on, left_index, right_index
: to specify the columns or indexes on which to join.
how
: {"inner", "left", "right", "outer"}
- "left" : use keys from left frame only
- "right" : use keys from right frame only
- "inner" (default) : use intersection of keys from both frames
- "outer" : use union of keys from both frames
There are several cases to consider which are very important to understand:
Below are the different joins in SQL.
In [31]:
left = pd.DataFrame({'key': range(5),
'left_value': ['L0', 'L1', 'L2', 'L3', 'L4']})
right = pd.DataFrame({'key': range(2, 7),
'right_value': ['R0', 'R1', 'R2', 'R3', 'R4']})
print left, '\n'
print right
In [32]:
print pd.merge(left, right, on='key', how='inner')
left_on
and right_on
parameters to specify which fields to join from each frame.pd.merge(left, right, left_on='left_key', right_on='right_key')
left_index
or right_index
parameters to specify to use the index column, with a True/False value. You can mix and match columns and indexes like so:pd.merge(left, right, left_on='key', right_index=True)
In [33]:
print pd.merge(left, right, on='key', how='left')
In [34]:
print pd.merge(left, right, on='key', how='right')
In [35]:
print pd.merge(left, right, on='key', how='outer')
Use pandas .concat()
static method to combine Series/DataFrames into one unified object. (See DOC)
pandas.concat()
takes a list of Series or DataFrames and returns a Series or DataFrame of the concatenated objects. Note that because the function takes list, you can combine many objects at once.
Use axis
parameter to define along which axis to concatenate:
axis
= 0 : concatenate vertically (default)
axis
= 1 : concatenante side-by-side
In [36]:
pd.concat([left, right], axis=1)
Out[36]: