A Quick Introduction to Pandas

Author: Paul M. Magwene


In [1]:
import numpy as np
import pandas as pd

Creating a data frame from scratch

First let's see how to create a DataFrame from scratch. In this first example, we'll use random numbers generated using functions defined in the np.random module.


In [2]:
np.random.seed(482010)  # seed the pseudo-random number generators
x = np.random.random(15)
y = np.random.binomial(10, x)

df = pd.DataFrame()
df['prob'] = x
df['count'] = y
df.head()


Out[2]:
prob count
0 0.471435 7
1 0.085978 0
2 0.314651 3
3 0.205287 3
4 0.861822 10

In [3]:
# alternate way to create same df by passing a dictionary
df2 = pd.DataFrame({'prob':x, 'count':y})
df2.head()


Out[3]:
count prob
0 7 0.471435
1 0 0.085978
2 3 0.314651
3 3 0.205287
4 10 0.861822

In [4]:
# column and row names
print("Column names: ", df.columns)
print("Row names (index): ", df.index)


Column names:  Index(['prob', 'count'], dtype='object')
Row names (index):  Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], dtype='int64')

In [5]:
# add a new column to our existing df
# randomly give each observation a categorization
df['category'] = np.random.choice(['A','B','C'], size=15)
df


Out[5]:
prob count category
0 0.471435 7 C
1 0.085978 0 C
2 0.314651 3 A
3 0.205287 3 B
4 0.861822 10 C
5 0.505977 8 C
6 0.928496 10 B
7 0.079889 0 A
8 0.108795 1 A
9 0.372346 5 C
10 0.542092 5 B
11 0.311425 4 C
12 0.875124 8 B
13 0.738843 7 A
14 0.722465 7 A

In [6]:
# create a new column representing locations
# using None to represent missing valuess
df['location'] = np.random.choice(["Raleigh","Durham","Chapel Hill", None], size=15)
df


Out[6]:
prob count category location
0 0.471435 7 C None
1 0.085978 0 C None
2 0.314651 3 A Durham
3 0.205287 3 B None
4 0.861822 10 C Chapel Hill
5 0.505977 8 C None
6 0.928496 10 B Raleigh
7 0.079889 0 A Durham
8 0.108795 1 A None
9 0.372346 5 C Durham
10 0.542092 5 B None
11 0.311425 4 C Chapel Hill
12 0.875124 8 B Raleigh
13 0.738843 7 A Durham
14 0.722465 7 A None

In [7]:
# find out for which observations location is null
df["location"].isnull()  # notice None objects are consider null


Out[7]:
0      True
1      True
2     False
3      True
4     False
5      True
6     False
7     False
8      True
9     False
10     True
11    False
12    False
13    False
14     True
Name: location, dtype: bool

Basic data frame manipulation and arithmetic


In [8]:
# transpose flips rows and columns
df.T


Out[8]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
prob 0.471435 0.0859779 0.314651 0.205287 0.861822 0.505977 0.928496 0.079889 0.108795 0.372346 0.542092 0.311425 0.875124 0.738843 0.722465
count 7 0 3 3 10 8 10 0 1 5 5 4 8 7 7
category C C A B C C B A A C B C B A A
location None None Durham None Chapel Hill None Raleigh Durham None Durham None Chapel Hill Raleigh Durham None

In [9]:
# numerical operations can be applied to numerical columns of data frames
# just as if they were numpy arrays
df['count'] * 2


Out[9]:
0     14
1      0
2      6
3      6
4     20
5     16
6     20
7      0
8      2
9     10
10    10
11     8
12    16
13    14
14    14
Name: count, dtype: int64

In [10]:
df['count'] * df['prob']


Out[10]:
0     3.300044
1     0.000000
2     0.943953
3     0.615861
4     8.618224
5     4.047814
6     9.284956
7     0.000000
8     0.108795
9     1.861729
10    2.710458
11    1.245701
12    7.000989
13    5.171902
14    5.057254
dtype: float64

Indexing DataFrames

Pandas DataFrames support three different indexing mechanisms.

  1. .iloc -- indexes rows and columns by integers. Indexes go from 0 to $n-1$ for rows, and 0 to $p-1$ for columns, where $n$ and $p$ are the number of rows and columns respectively. Slicing is like normal python slices.

  2. .loc -- indexes rows and columns by labels. If you use slices with .loc, note that contrary to standard Python slices, both the start and the stop are included! Note that if the row names (df.index) are integers, than slice

  3. .ix -- allows you to mix integers and labels. However, it defaults to label based indexing like .loc, and only adopts integer based indexing if row or column labels don't have integers in them. For the most part, .ix works intuitively, but if you're not careful it can lead to seom unexpected behavior.

For a fuller explanation of the differences between these indexing mechanisms see:


In [11]:
df["prob"]  # labels by themselves index column names


Out[11]:
0     0.471435
1     0.085978
2     0.314651
3     0.205287
4     0.861822
5     0.505977
6     0.928496
7     0.079889
8     0.108795
9     0.372346
10    0.542092
11    0.311425
12    0.875124
13    0.738843
14    0.722465
Name: prob, dtype: float64

In [12]:
df.prob # if your column name is a valid python variable name you can acces it like an attribute


Out[12]:
0     0.471435
1     0.085978
2     0.314651
3     0.205287
4     0.861822
5     0.505977
6     0.928496
7     0.079889
8     0.108795
9     0.372346
10    0.542092
11    0.311425
12    0.875124
13    0.738843
14    0.722465
Name: prob, dtype: float64

In [13]:
# get specific columns using list of column names
df[["count","location"]].head()  # lookup the head method in the docs


Out[13]:
count location
0 7 None
1 0 None
2 3 Durham
3 3 None
4 10 Chapel Hill

In [14]:
# get specific columns using list of column names
df[["prob","category"]].tail()  # lookup the tail method in the docs


Out[14]:
prob category
10 0.542092 B
11 0.311425 C
12 0.875124 B
13 0.738843 A
14 0.722465 A

In [15]:
# get first three rows of df using slices
df[:3]


Out[15]:
prob count category location
0 0.471435 7 C None
1 0.085978 0 C None
2 0.314651 3 A Durham

In [16]:
# last three rows of df using slices
df[-3:]


Out[16]:
prob count category location
12 0.875124 8 B Raleigh
13 0.738843 7 A Durham
14 0.722465 7 A None

In [17]:
df.iloc[:3,:]  # integer indexing and slicing using iloc


Out[17]:
prob count category location
0 0.471435 7 C None
1 0.085978 0 C None
2 0.314651 3 A Durham

In [18]:
df.iloc[:4,(1,3)]  # integer indexing and slicing


Out[18]:
count location
0 7 None
1 0 None
2 3 Durham
3 3 None

In [19]:
df.loc[:4,("count","location")]  # label indexing and slicing, contrast to above


Out[19]:
count location
0 7 None
1 0 None
2 3 Durham
3 3 None
4 10 Chapel Hill

In [20]:
df.ix[:4,:3]  # indexes rows like .loc (because row names are integers)
              # indexes columns like .iloc (because no integers in column names)


Out[20]:
prob count category
0 0.471435 7 C
1 0.085978 0 C
2 0.314651 3 A
3 0.205287 3 B
4 0.861822 10 C

Building up data summaries across categorical variables


In [21]:
# 'category' is a categorical variable. Count the number in each category
df['category'].value_counts()


Out[21]:
C    6
A    5
B    4
Name: category, dtype: int64

In [22]:
df.category.value_counts() # same as above


Out[22]:
C    6
A    5
B    4
Name: category, dtype: int64

In [23]:
# similar counts for location, note that None is treated as missing values and not counted
df['location'].value_counts()


Out[23]:
Durham         4
Raleigh        2
Chapel Hill    2
Name: location, dtype: int64

In [24]:
# overall mean of the "prob" variable
df.prob.mean()


Out[24]:
0.4749748803845538

In [25]:
# mean of prob grouped by location
df.groupby("location").prob.mean()


Out[25]:
location
Chapel Hill    0.586624
Durham         0.376432
Raleigh        0.901810
Name: prob, dtype: float64

In [26]:
# mean of all the continuous variables, group by category
df.groupby("category").mean()


Out[26]:
prob count
category
A 0.392928 3.600000
B 0.637749 6.500000
C 0.434831 5.666667

In [27]:
# groupby multiple columns
df.groupby(["location","category"]).mean()


Out[27]:
prob count
location category
Chapel Hill C 0.586624 7.000000
Durham A 0.377794 3.333333
C 0.372346 5.000000
Raleigh B 0.901810 9.000000

In [ ]: