A Quick Introduction to Pandas

Author: Paul M. Magwene



In [1]:

    
import numpy as np
import pandas as pd

Creating a data frame from scratch

First let's see how to create a DataFrame from scratch. In this first example, we'll use random numbers generated using functions defined in the np.random module.



In [2]:

    
np.random.seed(482010)  # seed the pseudo-random number generators
x = np.random.random(15)
y = np.random.binomial(10, x)

df = pd.DataFrame()
df['prob'] = x
df['count'] = y
df.head()



In [3]:

    
# alternate way to create same df by passing a dictionary
df2 = pd.DataFrame({'prob':x, 'count':y})
df2.head()



In [4]:

    
# column and row names
print("Column names: ", df.columns)
print("Row names (index): ", df.index)









    



Column names:  Index(['prob', 'count'], dtype='object')
Row names (index):  Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], dtype='int64')



In [5]:

    
# add a new column to our existing df
# randomly give each observation a categorization
df['category'] = np.random.choice(['A','B','C'], size=15)
df



In [6]:

    
# create a new column representing locations
# using None to represent missing valuess
df['location'] = np.random.choice(["Raleigh","Durham","Chapel Hill", None], size=15)
df









    Out[6]:






  
    
      
      prob
      count
      category
      location
    
  
  
    
      0
      0.471435
      7
      C
      None
    
    
      1
      0.085978
      0
      C
      None
    
    
      2
      0.314651
      3
      A
      Durham
    
    
      3
      0.205287
      3
      B
      None
    
    
      4
      0.861822
      10
      C
      Chapel Hill
    
    
      5
      0.505977
      8
      C
      None
    
    
      6
      0.928496
      10
      B
      Raleigh
    
    
      7
      0.079889
      0
      A
      Durham
    
    
      8
      0.108795
      1
      A
      None
    
    
      9
      0.372346
      5
      C
      Durham
    
    
      10
      0.542092
      5
      B
      None
    
    
      11
      0.311425
      4
      C
      Chapel Hill
    
    
      12
      0.875124
      8
      B
      Raleigh
    
    
      13
      0.738843
      7
      A
      Durham
    
    
      14
      0.722465
      7
      A
      None



In [7]:

    
# find out for which observations location is null
df["location"].isnull()  # notice None objects are consider null









    Out[7]:





0      True
1      True
2     False
3      True
4     False
5      True
6     False
7     False
8      True
9     False
10     True
11    False
12    False
13    False
14     True
Name: location, dtype: bool

Basic data frame manipulation and arithmetic



In [8]:

    
# transpose flips rows and columns
df.T









    Out[8]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
    
  
  
    
      prob
      0.471435
      0.0859779
      0.314651
      0.205287
      0.861822
      0.505977
      0.928496
      0.079889
      0.108795
      0.372346
      0.542092
      0.311425
      0.875124
      0.738843
      0.722465
    
    
      count
      7
      0
      3
      3
      10
      8
      10
      0
      1
      5
      5
      4
      8
      7
      7
    
    
      category
      C
      C
      A
      B
      C
      C
      B
      A
      A
      C
      B
      C
      B
      A
      A
    
    
      location
      None
      None
      Durham
      None
      Chapel Hill
      None
      Raleigh
      Durham
      None
      Durham
      None
      Chapel Hill
      Raleigh
      Durham
      None



In [9]:

    
# numerical operations can be applied to numerical columns of data frames
# just as if they were numpy arrays
df['count'] * 2









    Out[9]:





0     14
1      0
2      6
3      6
4     20
5     16
6     20
7      0
8      2
9     10
10    10
11     8
12    16
13    14
14    14
Name: count, dtype: int64



In [10]:

    
df['count'] * df['prob']









    Out[10]:





0     3.300044
1     0.000000
2     0.943953
3     0.615861
4     8.618224
5     4.047814
6     9.284956
7     0.000000
8     0.108795
9     1.861729
10    2.710458
11    1.245701
12    7.000989
13    5.171902
14    5.057254
dtype: float64

Indexing DataFrames

Pandas DataFrames support three different indexing mechanisms.

.iloc -- indexes rows and columns by integers. Indexes go from 0 to $n-1$ for rows, and 0 to $p-1$ for columns, where $n$ and $p$ are the number of rows and columns respectively. Slicing is like normal python slices.
.loc -- indexes rows and columns by labels. If you use slices with .loc, note that contrary to standard Python slices, both the start and the stop are included! Note that if the row names (df.index) are integers, than slice
.ix -- allows you to mix integers and labels. However, it defaults to label based indexing like .loc, and only adopts integer based indexing if row or column labels don't have integers in them. For the most part, .ix works intuitively, but if you're not careful it can lead to seom unexpected behavior.

For a fuller explanation of the differences between these indexing mechanisms see:



In [11]:

    
df["prob"]  # labels by themselves index column names









    Out[11]:





0     0.471435
1     0.085978
2     0.314651
3     0.205287
4     0.861822
5     0.505977
6     0.928496
7     0.079889
8     0.108795
9     0.372346
10    0.542092
11    0.311425
12    0.875124
13    0.738843
14    0.722465
Name: prob, dtype: float64



In [12]:

    
df.prob # if your column name is a valid python variable name you can acces it like an attribute









    Out[12]:





0     0.471435
1     0.085978
2     0.314651
3     0.205287
4     0.861822
5     0.505977
6     0.928496
7     0.079889
8     0.108795
9     0.372346
10    0.542092
11    0.311425
12    0.875124
13    0.738843
14    0.722465
Name: prob, dtype: float64



In [13]:

    
# get specific columns using list of column names
df[["count","location"]].head()  # lookup the head method in the docs









    Out[13]:






  
    
      
      count
      location
    
  
  
    
      0
      7
      None
    
    
      1
      0
      None
    
    
      2
      3
      Durham
    
    
      3
      3
      None
    
    
      4
      10
      Chapel Hill



In [14]:

    
# get specific columns using list of column names
df[["prob","category"]].tail()  # lookup the tail method in the docs



In [15]:

    
# get first three rows of df using slices
df[:3]



In [16]:

    
# last three rows of df using slices
df[-3:]



In [17]:

    
df.iloc[:3,:]  # integer indexing and slicing using iloc



In [18]:

    
df.iloc[:4,(1,3)]  # integer indexing and slicing



In [19]:

    
df.loc[:4,("count","location")]  # label indexing and slicing, contrast to above









    Out[19]:






  
    
      
      count
      location
    
  
  
    
      0
      7
      None
    
    
      1
      0
      None
    
    
      2
      3
      Durham
    
    
      3
      3
      None
    
    
      4
      10
      Chapel Hill



In [20]:

    
df.ix[:4,:3]  # indexes rows like .loc (because row names are integers)
              # indexes columns like .iloc (because no integers in column names)

Building up data summaries across categorical variables



In [21]:

    
# 'category' is a categorical variable. Count the number in each category
df['category'].value_counts()









    Out[21]:





C    6
A    5
B    4
Name: category, dtype: int64



In [22]:

    
df.category.value_counts() # same as above









    Out[22]:





C    6
A    5
B    4
Name: category, dtype: int64



In [23]:

    
# similar counts for location, note that None is treated as missing values and not counted
df['location'].value_counts()









    Out[23]:





Durham         4
Raleigh        2
Chapel Hill    2
Name: location, dtype: int64



In [24]:

    
# overall mean of the "prob" variable
df.prob.mean()









    Out[24]:





0.4749748803845538



In [25]:

    
# mean of prob grouped by location
df.groupby("location").prob.mean()









    Out[25]:





location
Chapel Hill    0.586624
Durham         0.376432
Raleigh        0.901810
Name: prob, dtype: float64



In [26]:

    
# mean of all the continuous variables, group by category
df.groupby("category").mean()



In [27]:

    
# groupby multiple columns
df.groupby(["location","category"]).mean()









    Out[27]:






  
    
      
      
      prob
      count
    
    
      location
      category
      
      
    
  
  
    
      Chapel Hill
      C
      0.586624
      7.000000
    
    
      Durham
      A
      0.377794
      3.333333
    
    
      C
      0.372346
      5.000000
    
    
      Raleigh
      B
      0.901810
      9.000000



In [ ]:

	prob	count
category
A	0.392928	3.600000
B	0.637749	6.500000
C	0.434831	5.666667

	prob	count	category
0	0.471435	7	C
1	0.085978	0	C
2	0.314651	3	A
3	0.205287	3	B
4	0.861822	10	C
5	0.505977	8	C
6	0.928496	10	B
7	0.079889	0	A
8	0.108795	1	A
9	0.372346	5	C
10	0.542092	5	B
11	0.311425	4	C
12	0.875124	8	B
13	0.738843	7	A
14	0.722465	7	A

	prob	count	category	location
0	0.471435	7	C	None
1	0.085978	0	C	None
2	0.314651	3	A	Durham
3	0.205287	3	B	None
4	0.861822	10	C	Chapel Hill
5	0.505977	8	C	None
6	0.928496	10	B	Raleigh
7	0.079889	0	A	Durham
8	0.108795	1	A	None
9	0.372346	5	C	Durham
10	0.542092	5	B	None
11	0.311425	4	C	Chapel Hill
12	0.875124	8	B	Raleigh
13	0.738843	7	A	Durham
14	0.722465	7	A	None

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
prob	0.471435	0.0859779	0.314651	0.205287	0.861822	0.505977	0.928496	0.079889	0.108795	0.372346	0.542092	0.311425	0.875124	0.738843	0.722465
count	7	0	3	3	10	8	10	0	1	5	5	4	8	7	7
category	C	C	A	B	C	C	B	A	A	C	B	C	B	A	A
location	None	None	Durham	None	Chapel Hill	None	Raleigh	Durham	None	Durham	None	Chapel Hill	Raleigh	Durham	None