Basic Pandas DataFrame operations

A few first steps with Pandas.

We'll create a small dataframe, access its elements, and enlarge it by adding new columns and rows


In [ ]:
from __future__ import print_function 

# Import Pandas & NumPy
import pandas as pd
import numpy as np

In [ ]:
# Create a tiny dataset, as a list of tuples
name = ('Oslo','Copenhaguen','Helsinki','Stockholm','Reykjavik')
pop = ( 647676, 583348, 626305, 917297, 121822 )
area = ( 480.76, 86.20, 715.49, 188.0,  273 )
data = [ (1000+i,n,p,s) for i, (n,p,s) in enumerate(zip(name,pop,area)) ]

1 Creating a DataFrame


In [ ]:
# Create the dataframe from the list of tuples. We need to add the names of the columns, plus
# the column(s) we want to be used as row index
df = pd.DataFrame.from_records( data=data, columns=('id','name','population','area'), index=['id'] )

Let's view the dataframe. We can print it:


In [ ]:
print(df)

In [ ]:
# See the options we've got for data formatting
pd.describe_option('display')

Or we can just show it, and it will be nicely formatted. Note the double header: the second header row is for the column(s) forming the DataFrame index.


In [ ]:
df

In [ ]:
# Check dataframe dimensions
print(df.shape)
# Check dataframe components
print(df.index)
print(df.columns)

2 Fetching columns


In [ ]:
df['name']

In [ ]:
# Or also
df.name

We can also get more than one column. These operations create and return a new DataFrame


In [ ]:
df[ ['name','population'] ]

Same thing, but this time we get a reference to the original DataFrame by using a locator operator (see next section)


In [ ]:
df.loc[:,['name','population']]

3 Accessing DataFrame contents

There are several ways of accessing the elements contained in a DataFrame

3.1 By label

We can acccess rows and columns by using labels, i.e. the index for the rows and/or columns, using the loc locator.


In [ ]:
# One row, using the index. Note that in this case our row index is the 'id' column
df.loc[1000]

In [ ]:
# Two rows
df.loc[1002:1003]

In [ ]:
# Two rows, but only selected columns
df.loc[1002:1003,'name':'population']

3.2 By position

And we can also select row/columns by their position using the iloc locator.


In [ ]:
# Get the first row
df.iloc[0]

In [ ]:
# Get the last row
df.iloc[-1]

3.3 By boolean selection

Another possibility is to use a logical expression to create a boolean matrix, and index with it, selecting the rows that satisfy the expression


In [ ]:
df[df.area<200]

In [ ]:
df[ (df.area<200) & (df.population>600000) ]

In [ ]:
# This variant returns the same size as the original dataframe, but fills only the rows that satisty the condition
df.where( df.area<200 )

3.4 Random sample


In [ ]:
df.sample(n=3)

4 Augmenting a DataFrame

Let's take the existing dataframe and enlarge it by adding new rows/columns

4.1 Adding a column


In [ ]:
# We create a new column by combining data from other columns
df.loc[:,'density'] = df.loc[:,'population']/df.loc[:,'area']

In [ ]:
df.head()

Another way of doing it is to use the assign() method. Ir returns a new DataFrame with the additions.


In [ ]:
df2 = df.assign( density2 = lambda x : x.population/x.area )
df2.head()

4.2 Adding rows


In [ ]:
# Find the next id to insert
next = df.tail(1).index.values[0] + 1

In [ ]:
# Define new rows. This time, for a change, we'll be using a dict of lists as input data
name = ('Tallinn', 'Riga', 'Vilnius')
pop = ( 439286, 641007, 542664 )
size = ( 159.2, 304, 401 )
data2 = { 'id' : range(next,next+len(name)),
          'name' : name, 
          'population' : pop, 
          'area' : size  }
#data = [ {'id':next+i, 'name':n, 'population': p, size:'s' } 
#         for i, (n,p,s) in enumerate(zip(name,pop,size)) ]

In [ ]:
# Create a dataframe from the dict of lists
df2 = pd.DataFrame( data2 )
# Set the column(s) to be used as the row index in this new dataframe
df2.set_index( 'id', inplace=True )
#df2 = pd.DataFrame.from_dict( data )
#df.append( data, ignore_index=True)

In [ ]:
df2

In [ ]:
# Now append this set of rows to the original one
df = df.append(df2, sort=False)
df

In [ ]:
# Find the rows having a missing density value. Obviously they will be the just added ones
missing = df[ np.isnan(df.density) ].index

df.loc[missing]

Now let's add the missing densities. First naive attempt:


In [ ]:
df.loc[missing].density = df.loc[missing].population/df.loc[missing].area

In [ ]:
df.loc[missing]

It didn't work. Why? Because we are selecting in two steps:

  • first we get the rows df.loc[missing]
  • and then we get the column of those rows df.loc[missing].population This is chained indexing. And it fails when using it for assignment

So let's try again, using a single-step indexing:


In [ ]:
df.loc[missing,'density'] = df.loc[missing,'population']/df.loc[missing,'area']

This time it works:


In [ ]:
df.loc[missing].density