Basic Pandas DataFrame operations

A few first steps with Pandas.

We'll create a small dataframe, access its elements, and enlarge it by adding new columns and rows


In [1]:
# Import Pandas & NumPy
import pandas as pd
import numpy as np

In [2]:
# Create a tiny dataset, as a list of tuples
name = ('Oslo','Copenhaguen','Helsinki','Stockholm','Reykjavik')
pop = ( 647676, 583348, 626305, 917297, 121822 )
area = ( 480.76, 86.20, 715.49, 188.0,  273 )
data = [ (1000+i,n,p,s) for i, (n,p,s) in enumerate(zip(name,pop,area)) ]

1 Creating a DataFrame


In [3]:
# Create the dataframe from the list of tuples. We need to add the names of the columns, plus
# the column(s) we want to be used as row index
df = pd.DataFrame.from_records( data=data, columns=('id','name','population','area'), index=['id'] )

Let's view the dataframe. We can print it:


In [4]:
print df


             name  population    area
id                                   
1000         Oslo      647676  480.76
1001  Copenhaguen      583348   86.20
1002     Helsinki      626305  715.49
1003    Stockholm      917297  188.00
1004    Reykjavik      121822  273.00

In [5]:
# See the options we've got for data formatting
pd.describe_option('display')


display.chop_threshold : float or None
    if set to a float value, all float values smaller then the given threshold
    will be displayed as exactly 0 by repr and friends.
    [default: None] [currently: None]

display.colheader_justify : 'left'/'right'
    Controls the justification of column headers. used by DataFrameFormatter.
    [default: right] [currently: right]

display.column_space No description available.
    [default: 12] [currently: 12]

display.date_dayfirst : boolean
    When True, prints and parses dates with the day first, eg 20/01/2005
    [default: False] [currently: False]

display.date_yearfirst : boolean
    When True, prints and parses dates with the year first, eg 2005/01/20
    [default: False] [currently: False]

display.encoding : str/unicode
    Defaults to the detected encoding of the console.
    Specifies the encoding to be used for strings returned by to_string,
    these are generally strings meant to be displayed on the console.
    [default: UTF-8] [currently: UTF-8]

display.expand_frame_repr : boolean
    Whether to print out the full DataFrame repr for wide DataFrames across
    multiple lines, `max_columns` is still respected, but the output will
    wrap-around across multiple "pages" if its width exceeds `display.width`.
    [default: True] [currently: True]

display.float_format : callable
    The callable should accept a floating point number and return
    a string with the desired format of the number. This is used
    in some places like SeriesFormatter.
    See core.format.EngFormatter for an example.
    [default: None] [currently: None]

display.height : int
    Deprecated.
    [default: 60] [currently: 60]
    (Deprecated, use `display.max_rows` instead.)

display.large_repr : 'truncate'/'info'
    For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
    show a truncated table (the default from 0.13), or switch to the view from
    df.info() (the behaviour in earlier versions of pandas).
    [default: truncate] [currently: truncate]

display.line_width : int
    Deprecated.
    [default: 80] [currently: 80]
    (Deprecated, use `display.width` instead.)

display.max_categories : int
    This sets the maximum number of categories pandas should output when printing
    out a `Categorical` or a Series of dtype "category".
    [default: 8] [currently: 8]

display.max_columns : int
    If max_cols is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the width of the terminal and print a truncated object which fits
    the screen width. The IPython notebook, IPython qtconsole, or IDLE
    do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 20] [currently: 20]

display.max_colwidth : int
    The maximum width in characters of a column in the repr of
    a pandas data structure. When the column overflows, a "..."
    placeholder is embedded in the output.
    [default: 50] [currently: 50]

display.max_info_columns : int
    max_info_columns is used in DataFrame.info method to decide if
    per column information will be printed.
    [default: 100] [currently: 100]

display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions then specified.
    [default: 1690785] [currently: 1690785]

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 60]

display.max_seq_items : int or None
    when pretty-printing a long sequence, no more then `max_seq_items`
    will be printed. If items are omitted, they will be denoted by the
    addition of "..." to the resulting string.

    If set to None, the number of items to be printed is unlimited.
    [default: 100] [currently: 100]

display.memory_usage : bool, string or None
    This specifies if the memory usage of a DataFrame should be displayed when
    df.info() is called. Valid values True,False,'deep'
    [default: True] [currently: True]

display.mpl_style : bool
    Setting this to 'default' will modify the rcParams used by matplotlib
    to give plots a more pleasing visual style by default.
    Setting this to None/False restores the values to their initial value.
    [default: None] [currently: None]

display.multi_sparse : boolean
    "sparsify" MultiIndex display (don't display repeated
    elements in outer levels within groups)
    [default: True] [currently: True]

display.notebook_repr_html : boolean
    When True, IPython notebook will use html representation for
    pandas objects (if it is available).
    [default: True] [currently: True]

display.pprint_nest_depth : int
    Controls the number of nested levels to process when pretty-printing
    [default: 3] [currently: 3]

display.precision : int
    Floating point output precision (number of significant digits). This is
    only a suggestion
    [default: 6] [currently: 6]

display.show_dimensions : boolean or 'truncate'
    Whether to print out dimensions at the end of DataFrame repr.
    If 'truncate' is specified, only print out the dimensions if the
    frame is truncated (e.g. not display all rows and/or columns)
    [default: truncate] [currently: truncate]

display.unicode.ambiguous_as_wide : boolean
    Whether to use the Unicode East Asian Width to calculate the display text width
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]

display.unicode.east_asian_width : boolean
    Whether to use the Unicode East Asian Width to calculate the display text width
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]

display.width : int
    Width of the display in characters. In case python/IPython is running in
    a terminal this can be set to None and pandas will correctly auto-detect
    the width.
    Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
    terminal and hence it is not possible to correctly detect the width.
    [default: 80] [currently: 80]


Or we can just show it, and it will be nicely formatted. Note the double header: the second header row is for the column(s) forming the DataFrame index.


In [6]:
df


Out[6]:
name population area
id
1000 Oslo 647676 480.76
1001 Copenhaguen 583348 86.20
1002 Helsinki 626305 715.49
1003 Stockholm 917297 188.00
1004 Reykjavik 121822 273.00

In [7]:
# Check dataframe dimensions
print df.shape
# Check dataframe components
print df.index
print df.columns


(5, 3)
Int64Index([1000, 1001, 1002, 1003, 1004], dtype='int64', name=u'id')
Index([u'name', u'population', u'area'], dtype='object')

2 Fetching columns


In [8]:
df['name']


Out[8]:
id
1000           Oslo
1001    Copenhaguen
1002       Helsinki
1003      Stockholm
1004      Reykjavik
Name: name, dtype: object

In [9]:
# Or also
df.name


Out[9]:
id
1000           Oslo
1001    Copenhaguen
1002       Helsinki
1003      Stockholm
1004      Reykjavik
Name: name, dtype: object

We can also get more than one column. These operations create and return a new DataFrame


In [10]:
df[ ['name','population'] ]


Out[10]:
name population
id
1000 Oslo 647676
1001 Copenhaguen 583348
1002 Helsinki 626305
1003 Stockholm 917297
1004 Reykjavik 121822

Same thing, but this time we get a reference to the original DataFrame by using a locator operator (see next section)


In [11]:
df.loc[:,['name','population']]


Out[11]:
name population
id
1000 Oslo 647676
1001 Copenhaguen 583348
1002 Helsinki 626305
1003 Stockholm 917297
1004 Reykjavik 121822

3 Accessing DataFrame contents

There are several ways of accessing the elements contained in a DataFrame

3.1 By label

We can acccess rows and columns by using labels, i.e. the index for the rows and/or columns, using the loc locator.


In [12]:
# One row, using the index. Note that in this case our row index is the 'id' column
df.loc[1000]


Out[12]:
name            Oslo
population    647676
area          480.76
Name: 1000, dtype: object

In [13]:
# Two rows
df.loc[1002:1003]


Out[13]:
name population area
id
1002 Helsinki 626305 715.49
1003 Stockholm 917297 188.00

In [14]:
# Two rows, but only selected columns
df.loc[1002:1003,'name':'population']


Out[14]:
name population
id
1002 Helsinki 626305
1003 Stockholm 917297

3.2 By position

And we can also select row/columns by their position using the iloc locator.


In [15]:
# Get the first row
df.iloc[0]


Out[15]:
name            Oslo
population    647676
area          480.76
Name: 1000, dtype: object

In [16]:
# Get the last row
df.iloc[-1]


Out[16]:
name          Reykjavik
population       121822
area                273
Name: 1004, dtype: object

3.3 By boolean selection

Another possibility is to use a logical expression to create a boolean matrix, and index with it, selecting the rows that satisfy the expression


In [17]:
df[df.area<200]


Out[17]:
name population area
id
1001 Copenhaguen 583348 86.2
1003 Stockholm 917297 188.0

In [18]:
df[ (df.area<200) & (df.population>600000) ]


Out[18]:
name population area
id
1003 Stockholm 917297 188

In [19]:
# This variant returns the same size as the original dataframe, but fills only the rows that satisty the condition
df.where( df.area<200 )


Out[19]:
name population area
id
1000 NaN NaN NaN
1001 Copenhaguen 583348 86.2
1002 NaN NaN NaN
1003 Stockholm 917297 188.0
1004 NaN NaN NaN

3.4 Random sample


In [20]:
df.sample(n=3)


Out[20]:
name population area
id
1004 Reykjavik 121822 273.0
1003 Stockholm 917297 188.0
1001 Copenhaguen 583348 86.2

4 Augmenting a DataFrame

Let's take the existing dataframe and enlarge it by adding new rows/columns

4.1 Adding a column


In [21]:
# We create a new column by combining data from other columns
df.loc[:,'density'] = df.loc[:,'population']/df.loc[:,'area']

In [22]:
df.head()


Out[22]:
name population area density
id
1000 Oslo 647676 480.76 1347.191946
1001 Copenhaguen 583348 86.20 6767.378190
1002 Helsinki 626305 715.49 875.351158
1003 Stockholm 917297 188.00 4879.239362
1004 Reykjavik 121822 273.00 446.234432

Another way of doing it is to use the assign() method. Ir returns a new DataFrame with the additions.


In [23]:
df2 = df.assign( density2 = lambda x : x.population/x.area )
df2.head()


Out[23]:
name population area density density2
id
1000 Oslo 647676 480.76 1347.191946 1347.191946
1001 Copenhaguen 583348 86.20 6767.378190 6767.378190
1002 Helsinki 626305 715.49 875.351158 875.351158
1003 Stockholm 917297 188.00 4879.239362 4879.239362
1004 Reykjavik 121822 273.00 446.234432 446.234432

4.2 Adding rows


In [24]:
# Find the next id to insert
next = df.tail(1).index.values[0] + 1

In [25]:
# Define new rows. This time, for a change, we'll be using a dict of lists as input data
name = ('Tallinn', 'Riga', 'Vilnius')
pop = ( 439286, 641007, 542664 )
size = ( 159.2, 304, 401 )
data2 = { 'id' : range(next,next+len(name)),
          'name' : name, 
          'population' : pop, 
          'area' : size  }
#data = [ {'id':next+i, 'name':n, 'population': p, size:'s' } 
#         for i, (n,p,s) in enumerate(zip(name,pop,size)) ]

In [26]:
# Create a dataframe from the dict of lists
df2 = pd.DataFrame( data2 )
# Set the column(s) to be used as the row index in this new dataframe
df2.set_index( 'id', inplace=True )
#df2 = pd.DataFrame.from_dict( data )
#df.append( data, ignore_index=True)

In [27]:
df2


Out[27]:
area name population
id
1005 159.2 Tallinn 439286
1006 304.0 Riga 641007
1007 401.0 Vilnius 542664

In [28]:
# Now append this set of rows to the original one
df = df.append(df2)
df


Out[28]:
area density name population
id
1000 480.76 1347.191946 Oslo 647676
1001 86.20 6767.378190 Copenhaguen 583348
1002 715.49 875.351158 Helsinki 626305
1003 188.00 4879.239362 Stockholm 917297
1004 273.00 446.234432 Reykjavik 121822
1005 159.20 NaN Tallinn 439286
1006 304.00 NaN Riga 641007
1007 401.00 NaN Vilnius 542664

In [29]:
# Find the rows having a missing density value. Obviously they will be the just added ones
missing = df[ np.isnan(df.density) ].index

df.loc[missing]


Out[29]:
area density name population
id
1005 159.2 NaN Tallinn 439286
1006 304.0 NaN Riga 641007
1007 401.0 NaN Vilnius 542664

Now let's add the missing densities. First naive attempt:


In [30]:
df.loc[missing].density = df.loc[missing].population/df.loc[missing].area

In [31]:
df.loc[missing]


Out[31]:
area density name population
id
1005 159.2 NaN Tallinn 439286
1006 304.0 NaN Riga 641007
1007 401.0 NaN Vilnius 542664

It didn't work. Why? Because we are selecting in two steps:

  • first we get the rows df.loc[missing]
  • and then we get the column of those rows df.loc[missing].population This is chained indexing. And it fails when using it for assignment

So let's try again, using a single-step indexing:


In [32]:
df.loc[missing,'density'] = df.loc[missing,'population']/df.loc[missing,'area']

This time it works:


In [33]:
df.loc[missing].density


Out[33]:
id
1005    2759.334171
1006    2108.575658
1007    1353.276808
Name: density, dtype: float64