Basic Pandas DataFrame operations

A few first steps with Pandas.

We'll create a small dataframe, access its elements, and enlarge it by adding new columns and rows



In [1]:

    
# Import Pandas & NumPy
import pandas as pd
import numpy as np



In [2]:

    
# Create a tiny dataset, as a list of tuples
name = ('Oslo','Copenhaguen','Helsinki','Stockholm','Reykjavik')
pop = ( 647676, 583348, 626305, 917297, 121822 )
area = ( 480.76, 86.20, 715.49, 188.0,  273 )
data = [ (1000+i,n,p,s) for i, (n,p,s) in enumerate(zip(name,pop,area)) ]

1 Creating a DataFrame



In [3]:

    
# Create the dataframe from the list of tuples. We need to add the names of the columns, plus
# the column(s) we want to be used as row index
df = pd.DataFrame.from_records( data=data, columns=('id','name','population','area'), index=['id'] )

Let's view the dataframe. We can print it:



In [4]:

    
print df









    



             name  population    area
id                                   
1000         Oslo      647676  480.76
1001  Copenhaguen      583348   86.20
1002     Helsinki      626305  715.49
1003    Stockholm      917297  188.00
1004    Reykjavik      121822  273.00



In [5]:

    
# See the options we've got for data formatting
pd.describe_option('display')









    



display.chop_threshold : float or None
    if set to a float value, all float values smaller then the given threshold
    will be displayed as exactly 0 by repr and friends.
    [default: None] [currently: None]

display.colheader_justify : 'left'/'right'
    Controls the justification of column headers. used by DataFrameFormatter.
    [default: right] [currently: right]

display.column_space No description available.
    [default: 12] [currently: 12]

display.date_dayfirst : boolean
    When True, prints and parses dates with the day first, eg 20/01/2005
    [default: False] [currently: False]

display.date_yearfirst : boolean
    When True, prints and parses dates with the year first, eg 2005/01/20
    [default: False] [currently: False]

display.encoding : str/unicode
    Defaults to the detected encoding of the console.
    Specifies the encoding to be used for strings returned by to_string,
    these are generally strings meant to be displayed on the console.
    [default: UTF-8] [currently: UTF-8]

display.expand_frame_repr : boolean
    Whether to print out the full DataFrame repr for wide DataFrames across
    multiple lines, `max_columns` is still respected, but the output will
    wrap-around across multiple "pages" if its width exceeds `display.width`.
    [default: True] [currently: True]

display.float_format : callable
    The callable should accept a floating point number and return
    a string with the desired format of the number. This is used
    in some places like SeriesFormatter.
    See core.format.EngFormatter for an example.
    [default: None] [currently: None]

display.height : int
    Deprecated.
    [default: 60] [currently: 60]
    (Deprecated, use `display.max_rows` instead.)

display.large_repr : 'truncate'/'info'
    For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
    show a truncated table (the default from 0.13), or switch to the view from
    df.info() (the behaviour in earlier versions of pandas).
    [default: truncate] [currently: truncate]

display.line_width : int
    Deprecated.
    [default: 80] [currently: 80]
    (Deprecated, use `display.width` instead.)

display.max_categories : int
    This sets the maximum number of categories pandas should output when printing
    out a `Categorical` or a Series of dtype "category".
    [default: 8] [currently: 8]

display.max_columns : int
    If max_cols is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the width of the terminal and print a truncated object which fits
    the screen width. The IPython notebook, IPython qtconsole, or IDLE
    do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 20] [currently: 20]

display.max_colwidth : int
    The maximum width in characters of a column in the repr of
    a pandas data structure. When the column overflows, a "..."
    placeholder is embedded in the output.
    [default: 50] [currently: 50]

display.max_info_columns : int
    max_info_columns is used in DataFrame.info method to decide if
    per column information will be printed.
    [default: 100] [currently: 100]

display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions then specified.
    [default: 1690785] [currently: 1690785]

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 60]

display.max_seq_items : int or None
    when pretty-printing a long sequence, no more then `max_seq_items`
    will be printed. If items are omitted, they will be denoted by the
    addition of "..." to the resulting string.

    If set to None, the number of items to be printed is unlimited.
    [default: 100] [currently: 100]

display.memory_usage : bool, string or None
    This specifies if the memory usage of a DataFrame should be displayed when
    df.info() is called. Valid values True,False,'deep'
    [default: True] [currently: True]

display.mpl_style : bool
    Setting this to 'default' will modify the rcParams used by matplotlib
    to give plots a more pleasing visual style by default.
    Setting this to None/False restores the values to their initial value.
    [default: None] [currently: None]

display.multi_sparse : boolean
    "sparsify" MultiIndex display (don't display repeated
    elements in outer levels within groups)
    [default: True] [currently: True]

display.notebook_repr_html : boolean
    When True, IPython notebook will use html representation for
    pandas objects (if it is available).
    [default: True] [currently: True]

display.pprint_nest_depth : int
    Controls the number of nested levels to process when pretty-printing
    [default: 3] [currently: 3]

display.precision : int
    Floating point output precision (number of significant digits). This is
    only a suggestion
    [default: 6] [currently: 6]

display.show_dimensions : boolean or 'truncate'
    Whether to print out dimensions at the end of DataFrame repr.
    If 'truncate' is specified, only print out the dimensions if the
    frame is truncated (e.g. not display all rows and/or columns)
    [default: truncate] [currently: truncate]

display.unicode.ambiguous_as_wide : boolean
    Whether to use the Unicode East Asian Width to calculate the display text width
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]

display.unicode.east_asian_width : boolean
    Whether to use the Unicode East Asian Width to calculate the display text width
    Enabling this may affect to the performance (default: False)
    [default: False] [currently: False]

display.width : int
    Width of the display in characters. In case python/IPython is running in
    a terminal this can be set to None and pandas will correctly auto-detect
    the width.
    Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
    terminal and hence it is not possible to correctly detect the width.
    [default: 80] [currently: 80]

Or we can just show it, and it will be nicely formatted. Note the double header: the second header row is for the column(s) forming the DataFrame index.



In [6]:

    
df









    Out[6]:






  
    
      
      name
      population
      area
    
    
      id
      
      
      
    
  
  
    
      1000
      Oslo
      647676
      480.76
    
    
      1001
      Copenhaguen
      583348
      86.20
    
    
      1002
      Helsinki
      626305
      715.49
    
    
      1003
      Stockholm
      917297
      188.00
    
    
      1004
      Reykjavik
      121822
      273.00



In [7]:

    
# Check dataframe dimensions
print df.shape
# Check dataframe components
print df.index
print df.columns









    



(5, 3)
Int64Index([1000, 1001, 1002, 1003, 1004], dtype='int64', name=u'id')
Index([u'name', u'population', u'area'], dtype='object')

2 Fetching columns



In [8]:

    
df['name']









    Out[8]:





id
1000           Oslo
1001    Copenhaguen
1002       Helsinki
1003      Stockholm
1004      Reykjavik
Name: name, dtype: object



In [9]:

    
# Or also
df.name









    Out[9]:





id
1000           Oslo
1001    Copenhaguen
1002       Helsinki
1003      Stockholm
1004      Reykjavik
Name: name, dtype: object

We can also get more than one column. These operations create and return a new DataFrame



In [10]:

    
df[ ['name','population'] ]









    Out[10]:






  
    
      
      name
      population
    
    
      id
      
      
    
  
  
    
      1000
      Oslo
      647676
    
    
      1001
      Copenhaguen
      583348
    
    
      1002
      Helsinki
      626305
    
    
      1003
      Stockholm
      917297
    
    
      1004
      Reykjavik
      121822

Same thing, but this time we get a reference to the original DataFrame by using a locator operator (see next section)



In [11]:

    
df.loc[:,['name','population']]









    Out[11]:






  
    
      
      name
      population
    
    
      id
      
      
    
  
  
    
      1000
      Oslo
      647676
    
    
      1001
      Copenhaguen
      583348
    
    
      1002
      Helsinki
      626305
    
    
      1003
      Stockholm
      917297
    
    
      1004
      Reykjavik
      121822

3 Accessing DataFrame contents

There are several ways of accessing the elements contained in a DataFrame

3.1 By label

We can acccess rows and columns by using labels, i.e. the index for the rows and/or columns, using the loc locator.



In [12]:

    
# One row, using the index. Note that in this case our row index is the 'id' column
df.loc[1000]









    Out[12]:





name            Oslo
population    647676
area          480.76
Name: 1000, dtype: object



In [13]:

    
# Two rows
df.loc[1002:1003]



In [14]:

    
# Two rows, but only selected columns
df.loc[1002:1003,'name':'population']

3.2 By position

And we can also select row/columns by their position using the iloc locator.



In [15]:

    
# Get the first row
df.iloc[0]









    Out[15]:





name            Oslo
population    647676
area          480.76
Name: 1000, dtype: object



In [16]:

    
# Get the last row
df.iloc[-1]









    Out[16]:





name          Reykjavik
population       121822
area                273
Name: 1004, dtype: object

3.3 By boolean selection

Another possibility is to use a logical expression to create a boolean matrix, and index with it, selecting the rows that satisfy the expression



In [17]:

    
df[df.area<200]









    Out[17]:






  
    
      
      name
      population
      area
    
    
      id
      
      
      
    
  
  
    
      1001
      Copenhaguen
      583348
      86.2
    
    
      1003
      Stockholm
      917297
      188.0



In [18]:

    
df[ (df.area<200) & (df.population>600000) ]



In [19]:

    
# This variant returns the same size as the original dataframe, but fills only the rows that satisty the condition
df.where( df.area<200 )









    Out[19]:






  
    
      
      name
      population
      area
    
    
      id
      
      
      
    
  
  
    
      1000
      NaN
      NaN
      NaN
    
    
      1001
      Copenhaguen
      583348
      86.2
    
    
      1002
      NaN
      NaN
      NaN
    
    
      1003
      Stockholm
      917297
      188.0
    
    
      1004
      NaN
      NaN
      NaN

3.4 Random sample



In [20]:

    
df.sample(n=3)









    Out[20]:






  
    
      
      name
      population
      area
    
    
      id
      
      
      
    
  
  
    
      1004
      Reykjavik
      121822
      273.0
    
    
      1003
      Stockholm
      917297
      188.0
    
    
      1001
      Copenhaguen
      583348
      86.2

4 Augmenting a DataFrame

Let's take the existing dataframe and enlarge it by adding new rows/columns

4.1 Adding a column



In [21]:

    
# We create a new column by combining data from other columns
df.loc[:,'density'] = df.loc[:,'population']/df.loc[:,'area']



In [22]:

    
df.head()









    Out[22]:






  
    
      
      name
      population
      area
      density
    
    
      id
      
      
      
      
    
  
  
    
      1000
      Oslo
      647676
      480.76
      1347.191946
    
    
      1001
      Copenhaguen
      583348
      86.20
      6767.378190
    
    
      1002
      Helsinki
      626305
      715.49
      875.351158
    
    
      1003
      Stockholm
      917297
      188.00
      4879.239362
    
    
      1004
      Reykjavik
      121822
      273.00
      446.234432

Another way of doing it is to use the assign() method. Ir returns a new DataFrame with the additions.



In [23]:

    
df2 = df.assign( density2 = lambda x : x.population/x.area )
df2.head()









    Out[23]:






  
    
      
      name
      population
      area
      density
      density2
    
    
      id
      
      
      
      
      
    
  
  
    
      1000
      Oslo
      647676
      480.76
      1347.191946
      1347.191946
    
    
      1001
      Copenhaguen
      583348
      86.20
      6767.378190
      6767.378190
    
    
      1002
      Helsinki
      626305
      715.49
      875.351158
      875.351158
    
    
      1003
      Stockholm
      917297
      188.00
      4879.239362
      4879.239362
    
    
      1004
      Reykjavik
      121822
      273.00
      446.234432
      446.234432

4.2 Adding rows



In [24]:

    
# Find the next id to insert
next = df.tail(1).index.values[0] + 1



In [25]:

    
# Define new rows. This time, for a change, we'll be using a dict of lists as input data
name = ('Tallinn', 'Riga', 'Vilnius')
pop = ( 439286, 641007, 542664 )
size = ( 159.2, 304, 401 )
data2 = { 'id' : range(next,next+len(name)),
          'name' : name, 
          'population' : pop, 
          'area' : size  }
#data = [ {'id':next+i, 'name':n, 'population': p, size:'s' } 
#         for i, (n,p,s) in enumerate(zip(name,pop,size)) ]



In [26]:

    
# Create a dataframe from the dict of lists
df2 = pd.DataFrame( data2 )
# Set the column(s) to be used as the row index in this new dataframe
df2.set_index( 'id', inplace=True )
#df2 = pd.DataFrame.from_dict( data )
#df.append( data, ignore_index=True)



In [27]:

    
df2



In [28]:

    
# Now append this set of rows to the original one
df = df.append(df2)
df









    Out[28]:






  
    
      
      area
      density
      name
      population
    
    
      id
      
      
      
      
    
  
  
    
      1000
      480.76
      1347.191946
      Oslo
      647676
    
    
      1001
      86.20
      6767.378190
      Copenhaguen
      583348
    
    
      1002
      715.49
      875.351158
      Helsinki
      626305
    
    
      1003
      188.00
      4879.239362
      Stockholm
      917297
    
    
      1004
      273.00
      446.234432
      Reykjavik
      121822
    
    
      1005
      159.20
      NaN
      Tallinn
      439286
    
    
      1006
      304.00
      NaN
      Riga
      641007
    
    
      1007
      401.00
      NaN
      Vilnius
      542664



In [29]:

    
# Find the rows having a missing density value. Obviously they will be the just added ones
missing = df[ np.isnan(df.density) ].index

df.loc[missing]

Now let's add the missing densities. First naive attempt:



In [30]:

    
df.loc[missing].density = df.loc[missing].population/df.loc[missing].area



In [31]:

    
df.loc[missing]

It didn't work. Why? Because we are selecting in two steps:

first we get the rows df.loc[missing]
and then we get the column of those rows df.loc[missing].population This is chained indexing. And it fails when using it for assignment

So let's try again, using a single-step indexing:



In [32]:

    
df.loc[missing,'density'] = df.loc[missing,'population']/df.loc[missing,'area']

This time it works:



In [33]:

    
df.loc[missing].density









    Out[33]:





id
1005    2759.334171
1006    2108.575658
1007    1353.276808
Name: density, dtype: float64

	name	population	area
id
1000	Oslo	647676	480.76
1001	Copenhaguen	583348	86.20
1002	Helsinki	626305	715.49
1003	Stockholm	917297	188.00
1004	Reykjavik	121822	273.00

	area	name	population
id
1005	159.2	Tallinn	439286
1006	304.0	Riga	641007
1007	401.0	Vilnius	542664

	area	density	name	population
id
1000	480.76	1347.191946	Oslo	647676
1001	86.20	6767.378190	Copenhaguen	583348
1002	715.49	875.351158	Helsinki	626305
1003	188.00	4879.239362	Stockholm	917297
1004	273.00	446.234432	Reykjavik	121822
1005	159.20	NaN	Tallinn	439286
1006	304.00	NaN	Riga	641007
1007	401.00	NaN	Vilnius	542664