A few first steps with Pandas.
We'll create a small dataframe, access its elements, and enlarge it by adding new columns and rows
In [1]:
# Import Pandas & NumPy
import pandas as pd
import numpy as np
In [2]:
# Create a tiny dataset, as a list of tuples
name = ('Oslo','Copenhaguen','Helsinki','Stockholm','Reykjavik')
pop = ( 647676, 583348, 626305, 917297, 121822 )
area = ( 480.76, 86.20, 715.49, 188.0, 273 )
data = [ (1000+i,n,p,s) for i, (n,p,s) in enumerate(zip(name,pop,area)) ]
In [3]:
# Create the dataframe from the list of tuples. We need to add the names of the columns, plus
# the column(s) we want to be used as row index
df = pd.DataFrame.from_records( data=data, columns=('id','name','population','area'), index=['id'] )
Let's view the dataframe. We can print it:
In [4]:
print df
In [5]:
# See the options we've got for data formatting
pd.describe_option('display')
Or we can just show it, and it will be nicely formatted. Note the double header: the second header row is for the column(s) forming the DataFrame index.
In [6]:
df
Out[6]:
In [7]:
# Check dataframe dimensions
print df.shape
# Check dataframe components
print df.index
print df.columns
In [8]:
df['name']
Out[8]:
In [9]:
# Or also
df.name
Out[9]:
We can also get more than one column. These operations create and return a new DataFrame
In [10]:
df[ ['name','population'] ]
Out[10]:
Same thing, but this time we get a reference to the original DataFrame by using a locator operator (see next section)
In [11]:
df.loc[:,['name','population']]
Out[11]:
There are several ways of accessing the elements contained in a DataFrame
We can acccess rows and columns by using labels, i.e. the index for the rows and/or columns, using the loc locator.
In [12]:
# One row, using the index. Note that in this case our row index is the 'id' column
df.loc[1000]
Out[12]:
In [13]:
# Two rows
df.loc[1002:1003]
Out[13]:
In [14]:
# Two rows, but only selected columns
df.loc[1002:1003,'name':'population']
Out[14]:
And we can also select row/columns by their position using the iloc locator.
In [15]:
# Get the first row
df.iloc[0]
Out[15]:
In [16]:
# Get the last row
df.iloc[-1]
Out[16]:
In [17]:
df[df.area<200]
Out[17]:
In [18]:
df[ (df.area<200) & (df.population>600000) ]
Out[18]:
In [19]:
# This variant returns the same size as the original dataframe, but fills only the rows that satisty the condition
df.where( df.area<200 )
Out[19]:
In [20]:
df.sample(n=3)
Out[20]:
In [21]:
# We create a new column by combining data from other columns
df.loc[:,'density'] = df.loc[:,'population']/df.loc[:,'area']
In [22]:
df.head()
Out[22]:
Another way of doing it is to use the assign() method. Ir returns a new DataFrame with the additions.
In [23]:
df2 = df.assign( density2 = lambda x : x.population/x.area )
df2.head()
Out[23]:
In [24]:
# Find the next id to insert
next = df.tail(1).index.values[0] + 1
In [25]:
# Define new rows. This time, for a change, we'll be using a dict of lists as input data
name = ('Tallinn', 'Riga', 'Vilnius')
pop = ( 439286, 641007, 542664 )
size = ( 159.2, 304, 401 )
data2 = { 'id' : range(next,next+len(name)),
'name' : name,
'population' : pop,
'area' : size }
#data = [ {'id':next+i, 'name':n, 'population': p, size:'s' }
# for i, (n,p,s) in enumerate(zip(name,pop,size)) ]
In [26]:
# Create a dataframe from the dict of lists
df2 = pd.DataFrame( data2 )
# Set the column(s) to be used as the row index in this new dataframe
df2.set_index( 'id', inplace=True )
#df2 = pd.DataFrame.from_dict( data )
#df.append( data, ignore_index=True)
In [27]:
df2
Out[27]:
In [28]:
# Now append this set of rows to the original one
df = df.append(df2)
df
Out[28]:
In [29]:
# Find the rows having a missing density value. Obviously they will be the just added ones
missing = df[ np.isnan(df.density) ].index
df.loc[missing]
Out[29]:
Now let's add the missing densities. First naive attempt:
In [30]:
df.loc[missing].density = df.loc[missing].population/df.loc[missing].area
In [31]:
df.loc[missing]
Out[31]:
It didn't work. Why? Because we are selecting in two steps:
df.loc[missing]df.loc[missing].population
This is chained indexing. And it fails when using it for assignmentSo let's try again, using a single-step indexing:
In [32]:
df.loc[missing,'density'] = df.loc[missing,'population']/df.loc[missing,'area']
This time it works:
In [33]:
df.loc[missing].density
Out[33]: