In this post I will summarize the data structures of Pandas library. Pandas is a library written for data manipulation and analysis. It is built on top of the Numpy library and it provides features not available in Numpy, so to be able to follow this tutorial you should have some basics on Python and NumPy library.

I am going to follow the official data structure documentation here

Data Structures in Pandas

There are 3 main data structures:

1. Series

The series are the 1D NumPy array under the hood. It consists of a NumPy array coupled with an array of labels.

Create Series:

import pandas as pd
ser = pd.Series(data, index=idx)
  • data can be ndarray, a Python dictionary, or a scalar value

In [6]:
# Using Scalar Values
import pandas as pd
ser = pd.Series([20, 21, 12], index=['London', 'New York','Helsinki'])
print(ser)


London      20
New York    21
Helsinki    12
dtype: int64

In [7]:
# Using Numpy ndarray
import numpy as np
np.random.seed(100)
ser=pd.Series(np.random.rand(7))
ser


Out[7]:
0    0.543405
1    0.278369
2    0.424518
3    0.844776
4    0.004719
5    0.121569
6    0.670749
dtype: float64

Using Python Dictionary:

If we want to use Python Dictionary, the usage is slighly different. Since dictionary data structure has the


In [10]:
currDict = {'US' : 'dollar', 'UK' : 'pound',
            'Germany': 'euro', 'Mexico':'peso',
            'Nigeria':'naira',
            'China':'yuan', 'Japan':'yen'}
currSeries=pd.Series(currDict)
currSeries


Out[10]:
China        yuan
Germany      euro
Japan         yen
Mexico       peso
Nigeria     naira
UK          pound
US         dollar
dtype: object

The index of a pandas Series structure is of type pandas.core.index.Index and can be viewed as an ordered multiset.

In the following case, we specify an index, but the index contains one entry that isn't a key in the corresponding dict. The result is that the value for the key is assigned as NaN, indicating that it is missing. We will deal with handling missing values in a later section.


In [9]:
stockPrices = {'GOOG':1180.97,'FB':62.57,
                'TWTR': 64.50, 'AMZN':358.69,
                'AAPL':500.6}
stockPriceSeries=pd.Series(stockPrices,
                            index=['GOOG','FB','YHOO',
                                    'TWTR','AMZN','AAPL'],
                            name='stockPrices')
stockPriceSeries


Out[9]:
GOOG    1180.97
FB        62.57
YHOO        NaN
TWTR      64.50
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64

The behavior of Series is very similar to that of numpy arrays discussed in a previous section, with one caveat being that an operation such as slicing also slices the index.

  • Values can be set and accessed using the index label in a dictionary-like manner:

In [11]:
stockPriceSeries


Out[11]:
GOOG    1180.97
FB        62.57
YHOO        NaN
TWTR      64.50
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64

In [13]:
stockPriceSeries['GOOG']=1200.0
stockPriceSeries


Out[13]:
GOOG    1200.00
FB        62.57
YHOO        NaN
TWTR      64.50
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64

Just as in the case of dict, KeyError is raised if you try to retrieve a missing label,

We can avoid this error to be avoid by explicitly using get as follows:


In [15]:
stockPriceSeries.get('MSFT',np.NaN)


Out[15]:
nan

In this case, the default value of np.NaN is specified as the value to return when the key does not exist in the Series structure.

The slice operation behaves the same way as a NumPy array:


In [16]:
stockPriceSeries[:4]


Out[16]:
GOOG    1200.00
FB        62.57
YHOO        NaN
TWTR      64.50
Name: stockPrices, dtype: float64

Logical slicing also works as follows:


In [17]:
stockPriceSeries[stockPriceSeries > 100]


Out[17]:
GOOG    1200.00
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64

Arithmetic and statistical operations can be applied, just as with a NumPy array:


In [18]:
np.mean(stockPriceSeries)


Out[18]:
437.27200000000005

In [19]:
np.std(stockPriceSeries)


Out[19]:
417.4446361087899

In [22]:
ser


Out[22]:
0    0.543405
1    0.278369
2    0.424518
3    0.844776
4    0.004719
5    0.121569
6    0.670749
dtype: float64

In [21]:
ser*ser


Out[21]:
0    0.295289
1    0.077490
2    0.180215
3    0.713647
4    0.000022
5    0.014779
6    0.449904
dtype: float64

In [20]:
np.sqrt(ser)


Out[20]:
0    0.737160
1    0.527607
2    0.651550
3    0.919117
4    0.068694
5    0.348668
6    0.818993
dtype: float64

In [23]:
ser[1:]


Out[23]:
1    0.278369
2    0.424518
3    0.844776
4    0.004719
5    0.121569
6    0.670749
dtype: float64

2. Data Frame

DataFrame is an 2-dimensional labeled array. Its column types can be heterogeneous: that is, of varying types. It is similar to structured arrays in NumPy with mutability added. It has the following properties:

  • Conceptually analogous to a table or spreadsheet of data.
  • Similar to a NumPy ndarray but not a subclass of np.ndarray.
  • Columns can be of heterogeneous types: float64, int, bool, and so on.
  • A DataFrame column is a Series structure.
  • It can be thought of as a dictionary of Series structures where both the columns and the rows are indexed, denoted as 'index' in the case of rows and 'columns' in the case of columns.
  • It is size mutable: columns can be inserted and deleted.

DataFrame is the most commonly used data structure in pandas. The constructor accepts many different types of arguments:

  • Dictionary of 1D ndarrays, lists, dictionaries, or Series structures
  • 2D NumPy array
  • Structured or record ndarray
  • Series structures
  • Another DataFrame structure

Using Dictionaries of Series:


In [24]:
# Dictionary created with pandas.Series
stockSummaries={
        'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'Beta', 'P/E','Market Cap(B)']),
        'GOOG': pd.Series([1133.43,36.05,335.83,0.87,31.44,380.64],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'Beta','P/E','Market Cap(B)']),
        'FB': pd.Series([61.48,0.59,2450,104.93,150.92],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'P/E', 'Market Cap(B)']),
        'YHOO': pd.Series([34.90,1.27,1010,27.48,0.66,35.36],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'P/E','Beta', 'Market Cap(B)']),
        'TWTR':pd.Series([65.25,-0.3,555.2,36.23],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'Market Cap(B)']),
        'AAPL':pd.Series([501.53,40.32,892.45,12.44,447.59,0.84],
                            index=['Closing price','EPS','Shares Outstanding(M)','P/E',
                            'Market Cap(B)','Beta'])}

stockDF=pd.DataFrame(stockSummaries)
stockDF


Out[24]:
AAPL AMZN FB GOOG TWTR YHOO
Beta 0.84 0.52 NaN 0.87 NaN 0.66
Closing price 501.53 346.15 61.48 1133.43 65.25 34.90
EPS 40.32 0.59 0.59 36.05 -0.30 1.27
Market Cap(B) 447.59 158.88 150.92 380.64 36.23 35.36
P/E 12.44 589.80 104.93 31.44 NaN 27.48
Shares Outstanding(M) 892.45 459.00 2450.00 335.83 555.20 1010.00

In [25]:
# We can also change the order...
stockDF=pd.DataFrame(stockSummaries,
                    index=['Closing price','EPS',
                    'Shares Outstanding(M)',
                    'P/E', 'Market Cap(B)','Beta'])
stockDF


Out[25]:
AAPL AMZN FB GOOG TWTR YHOO
Closing price 501.53 346.15 61.48 1133.43 65.25 34.90
EPS 40.32 0.59 0.59 36.05 -0.30 1.27
Shares Outstanding(M) 892.45 459.00 2450.00 335.83 555.20 1010.00
P/E 12.44 589.80 104.93 31.44 NaN 27.48
Market Cap(B) 447.59 158.88 150.92 380.64 36.23 35.36
Beta 0.84 0.52 NaN 0.87 NaN 0.66

The row index labels and column labels can be accessed via the index and column attributes:


In [26]:
stockDF.index


Out[26]:
Index([u'Closing price', u'EPS', u'Shares Outstanding(M)', u'P/E',
       u'Market Cap(B)', u'Beta'],
      dtype='object')

In [27]:
stockDF.columns


Out[27]:
Index([u'AAPL', u'AMZN', u'FB', u'GOOG', u'TWTR', u'YHOO'], dtype='object')

Using ndarrays/lists:

When we create a dataframe from lists, the keys become the column labels in the DataFrame structure and the data in the list becomes the column values. Note how the row label indexes are generated using np.range(n).


In [28]:
algos={'search':['DFS','BFS','Binary Search','Linear','ShortestPath (Djikstra)'],
       'sorting': ['Quicksort','Mergesort', 'Heapsort','Bubble Sort', 'Insertion Sort'],
       'machine learning':['RandomForest', 'K Nearest Neighbor', 'Logistic Regression', 'K-Means Clustering', 'Linear Regression']}
algoDF=pd.DataFrame(algos)
algoDF


Out[28]:
machine learning search sorting
0 RandomForest DFS Quicksort
1 K Nearest Neighbor BFS Mergesort
2 Logistic Regression Binary Search Heapsort
3 K-Means Clustering Linear Bubble Sort
4 Linear Regression ShortestPath (Djikstra) Insertion Sort

In [29]:
# Or we can index by specifying the index when creating data frame
pd.DataFrame(algos,index=['algo_1','algo_2','algo_3','algo_4','algo_5'])


Out[29]:
machine learning search sorting
algo_1 RandomForest DFS Quicksort
algo_2 K Nearest Neighbor BFS Mergesort
algo_3 Logistic Regression Binary Search Heapsort
algo_4 K-Means Clustering Linear Bubble Sort
algo_5 Linear Regression ShortestPath (Djikstra) Insertion Sort

Using Structured Array:

Structured arrayis an array of records or structs, for more info check structured arrays documentation:


In [30]:
memberData = np.zeros((4,), dtype=[('Name','a15'), ('Age','i4'), ('Weight','f4')])
memberData[:] = [('Sanjeev',37,162.4), ('Yingluck',45,137.8), ('Emeka',28,153.2), ('Amy',67,101.3)]
memberDF=pd.DataFrame(memberData)
memberDF


Out[30]:
Name Age Weight
0 Sanjeev 37 162.399994
1 Yingluck 45 137.800003
2 Emeka 28 153.199997
3 Amy 67 101.300003

In [31]:
pd.DataFrame(memberData, index=['a','b','c','d'])


Out[31]:
Name Age Weight
a Sanjeev 37 162.399994
b Yingluck 45 137.800003
c Emeka 28 153.199997
d Amy 67 101.300003

Using a Series Structure:


In [32]:
currSeries.name='currency'
pd.DataFrame(currSeries)


Out[32]:
currency
China yuan
Germany euro
Japan yen
Mexico peso
Nigeria naira
UK pound
US dollar

There are also alternative constructors for DataFrame; they can be summarized as follows:

  • DataFrame.from_dict: It takes a dictionary of dictionaries or sequences and returns DataFrame.
  • DataFrame.from_records: It takes a list of tuples or structured ndarray.
  • DataFrame.from_items: It takes a sequence of (key, value) pairs. The keys are the column or index names, and the values are the column or row values. If you wish the keys to be row index names, you must specify orient='index' as a parameter and specify the column names.
  • pandas.io.parsers.read_csv: This is a helper function that reads a CSV file into a pandas DataFrame structure.
  • pandas.io.parsers.read_table: This is a helper function that reads a delimited file into a pandas DataFrame structure.
  • pandas.io.parsers.read_fwf: This is a helper function that reads a table of fixed-width lines into a pandas DataFrame structure.

I will point out some of the operations most commonly used with data frames.


In [34]:
# Specific column can be reached as Series
memberDF['Name']


Out[34]:
0     Sanjeev
1    Yingluck
2       Emeka
3         Amy
Name: Name, dtype: object

In [39]:
# A new column can be added via assignment
memberDF['Height']=[60, 70,80, 90]
memberDF


Out[39]:
Name Age Weight Height
0 Sanjeev 37 162.399994 60
1 Yingluck 45 137.800003 70
2 Emeka 28 153.199997 80
3 Amy 67 101.300003 90

In [40]:
# column can be deleted using del
del memberDF['Height']
memberDF


Out[40]:
Name Age Weight
0 Sanjeev 37 162.399994
1 Yingluck 45 137.800003
2 Emeka 28 153.199997
3 Amy 67 101.300003

Basically, a DataFrame structure can be treated as if it were a dictionary of Series objects. Columns get inserted at the end; to insert a column at a specific location, you can use the insert function:


In [41]:
memberDF.insert(2,'isSenior',memberDF['Age']>60)
memberDF


Out[41]:
Name Age isSenior Weight
0 Sanjeev 37 False 162.399994
1 Yingluck 45 False 137.800003
2 Emeka 28 False 153.199997
3 Amy 67 True 101.300003

DataFrame objects align in a manner similar to Series objects, except that they align on both column and index labels. The resulting object is the union of the column and row labels:


In [42]:
ore1DF=pd.DataFrame(np.array([[20,35,25,20],[11,28,32,29]]),columns=['iron','magnesium','copper','silver'])
ore2DF=pd.DataFrame(np.array([[14,34,26,26],[33,19,25,23]]),columns=['iron','magnesium','gold','silver'])
ore1DF+ore2DF


Out[42]:
copper gold iron magnesium silver
0 NaN NaN 34 69 46
1 NaN NaN 44 47 52

In [43]:
ore1DF + pd.Series([25,25,25,25], index=['iron','magnesium', 'copper','silver'])


Out[43]:
iron magnesium copper silver
0 45 60 50 45
1 36 53 57 54

Mathematical operators can be applied element wise on DataFrame structures:


In [44]:
np.sqrt(ore1DF)


Out[44]:
iron magnesium copper silver
0 4.472136 5.916080 5.000000 4.472136
1 3.316625 5.291503 5.656854 5.385165

3. Panel

Panel is a 3D array. It is not as widely used as Series or DataFrame. It is not as easily displayed on screen or visualized as the other two because of its 3D nature.

The three axis names are as follows:

  • items: This is axis 0. Each each item corresponds to a DataFrame structure.
  • major_axis: This is axis 1. Each item corresponds to the rows of the DataFrame structure.
  • minor_axis: This is axis 2. Each item corresponds to the columns of each DataFrame structure.

Using 3D Numpy array with axis labels:


In [45]:
stockData=np.array([[[63.03,61.48,75],
                     [62.05,62.75,46],
                     [62.74,62.19,53]],
                    [[411.90, 404.38, 2.9],
                     [405.45, 405.91, 2.6],
                     [403.15, 404.42, 2.4]]])

stockData


Out[45]:
array([[[  63.03,   61.48,   75.  ],
        [  62.05,   62.75,   46.  ],
        [  62.74,   62.19,   53.  ]],

       [[ 411.9 ,  404.38,    2.9 ],
        [ 405.45,  405.91,    2.6 ],
        [ 403.15,  404.42,    2.4 ]]])

In [46]:
stockHistoricalPrices = pd.Panel(stockData, items=['FB', 'NFLX'], 
                                 major_axis=pd.date_range('2/3/2014', periods=3), 
                                 minor_axis=['open price', 'closing price', 'volume'])
stockHistoricalPrices


Out[46]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: FB to NFLX
Major_axis axis: 2014-02-03 00:00:00 to 2014-02-05 00:00:00
Minor_axis axis: open price to volume

As I mentioned earlier visualizing the 3D data is not really easy so we see a small information instead.

Using Python Dictionary of Data Frame Objects:


In [48]:
USData=pd.DataFrame(np.array([[249.62 , 8900], [ 282.16,12680], [309.35,14940]]),
                    columns=['Population(M)','GDP($B)'],
                    index=[1990,2000,2010])
USData


Out[48]:
Population(M) GDP($B)
1990 249.62 8900.0
2000 282.16 12680.0
2010 309.35 14940.0

In [49]:
ChinaData=pd.DataFrame(np.array([[1133.68, 390.28], [ 1266.83,1198.48], [1339.72, 6988.47]]),
                       columns=['Population(M)','GDP($B)'],
                       index=[1990,2000,2010])
ChinaData


Out[49]:
Population(M) GDP($B)
1990 1133.68 390.28
2000 1266.83 1198.48
2010 1339.72 6988.47

In [50]:
US_ChinaData={'US' : USData,
              'China': ChinaData}

pd.Panel(US_ChinaData)


Out[50]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: China to US
Major_axis axis: 1990 to 2010
Minor_axis axis: Population(M) to GDP($B)