In this post I will summarize the data structures of Pandas library. Pandas is a library written for data manipulation and analysis. It is built on top of the Numpy library and it provides features not available in Numpy, so to be able to follow this tutorial you should have some basics on Python and NumPy library.

I am going to follow the official data structure documentation here

Data Structures in Pandas

There are 3 main data structures:

1. Series

The series are the 1D NumPy array under the hood. It consists of a NumPy array coupled with an array of labels.

Create Series:

import pandas as pd
ser = pd.Series(data, index=idx)

data can be ndarray, a Python dictionary, or a scalar value



In [6]:

    
# Using Scalar Values
import pandas as pd
ser = pd.Series([20, 21, 12], index=['London', 'New York','Helsinki'])
print(ser)









    



London      20
New York    21
Helsinki    12
dtype: int64



In [7]:

    
# Using Numpy ndarray
import numpy as np
np.random.seed(100)
ser=pd.Series(np.random.rand(7))
ser









    Out[7]:





0    0.543405
1    0.278369
2    0.424518
3    0.844776
4    0.004719
5    0.121569
6    0.670749
dtype: float64

Using Python Dictionary:

If we want to use Python Dictionary, the usage is slighly different. Since dictionary data structure has the



In [10]:

    
currDict = {'US' : 'dollar', 'UK' : 'pound',
            'Germany': 'euro', 'Mexico':'peso',
            'Nigeria':'naira',
            'China':'yuan', 'Japan':'yen'}
currSeries=pd.Series(currDict)
currSeries









    Out[10]:





China        yuan
Germany      euro
Japan         yen
Mexico       peso
Nigeria     naira
UK          pound
US         dollar
dtype: object

The index of a pandas Series structure is of type pandas.core.index.Index and can be viewed as an ordered multiset.

In the following case, we specify an index, but the index contains one entry that isn't a key in the corresponding dict. The result is that the value for the key is assigned as NaN, indicating that it is missing. We will deal with handling missing values in a later section.



In [9]:

    
stockPrices = {'GOOG':1180.97,'FB':62.57,
                'TWTR': 64.50, 'AMZN':358.69,
                'AAPL':500.6}
stockPriceSeries=pd.Series(stockPrices,
                            index=['GOOG','FB','YHOO',
                                    'TWTR','AMZN','AAPL'],
                            name='stockPrices')
stockPriceSeries









    Out[9]:





GOOG    1180.97
FB        62.57
YHOO        NaN
TWTR      64.50
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64

The behavior of Series is very similar to that of numpy arrays discussed in a previous section, with one caveat being that an operation such as slicing also slices the index.

Values can be set and accessed using the index label in a dictionary-like manner:



In [11]:

    
stockPriceSeries









    Out[11]:





GOOG    1180.97
FB        62.57
YHOO        NaN
TWTR      64.50
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64



In [13]:

    
stockPriceSeries['GOOG']=1200.0
stockPriceSeries









    Out[13]:





GOOG    1200.00
FB        62.57
YHOO        NaN
TWTR      64.50
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64

Just as in the case of dict, KeyError is raised if you try to retrieve a missing label,

We can avoid this error to be avoid by explicitly using get as follows:



In [15]:

    
stockPriceSeries.get('MSFT',np.NaN)









    Out[15]:





nan

In this case, the default value of np.NaN is specified as the value to return when the key does not exist in the Series structure.

The slice operation behaves the same way as a NumPy array:



In [16]:

    
stockPriceSeries[:4]









    Out[16]:





GOOG    1200.00
FB        62.57
YHOO        NaN
TWTR      64.50
Name: stockPrices, dtype: float64

Logical slicing also works as follows:



In [17]:

    
stockPriceSeries[stockPriceSeries > 100]









    Out[17]:





GOOG    1200.00
AMZN     358.69
AAPL     500.60
Name: stockPrices, dtype: float64

Arithmetic and statistical operations can be applied, just as with a NumPy array:



In [18]:

    
np.mean(stockPriceSeries)









    Out[18]:





437.27200000000005



In [19]:

    
np.std(stockPriceSeries)









    Out[19]:





417.4446361087899



In [22]:

    
ser









    Out[22]:





0    0.543405
1    0.278369
2    0.424518
3    0.844776
4    0.004719
5    0.121569
6    0.670749
dtype: float64



In [21]:

    
ser*ser









    Out[21]:





0    0.295289
1    0.077490
2    0.180215
3    0.713647
4    0.000022
5    0.014779
6    0.449904
dtype: float64



In [20]:

    
np.sqrt(ser)









    Out[20]:





0    0.737160
1    0.527607
2    0.651550
3    0.919117
4    0.068694
5    0.348668
6    0.818993
dtype: float64



In [23]:

    
ser[1:]









    Out[23]:





1    0.278369
2    0.424518
3    0.844776
4    0.004719
5    0.121569
6    0.670749
dtype: float64

2. Data Frame

DataFrame is an 2-dimensional labeled array. Its column types can be heterogeneous: that is, of varying types. It is similar to structured arrays in NumPy with mutability added. It has the following properties:

Conceptually analogous to a table or spreadsheet of data.
Similar to a NumPy ndarray but not a subclass of np.ndarray.
Columns can be of heterogeneous types: float64, int, bool, and so on.
A DataFrame column is a Series structure.
It can be thought of as a dictionary of Series structures where both the columns and the rows are indexed, denoted as 'index' in the case of rows and 'columns' in the case of columns.
It is size mutable: columns can be inserted and deleted.

DataFrame is the most commonly used data structure in pandas. The constructor accepts many different types of arguments:

Dictionary of 1D ndarrays, lists, dictionaries, or Series structures
2D NumPy array
Structured or record ndarray
Series structures
Another DataFrame structure

Using Dictionaries of Series:



In [24]:

    
# Dictionary created with pandas.Series
stockSummaries={
        'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'Beta', 'P/E','Market Cap(B)']),
        'GOOG': pd.Series([1133.43,36.05,335.83,0.87,31.44,380.64],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'Beta','P/E','Market Cap(B)']),
        'FB': pd.Series([61.48,0.59,2450,104.93,150.92],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'P/E', 'Market Cap(B)']),
        'YHOO': pd.Series([34.90,1.27,1010,27.48,0.66,35.36],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'P/E','Beta', 'Market Cap(B)']),
        'TWTR':pd.Series([65.25,-0.3,555.2,36.23],
                            index=['Closing price','EPS','Shares Outstanding(M)',
                            'Market Cap(B)']),
        'AAPL':pd.Series([501.53,40.32,892.45,12.44,447.59,0.84],
                            index=['Closing price','EPS','Shares Outstanding(M)','P/E',
                            'Market Cap(B)','Beta'])}

stockDF=pd.DataFrame(stockSummaries)
stockDF









    Out[24]:






  
    
      
      AAPL
      AMZN
      FB
      GOOG
      TWTR
      YHOO
    
  
  
    
      Beta
      0.84
      0.52
      NaN
      0.87
      NaN
      0.66
    
    
      Closing price
      501.53
      346.15
      61.48
      1133.43
      65.25
      34.90
    
    
      EPS
      40.32
      0.59
      0.59
      36.05
      -0.30
      1.27
    
    
      Market Cap(B)
      447.59
      158.88
      150.92
      380.64
      36.23
      35.36
    
    
      P/E
      12.44
      589.80
      104.93
      31.44
      NaN
      27.48
    
    
      Shares Outstanding(M)
      892.45
      459.00
      2450.00
      335.83
      555.20
      1010.00



In [25]:

    
# We can also change the order...
stockDF=pd.DataFrame(stockSummaries,
                    index=['Closing price','EPS',
                    'Shares Outstanding(M)',
                    'P/E', 'Market Cap(B)','Beta'])
stockDF









    Out[25]:






  
    
      
      AAPL
      AMZN
      FB
      GOOG
      TWTR
      YHOO
    
  
  
    
      Closing price
      501.53
      346.15
      61.48
      1133.43
      65.25
      34.90
    
    
      EPS
      40.32
      0.59
      0.59
      36.05
      -0.30
      1.27
    
    
      Shares Outstanding(M)
      892.45
      459.00
      2450.00
      335.83
      555.20
      1010.00
    
    
      P/E
      12.44
      589.80
      104.93
      31.44
      NaN
      27.48
    
    
      Market Cap(B)
      447.59
      158.88
      150.92
      380.64
      36.23
      35.36
    
    
      Beta
      0.84
      0.52
      NaN
      0.87
      NaN
      0.66

The row index labels and column labels can be accessed via the index and column attributes:



In [26]:

    
stockDF.index









    Out[26]:





Index([u'Closing price', u'EPS', u'Shares Outstanding(M)', u'P/E',
       u'Market Cap(B)', u'Beta'],
      dtype='object')



In [27]:

    
stockDF.columns









    Out[27]:





Index([u'AAPL', u'AMZN', u'FB', u'GOOG', u'TWTR', u'YHOO'], dtype='object')

Using ndarrays/lists:

When we create a dataframe from lists, the keys become the column labels in the DataFrame structure and the data in the list becomes the column values. Note how the row label indexes are generated using np.range(n).



In [28]:

    
algos={'search':['DFS','BFS','Binary Search','Linear','ShortestPath (Djikstra)'],
       'sorting': ['Quicksort','Mergesort', 'Heapsort','Bubble Sort', 'Insertion Sort'],
       'machine learning':['RandomForest', 'K Nearest Neighbor', 'Logistic Regression', 'K-Means Clustering', 'Linear Regression']}
algoDF=pd.DataFrame(algos)
algoDF









    Out[28]:






  
    
      
      machine learning
      search
      sorting
    
  
  
    
      0
      RandomForest
      DFS
      Quicksort
    
    
      1
      K Nearest Neighbor
      BFS
      Mergesort
    
    
      2
      Logistic Regression
      Binary Search
      Heapsort
    
    
      3
      K-Means Clustering
      Linear
      Bubble Sort
    
    
      4
      Linear Regression
      ShortestPath (Djikstra)
      Insertion Sort



In [29]:

    
# Or we can index by specifying the index when creating data frame
pd.DataFrame(algos,index=['algo_1','algo_2','algo_3','algo_4','algo_5'])









    Out[29]:






  
    
      
      machine learning
      search
      sorting
    
  
  
    
      algo_1
      RandomForest
      DFS
      Quicksort
    
    
      algo_2
      K Nearest Neighbor
      BFS
      Mergesort
    
    
      algo_3
      Logistic Regression
      Binary Search
      Heapsort
    
    
      algo_4
      K-Means Clustering
      Linear
      Bubble Sort
    
    
      algo_5
      Linear Regression
      ShortestPath (Djikstra)
      Insertion Sort

Using Structured Array:

Structured arrayis an array of records or structs, for more info check structured arrays documentation:



In [30]:

    
memberData = np.zeros((4,), dtype=[('Name','a15'), ('Age','i4'), ('Weight','f4')])
memberData[:] = [('Sanjeev',37,162.4), ('Yingluck',45,137.8), ('Emeka',28,153.2), ('Amy',67,101.3)]
memberDF=pd.DataFrame(memberData)
memberDF



In [31]:

    
pd.DataFrame(memberData, index=['a','b','c','d'])

Using a Series Structure:



In [32]:

    
currSeries.name='currency'
pd.DataFrame(currSeries)









    Out[32]:






  
    
      
      currency
    
  
  
    
      China
      yuan
    
    
      Germany
      euro
    
    
      Japan
      yen
    
    
      Mexico
      peso
    
    
      Nigeria
      naira
    
    
      UK
      pound
    
    
      US
      dollar

There are also alternative constructors for DataFrame; they can be summarized as follows:

DataFrame.from_dict: It takes a dictionary of dictionaries or sequences and returns DataFrame.
DataFrame.from_records: It takes a list of tuples or structured ndarray.
DataFrame.from_items: It takes a sequence of (key, value) pairs. The keys are the column or index names, and the values are the column or row values. If you wish the keys to be row index names, you must specify orient='index' as a parameter and specify the column names.
pandas.io.parsers.read_csv: This is a helper function that reads a CSV file into a pandas DataFrame structure.
pandas.io.parsers.read_table: This is a helper function that reads a delimited file into a pandas DataFrame structure.
pandas.io.parsers.read_fwf: This is a helper function that reads a table of fixed-width lines into a pandas DataFrame structure.

I will point out some of the operations most commonly used with data frames.



In [34]:

    
# Specific column can be reached as Series
memberDF['Name']









    Out[34]:





0     Sanjeev
1    Yingluck
2       Emeka
3         Amy
Name: Name, dtype: object



In [39]:

    
# A new column can be added via assignment
memberDF['Height']=[60, 70,80, 90]
memberDF



In [40]:

    
# column can be deleted using del
del memberDF['Height']
memberDF

Basically, a DataFrame structure can be treated as if it were a dictionary of Series objects. Columns get inserted at the end; to insert a column at a specific location, you can use the insert function:



In [41]:

    
memberDF.insert(2,'isSenior',memberDF['Age']>60)
memberDF









    Out[41]:






  
    
      
      Name
      Age
      isSenior
      Weight
    
  
  
    
      0
      Sanjeev
      37
      False
      162.399994
    
    
      1
      Yingluck
      45
      False
      137.800003
    
    
      2
      Emeka
      28
      False
      153.199997
    
    
      3
      Amy
      67
      True
      101.300003

DataFrame objects align in a manner similar to Series objects, except that they align on both column and index labels. The resulting object is the union of the column and row labels:



In [42]:

    
ore1DF=pd.DataFrame(np.array([[20,35,25,20],[11,28,32,29]]),columns=['iron','magnesium','copper','silver'])
ore2DF=pd.DataFrame(np.array([[14,34,26,26],[33,19,25,23]]),columns=['iron','magnesium','gold','silver'])
ore1DF+ore2DF









    Out[42]:






  
    
      
      copper
      gold
      iron
      magnesium
      silver
    
  
  
    
      0
      NaN
      NaN
      34
      69
      46
    
    
      1
      NaN
      NaN
      44
      47
      52



In [43]:

    
ore1DF + pd.Series([25,25,25,25], index=['iron','magnesium', 'copper','silver'])

Mathematical operators can be applied element wise on DataFrame structures:



In [44]:

    
np.sqrt(ore1DF)

3. Panel

Panel is a 3D array. It is not as widely used as Series or DataFrame. It is not as easily displayed on screen or visualized as the other two because of its 3D nature.

The three axis names are as follows:

items: This is axis 0. Each each item corresponds to a DataFrame structure.
major_axis: This is axis 1. Each item corresponds to the rows of the DataFrame structure.
minor_axis: This is axis 2. Each item corresponds to the columns of each DataFrame structure.

Using 3D Numpy array with axis labels:



In [45]:

    
stockData=np.array([[[63.03,61.48,75],
                     [62.05,62.75,46],
                     [62.74,62.19,53]],
                    [[411.90, 404.38, 2.9],
                     [405.45, 405.91, 2.6],
                     [403.15, 404.42, 2.4]]])

stockData









    Out[45]:





array([[[  63.03,   61.48,   75.  ],
        [  62.05,   62.75,   46.  ],
        [  62.74,   62.19,   53.  ]],

       [[ 411.9 ,  404.38,    2.9 ],
        [ 405.45,  405.91,    2.6 ],
        [ 403.15,  404.42,    2.4 ]]])



In [46]:

    
stockHistoricalPrices = pd.Panel(stockData, items=['FB', 'NFLX'], 
                                 major_axis=pd.date_range('2/3/2014', periods=3), 
                                 minor_axis=['open price', 'closing price', 'volume'])
stockHistoricalPrices









    Out[46]:





<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: FB to NFLX
Major_axis axis: 2014-02-03 00:00:00 to 2014-02-05 00:00:00
Minor_axis axis: open price to volume

As I mentioned earlier visualizing the 3D data is not really easy so we see a small information instead.

Using Python Dictionary of Data Frame Objects:



In [48]:

    
USData=pd.DataFrame(np.array([[249.62 , 8900], [ 282.16,12680], [309.35,14940]]),
                    columns=['Population(M)','GDP($B)'],
                    index=[1990,2000,2010])
USData









    Out[48]:






  
    
      
      Population(M)
      GDP($B)
    
  
  
    
      1990
      249.62
      8900.0
    
    
      2000
      282.16
      12680.0
    
    
      2010
      309.35
      14940.0



In [49]:

    
ChinaData=pd.DataFrame(np.array([[1133.68, 390.28], [ 1266.83,1198.48], [1339.72, 6988.47]]),
                       columns=['Population(M)','GDP($B)'],
                       index=[1990,2000,2010])
ChinaData









    Out[49]:






  
    
      
      Population(M)
      GDP($B)
    
  
  
    
      1990
      1133.68
      390.28
    
    
      2000
      1266.83
      1198.48
    
    
      2010
      1339.72
      6988.47



In [50]:

    
US_ChinaData={'US' : USData,
              'China': ChinaData}

pd.Panel(US_ChinaData)









    Out[50]:





<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: China to US
Major_axis axis: 1990 to 2010
Minor_axis axis: Population(M) to GDP($B)

	iron	magnesium	copper	silver
0	4.472136	5.916080	5.000000	4.472136
1	3.316625	5.291503	5.656854	5.385165

	AAPL	AMZN	FB	GOOG	TWTR	YHOO
Beta	0.84	0.52	NaN	0.87	NaN	0.66
Closing price	501.53	346.15	61.48	1133.43	65.25	34.90
EPS	40.32	0.59	0.59	36.05	-0.30	1.27
Market Cap(B)	447.59	158.88	150.92	380.64	36.23	35.36
P/E	12.44	589.80	104.93	31.44	NaN	27.48
Shares Outstanding(M)	892.45	459.00	2450.00	335.83	555.20	1010.00

	machine learning	search	sorting
0	RandomForest	DFS	Quicksort
1	K Nearest Neighbor	BFS	Mergesort
2	Logistic Regression	Binary Search	Heapsort
3	K-Means Clustering	Linear	Bubble Sort
4	Linear Regression	ShortestPath (Djikstra)	Insertion Sort

	machine learning	search	sorting
algo_1	RandomForest	DFS	Quicksort
algo_2	K Nearest Neighbor	BFS	Mergesort
algo_3	Logistic Regression	Binary Search	Heapsort
algo_4	K-Means Clustering	Linear	Bubble Sort
algo_5	Linear Regression	ShortestPath (Djikstra)	Insertion Sort

	Name	Age	Weight
0	Sanjeev	37	162.399994
1	Yingluck	45	137.800003
2	Emeka	28	153.199997
3	Amy	67	101.300003

	currency
China	yuan
Germany	euro
Japan	yen
Mexico	peso
Nigeria	naira
UK	pound
US	dollar

	Name	Age	isSenior	Weight
0	Sanjeev	37	False	162.399994
1	Yingluck	45	False	137.800003
2	Emeka	28	False	153.199997
3	Amy	67	True	101.300003