In this post I will summarize the data structures of Pandas
library. Pandas
is a library written for data manipulation and analysis. It is built on top of the Numpy
library and it provides features not available in Numpy
, so to be able to follow this tutorial you should have some basics on Python and NumPy library.
I am going to follow the official data structure documentation here
There are 3 main data structures:
The series are the 1D NumPy
array under the hood. It consists of a NumPy
array coupled with an array of labels.
Create Series:
import pandas as pd
ser = pd.Series(data, index=idx)
data
can be ndarray
, a Python dictionary, or a scalar value
In [6]:
# Using Scalar Values
import pandas as pd
ser = pd.Series([20, 21, 12], index=['London', 'New York','Helsinki'])
print(ser)
In [7]:
# Using Numpy ndarray
import numpy as np
np.random.seed(100)
ser=pd.Series(np.random.rand(7))
ser
Out[7]:
Using Python Dictionary:
If we want to use Python Dictionary, the usage is slighly different. Since dictionary data structure has the
In [10]:
currDict = {'US' : 'dollar', 'UK' : 'pound',
'Germany': 'euro', 'Mexico':'peso',
'Nigeria':'naira',
'China':'yuan', 'Japan':'yen'}
currSeries=pd.Series(currDict)
currSeries
Out[10]:
The index of a pandas Series structure is of type pandas.core.index.Index
and can be viewed as an ordered multiset.
In the following case, we specify an index, but the index contains one entry that isn't a key in the corresponding dict
. The result is that the value for the key is assigned as NaN
, indicating that it is missing. We will deal with handling missing values in a later section.
In [9]:
stockPrices = {'GOOG':1180.97,'FB':62.57,
'TWTR': 64.50, 'AMZN':358.69,
'AAPL':500.6}
stockPriceSeries=pd.Series(stockPrices,
index=['GOOG','FB','YHOO',
'TWTR','AMZN','AAPL'],
name='stockPrices')
stockPriceSeries
Out[9]:
The behavior of Series is very similar to that of numpy
arrays discussed in a previous section, with one caveat being that an operation such as slicing also slices the index.
In [11]:
stockPriceSeries
Out[11]:
In [13]:
stockPriceSeries['GOOG']=1200.0
stockPriceSeries
Out[13]:
Just as in the case of dict
, KeyError
is raised if you try to retrieve a missing label,
We can avoid this error to be avoid by explicitly using get as follows:
In [15]:
stockPriceSeries.get('MSFT',np.NaN)
Out[15]:
In this case, the default value of np.NaN
is specified as the value to return when the key does not exist in the Series structure.
The slice operation behaves the same way as a NumPy
array:
In [16]:
stockPriceSeries[:4]
Out[16]:
Logical slicing also works as follows:
In [17]:
stockPriceSeries[stockPriceSeries > 100]
Out[17]:
Arithmetic and statistical operations can be applied, just as with a NumPy array:
In [18]:
np.mean(stockPriceSeries)
Out[18]:
In [19]:
np.std(stockPriceSeries)
Out[19]:
In [22]:
ser
Out[22]:
In [21]:
ser*ser
Out[21]:
In [20]:
np.sqrt(ser)
Out[20]:
In [23]:
ser[1:]
Out[23]:
DataFrame is an 2-dimensional labeled array. Its column types can be heterogeneous: that is, of varying types. It is similar to structured arrays in NumPy with mutability added. It has the following properties:
ndarray
but not a subclass of np.ndarray
.float64, int, bool,
and so on.DataFrame is the most commonly used data structure in pandas. The constructor accepts many different types of arguments:
1D ndarrays
, lists, dictionaries, or Series structures2D NumPy array
Using Dictionaries of Series:
In [24]:
# Dictionary created with pandas.Series
stockSummaries={
'AMZN': pd.Series([346.15,0.59,459,0.52,589.8,158.88],
index=['Closing price','EPS','Shares Outstanding(M)',
'Beta', 'P/E','Market Cap(B)']),
'GOOG': pd.Series([1133.43,36.05,335.83,0.87,31.44,380.64],
index=['Closing price','EPS','Shares Outstanding(M)',
'Beta','P/E','Market Cap(B)']),
'FB': pd.Series([61.48,0.59,2450,104.93,150.92],
index=['Closing price','EPS','Shares Outstanding(M)',
'P/E', 'Market Cap(B)']),
'YHOO': pd.Series([34.90,1.27,1010,27.48,0.66,35.36],
index=['Closing price','EPS','Shares Outstanding(M)',
'P/E','Beta', 'Market Cap(B)']),
'TWTR':pd.Series([65.25,-0.3,555.2,36.23],
index=['Closing price','EPS','Shares Outstanding(M)',
'Market Cap(B)']),
'AAPL':pd.Series([501.53,40.32,892.45,12.44,447.59,0.84],
index=['Closing price','EPS','Shares Outstanding(M)','P/E',
'Market Cap(B)','Beta'])}
stockDF=pd.DataFrame(stockSummaries)
stockDF
Out[24]:
In [25]:
# We can also change the order...
stockDF=pd.DataFrame(stockSummaries,
index=['Closing price','EPS',
'Shares Outstanding(M)',
'P/E', 'Market Cap(B)','Beta'])
stockDF
Out[25]:
The row index labels and column labels can be accessed via the index and column attributes:
In [26]:
stockDF.index
Out[26]:
In [27]:
stockDF.columns
Out[27]:
Using ndarrays/lists:
When we create a dataframe from lists, the keys become the column labels in the DataFrame structure and the data in the list becomes the column values. Note how the row label indexes are generated using np.range(n)
.
In [28]:
algos={'search':['DFS','BFS','Binary Search','Linear','ShortestPath (Djikstra)'],
'sorting': ['Quicksort','Mergesort', 'Heapsort','Bubble Sort', 'Insertion Sort'],
'machine learning':['RandomForest', 'K Nearest Neighbor', 'Logistic Regression', 'K-Means Clustering', 'Linear Regression']}
algoDF=pd.DataFrame(algos)
algoDF
Out[28]:
In [29]:
# Or we can index by specifying the index when creating data frame
pd.DataFrame(algos,index=['algo_1','algo_2','algo_3','algo_4','algo_5'])
Out[29]:
Using Structured Array:
Structured arrayis an array of records or structs, for more info check structured arrays documentation:
In [30]:
memberData = np.zeros((4,), dtype=[('Name','a15'), ('Age','i4'), ('Weight','f4')])
memberData[:] = [('Sanjeev',37,162.4), ('Yingluck',45,137.8), ('Emeka',28,153.2), ('Amy',67,101.3)]
memberDF=pd.DataFrame(memberData)
memberDF
Out[30]:
In [31]:
pd.DataFrame(memberData, index=['a','b','c','d'])
Out[31]:
Using a Series Structure:
In [32]:
currSeries.name='currency'
pd.DataFrame(currSeries)
Out[32]:
There are also alternative constructors for DataFrame
; they can be summarized as follows:
DataFrame.from_dict
: It takes a dictionary of dictionaries or sequences and returns DataFrame.DataFrame.from_records
: It takes a list of tuples or structured ndarray.DataFrame.from_items
: It takes a sequence of (key, value) pairs. The keys are the column or index names, and the values are the column or row values. If you wish the keys to be row index names, you must specify orient='index' as a parameter and specify the column names.pandas.io.parsers.read_csv
: This is a helper function that reads a CSV file into a pandas DataFrame structure.pandas.io.parsers.read_table
: This is a helper function that reads a delimited file into a pandas DataFrame structure.pandas.io.parsers.read_fwf
: This is a helper function that reads a table of fixed-width lines into a pandas DataFrame structure.I will point out some of the operations most commonly used with data frames.
In [34]:
# Specific column can be reached as Series
memberDF['Name']
Out[34]:
In [39]:
# A new column can be added via assignment
memberDF['Height']=[60, 70,80, 90]
memberDF
Out[39]:
In [40]:
# column can be deleted using del
del memberDF['Height']
memberDF
Out[40]:
Basically, a DataFrame structure can be treated as if it were a dictionary of Series objects. Columns get inserted at the end; to insert a column at a specific location, you can use the insert function:
In [41]:
memberDF.insert(2,'isSenior',memberDF['Age']>60)
memberDF
Out[41]:
DataFrame objects align in a manner similar to Series objects, except that they align on both column and index labels. The resulting object is the union of the column and row labels:
In [42]:
ore1DF=pd.DataFrame(np.array([[20,35,25,20],[11,28,32,29]]),columns=['iron','magnesium','copper','silver'])
ore2DF=pd.DataFrame(np.array([[14,34,26,26],[33,19,25,23]]),columns=['iron','magnesium','gold','silver'])
ore1DF+ore2DF
Out[42]:
In [43]:
ore1DF + pd.Series([25,25,25,25], index=['iron','magnesium', 'copper','silver'])
Out[43]:
Mathematical operators can be applied element wise on DataFrame structures:
In [44]:
np.sqrt(ore1DF)
Out[44]:
Panel is a 3D array. It is not as widely used as Series or DataFrame. It is not as easily displayed on screen or visualized as the other two because of its 3D nature.
The three axis names are as follows:
Using 3D Numpy array with axis labels:
In [45]:
stockData=np.array([[[63.03,61.48,75],
[62.05,62.75,46],
[62.74,62.19,53]],
[[411.90, 404.38, 2.9],
[405.45, 405.91, 2.6],
[403.15, 404.42, 2.4]]])
stockData
Out[45]:
In [46]:
stockHistoricalPrices = pd.Panel(stockData, items=['FB', 'NFLX'],
major_axis=pd.date_range('2/3/2014', periods=3),
minor_axis=['open price', 'closing price', 'volume'])
stockHistoricalPrices
Out[46]:
As I mentioned earlier visualizing the 3D data is not really easy so we see a small information instead.
Using Python Dictionary of Data Frame Objects:
In [48]:
USData=pd.DataFrame(np.array([[249.62 , 8900], [ 282.16,12680], [309.35,14940]]),
columns=['Population(M)','GDP($B)'],
index=[1990,2000,2010])
USData
Out[48]:
In [49]:
ChinaData=pd.DataFrame(np.array([[1133.68, 390.28], [ 1266.83,1198.48], [1339.72, 6988.47]]),
columns=['Population(M)','GDP($B)'],
index=[1990,2000,2010])
ChinaData
Out[49]:
In [50]:
US_ChinaData={'US' : USData,
'China': ChinaData}
pd.Panel(US_ChinaData)
Out[50]: