by Megat Harun Al Rashid bin Megat Ahmad
last updated: April 14, 2016
$Pandas$ is an open source library for data analysis in Python. It gives Python similar capabilities to R programming language and even though it is possible to run R in Jupyter Notebook, it would be more practical to do data analysis with a Python friendly syntax. Similar to other libraries, the first step to use $Pandas$ is to import the library and usually together with the $Numpy$ library.
In [1]:
import pandas as pd
import numpy as np
Data structures (similar to Sequence in Python) of $Pandas$ revolves around the Series and DataFrame structures. Both are fast as they are built on top of $Numpy$.
A Series is a one-dimensional object with a lot of similar properties similar to a list or dictionary in Python's Sequence. Each element or item in a Series will be assigned by default an index label from 0 to N-1 (where N is the length of the Series) and it can contains the various type of Python's data.
In [2]:
# Creating a series (with different type of data)
s1 = pd.Series([34, 'Material', 4*np.pi, 'Reactor', [100,250,500,750], 'kW'])
s1
Out[2]:
The index of a Series can be specified during its creation and giving it a similar function to a dictionary.
In [3]:
# Creating a series with specified index
lt = [34, 'Material', 4*np.pi, 'Reactor', [100,250,500,750], 'kW']
s2 = pd.Series(lt, index = ['b1', 'r1', 'solid angle', 18, 'reactor power', 'unit'])
s2
Out[3]:
Data can be extracted by specifying the element position or index (similar to list/dictionary).
In [4]:
s1[3], s2['solid angle']
Out[4]:
Series can also be constructed from a dictionary.
In [5]:
pop_cities = {'Kuala Lumpur':1588750, 'Seberang Perai':818197, 'Kajang':795522,
'Klang':744062, 'Subang Jaya':708296}
cities = pd.Series(pop_cities)
cities
Out[5]:
The elements can be sort using the $Series.order()$ function. This will not change the structure of the original variable.
In [6]:
cities.order(ascending=False)
Out[6]:
In [7]:
cities
Out[7]:
Another sorting function is the $sort()$ function but this will change the structure of the Series variable.
In [8]:
# Sorting with descending values
cities.sort(ascending=False)
cities
Out[8]:
In [9]:
cities
Out[9]:
Conditions can be applied to the elements.
In [10]:
# cities with population less than 800,000
cities[cities<800000]
Out[10]:
In [11]:
# cities with population between 750,000 and 800,000
cities[cities<800000][cities>750000]
Out[11]:
A DataFrame is a 2-dimensional data structure with named rows and columns. It is similar to R's data.frame object and function like a spreadsheet. DataFrame can be considered to be made of series of Series data according to the column names. DataFrame can be created by passing a 2-dimensional array of data and specifying the rows and columns names.
In [12]:
# Creating a DataFrame by passing a 2-D numpy array of random number
# Creating first the date-time index using date_range function
# and checking it.
dates = pd.date_range('20140801', periods = 8, freq = 'D')
dates
Out[12]:
In [13]:
# Creating the column names as list
Kedai = ['Kedai A', 'Kedai B', 'Kedai C', 'Kedai D', 'Kedai E']
# Creating the DataFrame with specified rows and columns
df = pd.DataFrame(np.random.randn(8,5),index=dates,columns=Kedai)
df
Out[13]:
Some of the useful functions that can be applied to a DataFrame include:
In [14]:
df.head() # Displaying the first five (default) rows
Out[14]:
In [15]:
df.head(3) # Displaying the first three (specified) rows
Out[15]:
In [16]:
df.tail(2) # Displaying the last two (specified) rows
Out[16]:
In [17]:
df.index # Showing the index of rows
Out[17]:
In [18]:
df.columns # Showing the fields of columns
Out[18]:
In [19]:
df.values # Showing the data only in its original 2-D array
Out[19]:
In [20]:
df.describe() # Simple statistical data for each column
Out[20]:
In [21]:
df.T # Transposing the DataFrame (index becomes column and vice versa)
Out[21]:
In [22]:
df.sort_index(axis=1,ascending=False) # Sorting with descending column
Out[22]:
In [23]:
df.sort(columns='Kedai D') # Sorting according to ascending specific column
Out[23]:
In [24]:
df['Kedai A'] # Extract specific column (using python list syntax)
Out[24]:
In [25]:
df['Kedai A'][2:4] # Slicing specific column (using python list syntax)
Out[25]:
In [26]:
df[2:4] # Slicing specific row data (using python list syntax)
Out[26]:
In [27]:
# Slicing specific index range
df['2014-08-03':'2014-08-05']
Out[27]:
In [28]:
# Slicing specific index range for a particular column
df['2014-08-03':'2014-08-05']['Kedai B']
Out[28]:
In [29]:
# Using the loc() function
# Slicing specific index and column ranges
df.loc['2014-08-03':'2014-08-05','Kedai B':'Kedai D']
Out[29]:
In [30]:
# Slicing specific index range with specific column names
df.loc['2014-08-03':'2014-08-05',['Kedai B','Kedai D']]
Out[30]:
In [31]:
# Possibly not yet to have something like this
df.loc[['2014-08-01','2014-08-03':'2014-08-05'],['Kedai B','Kedai D']]
In [32]:
# Using the iloc() function
df.iloc[3] # Specific row location
Out[32]:
In [33]:
df.iloc[:,3] # Specific column location (all rows)
Out[33]:
In [34]:
df.iloc[2:4,1:3] # Python like slicing for range
Out[34]:
In [35]:
df.iloc[[2,4],[1,3]] # Slicing with python like list
Out[35]:
In [36]:
# Conditionals on the data
df>0 # Array values > 0 OR
Out[36]:
In [37]:
df[df>0] # Directly getting the value
Out[37]:
NaN means empty, missing data or unavailable.
In [38]:
df[df['Kedai B']<0] # With reference to specific value in a column (e.g. Kedai B)
Out[38]:
In [39]:
df2 = df.copy() # Made a copy of a database
In [40]:
df2
Out[40]:
In [41]:
# Adding column
df2['Tambah'] = ['satu','satu','dua','tiga','empat','tiga','lima','enam']
In [42]:
df2
Out[42]:
In [43]:
# Adding row using append() function. The previous loc() is possibly deprecated.
# Assign a new name to the new row (with the same format)
new_row_name = pd.date_range('20140809', periods = 1, freq = 'D')
# Appending new row with new data
df2.append(list(np.random.randn(5))+['sembilan'])
# Renaming the new row (here actually is a reassignment)
df2 = df2.rename(index={10: new_row_name[0]})
df2
Out[43]:
In [44]:
# Assigning new data to a row
df2.loc['2014-08-05'] = list(np.random.randn(5))+['tujuh']
df2
Out[44]:
In [45]:
# Assigning new data to a specific element
df2.loc['2014-08-05','Tambah'] = 'lapan'
df2
Out[45]:
In [46]:
# Using the isin() function (returns boolean data frame)
df2.isin(['satu','tiga'])
Out[46]:
In [47]:
# Select specific row based on additonal column
df2[df2['Tambah'].isin(['satu','tiga'])]
Out[47]:
In [48]:
# Use previous command - select certain column based on selected additional column
df2[df2['Tambah'].isin(['satu','tiga'])].loc[:,'Kedai B':'Kedai D']
Out[48]:
In [49]:
# Select > 0 from previous cell...
(df2[df2['Tambah'].isin(['satu','tiga'])].loc[:,'Kedai B':'Kedai D']>0)
Out[49]:
We have seen few operations previously on Series and DataFrame and here this will be explored further.
In [50]:
df.mean() # Statistical mean (column) - same as df.mean(0), 0 means column
Out[50]:
In [51]:
df.mean(1) # Statistical mean (row) - 1 means row
Out[51]:
In [52]:
df.mean()['Kedai C':'Kedai E'] # Statistical mean (range of columns)
Out[52]:
In [53]:
df.max() # Statistical max (column)
Out[53]:
In [54]:
df.max()['Kedai C'] # Statistical max (specific column)
Out[54]:
In [55]:
df.max(1)['2014-08-04':'2014-08-07'] # Statistical max (specific row)
Out[55]:
In [56]:
df.max(1)[dates[3]] # Statistical max (specific row by variable)
Out[56]:
Other statistical functions can be checked by typing df.< TAB >.
The data in a DataFrame can be represented by a variable declared using the $lambda$ operator.
In [57]:
df.apply(lambda x: x.max() - x.min()) # Operating array values with function
Out[57]:
In [58]:
df.apply(lambda z: np.log(z)) # Operating array values with function
Out[58]:
Replacing, rearranging and operations of data between columns can be done much like spreadsheet.
In [59]:
df3 = df.copy()
In [60]:
df3[r'Kedai A^2/Kedai E'] = df3['Kedai A']**2/df3['Kedai E']
df3
Out[60]:
Tables can be split, rearranged and combined.
In [61]:
df4 = df.copy()
df4
Out[61]:
In [62]:
pieces = [df4[6:], df4[3:6], df4[:3]] # split row 2+3+3
pieces
Out[62]:
In [63]:
df5 = pd.concat(pieces) # concantenate (rearrange/combine)
df5
Out[63]:
In [64]:
df4+df5 # Operation between tables with original index sequence
Out[64]:
In [65]:
df0 = df.loc[:,'Kedai A':'Kedai C'] # Slicing and extracting columns
pd.concat([df4, df0], axis = 1) # Concatenating columns (axis = 1 -> refers to column)
Out[65]:
Let us look on some of the simple plotting function on $Pandas$ (requires $Matplotlib$ library).
In [66]:
df_add = df.copy()
In [67]:
# Simple auto plotting
%matplotlib inline
df_add.cumsum().plot()
Out[67]:
In [68]:
# Reposition the legend
import matplotlib.pyplot as plt
df_add.cumsum().plot()
plt.legend(bbox_to_anchor=[1.3, 1])
Out[68]:
In the above example, repositioning the legend requires the legend function in $Matplotlib$ library. Therefore, the $Matplotlib$ library must be explicitly imported.
In [69]:
df_add.cumsum().plot(kind='bar')
plt.legend(bbox_to_anchor=[1.3, 1])
Out[69]:
In [70]:
df_add.cumsum().plot(kind='barh', stacked=True)
Out[70]:
In [71]:
df_add.cumsum().plot(kind='hist', alpha=0.5)
Out[71]:
In [72]:
df_add.cumsum().plot(kind='area', alpha=0.4, stacked=False)
plt.legend(bbox_to_anchor=[1.3, 1])
Out[72]:
A 3-dimensional plot can be projected on a canvas but requires the $Axes3D$ library with slightly complicated settings.
In [73]:
# Plotting a 3D bar plot
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
# Convert the time format into ordinary strings
time_series = pd.Series(df.index.format())
fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')
# Plotting the bar graph column by column
for c, z in zip(['r', 'g', 'b', 'y','m'], np.arange(len(df.columns))):
xs = df.index
ys = df.values[:,z]
ax.bar(xs, ys, zs=z, zdir='y', color=c, alpha=0.5)
ax.set_zlabel('Z')
ax.set_xticklabels(time_series, va = 'baseline', ha = 'right', rotation = 15)
ax.set_yticks(np.arange(len(df.columns)))
ax.set_yticklabels(df.columns, va = 'center', ha = 'left', rotation = -42)
ax.view_init(30, -30)
fig.tight_layout()
Data in DataFrame can be exported into csv (comma separated value) and Excel file. The users can also create a DataFrame from data in csv and Excel file, the data can then be processed.
In [74]:
# Export data to a csv file but separated with < TAB > rather than comma
# the default separation is with comma
df.to_csv('Tutorial8/Kedai.txt', sep='\t')
In [75]:
# Export to Excel file
df.to_excel('Tutorial8/Kedai.xlsx', sheet_name = 'Tarikh', index = True)
In [76]:
# Importing data from csv file (without header)
from_file = pd.read_csv('Tutorial8/Malaysian_Town.txt',sep='\t',header=None)
from_file.head()
Out[76]:
In [77]:
# Importing data from Excel file (with header (the first row) that became the column names)
from_excel = pd.read_excel('Tutorial8/Malaysian_Town.xlsx','Sheet1')
from_excel.head()
Out[77]:
Further $Pandas$ features can be found in http://pandas.pydata.org/.