BASIC PYTHON FOR RESEARCHERS

by Megat Harun Al Rashid bin Megat Ahmad
last updated: April 14, 2016


8. Database and Data Analysis


$Pandas$ is an open source library for data analysis in Python. It gives Python similar capabilities to R programming language and even though it is possible to run R in Jupyter Notebook, it would be more practical to do data analysis with a Python friendly syntax. Similar to other libraries, the first step to use $Pandas$ is to import the library and usually together with the $Numpy$ library.


In [1]:
import pandas as pd
import numpy as np

8.1 Data Structures

Data structures (similar to Sequence in Python) of $Pandas$ revolves around the Series and DataFrame structures. Both are fast as they are built on top of $Numpy$.

A Series is a one-dimensional object with a lot of similar properties similar to a list or dictionary in Python's Sequence. Each element or item in a Series will be assigned by default an index label from 0 to N-1 (where N is the length of the Series) and it can contains the various type of Python's data.


In [2]:
# Creating a series (with different type of data)

s1 = pd.Series([34, 'Material', 4*np.pi, 'Reactor', [100,250,500,750], 'kW'])
s1


Out[2]:
0                      34
1                Material
2                 12.5664
3                 Reactor
4    [100, 250, 500, 750]
5                      kW
dtype: object

The index of a Series can be specified during its creation and giving it a similar function to a dictionary.


In [3]:
# Creating a series with specified index

lt = [34, 'Material', 4*np.pi, 'Reactor', [100,250,500,750], 'kW']

s2 = pd.Series(lt, index = ['b1', 'r1', 'solid angle', 18, 'reactor power', 'unit'])
s2


Out[3]:
b1                                 34
r1                           Material
solid angle                   12.5664
18                            Reactor
reactor power    [100, 250, 500, 750]
unit                               kW
dtype: object

Data can be extracted by specifying the element position or index (similar to list/dictionary).


In [4]:
s1[3], s2['solid angle']


Out[4]:
('Reactor', 12.566370614359172)

Series can also be constructed from a dictionary.


In [5]:
pop_cities = {'Kuala Lumpur':1588750, 'Seberang Perai':818197, 'Kajang':795522,
              'Klang':744062, 'Subang Jaya':708296}
cities = pd.Series(pop_cities)
cities


Out[5]:
Kajang             795522
Klang              744062
Kuala Lumpur      1588750
Seberang Perai     818197
Subang Jaya        708296
dtype: int64

The elements can be sort using the $Series.order()$ function. This will not change the structure of the original variable.


In [6]:
cities.order(ascending=False)


/home/megatharun/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: order is deprecated, use sort_values(...)
  if __name__ == '__main__':
Out[6]:
Kuala Lumpur      1588750
Seberang Perai     818197
Kajang             795522
Klang              744062
Subang Jaya        708296
dtype: int64

In [7]:
cities


Out[7]:
Kajang             795522
Klang              744062
Kuala Lumpur      1588750
Seberang Perai     818197
Subang Jaya        708296
dtype: int64

Another sorting function is the $sort()$ function but this will change the structure of the Series variable.


In [8]:
# Sorting with descending values

cities.sort(ascending=False)
cities


/home/megatharun/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort is deprecated, use sort_values(inplace=True) for for INPLACE sorting
  app.launch_new_instance()
Out[8]:
Kuala Lumpur      1588750
Seberang Perai     818197
Kajang             795522
Klang              744062
Subang Jaya        708296
dtype: int64

In [9]:
cities


Out[9]:
Kuala Lumpur      1588750
Seberang Perai     818197
Kajang             795522
Klang              744062
Subang Jaya        708296
dtype: int64

Conditions can be applied to the elements.


In [10]:
# cities with population less than 800,000
cities[cities<800000]


Out[10]:
Kajang         795522
Klang          744062
Subang Jaya    708296
dtype: int64

In [11]:
# cities with population between 750,000 and 800,000
cities[cities<800000][cities>750000]


Out[11]:
Kajang    795522
dtype: int64

A DataFrame is a 2-dimensional data structure with named rows and columns. It is similar to R's data.frame object and function like a spreadsheet. DataFrame can be considered to be made of series of Series data according to the column names. DataFrame can be created by passing a 2-dimensional array of data and specifying the rows and columns names.


In [12]:
# Creating a DataFrame by passing a 2-D numpy array of random number 

# Creating first the date-time index using date_range function
# and checking it.
dates = pd.date_range('20140801', periods = 8, freq = 'D')
dates


Out[12]:
DatetimeIndex(['2014-08-01', '2014-08-02', '2014-08-03', '2014-08-04',
               '2014-08-05', '2014-08-06', '2014-08-07', '2014-08-08'],
              dtype='datetime64[ns]', freq='D')

In [13]:
# Creating the column names as list
Kedai = ['Kedai A', 'Kedai B', 'Kedai C', 'Kedai D', 'Kedai E']

# Creating the DataFrame with specified rows and columns
df = pd.DataFrame(np.random.randn(8,5),index=dates,columns=Kedai)
df


Out[13]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737

Some of the useful functions that can be applied to a DataFrame include:


In [14]:
df.head()             # Displaying the first five (default) rows


Out[14]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344

In [15]:
df.head(3)             # Displaying the first three (specified) rows


Out[15]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358

In [16]:
df.tail(2)             # Displaying the last two (specified) rows


Out[16]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737

In [17]:
df.index               # Showing the index of rows


Out[17]:
DatetimeIndex(['2014-08-01', '2014-08-02', '2014-08-03', '2014-08-04',
               '2014-08-05', '2014-08-06', '2014-08-07', '2014-08-08'],
              dtype='datetime64[ns]', freq='D')

In [18]:
df.columns             # Showing the fields of columns


Out[18]:
Index([u'Kedai A', u'Kedai B', u'Kedai C', u'Kedai D', u'Kedai E'], dtype='object')

In [19]:
df.values              # Showing the data only in its original 2-D array


Out[19]:
array([[ 0.90820724, -1.91591888,  0.0977269 ,  0.50524069,  0.28863545],
       [ 0.25445799,  0.94370516,  0.39588871, -0.48697109,  0.07528073],
       [ 0.11536149, -0.95897049, -0.82546162, -0.13019616,  0.23635838],
       [ 0.63468028, -0.11778374, -0.29100337,  2.07521734, -0.06350881],
       [-0.18992099, -0.90205808, -0.67746765,  0.15523248, -1.28534372],
       [ 1.30947528,  0.75378506,  0.85105209, -0.5646459 ,  0.36162018],
       [ 0.64429166, -1.10334612, -0.18365171,  1.27093484, -2.39379768],
       [-0.51212697,  0.14798184, -2.00385122,  1.74055598, -0.77073667]])

In [20]:
df.describe()          # Simple statistical data for each column


Out[20]:
Kedai A Kedai B Kedai C Kedai D Kedai E
count 8.000000 8.000000 8.000000 8.000000 8.000000
mean 0.395553 -0.394076 -0.329596 0.570671 -0.443937
std 0.595483 0.990932 0.869785 1.014065 0.976833
min -0.512127 -1.915919 -2.003851 -0.564646 -2.393798
25% 0.039041 -0.995064 -0.714466 -0.219390 -0.899388
50% 0.444569 -0.509921 -0.237328 0.330237 0.005886
75% 0.710271 0.299433 0.172267 1.388340 0.249428
max 1.309475 0.943705 0.851052 2.075217 0.361620

In [21]:
df.T                    # Transposing the DataFrame (index becomes column and vice versa)


Out[21]:
2014-08-01 00:00:00 2014-08-02 00:00:00 2014-08-03 00:00:00 2014-08-04 00:00:00 2014-08-05 00:00:00 2014-08-06 00:00:00 2014-08-07 00:00:00 2014-08-08 00:00:00
Kedai A 0.908207 0.254458 0.115361 0.634680 -0.189921 1.309475 0.644292 -0.512127
Kedai B -1.915919 0.943705 -0.958970 -0.117784 -0.902058 0.753785 -1.103346 0.147982
Kedai C 0.097727 0.395889 -0.825462 -0.291003 -0.677468 0.851052 -0.183652 -2.003851
Kedai D 0.505241 -0.486971 -0.130196 2.075217 0.155232 -0.564646 1.270935 1.740556
Kedai E 0.288635 0.075281 0.236358 -0.063509 -1.285344 0.361620 -2.393798 -0.770737

In [22]:
df.sort_index(axis=1,ascending=False)     # Sorting with descending column


Out[22]:
Kedai E Kedai D Kedai C Kedai B Kedai A
2014-08-01 0.288635 0.505241 0.097727 -1.915919 0.908207
2014-08-02 0.075281 -0.486971 0.395889 0.943705 0.254458
2014-08-03 0.236358 -0.130196 -0.825462 -0.958970 0.115361
2014-08-04 -0.063509 2.075217 -0.291003 -0.117784 0.634680
2014-08-05 -1.285344 0.155232 -0.677468 -0.902058 -0.189921
2014-08-06 0.361620 -0.564646 0.851052 0.753785 1.309475
2014-08-07 -2.393798 1.270935 -0.183652 -1.103346 0.644292
2014-08-08 -0.770737 1.740556 -2.003851 0.147982 -0.512127

In [23]:
df.sort(columns='Kedai D')               # Sorting according to ascending specific column


/home/megatharun/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':
Out[23]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509

In [24]:
df['Kedai A']             # Extract specific column (using python list syntax)


Out[24]:
2014-08-01    0.908207
2014-08-02    0.254458
2014-08-03    0.115361
2014-08-04    0.634680
2014-08-05   -0.189921
2014-08-06    1.309475
2014-08-07    0.644292
2014-08-08   -0.512127
Freq: D, Name: Kedai A, dtype: float64

In [25]:
df['Kedai A'][2:4]        # Slicing specific column (using python list syntax)


Out[25]:
2014-08-03    0.115361
2014-08-04    0.634680
Freq: D, Name: Kedai A, dtype: float64

In [26]:
df[2:4]                   # Slicing specific row data  (using python list syntax)


Out[26]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509

In [27]:
# Slicing specific index range
df['2014-08-03':'2014-08-05']


Out[27]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344

In [28]:
# Slicing specific index range for a particular column
df['2014-08-03':'2014-08-05']['Kedai B']


Out[28]:
2014-08-03   -0.958970
2014-08-04   -0.117784
2014-08-05   -0.902058
Freq: D, Name: Kedai B, dtype: float64

In [29]:
# Using the loc() function

# Slicing specific index and column ranges
df.loc['2014-08-03':'2014-08-05','Kedai B':'Kedai D']


Out[29]:
Kedai B Kedai C Kedai D
2014-08-03 -0.958970 -0.825462 -0.130196
2014-08-04 -0.117784 -0.291003 2.075217
2014-08-05 -0.902058 -0.677468 0.155232

In [30]:
# Slicing specific index range with specific column names
df.loc['2014-08-03':'2014-08-05',['Kedai B','Kedai D']]


Out[30]:
Kedai B Kedai D
2014-08-03 -0.958970 -0.130196
2014-08-04 -0.117784 2.075217
2014-08-05 -0.902058 0.155232

In [31]:
# Possibly not yet to have something like this

df.loc[['2014-08-01','2014-08-03':'2014-08-05'],['Kedai B','Kedai D']]


  File "<ipython-input-31-2ec112cb4d27>", line 3
    df.loc[['2014-08-01','2014-08-03':'2014-08-05'],['Kedai B','Kedai D']]
                                     ^
SyntaxError: invalid syntax

In [32]:
# Using the iloc() function

df.iloc[3]                 # Specific row location


Out[32]:
Kedai A    0.634680
Kedai B   -0.117784
Kedai C   -0.291003
Kedai D    2.075217
Kedai E   -0.063509
Name: 2014-08-04 00:00:00, dtype: float64

In [33]:
df.iloc[:,3]               # Specific column location (all rows)


Out[33]:
2014-08-01    0.505241
2014-08-02   -0.486971
2014-08-03   -0.130196
2014-08-04    2.075217
2014-08-05    0.155232
2014-08-06   -0.564646
2014-08-07    1.270935
2014-08-08    1.740556
Freq: D, Name: Kedai D, dtype: float64

In [34]:
df.iloc[2:4,1:3]           # Python like slicing for range


Out[34]:
Kedai B Kedai C
2014-08-03 -0.958970 -0.825462
2014-08-04 -0.117784 -0.291003

In [35]:
df.iloc[[2,4],[1,3]]        # Slicing with python like list


Out[35]:
Kedai B Kedai D
2014-08-03 -0.958970 -0.130196
2014-08-05 -0.902058 0.155232

In [36]:
# Conditionals on the data

df>0                        # Array values > 0 OR


Out[36]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 True False True True True
2014-08-02 True True True False True
2014-08-03 True False False False True
2014-08-04 True False False True False
2014-08-05 False False False True False
2014-08-06 True True True False True
2014-08-07 True False False True False
2014-08-08 False True False True False

In [37]:
df[df>0]                    # Directly getting the value


Out[37]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 0.908207 NaN 0.097727 0.505241 0.288635
2014-08-02 0.254458 0.943705 0.395889 NaN 0.075281
2014-08-03 0.115361 NaN NaN NaN 0.236358
2014-08-04 0.634680 NaN NaN 2.075217 NaN
2014-08-05 NaN NaN NaN 0.155232 NaN
2014-08-06 1.309475 0.753785 0.851052 NaN 0.361620
2014-08-07 0.644292 NaN NaN 1.270935 NaN
2014-08-08 NaN 0.147982 NaN 1.740556 NaN

NaN means empty, missing data or unavailable.


In [38]:
df[df['Kedai B']<0]           # With reference to specific value in a column (e.g. Kedai B)


Out[38]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798

In [39]:
df2 = df.copy()             # Made a copy of a database

In [40]:
df2


Out[40]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737

In [41]:
# Adding column

df2['Tambah'] = ['satu','satu','dua','tiga','empat','tiga','lima','enam']

In [42]:
df2


Out[42]:
Kedai A Kedai B Kedai C Kedai D Kedai E Tambah
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635 satu
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281 satu
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358 dua
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509 tiga
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344 empat
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620 tiga
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798 lima
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737 enam

In [43]:
# Adding row using append() function. The previous loc() is possibly deprecated.

# Assign a new name to the new row (with the same format)
new_row_name = pd.date_range('20140809', periods = 1, freq = 'D')

# Appending new row with new data
df2.append(list(np.random.randn(5))+['sembilan'])

# Renaming the new row (here actually is a reassignment)
df2 = df2.rename(index={10: new_row_name[0]})
df2


Out[43]:
Kedai A Kedai B Kedai C Kedai D Kedai E Tambah
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635 satu
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281 satu
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358 dua
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509 tiga
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344 empat
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620 tiga
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798 lima
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737 enam

In [44]:
# Assigning new data to a row
df2.loc['2014-08-05'] = list(np.random.randn(5))+['tujuh']
df2


Out[44]:
Kedai A Kedai B Kedai C Kedai D Kedai E Tambah
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635 satu
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281 satu
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358 dua
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509 tiga
2014-08-05 0.838633 0.757729 0.064402 -0.950469 0.899920 tujuh
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620 tiga
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798 lima
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737 enam

In [45]:
# Assigning new data to a specific element
df2.loc['2014-08-05','Tambah'] = 'lapan'
df2


Out[45]:
Kedai A Kedai B Kedai C Kedai D Kedai E Tambah
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635 satu
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281 satu
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358 dua
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509 tiga
2014-08-05 0.838633 0.757729 0.064402 -0.950469 0.899920 lapan
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620 tiga
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798 lima
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737 enam

In [46]:
# Using the isin() function (returns boolean data frame)
df2.isin(['satu','tiga'])


Out[46]:
Kedai A Kedai B Kedai C Kedai D Kedai E Tambah
2014-08-01 False False False False False True
2014-08-02 False False False False False True
2014-08-03 False False False False False False
2014-08-04 False False False False False True
2014-08-05 False False False False False False
2014-08-06 False False False False False True
2014-08-07 False False False False False False
2014-08-08 False False False False False False

In [47]:
# Select specific row based on additonal column
df2[df2['Tambah'].isin(['satu','tiga'])]


Out[47]:
Kedai A Kedai B Kedai C Kedai D Kedai E Tambah
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635 satu
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281 satu
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509 tiga
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620 tiga

In [48]:
# Use previous command - select certain column based on selected additional column
df2[df2['Tambah'].isin(['satu','tiga'])].loc[:,'Kedai B':'Kedai D']


Out[48]:
Kedai B Kedai C Kedai D
2014-08-01 -1.915919 0.097727 0.505241
2014-08-02 0.943705 0.395889 -0.486971
2014-08-04 -0.117784 -0.291003 2.075217
2014-08-06 0.753785 0.851052 -0.564646

In [49]:
# Select > 0 from previous cell...
(df2[df2['Tambah'].isin(['satu','tiga'])].loc[:,'Kedai B':'Kedai D']>0)


Out[49]:
Kedai B Kedai C Kedai D
2014-08-01 False True True
2014-08-02 True True False
2014-08-04 False False True
2014-08-06 True True False

8.2 Data Operations

We have seen few operations previously on Series and DataFrame and here this will be explored further.


In [50]:
df.mean()                    # Statistical mean (column) - same as df.mean(0), 0 means column


Out[50]:
Kedai A    0.395553
Kedai B   -0.394076
Kedai C   -0.329596
Kedai D    0.570671
Kedai E   -0.443937
dtype: float64

In [51]:
df.mean(1)                   # Statistical mean (row) - 1 means row


Out[51]:
2014-08-01   -0.023222
2014-08-02    0.236472
2014-08-03   -0.312582
2014-08-04    0.447520
2014-08-05   -0.579912
2014-08-06    0.542257
2014-08-07   -0.353114
2014-08-08   -0.279635
Freq: D, dtype: float64

In [52]:
df.mean()['Kedai C':'Kedai E']         # Statistical mean (range of columns)


Out[52]:
Kedai C   -0.329596
Kedai D    0.570671
Kedai E   -0.443937
dtype: float64

In [53]:
df.max()                               # Statistical max (column)


Out[53]:
Kedai A    1.309475
Kedai B    0.943705
Kedai C    0.851052
Kedai D    2.075217
Kedai E    0.361620
dtype: float64

In [54]:
df.max()['Kedai C']                    # Statistical max (specific column)


Out[54]:
0.85105209316874242

In [55]:
df.max(1)['2014-08-04':'2014-08-07']   # Statistical max (specific row)


Out[55]:
2014-08-04    2.075217
2014-08-05    0.155232
2014-08-06    1.309475
2014-08-07    1.270935
Freq: D, dtype: float64

In [56]:
df.max(1)[dates[3]]          # Statistical max (specific row by variable)


Out[56]:
2.0752173395827045

Other statistical functions can be checked by typing df.< TAB >.

The data in a DataFrame can be represented by a variable declared using the $lambda$ operator.


In [57]:
df.apply(lambda x: x.max() - x.min())       # Operating array values with function


Out[57]:
Kedai A    1.821602
Kedai B    2.859624
Kedai C    2.854903
Kedai D    2.639863
Kedai E    2.755418
dtype: float64

In [58]:
df.apply(lambda z: np.log(z))               # Operating array values with function


Out[58]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 -0.096283 NaN -2.325578 -0.682720 -1.242591
2014-08-02 -1.368620 -0.057941 -0.926622 NaN -2.586531
2014-08-03 -2.159685 NaN NaN NaN -1.442406
2014-08-04 -0.454634 NaN NaN 0.730066 NaN
2014-08-05 NaN NaN NaN -1.862831 NaN
2014-08-06 0.269627 -0.282648 -0.161282 NaN -1.017161
2014-08-07 -0.439604 NaN NaN 0.239753 NaN
2014-08-08 NaN -1.910666 NaN 0.554205 NaN

Replacing, rearranging and operations of data between columns can be done much like spreadsheet.


In [59]:
df3 = df.copy()

In [60]:
df3[r'Kedai A^2/Kedai E'] = df3['Kedai A']**2/df3['Kedai E']
df3


Out[60]:
Kedai A Kedai B Kedai C Kedai D Kedai E Kedai A^2/Kedai E
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635 2.857724
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281 0.860099
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358 0.056305
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509 -6.342727
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344 -0.028063
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620 4.741786
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798 -0.173411
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737 -0.340290

Tables can be split, rearranged and combined.


In [61]:
df4 = df.copy()
df4


Out[61]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737

In [62]:
pieces = [df4[6:], df4[3:6], df4[:3]]   # split row 2+3+3
pieces


Out[62]:
[             Kedai A   Kedai B   Kedai C   Kedai D   Kedai E
 2014-08-07  0.644292 -1.103346 -0.183652  1.270935 -2.393798
 2014-08-08 -0.512127  0.147982 -2.003851  1.740556 -0.770737,
              Kedai A   Kedai B   Kedai C   Kedai D   Kedai E
 2014-08-04  0.634680 -0.117784 -0.291003  2.075217 -0.063509
 2014-08-05 -0.189921 -0.902058 -0.677468  0.155232 -1.285344
 2014-08-06  1.309475  0.753785  0.851052 -0.564646  0.361620,
              Kedai A   Kedai B   Kedai C   Kedai D   Kedai E
 2014-08-01  0.908207 -1.915919  0.097727  0.505241  0.288635
 2014-08-02  0.254458  0.943705  0.395889 -0.486971  0.075281
 2014-08-03  0.115361 -0.958970 -0.825462 -0.130196  0.236358]

In [63]:
df5 = pd.concat(pieces)                 # concantenate (rearrange/combine)
df5


Out[63]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358

In [64]:
df4+df5             # Operation between tables with original index sequence


Out[64]:
Kedai A Kedai B Kedai C Kedai D Kedai E
2014-08-01 1.816414 -3.831838 0.195454 1.010481 0.577271
2014-08-02 0.508916 1.887410 0.791777 -0.973942 0.150561
2014-08-03 0.230723 -1.917941 -1.650923 -0.260392 0.472717
2014-08-04 1.269361 -0.235567 -0.582007 4.150435 -0.127018
2014-08-05 -0.379842 -1.804116 -1.354935 0.310465 -2.570687
2014-08-06 2.618951 1.507570 1.702104 -1.129292 0.723240
2014-08-07 1.288583 -2.206692 -0.367303 2.541870 -4.787595
2014-08-08 -1.024254 0.295964 -4.007702 3.481112 -1.541473

In [65]:
df0 = df.loc[:,'Kedai A':'Kedai C']     # Slicing and extracting columns
pd.concat([df4, df0], axis = 1)         # Concatenating columns (axis = 1 -> refers to column)


Out[65]:
Kedai A Kedai B Kedai C Kedai D Kedai E Kedai A Kedai B Kedai C
2014-08-01 0.908207 -1.915919 0.097727 0.505241 0.288635 0.908207 -1.915919 0.097727
2014-08-02 0.254458 0.943705 0.395889 -0.486971 0.075281 0.254458 0.943705 0.395889
2014-08-03 0.115361 -0.958970 -0.825462 -0.130196 0.236358 0.115361 -0.958970 -0.825462
2014-08-04 0.634680 -0.117784 -0.291003 2.075217 -0.063509 0.634680 -0.117784 -0.291003
2014-08-05 -0.189921 -0.902058 -0.677468 0.155232 -1.285344 -0.189921 -0.902058 -0.677468
2014-08-06 1.309475 0.753785 0.851052 -0.564646 0.361620 1.309475 0.753785 0.851052
2014-08-07 0.644292 -1.103346 -0.183652 1.270935 -2.393798 0.644292 -1.103346 -0.183652
2014-08-08 -0.512127 0.147982 -2.003851 1.740556 -0.770737 -0.512127 0.147982 -2.003851

8.3 Plotting Functions


Let us look on some of the simple plotting function on $Pandas$ (requires $Matplotlib$ library).


In [66]:
df_add = df.copy()

In [67]:
# Simple auto plotting
%matplotlib inline

df_add.cumsum().plot()


Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f37fd724a50>

In [68]:
# Reposition the legend

import matplotlib.pyplot as plt

df_add.cumsum().plot()
plt.legend(bbox_to_anchor=[1.3, 1])


Out[68]:
<matplotlib.legend.Legend at 0x7f37fd74cb10>

In the above example, repositioning the legend requires the legend function in $Matplotlib$ library. Therefore, the $Matplotlib$ library must be explicitly imported.


In [69]:
df_add.cumsum().plot(kind='bar')
plt.legend(bbox_to_anchor=[1.3, 1])


Out[69]:
<matplotlib.legend.Legend at 0x7f37f9d35250>

In [70]:
df_add.cumsum().plot(kind='barh', stacked=True)


Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f37f9b33bd0>

In [71]:
df_add.cumsum().plot(kind='hist', alpha=0.5)


Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f37f996bd50>

In [72]:
df_add.cumsum().plot(kind='area', alpha=0.4, stacked=False)
plt.legend(bbox_to_anchor=[1.3, 1])


Out[72]:
<matplotlib.legend.Legend at 0x7f37f996bc90>

A 3-dimensional plot can be projected on a canvas but requires the $Axes3D$ library with slightly complicated settings.


In [73]:
# Plotting a 3D bar plot

from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Convert the time format into ordinary strings
time_series = pd.Series(df.index.format())

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')

# Plotting the bar graph column by column
for c, z in zip(['r', 'g', 'b', 'y','m'], np.arange(len(df.columns))):
    xs = df.index
    ys = df.values[:,z]
    ax.bar(xs, ys, zs=z, zdir='y', color=c, alpha=0.5)
    
ax.set_zlabel('Z')
ax.set_xticklabels(time_series, va = 'baseline', ha = 'right', rotation = 15)
ax.set_yticks(np.arange(len(df.columns)))
ax.set_yticklabels(df.columns, va = 'center', ha = 'left', rotation = -42)

ax.view_init(30, -30)

fig.tight_layout()



8.4 Reading And Writing Data To File

Data in DataFrame can be exported into csv (comma separated value) and Excel file. The users can also create a DataFrame from data in csv and Excel file, the data can then be processed.


In [74]:
# Export data to a csv file but separated with < TAB > rather than comma
# the default separation is with comma

df.to_csv('Tutorial8/Kedai.txt', sep='\t')

In [75]:
# Export to Excel file

df.to_excel('Tutorial8/Kedai.xlsx', sheet_name = 'Tarikh', index = True)

In [76]:
# Importing data from csv file (without header)

from_file = pd.read_csv('Tutorial8/Malaysian_Town.txt',sep='\t',header=None)
from_file.head()


Out[76]:
0 1 2 3 4
0 1 Kuala Lumpur 1,588,750 NaN City
1 2 Seberang Perai 818,197 Includes Butterworth, Bukit Mertajam Municipality
2 3 Kajang 795,522 NaN Municipality
3 4 Klang 744,062 NaN Municipality
4 5 Subang Jaya 708,296 NaN Municipality

In [77]:
# Importing data from Excel file (with header (the first row) that became the column names)

from_excel = pd.read_excel('Tutorial8/Malaysian_Town.xlsx','Sheet1')
from_excel.head()


Out[77]:
No. City Population Area Type
0 1 Kuala Lumpur 1588750 NaN City
1 2 Seberang Perai 818197 Includes Butterworth, Bukit Mertajam Municipality
2 3 Kajang 795522 NaN Municipality
3 4 Klang 744062 NaN Municipality
4 5 Subang Jaya 708296 NaN Municipality

Further $Pandas$ features can be found in http://pandas.pydata.org/.