BASIC PYTHON FOR RESEARCHERS

by Megat Harun Al Rashid bin Megat Ahmad
last updated: April 14, 2016

8. Database and Data Analysis

$Pandas$ is an open source library for data analysis in Python. It gives Python similar capabilities to R programming language and even though it is possible to run R in Jupyter Notebook, it would be more practical to do data analysis with a Python friendly syntax. Similar to other libraries, the first step to use $Pandas$ is to import the library and usually together with the $Numpy$ library.



In [1]:

    
import pandas as pd
import numpy as np

8.1 Data Structures

Data structures (similar to Sequence in Python) of $Pandas$ revolves around the Series and DataFrame structures. Both are fast as they are built on top of $Numpy$.

A Series is a one-dimensional object with a lot of similar properties similar to a list or dictionary in Python's Sequence. Each element or item in a Series will be assigned by default an index label from 0 to N-1 (where N is the length of the Series) and it can contains the various type of Python's data.



In [2]:

    
# Creating a series (with different type of data)

s1 = pd.Series([34, 'Material', 4*np.pi, 'Reactor', [100,250,500,750], 'kW'])
s1









    Out[2]:





0                      34
1                Material
2                 12.5664
3                 Reactor
4    [100, 250, 500, 750]
5                      kW
dtype: object

The index of a Series can be specified during its creation and giving it a similar function to a dictionary.



In [3]:

    
# Creating a series with specified index

lt = [34, 'Material', 4*np.pi, 'Reactor', [100,250,500,750], 'kW']

s2 = pd.Series(lt, index = ['b1', 'r1', 'solid angle', 18, 'reactor power', 'unit'])
s2









    Out[3]:





b1                                 34
r1                           Material
solid angle                   12.5664
18                            Reactor
reactor power    [100, 250, 500, 750]
unit                               kW
dtype: object

Data can be extracted by specifying the element position or index (similar to list/dictionary).



In [4]:

    
s1[3], s2['solid angle']









    Out[4]:





('Reactor', 12.566370614359172)

Series can also be constructed from a dictionary.



In [5]:

    
pop_cities = {'Kuala Lumpur':1588750, 'Seberang Perai':818197, 'Kajang':795522,
              'Klang':744062, 'Subang Jaya':708296}
cities = pd.Series(pop_cities)
cities









    Out[5]:





Kajang             795522
Klang              744062
Kuala Lumpur      1588750
Seberang Perai     818197
Subang Jaya        708296
dtype: int64

The elements can be sort using the $Series.order()$ function. This will not change the structure of the original variable.



In [6]:

    
cities.order(ascending=False)









    



/home/megatharun/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: order is deprecated, use sort_values(...)
  if __name__ == '__main__':






    Out[6]:





Kuala Lumpur      1588750
Seberang Perai     818197
Kajang             795522
Klang              744062
Subang Jaya        708296
dtype: int64



In [7]:

    
cities









    Out[7]:





Kajang             795522
Klang              744062
Kuala Lumpur      1588750
Seberang Perai     818197
Subang Jaya        708296
dtype: int64

Another sorting function is the $sort()$ function but this will change the structure of the Series variable.



In [8]:

    
# Sorting with descending values

cities.sort(ascending=False)
cities









    



/home/megatharun/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: FutureWarning: sort is deprecated, use sort_values(inplace=True) for for INPLACE sorting
  app.launch_new_instance()






    Out[8]:





Kuala Lumpur      1588750
Seberang Perai     818197
Kajang             795522
Klang              744062
Subang Jaya        708296
dtype: int64



In [9]:

    
cities









    Out[9]:





Kuala Lumpur      1588750
Seberang Perai     818197
Kajang             795522
Klang              744062
Subang Jaya        708296
dtype: int64

Conditions can be applied to the elements.



In [10]:

    
# cities with population less than 800,000
cities[cities<800000]









    Out[10]:





Kajang         795522
Klang          744062
Subang Jaya    708296
dtype: int64



In [11]:

    
# cities with population between 750,000 and 800,000
cities[cities<800000][cities>750000]









    Out[11]:





Kajang    795522
dtype: int64

A DataFrame is a 2-dimensional data structure with named rows and columns. It is similar to R's data.frame object and function like a spreadsheet. DataFrame can be considered to be made of series of Series data according to the column names. DataFrame can be created by passing a 2-dimensional array of data and specifying the rows and columns names.



In [12]:

    
# Creating a DataFrame by passing a 2-D numpy array of random number 

# Creating first the date-time index using date_range function
# and checking it.
dates = pd.date_range('20140801', periods = 8, freq = 'D')
dates









    Out[12]:





DatetimeIndex(['2014-08-01', '2014-08-02', '2014-08-03', '2014-08-04',
               '2014-08-05', '2014-08-06', '2014-08-07', '2014-08-08'],
              dtype='datetime64[ns]', freq='D')



In [13]:

    
# Creating the column names as list
Kedai = ['Kedai A', 'Kedai B', 'Kedai C', 'Kedai D', 'Kedai E']

# Creating the DataFrame with specified rows and columns
df = pd.DataFrame(np.random.randn(8,5),index=dates,columns=Kedai)
df

Some of the useful functions that can be applied to a DataFrame include:



In [14]:

    
df.head()             # Displaying the first five (default) rows



In [15]:

    
df.head(3)             # Displaying the first three (specified) rows



In [16]:

    
df.tail(2)             # Displaying the last two (specified) rows



In [17]:

    
df.index               # Showing the index of rows









    Out[17]:





DatetimeIndex(['2014-08-01', '2014-08-02', '2014-08-03', '2014-08-04',
               '2014-08-05', '2014-08-06', '2014-08-07', '2014-08-08'],
              dtype='datetime64[ns]', freq='D')



In [18]:

    
df.columns             # Showing the fields of columns









    Out[18]:





Index([u'Kedai A', u'Kedai B', u'Kedai C', u'Kedai D', u'Kedai E'], dtype='object')



In [19]:

    
df.values              # Showing the data only in its original 2-D array









    Out[19]:





array([[ 0.90820724, -1.91591888,  0.0977269 ,  0.50524069,  0.28863545],
       [ 0.25445799,  0.94370516,  0.39588871, -0.48697109,  0.07528073],
       [ 0.11536149, -0.95897049, -0.82546162, -0.13019616,  0.23635838],
       [ 0.63468028, -0.11778374, -0.29100337,  2.07521734, -0.06350881],
       [-0.18992099, -0.90205808, -0.67746765,  0.15523248, -1.28534372],
       [ 1.30947528,  0.75378506,  0.85105209, -0.5646459 ,  0.36162018],
       [ 0.64429166, -1.10334612, -0.18365171,  1.27093484, -2.39379768],
       [-0.51212697,  0.14798184, -2.00385122,  1.74055598, -0.77073667]])



In [20]:

    
df.describe()          # Simple statistical data for each column



In [21]:

    
df.T                    # Transposing the DataFrame (index becomes column and vice versa)









    Out[21]:






  
    
      
      2014-08-01 00:00:00
      2014-08-02 00:00:00
      2014-08-03 00:00:00
      2014-08-04 00:00:00
      2014-08-05 00:00:00
      2014-08-06 00:00:00
      2014-08-07 00:00:00
      2014-08-08 00:00:00
    
  
  
    
      Kedai A
      0.908207
      0.254458
      0.115361
      0.634680
      -0.189921
      1.309475
      0.644292
      -0.512127
    
    
      Kedai B
      -1.915919
      0.943705
      -0.958970
      -0.117784
      -0.902058
      0.753785
      -1.103346
      0.147982
    
    
      Kedai C
      0.097727
      0.395889
      -0.825462
      -0.291003
      -0.677468
      0.851052
      -0.183652
      -2.003851
    
    
      Kedai D
      0.505241
      -0.486971
      -0.130196
      2.075217
      0.155232
      -0.564646
      1.270935
      1.740556
    
    
      Kedai E
      0.288635
      0.075281
      0.236358
      -0.063509
      -1.285344
      0.361620
      -2.393798
      -0.770737



In [22]:

    
df.sort_index(axis=1,ascending=False)     # Sorting with descending column



In [23]:

    
df.sort(columns='Kedai D')               # Sorting according to ascending specific column









    



/home/megatharun/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':






    Out[23]:






  
    
      
      Kedai A
      Kedai B
      Kedai C
      Kedai D
      Kedai E
    
  
  
    
      2014-08-06
      1.309475
      0.753785
      0.851052
      -0.564646
      0.361620
    
    
      2014-08-02
      0.254458
      0.943705
      0.395889
      -0.486971
      0.075281
    
    
      2014-08-03
      0.115361
      -0.958970
      -0.825462
      -0.130196
      0.236358
    
    
      2014-08-05
      -0.189921
      -0.902058
      -0.677468
      0.155232
      -1.285344
    
    
      2014-08-01
      0.908207
      -1.915919
      0.097727
      0.505241
      0.288635
    
    
      2014-08-07
      0.644292
      -1.103346
      -0.183652
      1.270935
      -2.393798
    
    
      2014-08-08
      -0.512127
      0.147982
      -2.003851
      1.740556
      -0.770737
    
    
      2014-08-04
      0.634680
      -0.117784
      -0.291003
      2.075217
      -0.063509



In [24]:

    
df['Kedai A']             # Extract specific column (using python list syntax)









    Out[24]:





2014-08-01    0.908207
2014-08-02    0.254458
2014-08-03    0.115361
2014-08-04    0.634680
2014-08-05   -0.189921
2014-08-06    1.309475
2014-08-07    0.644292
2014-08-08   -0.512127
Freq: D, Name: Kedai A, dtype: float64



In [25]:

    
df['Kedai A'][2:4]        # Slicing specific column (using python list syntax)









    Out[25]:





2014-08-03    0.115361
2014-08-04    0.634680
Freq: D, Name: Kedai A, dtype: float64



In [26]:

    
df[2:4]                   # Slicing specific row data  (using python list syntax)



In [27]:

    
# Slicing specific index range
df['2014-08-03':'2014-08-05']



In [28]:

    
# Slicing specific index range for a particular column
df['2014-08-03':'2014-08-05']['Kedai B']









    Out[28]:





2014-08-03   -0.958970
2014-08-04   -0.117784
2014-08-05   -0.902058
Freq: D, Name: Kedai B, dtype: float64



In [29]:

    
# Using the loc() function

# Slicing specific index and column ranges
df.loc['2014-08-03':'2014-08-05','Kedai B':'Kedai D']



In [30]:

    
# Slicing specific index range with specific column names
df.loc['2014-08-03':'2014-08-05',['Kedai B','Kedai D']]



In [31]:

    
# Possibly not yet to have something like this

df.loc[['2014-08-01','2014-08-03':'2014-08-05'],['Kedai B','Kedai D']]









    



  File "<ipython-input-31-2ec112cb4d27>", line 3
    df.loc[['2014-08-01','2014-08-03':'2014-08-05'],['Kedai B','Kedai D']]
                                     ^
SyntaxError: invalid syntax



In [32]:

    
# Using the iloc() function

df.iloc[3]                 # Specific row location









    Out[32]:





Kedai A    0.634680
Kedai B   -0.117784
Kedai C   -0.291003
Kedai D    2.075217
Kedai E   -0.063509
Name: 2014-08-04 00:00:00, dtype: float64



In [33]:

    
df.iloc[:,3]               # Specific column location (all rows)









    Out[33]:





2014-08-01    0.505241
2014-08-02   -0.486971
2014-08-03   -0.130196
2014-08-04    2.075217
2014-08-05    0.155232
2014-08-06   -0.564646
2014-08-07    1.270935
2014-08-08    1.740556
Freq: D, Name: Kedai D, dtype: float64



In [34]:

    
df.iloc[2:4,1:3]           # Python like slicing for range



In [35]:

    
df.iloc[[2,4],[1,3]]        # Slicing with python like list



In [36]:

    
# Conditionals on the data

df>0                        # Array values > 0 OR









    Out[36]:






  
    
      
      Kedai A
      Kedai B
      Kedai C
      Kedai D
      Kedai E
    
  
  
    
      2014-08-01
      True
      False
      True
      True
      True
    
    
      2014-08-02
      True
      True
      True
      False
      True
    
    
      2014-08-03
      True
      False
      False
      False
      True
    
    
      2014-08-04
      True
      False
      False
      True
      False
    
    
      2014-08-05
      False
      False
      False
      True
      False
    
    
      2014-08-06
      True
      True
      True
      False
      True
    
    
      2014-08-07
      True
      False
      False
      True
      False
    
    
      2014-08-08
      False
      True
      False
      True
      False



In [37]:

    
df[df>0]                    # Directly getting the value

NaN means empty, missing data or unavailable.



In [38]:

    
df[df['Kedai B']<0]           # With reference to specific value in a column (e.g. Kedai B)



In [39]:

    
df2 = df.copy()             # Made a copy of a database



In [40]:

    
df2



In [41]:

    
# Adding column

df2['Tambah'] = ['satu','satu','dua','tiga','empat','tiga','lima','enam']



In [42]:

    
df2



In [43]:

    
# Adding row using append() function. The previous loc() is possibly deprecated.

# Assign a new name to the new row (with the same format)
new_row_name = pd.date_range('20140809', periods = 1, freq = 'D')

# Appending new row with new data
df2.append(list(np.random.randn(5))+['sembilan'])

# Renaming the new row (here actually is a reassignment)
df2 = df2.rename(index={10: new_row_name[0]})
df2



In [44]:

    
# Assigning new data to a row
df2.loc['2014-08-05'] = list(np.random.randn(5))+['tujuh']
df2



In [45]:

    
# Assigning new data to a specific element
df2.loc['2014-08-05','Tambah'] = 'lapan'
df2



In [46]:

    
# Using the isin() function (returns boolean data frame)
df2.isin(['satu','tiga'])









    Out[46]:






  
    
      
      Kedai A
      Kedai B
      Kedai C
      Kedai D
      Kedai E
      Tambah
    
  
  
    
      2014-08-01
      False
      False
      False
      False
      False
      True
    
    
      2014-08-02
      False
      False
      False
      False
      False
      True
    
    
      2014-08-03
      False
      False
      False
      False
      False
      False
    
    
      2014-08-04
      False
      False
      False
      False
      False
      True
    
    
      2014-08-05
      False
      False
      False
      False
      False
      False
    
    
      2014-08-06
      False
      False
      False
      False
      False
      True
    
    
      2014-08-07
      False
      False
      False
      False
      False
      False
    
    
      2014-08-08
      False
      False
      False
      False
      False
      False



In [47]:

    
# Select specific row based on additonal column
df2[df2['Tambah'].isin(['satu','tiga'])]



In [48]:

    
# Use previous command - select certain column based on selected additional column
df2[df2['Tambah'].isin(['satu','tiga'])].loc[:,'Kedai B':'Kedai D']



In [49]:

    
# Select > 0 from previous cell...
(df2[df2['Tambah'].isin(['satu','tiga'])].loc[:,'Kedai B':'Kedai D']>0)









    Out[49]:






  
    
      
      Kedai B
      Kedai C
      Kedai D
    
  
  
    
      2014-08-01
      False
      True
      True
    
    
      2014-08-02
      True
      True
      False
    
    
      2014-08-04
      False
      False
      True
    
    
      2014-08-06
      True
      True
      False

8.2 Data Operations

We have seen few operations previously on Series and DataFrame and here this will be explored further.



In [50]:

    
df.mean()                    # Statistical mean (column) - same as df.mean(0), 0 means column









    Out[50]:





Kedai A    0.395553
Kedai B   -0.394076
Kedai C   -0.329596
Kedai D    0.570671
Kedai E   -0.443937
dtype: float64



In [51]:

    
df.mean(1)                   # Statistical mean (row) - 1 means row









    Out[51]:





2014-08-01   -0.023222
2014-08-02    0.236472
2014-08-03   -0.312582
2014-08-04    0.447520
2014-08-05   -0.579912
2014-08-06    0.542257
2014-08-07   -0.353114
2014-08-08   -0.279635
Freq: D, dtype: float64



In [52]:

    
df.mean()['Kedai C':'Kedai E']         # Statistical mean (range of columns)









    Out[52]:





Kedai C   -0.329596
Kedai D    0.570671
Kedai E   -0.443937
dtype: float64



In [53]:

    
df.max()                               # Statistical max (column)









    Out[53]:





Kedai A    1.309475
Kedai B    0.943705
Kedai C    0.851052
Kedai D    2.075217
Kedai E    0.361620
dtype: float64



In [54]:

    
df.max()['Kedai C']                    # Statistical max (specific column)









    Out[54]:





0.85105209316874242



In [55]:

    
df.max(1)['2014-08-04':'2014-08-07']   # Statistical max (specific row)









    Out[55]:





2014-08-04    2.075217
2014-08-05    0.155232
2014-08-06    1.309475
2014-08-07    1.270935
Freq: D, dtype: float64



In [56]:

    
df.max(1)[dates[3]]          # Statistical max (specific row by variable)









    Out[56]:





2.0752173395827045

Other statistical functions can be checked by typing df.< TAB >.

The data in a DataFrame can be represented by a variable declared using the $lambda$ operator.



In [57]:

    
df.apply(lambda x: x.max() - x.min())       # Operating array values with function









    Out[57]:





Kedai A    1.821602
Kedai B    2.859624
Kedai C    2.854903
Kedai D    2.639863
Kedai E    2.755418
dtype: float64



In [58]:

    
df.apply(lambda z: np.log(z))               # Operating array values with function

Replacing, rearranging and operations of data between columns can be done much like spreadsheet.



In [59]:

    
df3 = df.copy()



In [60]:

    
df3[r'Kedai A^2/Kedai E'] = df3['Kedai A']**2/df3['Kedai E']
df3









    Out[60]:






  
    
      
      Kedai A
      Kedai B
      Kedai C
      Kedai D
      Kedai E
      Kedai A^2/Kedai E
    
  
  
    
      2014-08-01
      0.908207
      -1.915919
      0.097727
      0.505241
      0.288635
      2.857724
    
    
      2014-08-02
      0.254458
      0.943705
      0.395889
      -0.486971
      0.075281
      0.860099
    
    
      2014-08-03
      0.115361
      -0.958970
      -0.825462
      -0.130196
      0.236358
      0.056305
    
    
      2014-08-04
      0.634680
      -0.117784
      -0.291003
      2.075217
      -0.063509
      -6.342727
    
    
      2014-08-05
      -0.189921
      -0.902058
      -0.677468
      0.155232
      -1.285344
      -0.028063
    
    
      2014-08-06
      1.309475
      0.753785
      0.851052
      -0.564646
      0.361620
      4.741786
    
    
      2014-08-07
      0.644292
      -1.103346
      -0.183652
      1.270935
      -2.393798
      -0.173411
    
    
      2014-08-08
      -0.512127
      0.147982
      -2.003851
      1.740556
      -0.770737
      -0.340290

Tables can be split, rearranged and combined.



In [61]:

    
df4 = df.copy()
df4



In [62]:

    
pieces = [df4[6:], df4[3:6], df4[:3]]   # split row 2+3+3
pieces









    Out[62]:





[             Kedai A   Kedai B   Kedai C   Kedai D   Kedai E
 2014-08-07  0.644292 -1.103346 -0.183652  1.270935 -2.393798
 2014-08-08 -0.512127  0.147982 -2.003851  1.740556 -0.770737,
              Kedai A   Kedai B   Kedai C   Kedai D   Kedai E
 2014-08-04  0.634680 -0.117784 -0.291003  2.075217 -0.063509
 2014-08-05 -0.189921 -0.902058 -0.677468  0.155232 -1.285344
 2014-08-06  1.309475  0.753785  0.851052 -0.564646  0.361620,
              Kedai A   Kedai B   Kedai C   Kedai D   Kedai E
 2014-08-01  0.908207 -1.915919  0.097727  0.505241  0.288635
 2014-08-02  0.254458  0.943705  0.395889 -0.486971  0.075281
 2014-08-03  0.115361 -0.958970 -0.825462 -0.130196  0.236358]



In [63]:

    
df5 = pd.concat(pieces)                 # concantenate (rearrange/combine)
df5



In [64]:

    
df4+df5             # Operation between tables with original index sequence



In [65]:

    
df0 = df.loc[:,'Kedai A':'Kedai C']     # Slicing and extracting columns
pd.concat([df4, df0], axis = 1)         # Concatenating columns (axis = 1 -> refers to column)

8.3 Plotting Functions

Let us look on some of the simple plotting function on $Pandas$ (requires $Matplotlib$ library).



In [66]:

    
df_add = df.copy()



In [67]:

    
# Simple auto plotting
%matplotlib inline

df_add.cumsum().plot()









    Out[67]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f37fd724a50>



In [68]:

    
# Reposition the legend

import matplotlib.pyplot as plt

df_add.cumsum().plot()
plt.legend(bbox_to_anchor=[1.3, 1])









    Out[68]:





<matplotlib.legend.Legend at 0x7f37fd74cb10>

In the above example, repositioning the legend requires the legend function in $Matplotlib$ library. Therefore, the $Matplotlib$ library must be explicitly imported.



In [69]:

    
df_add.cumsum().plot(kind='bar')
plt.legend(bbox_to_anchor=[1.3, 1])









    Out[69]:





<matplotlib.legend.Legend at 0x7f37f9d35250>



In [70]:

    
df_add.cumsum().plot(kind='barh', stacked=True)









    Out[70]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f37f9b33bd0>



In [71]:

    
df_add.cumsum().plot(kind='hist', alpha=0.5)









    Out[71]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f37f996bd50>



In [72]:

    
df_add.cumsum().plot(kind='area', alpha=0.4, stacked=False)
plt.legend(bbox_to_anchor=[1.3, 1])









    Out[72]:





<matplotlib.legend.Legend at 0x7f37f996bc90>

A 3-dimensional plot can be projected on a canvas but requires the $Axes3D$ library with slightly complicated settings.



In [73]:

    
# Plotting a 3D bar plot

from mpl_toolkits.mplot3d import Axes3D
import numpy as np

# Convert the time format into ordinary strings
time_series = pd.Series(df.index.format())

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111, projection='3d')

# Plotting the bar graph column by column
for c, z in zip(['r', 'g', 'b', 'y','m'], np.arange(len(df.columns))):
    xs = df.index
    ys = df.values[:,z]
    ax.bar(xs, ys, zs=z, zdir='y', color=c, alpha=0.5)
    
ax.set_zlabel('Z')
ax.set_xticklabels(time_series, va = 'baseline', ha = 'right', rotation = 15)
ax.set_yticks(np.arange(len(df.columns)))
ax.set_yticklabels(df.columns, va = 'center', ha = 'left', rotation = -42)

ax.view_init(30, -30)

fig.tight_layout()

8.4 Reading And Writing Data To File

Data in DataFrame can be exported into csv (comma separated value) and Excel file. The users can also create a DataFrame from data in csv and Excel file, the data can then be processed.



In [74]:

    
# Export data to a csv file but separated with < TAB > rather than comma
# the default separation is with comma

df.to_csv('Tutorial8/Kedai.txt', sep='\t')



In [75]:

    
# Export to Excel file

df.to_excel('Tutorial8/Kedai.xlsx', sheet_name = 'Tarikh', index = True)



In [76]:

    
# Importing data from csv file (without header)

from_file = pd.read_csv('Tutorial8/Malaysian_Town.txt',sep='\t',header=None)
from_file.head()









    Out[76]:






  
    
      
      0
      1
      2
      3
      4
    
  
  
    
      0
      1
      Kuala Lumpur
      1,588,750
      NaN
      City
    
    
      1
      2
      Seberang Perai
      818,197
      Includes Butterworth, Bukit Mertajam
      Municipality
    
    
      2
      3
      Kajang
      795,522
      NaN
      Municipality
    
    
      3
      4
      Klang
      744,062
      NaN
      Municipality
    
    
      4
      5
      Subang Jaya
      708,296
      NaN
      Municipality



In [77]:

    
# Importing data from Excel file (with header (the first row) that became the column names)

from_excel = pd.read_excel('Tutorial8/Malaysian_Town.xlsx','Sheet1')
from_excel.head()









    Out[77]:






  
    
      
      No.
      City
      Population
      Area
      Type
    
  
  
    
      0
      1
      Kuala Lumpur
      1588750
      NaN
      City
    
    
      1
      2
      Seberang Perai
      818197
      Includes Butterworth, Bukit Mertajam
      Municipality
    
    
      2
      3
      Kajang
      795522
      NaN
      Municipality
    
    
      3
      4
      Klang
      744062
      NaN
      Municipality
    
    
      4
      5
      Subang Jaya
      708296
      NaN
      Municipality

Further $Pandas$ features can be found in http://pandas.pydata.org/.

	Kedai A	Kedai B	Kedai C	Kedai D	Kedai E
count	8.000000	8.000000	8.000000	8.000000	8.000000
mean	0.395553	-0.394076	-0.329596	0.570671	-0.443937
std	0.595483	0.990932	0.869785	1.014065	0.976833
min	-0.512127	-1.915919	-2.003851	-0.564646	-2.393798
25%	0.039041	-0.995064	-0.714466	-0.219390	-0.899388
50%	0.444569	-0.509921	-0.237328	0.330237	0.005886
75%	0.710271	0.299433	0.172267	1.388340	0.249428
max	1.309475	0.943705	0.851052	2.075217	0.361620

	Kedai A	Kedai B	Kedai C	Kedai D	Kedai E
2014-08-01	1.816414	-3.831838	0.195454	1.010481	0.577271
2014-08-02	0.508916	1.887410	0.791777	-0.973942	0.150561
2014-08-03	0.230723	-1.917941	-1.650923	-0.260392	0.472717
2014-08-04	1.269361	-0.235567	-0.582007	4.150435	-0.127018
2014-08-05	-0.379842	-1.804116	-1.354935	0.310465	-2.570687
2014-08-06	2.618951	1.507570	1.702104	-1.129292	0.723240
2014-08-07	1.288583	-2.206692	-0.367303	2.541870	-4.787595
2014-08-08	-1.024254	0.295964	-4.007702	3.481112	-1.541473

	Kedai A	Kedai B	Kedai C	Kedai D	Kedai E
2014-08-01	0.908207	-1.915919	0.097727	0.505241	0.288635
2014-08-02	0.254458	0.943705	0.395889	-0.486971	0.075281
2014-08-03	0.115361	-0.958970	-0.825462	-0.130196	0.236358
2014-08-04	0.634680	-0.117784	-0.291003	2.075217	-0.063509
2014-08-05	-0.189921	-0.902058	-0.677468	0.155232	-1.285344
2014-08-06	1.309475	0.753785	0.851052	-0.564646	0.361620
2014-08-07	0.644292	-1.103346	-0.183652	1.270935	-2.393798
2014-08-08	-0.512127	0.147982	-2.003851	1.740556	-0.770737

	Kedai A	Kedai B	Kedai C	Kedai D	Kedai E
2014-08-01	True	False	True	True	True
2014-08-02	True	True	True	False	True
2014-08-03	True	False	False	False	True
2014-08-04	True	False	False	True	False
2014-08-05	False	False	False	True	False
2014-08-06	True	True	True	False	True
2014-08-07	True	False	False	True	False
2014-08-08	False	True	False	True	False

	Kedai A	Kedai B	Kedai C	Kedai D	Kedai E
2014-08-01	-0.096283	NaN	-2.325578	-0.682720	-1.242591
2014-08-02	-1.368620	-0.057941	-0.926622	NaN	-2.586531
2014-08-03	-2.159685	NaN	NaN	NaN	-1.442406
2014-08-04	-0.454634	NaN	NaN	0.730066	NaN
2014-08-05	NaN	NaN	NaN	-1.862831	NaN
2014-08-06	0.269627	-0.282648	-0.161282	NaN	-1.017161
2014-08-07	-0.439604	NaN	NaN	0.239753	NaN
2014-08-08	NaN	-1.910666	NaN	0.554205	NaN

	0	1	2	3	4
0	1	Kuala Lumpur	1,588,750	NaN	City
1	2	Seberang Perai	818,197	Includes Butterworth, Bukit Mertajam	Municipality
2	3	Kajang	795,522	NaN	Municipality
3	4	Klang	744,062	NaN	Municipality
4	5	Subang Jaya	708,296	NaN	Municipality

	No.	City	Population	Area	Type
0	1	Kuala Lumpur	1588750	NaN	City
1	2	Seberang Perai	818197	Includes Butterworth, Bukit Mertajam	Municipality
2	3	Kajang	795522	NaN	Municipality
3	4	Klang	744062	NaN	Municipality
4	5	Subang Jaya	708296	NaN	Municipality