Pandas objects

So far, we have manipulated data which were stored in NumPy arrays. Let us consider 2D data.



In [1]:

    
import numpy as np

ar = 0.5 * np.eye(3)
ar[2, 1] = 1
ar









    Out[1]:





array([[ 0.5,  0. ,  0. ],
       [ 0. ,  0.5,  0. ],
       [ 0. ,  1. ,  0.5]])

We could visualize it with Matplotlib.



In [2]:

    
import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(ar, cmap=plt.cm.gray)









    Out[2]:





<matplotlib.image.AxesImage at 0x7f7aefc3b550>

Raw data could look like this. Say that columns hold variables and rows hold observations (or records). We may want to label the data (set some metadata). We may also want to handle non-numerical data. Then, we want to store our data in a DataFrame, a 2D labelled data structure with columns of potentially different types.

The DataFrame object



In [3]:

    
import pandas as pd

df = pd.DataFrame(ar)

df

The DataFrame object has attributes...



In [4]:

    
df.size

df.shape









    Out[4]:





(3, 3)

... and methods, as we shall see in the following. For now, let us label our data.



In [5]:

    
df.columns = ['red', 'green', 'blue']

Note that, alternatively, you could have done df.rename(columns={0: 'red', 1: 'green', 2: 'blue'}, inplace=True).



In [6]:

    
df



In [7]:

    
df.plot()









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f7aebff7b70>

(This is a terrible visualization though... 3-cycle needed!)

Hands-on exercises

Create another DataFrame, df2, equal to df (with the same values for each column) by passing a dictionary to pd.DataFrame(). You can check your answer by running pd.testing.assert_frame_equal(df, df2, check_like=True).
What is the type of object df[['green']]?
What is the type of object df['green']?

The Series object

A Series is a 1D labelled data structure.



In [8]:

    
df['green']









    Out[8]:





0    0.0
1    0.5
2    1.0
Name: green, dtype: float64

It can hold any data type.



In [9]:

    
pd.Series(range(10))









    Out[9]:





0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64



In [10]:

    
s = pd.Series(['first', 'second', 'third'])

s









    Out[10]:





0     first
1    second
2     third
dtype: object



In [11]:

    
t = pd.Series([pd.Timestamp('2017-09-01'), pd.Timestamp('2017-09-02'), pd.Timestamp('2017-09-03')])

t









    Out[11]:





0   2017-09-01
1   2017-09-02
2   2017-09-03
dtype: datetime64[ns]



In [12]:

    
alpha = pd.Series(0.1 * np.arange(1, 4))

alpha.plot(kind='bar')









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f7aebf51908>



In [13]:

    
df['alpha'] = alpha

df

Hands-on exercises

Create a series equal (element-wise) to the product of the 'green' variable and the 'alpha' variable. (Hint: It works like NumPy arrays.)
Label this series as 'pre_multiplied_green'. (Hint: Use tab completion to explore the list of attributes and/or scroll up to see which attribute should be set.)

The Index object

The Index object stores axis labels for Series and DataFrames.



In [14]:

    
alpha.index









    Out[14]:





RangeIndex(start=0, stop=3, step=1)



In [15]:

    
df.index









    Out[15]:





RangeIndex(start=0, stop=3, step=1)



In [16]:

    
alpha









    Out[16]:





0    0.1
1    0.2
2    0.3
dtype: float64



In [17]:

    
alpha.index = s



In [18]:

    
alpha









    Out[18]:





first     0.1
second    0.2
third     0.3
dtype: float64



In [19]:

    
alpha.index









    Out[19]:





Index(['first', 'second', 'third'], dtype='object')



In [20]:

    
df.set_index(s)



In [21]:

    
df.set_index(s, inplace=True)

Hands-on exercises

Set the index of df to be t, not s.
What is the difference between the index of df and that of alpha?
How would you call object df['green']?