So far, we have manipulated data which were stored in NumPy arrays. Let us consider 2D data.
In [1]:
import numpy as np
ar = 0.5 * np.eye(3)
ar[2, 1] = 1
ar
Out[1]:
We could visualize it with Matplotlib.
In [2]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(ar, cmap=plt.cm.gray)
Out[2]:
Raw data could look like this. Say that columns hold variables and rows hold observations (or records). We may want to label the data (set some metadata). We may also want to handle non-numerical data. Then, we want to store our data in a DataFrame, a 2D labelled data structure with columns of potentially different types.
In [3]:
import pandas as pd
df = pd.DataFrame(ar)
df
Out[3]:
The DataFrame object has attributes...
In [4]:
df.size
df.shape
Out[4]:
... and methods, as we shall see in the following. For now, let us label our data.
In [5]:
df.columns = ['red', 'green', 'blue']
Note that, alternatively, you could have done df.rename(columns={0: 'red', 1: 'green', 2: 'blue'}, inplace=True)
.
In [6]:
df
Out[6]:
In [7]:
df.plot()
Out[7]:
(This is a terrible visualization though... 3-cycle needed!)
df2
, equal to df
(with the same values for each column) by passing a dictionary to pd.DataFrame()
. You can check your answer by running pd.testing.assert_frame_equal(df, df2, check_like=True)
. df[['green']]
?df['green']
?A Series is a 1D labelled data structure.
In [8]:
df['green']
Out[8]:
It can hold any data type.
In [9]:
pd.Series(range(10))
Out[9]:
In [10]:
s = pd.Series(['first', 'second', 'third'])
s
Out[10]:
In [11]:
t = pd.Series([pd.Timestamp('2017-09-01'), pd.Timestamp('2017-09-02'), pd.Timestamp('2017-09-03')])
t
Out[11]:
In [12]:
alpha = pd.Series(0.1 * np.arange(1, 4))
alpha.plot(kind='bar')
Out[12]:
In [13]:
df['alpha'] = alpha
df
Out[13]:
The Index object stores axis labels for Series and DataFrames.
In [14]:
alpha.index
Out[14]:
In [15]:
df.index
Out[15]:
In [16]:
alpha
Out[16]:
In [17]:
alpha.index = s
In [18]:
alpha
Out[18]:
In [19]:
alpha.index
Out[19]:
In [20]:
df.set_index(s)
Out[20]:
In [21]:
df.set_index(s, inplace=True)