General pandas Concepts


In [2]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)


3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2
0.16.2

Now we’ve covered numpy the basis for pandas. We’ve covered some of the more advanced python concepts like list comprehensions and lambda functions. Let’s jump back to our roadmap.

We’ve covered the general ecosystem. We’ve covered a lot of numpy, now let’s get our hands dirty with some real data and actually using pandas. I hope you’ve watched the numpy videos that we covered earlier, they may seem academic but they’re really going to provide a fantastic foundation for what we’re going to learn now.

Now I'm going to breeze through a couple of subjects right now. Don’t feel the need to take notes or even try this code yourself. You can if you like, but it’s mainly to introduce you to the power of pandas, not for you to copy.

Pandas is made up of a couple of core types.

We’ve got an index. The index is a way of querying the data in an array or Series or querying the data in a Series or DataFrame.


In [3]:
pd.Index


Out[3]:
pandas.core.index.Index

We’ve got the Series. The Series is like a 1 dimensional array in numpy. It has some helper functions and an index that allows for querying of the data in simple ways.

We can make a simple Series from a numpy array.


In [4]:
pd.Series


Out[4]:
pandas.core.series.Series

In [5]:
series_ex = pd.Series(np.arange(26))
series_ex


Out[5]:
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
20    20
21    21
22    22
23    23
24    24
25    25
dtype: int64

Now that we’ve created it. We can see it has an index, that we just talked about, as well as values. When we print these out, they should look similar - just like numpy arrays. Now here is where the series gets powerful.


In [6]:
series_ex.index


Out[6]:
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25],
           dtype='int64')

we can replace the index with our own index. In this example I’ll use the lower case values of ascii characters.


In [9]:
import string
lcase = string.ascii_lowercase
ucase = string.ascii_uppercase
print(lcase, ucase)


abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

In [10]:
lcase = list(lcase)
ucase = list(ucase)
print(lcase)
print(ucase)


['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']

In [11]:
series_ex.index = lcase

In [12]:
series_ex.index


Out[12]:
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
       'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
      dtype='object')

In [13]:
series_ex


Out[13]:
a     0
b     1
c     2
d     3
e     4
f     5
g     6
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
p    15
q    16
r    17
s    18
t    19
u    20
v    21
w    22
x    23
y    24
z    25
dtype: int64

Now we can query just like we would if an array. You can think of the Series like an extremely powerful array.

We can query either sections or specific values.


In [14]:
series_ex.ix['d':'k']


Out[14]:
d     3
e     4
f     5
g     6
h     7
i     8
j     9
k    10
dtype: int64

In [15]:
series_ex.ix['f']


Out[15]:
5

Now don’t worry about the functions that I’m using. We’re going to go over those in detail - I just wanted to introduce the concept.

We’ve got the DataFrame which is like a matrix or series of series’. It also has an index (or multiple indexes).


In [16]:
pd.DataFrame


Out[16]:
pandas.core.frame.DataFrame

Let’s go ahead and create one. We’ve make it from the lowercase, uppercase, and a number range.


In [19]:
letters = pd.DataFrame([lcase, ucase, list(range(26))])
letters


Out[19]:
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
0 a b c d e f g h i j ... q r s t u v w x y z
1 A B C D E F G H I J ... Q R S T U V W X Y Z
2 0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25

3 rows × 26 columns

Just like a numpy array we can transpose it.


In [20]:
letters = letters.transpose()
letters.head()


Out[20]:
0 1 2
0 a A 0
1 b B 1
2 c C 2
3 d D 3
4 e E 4

In [21]:
letters.columns


Out[21]:
Int64Index([0, 1, 2], dtype='int64')

In [22]:
letters.index


Out[22]:
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25],
           dtype='int64')

But now that we have columns as well as an index, we can rename the columns to better describe and query the data.


In [23]:
letters.columns = ['lowercase','uppercase','number']

In [24]:
letters.lowercase


Out[24]:
0     a
1     b
2     c
3     d
4     e
5     f
6     g
7     h
8     i
9     j
10    k
11    l
12    m
13    n
14    o
15    p
16    q
17    r
18    s
19    t
20    u
21    v
22    w
23    x
24    y
25    z
Name: lowercase, dtype: object

In [25]:
letters['lowercase']


Out[25]:
0     a
1     b
2     c
3     d
4     e
5     f
6     g
7     h
8     i
9     j
10    k
11    l
12    m
13    n
14    o
15    p
16    q
17    r
18    s
19    t
20    u
21    v
22    w
23    x
24    y
25    z
Name: lowercase, dtype: object

We can even set up a date range to associate each letter with a date. Now obviously this isn’t too helpful for the alphabet, but this allows you to do some amazing things once you are analyzing real data.


In [26]:
letters.index = pd.date_range('9/1/2012',periods=26)

In [27]:
letters


Out[27]:
lowercase uppercase number
2012-09-01 a A 0
2012-09-02 b B 1
2012-09-03 c C 2
2012-09-04 d D 3
2012-09-05 e E 4
2012-09-06 f F 5
2012-09-07 g G 6
2012-09-08 h H 7
2012-09-09 i I 8
2012-09-10 j J 9
2012-09-11 k K 10
2012-09-12 l L 11
2012-09-13 m M 12
2012-09-14 n N 13
2012-09-15 o O 14
2012-09-16 p P 15
2012-09-17 q Q 16
2012-09-18 r R 17
2012-09-19 s S 18
2012-09-20 t T 19
2012-09-21 u U 20
2012-09-22 v V 21
2012-09-23 w W 22
2012-09-24 x X 23
2012-09-25 y Y 24
2012-09-26 z Z 25

In [28]:
letters['9-10-2012':'9-15-2012']


Out[28]:
lowercase uppercase number
2012-09-10 j J 9
2012-09-11 k K 10
2012-09-12 l L 11
2012-09-13 m M 12
2012-09-14 n N 13
2012-09-15 o O 14

Now if you don’t have any experience with pandas this is going to seem like a lot! Don’t worry we’re going to cover everything in the coming videos, I just wanted to give you an introduction to the amazingly expressive power of pandas and python. We’ve seen the building blocks with the Index, the Series, and the DataFrame.

Now let’s dive deeper into each one.