4-4 pandas DataFrame Basics



In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt


3.3.2 (v3.3.2:d047928ae3f6, May 13 2013, 13:52:24) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2
0.16.2

Now fundamentally the data frame is just an abstraction but it provides a ton of useful tools that you’re going to get to see. This video is just going to go over the basic idea of the data frame as well as how to create them.


In [4]:
import string
upcase = [x for x in string.ascii_uppercase]
lcase = [x for x in string.ascii_lowercase]

In [5]:
print(upcase[:5], lcase[:5])


['A', 'B', 'C', 'D', 'E'] ['a', 'b', 'c', 'd', 'e']

You can create DataFrames by passing in np arrays, lists of series, or dictionaries.


In [6]:
pd.DataFrame([upcase, lcase])


Out[6]:
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
0 A B C D E F G H I J ... Q R S T U V W X Y Z
1 a b c d e f g h i j ... q r s t u v w x y z

2 rows × 26 columns

We’ll be covering a lot of different aspects here but as always we’re going to start with the simple stuff. A simplification of a data frame is like an excel table or sql table. You’ve got columns and rows.

In more specific pandas terms, it's a more powerful list of series. Each column is a Series of data and it just so happens these can have relationships.

You can see that if we just pass in a list of lists it treats them like columns. Of course if that’s an issue we can just transpose it and get we’ll get them as columns.


In [7]:
pd.DataFrame([upcase, lcase]).T


Out[7]:
0 1
0 A a
1 B b
2 C c
3 D d
4 E e
5 F f
6 G g
7 H h
8 I i
9 J j
10 K k
11 L l
12 M m
13 N n
14 O o
15 P p
16 Q q
17 R r
18 S s
19 T t
20 U u
21 V v
22 W w
23 X x
24 Y y
25 Z z

This should be familiar because it’s the same way that we transpose ndarrays in numpy.

Of course we can also specify them as explicit columns but passing in a dictionary where the keys are the column names and the values are the lists of each item (or the rows).


In [8]:
letters = pd.DataFrame({'lowercase':lcase, 'uppercase':upcase})
letters.head()


Out[8]:
lowercase uppercase
0 a A
1 b B
2 c C
3 d D
4 e E

Now you’ll see that if these lengths are not the same, we’ll get a ValueError so it’s worth checking to make sure your data is clean before importing or using it to create a DataFrame


In [9]:
pd.DataFrame({'lowercase':lcase + [0], 'uppercase':upcase})


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-7de5286cc816> in <module>()
----> 1 pd.DataFrame({'lowercase':lcase + [0], 'uppercase':upcase})

/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    212                                  dtype=dtype, copy=copy)
    213         elif isinstance(data, dict):
--> 214             mgr = self._init_dict(data, index, columns, dtype=dtype)
    215         elif isinstance(data, ma.MaskedArray):
    216             import numpy.ma.mrecords as mrecords

/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
    339 
    340         return _arrays_to_mgr(arrays, data_names, index, columns,
--> 341                               dtype=dtype)
    342 
    343     def _init_ndarray(self, values, index, columns, dtype=None,

/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   4796     # figure out the index, if necessary
   4797     if index is None:
-> 4798         index = extract_index(arrays)
   4799     else:
   4800         index = _ensure_index(index)

/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/pandas/core/frame.py in extract_index(data)
   4844             lengths = list(set(raw_lengths))
   4845             if len(lengths) > 1:
-> 4846                 raise ValueError('arrays must all be same length')
   4847 
   4848             if have_dicts:

ValueError: arrays must all be same length

In [10]:
letters.head()


Out[10]:
lowercase uppercase
0 a A
1 b B
2 c C
3 d D
4 e E

We can rename the columns easily and even add a new one through a relatively simple dictionary like assignment. I'll go over some more complex methods later on.


In [11]:
letters.columns = ['LowerCase','UpperCase']

In [12]:
np.random.seed(25)
letters['Number'] = np.random.random_integers(1,50,26)

In [13]:
letters


Out[13]:
LowerCase UpperCase Number
0 a A 5
1 b B 27
2 c C 16
3 d D 24
4 e E 45
5 f F 9
6 g G 29
7 h H 5
8 i I 26
9 j J 32
10 k K 6
11 l L 2
12 m M 40
13 n N 4
14 o O 25
15 p P 4
16 q Q 21
17 r R 46
18 s S 4
19 t T 2
20 u U 23
21 v V 32
22 w W 49
23 x X 48
24 y Y 10
25 z Z 17

Now just like Series, DataFrames have data types, we can get those by accessing the dtypes of the DataFrame which will give us details on the data types we've got.


In [14]:
letters.dtypes


Out[14]:
LowerCase    object
UpperCase    object
Number        int64
dtype: object

In [15]:
letters.index = lcase
letters


Out[15]:
LowerCase UpperCase Number
a a A 5
b b B 27
c c C 16
d d D 24
e e E 45
f f F 9
g g G 29
h h H 5
i i I 26
j j J 32
k k K 6
l l L 2
m m M 40
n n N 4
o o O 25
p p P 4
q q Q 21
r r R 46
s s S 4
t t T 2
u u U 23
v v V 32
w w W 49
x x X 48
y y Y 10
z z Z 17

Of course we can sort maybe by a specific column or by the index(the default).


In [16]:
letters.sort('Number')


Out[16]:
LowerCase UpperCase Number
t t T 2
l l L 2
s s S 4
p p P 4
n n N 4
a a A 5
h h H 5
k k K 6
f f F 9
y y Y 10
c c C 16
z z Z 17
q q Q 21
u u U 23
d d D 24
o o O 25
i i I 26
b b B 27
g g G 29
v v V 32
j j J 32
m m M 40
e e E 45
r r R 46
x x X 48
w w W 49

In [17]:
letters.sort()


Out[17]:
LowerCase UpperCase Number
a a A 5
b b B 27
c c C 16
d d D 24
e e E 45
f f F 9
g g G 29
h h H 5
i i I 26
j j J 32
k k K 6
l l L 2
m m M 40
n n N 4
o o O 25
p p P 4
q q Q 21
r r R 46
s s S 4
t t T 2
u u U 23
v v V 32
w w W 49
x x X 48
y y Y 10
z z Z 17

We've seen how to query for one column and multiple columns isn't too much more difficult.

We can get upper and lower case columns


In [18]:
letters[['LowerCase','UpperCase']].head()


Out[18]:
LowerCase UpperCase
a a A
b b B
c c C
d d D
e e E

We can also just query the index as well. We went over a lot of that in the Series Section and a lot of the same applies here.

We can query by index location or by letters


In [19]:
letters.iloc[5:10]


Out[19]:
LowerCase UpperCase Number
f f F 9
g g G 29
h h H 5
i i I 26
j j J 32

In [20]:
letters["f":"k"]


Out[20]:
LowerCase UpperCase Number
f f F 9
g g G 29
h h H 5
i i I 26
j j J 32
k k K 6

Now that we’ve covered this basic concept of pandas.

We covered how indexes integrate with both Series and DataFrames. We've covered how numpy underlies a lot of the power we've got and to be honest we've really covered a lot of the fundamental for doing data analysis with python and pandas.

Although these videos have been using fabricated data we have covered a lot of the methods that you’re going to be using on a regular basis during your analysis of data.

Let's go ahead and dive into our first data set