Python for Data Analysis Lightning Tutorials

Pandas Cookbook Series

Python for Data Analysis Lightning Tutorials is a series of tutorials in Data Analysis, Statistics, and Graphics using Python. The Pandas Cookbook series of tutorials provides recipes for common tasks and moves on to more advanced topics in statistics and time series analysis.

Created by Alfred Essa, Aug 8, 2013

Note: IPython Notebook and Data files can be found at my Github Site: github/alfredessa

Chapter 1: Data Structures

1.1 Problem. How can I create a Series object in Pandas?

1.11 What is a Series?

The Series data structure in Pandas is a one-dimensional labeled array.

  • Data in the array can be of any type (integers, strings, floating point numbers, Python objects, etc.).
  • Data within the array is homogeneous

1.12 Series is ndarray-like and dict-like

Pandas Series objects are amphibian in character, exhibiting both ndarray-like and dict-like properties. See Discussion section below.

1.13 How can I Create a Series?

The basic method to create a Series:

- s = Series(data, index=index)

Here, data can be different things, including:

- a list
- an array
- a dictionary

1.14 Preliminaries - import pandas library


In [118]:
import pandas as pd
import numpy as np

1.15 Example 1 : Create a Basic Series Object


In [119]:
# series constructor with data as a list of integers
s1 = pd.Series([33, 19, 15, 89, 11, -5, 9])

In [120]:
# the default index, if not specified in the Series constructor, is a series of integers
s1


Out[120]:
0    33
1    19
2    15
3    89
4    11
5    -5
6     9
dtype: int64

In [121]:
# type of series is pandas series
type(s1)


Out[121]:
pandas.core.series.Series

In [122]:
# retrieve the values of the series
s1.values


Out[122]:
array([33, 19, 15, 89, 11, -5,  9])

In [123]:
# type of data values is NumPy ndarray
type(s1.values)


Out[123]:
numpy.ndarray

In [124]:
# retrieve the indices of the array
s1.index


Out[124]:
Int64Index([0, 1, 2, 3, 4, 5, 6], dtype=int64)

In [125]:
# think of a series as a mapping from index to values
s1


Out[125]:
0    33
1    19
2    15
3    89
4    11
5    -5
6     9
dtype: int64

1.16 Example 2: Creating a Series Object with Meaningful Labels


In [126]:
# define the data and index as lists
data1 = [33, 19, 15, 89, 11, -5, 9]
index1 = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

In [127]:
# create series 
s2 = pd.Series(data1, index=index1)

In [128]:
s2


Out[128]:
Mon    33
Tue    19
Wed    15
Thu    89
Fri    11
Sat    -5
Sun     9
dtype: int64

In [129]:
# verify index 
s2.index


Out[129]:
Index([u'Mon', u'Tue', u'Wed', u'Thu', u'Fri', u'Sat', u'Sun'], dtype=object)


In [130]:
# we can also give meaningful labels to the series data and the index

s2.name='Daily Temperatures'
s2.index.name='Weekday'

In [131]:
s2


Out[131]:
Weekday
Mon        33
Tue        19
Wed        15
Thu        89
Fri        11
Sat        -5
Sun         9
Name: Daily Temperatures, dtype: int64

1.17 Example 3: Data in a Series is homogeneous


In [ ]:
# the second data element in the list is a float
data2 = [33, 19.3, 15, 89, 11, -5, 9]

In [132]:
s3 = pd.Series(data2, index=index1)

In [133]:
# all the data elements are of type float
s3


Out[133]:
Mon    33.0
Tue    19.3
Wed    15.0
Thu    89.0
Fri    11.0
Sat    -5.0
Sun     9.0
dtype: float64

1.18 Example 3: Creating a Series from a Python Dict


In [ ]:
dict1 = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}

In [134]:
s4 = pd.Series(dict1)

In [135]:
s4


Out[135]:
Fri    11
Mon    33
Sat    -5
Sun     9
Thu    89
Tue    19
Wed    15
dtype: int64

Discussion


In [ ]:

The most general representation of a Series is as an ordered key-value store.

  • The order is represented by the offset.
  • The key-value is a mapping from index or label to the data array values.
  • Index as "offset" or "position" vs index as "label" or "key".

1.19 Series is ndarray-like


In [137]:
# vectorized operations
s4 * 2


Out[137]:
Fri     22
Mon     66
Sat    -10
Sun     18
Thu    178
Tue     38
Wed     30
dtype: int64

In [138]:
np.log(s4)


Out[138]:
Fri    2.397895
Mon    3.496508
Sat         NaN
Sun    2.197225
Thu    4.488636
Tue    2.944439
Wed    2.708050
dtype: float64

Note: NaN (not a number) is the standard missing data marker used in Pandas


In [139]:
# slice using index labels
s4['Thu':'Wed']


Out[139]:
Thu    89
Tue    19
Wed    15
dtype: int64

In [140]:
# slice using position
s4[1:3]


Out[140]:
Mon    33
Sat    -5
dtype: int64

In [141]:
# retrieve value using offset
s4[1]


Out[141]:
33

In [142]:
# set value using offset
s4[1]=199

In [143]:
s4


Out[143]:
Fri     11
Mon    199
Sat     -5
Sun      9
Thu     89
Tue     19
Wed     15
dtype: int64

In [144]:
# as a subclass of ndarray, Series is a valid argument to most NumPy functions - median
s4


Out[144]:
Fri     11
Mon    199
Sat     -5
Sun      9
Thu     89
Tue     19
Wed     15
dtype: int64

In [145]:
s4.median()


Out[145]:
15.0

In [146]:
# maximum 
s4.max()


Out[146]:
199

In [147]:
# cumsum
s4.cumsum()


Out[147]:
Fri     11
Mon    210
Sat    205
Sun    214
Thu    303
Tue    322
Wed    337
dtype: int64

In [148]:
# looping over a collection and indices
for i,v in enumerate(s4):
    print i,v


0 11
1 199
2 -5
3 9
4 89
5 19
6 15

In [149]:
# list comprehension can be used to create a new list
new_list = [x**2 for x in s4]

In [150]:
new_list


Out[150]:
[121, 39601, 25, 81, 7921, 361, 225]

1.21 Series is dict-like


In [151]:
# is the key in the  
'Sun' in s4


Out[151]:
True

In [152]:



Out[152]:
19

In [153]:
# retrieve value using key or index
s4['Tue']


Out[153]:
19

In [154]:
# assignment using key
s4['Tue']=200

In [155]:
s4


Out[155]:
Fri     11
Mon    199
Sat     -5
Sun      9
Thu     89
Tue    200
Wed     15
dtype: int64

In [157]:
# looping over dictionary keys and values
for k,v in s4.iteritems():
    print k,v


Fri 11
Mon 199
Sat -5
Sun 9
Thu 89
Tue 200
Wed 15

Resources

Pandas Series Documentation


In [ ]:
from IPython.core.display import HTML

HTML("<iframe src=http://pandas.pydata.org/pandas-docs/dev/dsintro.html#series width=800 height=350></iframe>")

In [ ]:


In [ ]:
!pwd

In [ ]:
np.rand(50)

In [ ]: