Introduction to Python for Data Sciences

Franck Iutzeler
Fall. 2018

In a previous chapter, we explored some features of NumPy and notably its arrays. Here we will take a look at the data structures provided by the Pandas library.

Pandas is a newer package built on top of NumPy which provides an efficient implementation of DataFrames. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations.

Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd.


In [1]:
import pandas as pd
import numpy as np

a) Pandas Series

Go to top

A Pandas Series is a one-dimensional array of indexed data.


In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data


Out[2]:
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The contents can be accessed in the same way as for NumPy arrays, to the difference that when more than one value is selected, the type remains a Pandas Series.


In [3]:
print(data[0],type(data[0]))


0.25 <class 'numpy.float64'>

In [4]:
print(data[2:],type(data[2:]))


2    0.75
3    1.00
dtype: float64 <class 'pandas.core.series.Series'>

The type Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes.

  • values are the contents of the series as a NumPy array

In [5]:
print(data.values,type(data.values))


[ 0.25  0.5   0.75  1.  ] <class 'numpy.ndarray'>
  • index are the indices of the series

In [6]:
print(data.index,type(data.index))


RangeIndex(start=0, stop=4, step=1) <class 'pandas.core.indexes.range.RangeIndex'>

Series Indices

The main difference between NumPy arrays and Pandas Series is the presence of this index field. By default, it is set (as in NumPy arrays) as 0,1,..,size_of_the_series but a Series index can be explicitly defined. The indices may be numbers but also strings. Then, the contents of the series have to be accessed using these defined indices.


In [7]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data)


a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [8]:
print(data['c'])


0.75

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 3, 4, 2])
print(data)


1    0.25
3    0.50
4    0.75
2    1.00
dtype: float64

In [10]:
print(data[2])


1.0

Series and Python Dictionaries [*]

Pandas Series and Python Dictionaries are close semantically: mappping keys to values. However, the implementation of Pandas series is usually more efficient than dictionaries in the context of data science. Naturally, Series can be contructed from dictionaries.


In [11]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
print(population_dict,type(population_dict))
print(population,type(population))


{'Florida': 19552860, 'Illinois': 12882135, 'Texas': 26448193, 'New York': 19651127, 'California': 38332521} <class 'dict'>
California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64 <class 'pandas.core.series.Series'>

In [12]:
population['California']


Out[12]:
38332521

In [13]:
population['California':'Illinois']


Out[13]:
California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

b) Pandas DataFrames

Go to top

DataFrames is a fundamental object of Pandas that mimicks what can be found in R for instance. Dataframes can be seen as an array of Series: to each index (corresponding to an individual for instance or a line in a table), a Dataframe maps multiples values; these values corresponds to the columns of the DataFrame which each have a name (as a string).

In the following example, we will construct a Dataframe from two Series with common indices.


In [14]:
area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})

In [15]:
states = pd.DataFrame({'Population': population, 'Area': area})
print(states,type(states))


              Area  Population
California  423967    38332521
Florida     170312    19552860
Illinois    149995    12882135
New York    141297    19651127
Texas       695662    26448193 <class 'pandas.core.frame.DataFrame'>

In Jupyter notebooks, DataFrames are displayed in a fancier way when the name of the dataframe is typed (instead of using print)


In [16]:
states


Out[16]:
Area Population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

DataFrames have

  • index that are the defined indices as in Series
  • columns that are the columns names
  • values that return a (2D) NumPy array with the contents

In [17]:
print(states.index)
print(states.columns)
print(states.values,type(states.values),states.values.shape)


Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
Index(['Area', 'Population'], dtype='object')
[[  423967 38332521]
 [  170312 19552860]
 [  149995 12882135]
 [  141297 19651127]
 [  695662 26448193]] <class 'numpy.ndarray'> (5, 2)
**Warning:** When accessing a Dataframe, dataframe_name[column_name] return the corresponding column as a Series. dataframe_name[index_name] returns an error! We will see later how to access a specific index.

In [18]:
print(states['Area'],type(states['Area']))


California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: Area, dtype: int64 <class 'pandas.core.series.Series'>

In [19]:
print(states['California'])


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'California'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-19-6c1ec5cbb61e> in <module>()
----> 1 print(states['California'])

~/.local/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

~/.local/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

~/.local/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

~/.local/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

~/.local/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'California'

 Dataframe creation

To create DataFrames, the main methods are:

  • from Series (as above)

In [20]:
print(population,type(population))
states = pd.DataFrame({'Population': population, 'Area': area})
states


California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64 <class 'pandas.core.series.Series'>
Out[20]:
Area Population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193
  • from NumPy arrays (the columns and indices are taken as the array's ones)

In [21]:
A = np.random.randn(5,3)
print(A,type(A))
dfA = pd.DataFrame(A)
dfA


[[-0.09894406 -1.29485362 -0.59476304]
 [ 0.76951757 -0.06049627  0.29844475]
 [-1.96703017  0.79565206  0.31305243]
 [-0.96082799 -3.15184472 -1.00094927]
 [ 0.27694356  0.41651492 -0.29137939]] <class 'numpy.ndarray'>
Out[21]:
0 1 2
0 -0.098944 -1.294854 -0.594763
1 0.769518 -0.060496 0.298445
2 -1.967030 0.795652 0.313052
3 -0.960828 -3.151845 -1.000949
4 0.276944 0.416515 -0.291379
  • from a list of dictionaries. Be careful, each element of the list is an example (corresponding to an automatic index 0,1,...) while each key of the dictonary corresponds to a column.

In [22]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
print(data,type(data))
print(data[0],type(data[0]))


[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}] <class 'list'>
{'a': 0, 'b': 0} <class 'dict'>

In [23]:
df = pd.DataFrame(data)
df


Out[23]:
a b
0 0 0
1 1 2
2 2 4
  • from a file , typically a csv file (for comma separated values), eventually with the names of the columns as a first line.
col_1_name,col_2_name,col_3_name
col_1_v1,col_2_v1,col_3_v1
col_1_v2,col_2_v2,col_3_v2
...

For other files types (MS Excel, libSVM, any other separator) see this part of the doc


In [25]:
!head -4 data/president_heights.csv # Jupyter bash command to see the first 4 lines of the file


order,name,height(cm)
1,George Washington,189
2,John Adams,170
3,Thomas Jefferson,189

In [26]:
data = pd.read_csv('data/president_heights.csv')
data


Out[26]:
order name height(cm)
0 1 George Washington 189
1 2 John Adams 170
2 3 Thomas Jefferson 189
3 4 James Madison 163
4 5 James Monroe 183
5 6 John Quincy Adams 171
6 7 Andrew Jackson 185
7 8 Martin Van Buren 168
8 9 William Henry Harrison 173
9 10 John Tyler 183
10 11 James K. Polk 173
11 12 Zachary Taylor 173
12 13 Millard Fillmore 175
13 14 Franklin Pierce 178
14 15 James Buchanan 183
15 16 Abraham Lincoln 193
16 17 Andrew Johnson 178
17 18 Ulysses S. Grant 173
18 19 Rutherford B. Hayes 174
19 20 James A. Garfield 183
20 21 Chester A. Arthur 183
21 23 Benjamin Harrison 168
22 25 William McKinley 170
23 26 Theodore Roosevelt 178
24 27 William Howard Taft 182
25 28 Woodrow Wilson 180
26 29 Warren G. Harding 183
27 30 Calvin Coolidge 178
28 31 Herbert Hoover 182
29 32 Franklin D. Roosevelt 188
30 33 Harry S. Truman 175
31 34 Dwight D. Eisenhower 179
32 35 John F. Kennedy 183
33 36 Lyndon B. Johnson 193
34 37 Richard Nixon 182
35 38 Gerald Ford 183
36 39 Jimmy Carter 177
37 40 Ronald Reagan 185
38 41 George H. W. Bush 188
39 42 Bill Clinton 188
40 43 George W. Bush 182
41 44 Barack Obama 185
42 45 Donald Trump 188

Names and Values

Notice there can be missing values in DataFrames.


In [25]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])


Out[25]:
a b c
0 1.0 2 NaN
1 NaN 3 4.0

You can set indices and columns names a posteriori


In [26]:
dfA.columns = ['a','b','c']
dfA.index = [i**2 for i in range(1,6)  ]
dfA


Out[26]:
a b c
1 2.408294 -0.035728 1.268004
4 -1.451913 -0.136165 0.526509
9 0.955310 -0.767303 1.212357
16 1.264427 -0.293195 0.728914
25 -0.385296 -1.073592 -0.439185

In [27]:
area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
states = pd.DataFrame({'Population': population, 'Area': area})
states


Out[27]:
Area Population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

You may access columns directly with names, then you can access individuals with their index.


In [28]:
states['Area']


Out[28]:
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: Area, dtype: int64

In [29]:
states['Area']['Texas']


Out[29]:
695662

To ease the access, Pandas offers dedicated methods:

  • iloc enables to access subparts of the dataframe as if it was a NumPy array.

In [30]:
states.iloc[:2]


Out[30]:
Area Population
California 423967 38332521
Florida 170312 19552860

In [31]:
states.iloc[:2,0]


Out[31]:
California    423967
Florida       170312
Name: Area, dtype: int64
  • loc does the same but with the explicit names (the last one is included)

In [32]:
states.loc[:'New York']


Out[32]:
Area Population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127

In [33]:
states.loc[:,'Population':]


Out[33]:
Population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193

Package Check and Styling

Go to top


In [ ]:
import lib.notebook_setting as nbs

packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)

nbs.cssStyling()