Introduction to Python for Data Sciences

Franck Iutzeler
Fall. 2018

`1. Pandas`

Package check and Styling

Outline

    a) Pandas Series
    b) Pandas DataFrames
    c) Indexing

In a previous chapter, we explored some features of NumPy and notably its arrays. Here we will take a look at the data structures provided by the Pandas library.

Pandas is a newer package built on top of NumPy which provides an efficient implementation of DataFrames. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations.

Just as we generally import NumPy under the alias np, we will import Pandas under the alias pd.



In [1]:

    
import pandas as pd
import numpy as np

a) Pandas Series
Go to top

A Pandas Series is a one-dimensional array of indexed data.



In [2]:

    
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data









    Out[2]:





0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The contents can be accessed in the same way as for NumPy arrays, to the difference that when more than one value is selected, the type remains a Pandas Series.



In [3]:

    
print(data[0],type(data[0]))









    



0.25 <class 'numpy.float64'>



In [4]:

    
print(data[2:],type(data[2:]))









    



2    0.75
3    1.00
dtype: float64 <class 'pandas.core.series.Series'>

The type Series wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes.

values are the contents of the series as a NumPy array



In [5]:

    
print(data.values,type(data.values))









    



[ 0.25  0.5   0.75  1.  ] <class 'numpy.ndarray'>

index are the indices of the series



In [6]:

    
print(data.index,type(data.index))









    



RangeIndex(start=0, stop=4, step=1) <class 'pandas.core.indexes.range.RangeIndex'>

Series Indices

The main difference between NumPy arrays and Pandas Series is the presence of this index field. By default, it is set (as in NumPy arrays) as 0,1,..,size_of_the_series but a Series index can be explicitly defined. The indices may be numbers but also strings. Then, the contents of the series have to be accessed using these defined indices.



In [7]:

    
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data)









    



a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64



In [8]:

    
print(data['c'])



In [9]:

    
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 3, 4, 2])
print(data)









    



1    0.25
3    0.50
4    0.75
2    1.00
dtype: float64



In [10]:

    
print(data[2])

1.0

Series and Python Dictionaries [*]

Pandas Series and Python Dictionaries are close semantically: mappping keys to values. However, the implementation of Pandas series is usually more efficient than dictionaries in the context of data science. Naturally, Series can be contructed from dictionaries.



In [11]:

    
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
print(population_dict,type(population_dict))
print(population,type(population))









    



{'Florida': 19552860, 'Illinois': 12882135, 'Texas': 26448193, 'New York': 19651127, 'California': 38332521} <class 'dict'>
California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64 <class 'pandas.core.series.Series'>



In [12]:

    
population['California']









    Out[12]:





38332521



In [13]:

    
population['California':'Illinois']









    Out[13]:





California    38332521
Florida       19552860
Illinois      12882135
dtype: int64

b) Pandas DataFrames

Go to top

DataFrames is a fundamental object of Pandas that mimicks what can be found in R for instance. Dataframes can be seen as an array of Series: to each index (corresponding to an individual for instance or a line in a table), a Dataframe maps multiples values; these values corresponds to the columns of the DataFrame which each have a name (as a string).

In the following example, we will construct a Dataframe from two Series with common indices.



In [14]:

    
area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})



In [15]:

    
states = pd.DataFrame({'Population': population, 'Area': area})
print(states,type(states))









    



              Area  Population
California  423967    38332521
Florida     170312    19552860
Illinois    149995    12882135
New York    141297    19651127
Texas       695662    26448193 <class 'pandas.core.frame.DataFrame'>

In Jupyter notebooks, DataFrames are displayed in a fancier way when the name of the dataframe is typed (instead of using print)



In [16]:

    
states

DataFrames have

index that are the defined indices as in Series
columns that are the columns names
values that return a (2D) NumPy array with the contents



In [17]:

    
print(states.index)
print(states.columns)
print(states.values,type(states.values),states.values.shape)









    



Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
Index(['Area', 'Population'], dtype='object')
[[  423967 38332521]
 [  170312 19552860]
 [  149995 12882135]
 [  141297 19651127]
 [  695662 26448193]] <class 'numpy.ndarray'> (5, 2)

**Warning:** When accessing a Dataframe, dataframe_name[column_name] return the corresponding column as a Series. dataframe_name[index_name] returns an error! We will see later how to access a specific index.



In [18]:

    
print(states['Area'],type(states['Area']))









    



California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: Area, dtype: int64 <class 'pandas.core.series.Series'>



In [19]:

    
print(states['California'])









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.local/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2441             try:
-> 2442                 return self._engine.get_loc(key)
   2443             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'California'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-19-6c1ec5cbb61e> in <module>()
----> 1 print(states['California'])

~/.local/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

~/.local/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

~/.local/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1643         res = cache.get(item)
   1644         if res is None:
-> 1645             values = self._data.get(item)
   1646             res = self._box_item_values(item, values)
   1647             cache[item] = res

~/.local/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

~/.local/lib/python3.5/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'California'

Dataframe creation

To create DataFrames, the main methods are:

from Series (as above)



In [20]:

    
print(population,type(population))
states = pd.DataFrame({'Population': population, 'Area': area})
states









    



California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64 <class 'pandas.core.series.Series'>






    Out[20]:







  
    
      
      Area
      Population
    
  
  
    
      California
      423967
      38332521
    
    
      Florida
      170312
      19552860
    
    
      Illinois
      149995
      12882135
    
    
      New York
      141297
      19651127
    
    
      Texas
      695662
      26448193

from NumPy arrays (the columns and indices are taken as the array's ones)



In [21]:

    
A = np.random.randn(5,3)
print(A,type(A))
dfA = pd.DataFrame(A)
dfA









    



[[-0.09894406 -1.29485362 -0.59476304]
 [ 0.76951757 -0.06049627  0.29844475]
 [-1.96703017  0.79565206  0.31305243]
 [-0.96082799 -3.15184472 -1.00094927]
 [ 0.27694356  0.41651492 -0.29137939]] <class 'numpy.ndarray'>






    Out[21]:







  
    
      
      0
      1
      2
    
  
  
    
      0
      -0.098944
      -1.294854
      -0.594763
    
    
      1
      0.769518
      -0.060496
      0.298445
    
    
      2
      -1.967030
      0.795652
      0.313052
    
    
      3
      -0.960828
      -3.151845
      -1.000949
    
    
      4
      0.276944
      0.416515
      -0.291379

from a list of dictionaries. Be careful, each element of the list is an example (corresponding to an automatic index 0,1,...) while each key of the dictonary corresponds to a column.



In [22]:

    
data = [{'a': i, 'b': 2 * i} for i in range(3)]
print(data,type(data))
print(data[0],type(data[0]))









    



[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}] <class 'list'>
{'a': 0, 'b': 0} <class 'dict'>



In [23]:

    
df = pd.DataFrame(data)
df

from a file , typically a csv file (for comma separated values), eventually with the names of the columns as a first line.

col_1_name,col_2_name,col_3_name
col_1_v1,col_2_v1,col_3_v1
col_1_v2,col_2_v2,col_3_v2
...

For other files types (MS Excel, libSVM, any other separator) see this part of the doc



In [25]:

    
!head -4 data/president_heights.csv # Jupyter bash command to see the first 4 lines of the file









    



order,name,height(cm)
1,George Washington,189
2,John Adams,170
3,Thomas Jefferson,189



In [26]:

    
data = pd.read_csv('data/president_heights.csv')
data









    Out[26]:







  
    
      
      order
      name
      height(cm)
    
  
  
    
      0
      1
      George Washington
      189
    
    
      1
      2
      John Adams
      170
    
    
      2
      3
      Thomas Jefferson
      189
    
    
      3
      4
      James Madison
      163
    
    
      4
      5
      James Monroe
      183
    
    
      5
      6
      John Quincy Adams
      171
    
    
      6
      7
      Andrew Jackson
      185
    
    
      7
      8
      Martin Van Buren
      168
    
    
      8
      9
      William Henry Harrison
      173
    
    
      9
      10
      John Tyler
      183
    
    
      10
      11
      James K. Polk
      173
    
    
      11
      12
      Zachary Taylor
      173
    
    
      12
      13
      Millard Fillmore
      175
    
    
      13
      14
      Franklin Pierce
      178
    
    
      14
      15
      James Buchanan
      183
    
    
      15
      16
      Abraham Lincoln
      193
    
    
      16
      17
      Andrew Johnson
      178
    
    
      17
      18
      Ulysses S. Grant
      173
    
    
      18
      19
      Rutherford B. Hayes
      174
    
    
      19
      20
      James A. Garfield
      183
    
    
      20
      21
      Chester A. Arthur
      183
    
    
      21
      23
      Benjamin Harrison
      168
    
    
      22
      25
      William McKinley
      170
    
    
      23
      26
      Theodore Roosevelt
      178
    
    
      24
      27
      William Howard Taft
      182
    
    
      25
      28
      Woodrow Wilson
      180
    
    
      26
      29
      Warren G. Harding
      183
    
    
      27
      30
      Calvin Coolidge
      178
    
    
      28
      31
      Herbert Hoover
      182
    
    
      29
      32
      Franklin D. Roosevelt
      188
    
    
      30
      33
      Harry S. Truman
      175
    
    
      31
      34
      Dwight D. Eisenhower
      179
    
    
      32
      35
      John F. Kennedy
      183
    
    
      33
      36
      Lyndon B. Johnson
      193
    
    
      34
      37
      Richard Nixon
      182
    
    
      35
      38
      Gerald Ford
      183
    
    
      36
      39
      Jimmy Carter
      177
    
    
      37
      40
      Ronald Reagan
      185
    
    
      38
      41
      George H. W. Bush
      188
    
    
      39
      42
      Bill Clinton
      188
    
    
      40
      43
      George W. Bush
      182
    
    
      41
      44
      Barack Obama
      185
    
    
      42
      45
      Donald Trump
      188

Names and Values

Notice there can be missing values in DataFrames.



In [25]:

    
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

You can set indices and columns names a posteriori



In [26]:

    
dfA.columns = ['a','b','c']
dfA.index = [i**2 for i in range(1,6)  ]
dfA

c) Indexing

Go to top



In [27]:

    
area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
states = pd.DataFrame({'Population': population, 'Area': area})
states

You may access columns directly with names, then you can access individuals with their index.



In [28]:

    
states['Area']









    Out[28]:





California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: Area, dtype: int64



In [29]:

    
states['Area']['Texas']









    Out[29]:





695662

To ease the access, Pandas offers dedicated methods:

iloc enables to access subparts of the dataframe as if it was a NumPy array.



In [30]:

    
states.iloc[:2]









    Out[30]:







  
    
      
      Area
      Population
    
  
  
    
      California
      423967
      38332521
    
    
      Florida
      170312
      19552860



In [31]:

    
states.iloc[:2,0]









    Out[31]:





California    423967
Florida       170312
Name: Area, dtype: int64

loc does the same but with the explicit names (the last one is included)



In [32]:

    
states.loc[:'New York']



In [33]:

    
states.loc[:,'Population':]









    Out[33]:







  
    
      
      Population
    
  
  
    
      California
      38332521
    
    
      Florida
      19552860
    
    
      Illinois
      12882135
    
    
      New York
      19651127
    
    
      Texas
      26448193

Package Check and Styling

Go to top



In [ ]:

    
import lib.notebook_setting as nbs

packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)

nbs.cssStyling()

	a	b	c
0	1.0	2	NaN
1	NaN	3	4.0

	a	b	c
1	2.408294	-0.035728	1.268004
4	-1.451913	-0.136165	0.526509
9	0.955310	-0.767303	1.212357
16	1.264427	-0.293195	0.728914
25	-0.385296	-1.073592	-0.439185

	Area	Population
California	423967	38332521
Florida	170312	19552860
Illinois	149995	12882135
New York	141297	19651127
Texas	695662	26448193

	0	1	2
0	-0.098944	-1.294854	-0.594763
1	0.769518	-0.060496	0.298445
2	-1.967030	0.795652	0.313052
3	-0.960828	-3.151845	-1.000949
4	0.276944	0.416515	-0.291379

	order	name	height(cm)
0	1	George Washington	189
1	2	John Adams	170
2	3	Thomas Jefferson	189
3	4	James Madison	163
4	5	James Monroe	183
5	6	John Quincy Adams	171
6	7	Andrew Jackson	185
7	8	Martin Van Buren	168
8	9	William Henry Harrison	173
9	10	John Tyler	183
10	11	James K. Polk	173
11	12	Zachary Taylor	173
12	13	Millard Fillmore	175
13	14	Franklin Pierce	178
14	15	James Buchanan	183
15	16	Abraham Lincoln	193
16	17	Andrew Johnson	178
17	18	Ulysses S. Grant	173
18	19	Rutherford B. Hayes	174
19	20	James A. Garfield	183
20	21	Chester A. Arthur	183
21	23	Benjamin Harrison	168
22	25	William McKinley	170
23	26	Theodore Roosevelt	178
24	27	William Howard Taft	182
25	28	Woodrow Wilson	180
26	29	Warren G. Harding	183
27	30	Calvin Coolidge	178
28	31	Herbert Hoover	182
29	32	Franklin D. Roosevelt	188
30	33	Harry S. Truman	175
31	34	Dwight D. Eisenhower	179
32	35	John F. Kennedy	183
33	36	Lyndon B. Johnson	193
34	37	Richard Nixon	182
35	38	Gerald Ford	183
36	39	Jimmy Carter	177
37	40	Ronald Reagan	185
38	41	George H. W. Bush	188
39	42	Bill Clinton	188
40	43	George W. Bush	182
41	44	Barack Obama	185
42	45	Donald Trump	188

Introduction to Python for Data Sciences

1. Pandas

a) Pandas Series Go to top

Series Indices

Series and Python Dictionaries [*]

b) Pandas DataFrames

Dataframe creation

Names and Values

c) Indexing

Package Check and Styling

`1. Pandas`

a) Pandas Series
Go to top