Pandas

Importando o Pandas e o NumPy



In [1]:

    
import pandas as pd
import numpy as np

Series

Uma Series é um objeto semelhante a uma vetor que possui um vetor de dados e um vetor de labels associadas chamado index. Sua documentação completa se encontra em: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html#pandas.Series

Instanciando uma Series



In [2]:

    
""" Apenas a partir dos valores """

obj = pd.Series([4, 7, -5, 3])
obj









    Out[2]:





0    4
1    7
2   -5
3    3
dtype: int64



In [3]:

    
obj.values









    Out[3]:





array([ 4,  7, -5,  3], dtype=int64)



In [4]:

    
obj.index









    Out[4]:





RangeIndex(start=0, stop=4, step=1)



In [5]:

    
""" A partir dos valores e dos índices """

obj2 = pd.Series([4, 7, -5, 3], index=['d','b','a','c'])
obj2









    Out[5]:





d    4
b    7
a   -5
c    3
dtype: int64



In [6]:

    
obj2.index









    Out[6]:





Index([u'd', u'b', u'a', u'c'], dtype='object')



In [7]:

    
""" A partir de um dictionary """

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3









    Out[7]:





Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64



In [8]:

    
""" A partir de um dictionary e dos índices """

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4









    Out[8]:





California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Acessando elementos de uma Series



In [9]:

    
obj2['a']









    Out[9]:





-5



In [10]:

    
obj2['d'] = 6
obj2['d']









    Out[10]:





6



In [11]:

    
obj2[['c','a','d']]









    Out[11]:





c    3
a   -5
d    6
dtype: int64



In [12]:

    
obj2[obj2 > 0]









    Out[12]:





d    6
b    7
c    3
dtype: int64

Algumas operações permitidas em uma Series



In [13]:

    
""" Multiplicação por um escalar """

obj2 * 2









    Out[13]:





d    12
b    14
a   -10
c     6
dtype: int64



In [14]:

    
""" Operações de vetor do numpy """

import numpy as np

np.exp(obj2)









    Out[14]:





d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64



In [15]:

    
""" Funções que funcionam com dictionaries """

'b' in obj2









    Out[15]:





True



In [16]:

    
'e' in obj2









    Out[16]:





False



In [17]:

    
""" Funções para identificar dados faltando """

obj4.isnull()









    Out[17]:





California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool



In [18]:

    
obj4.notnull()









    Out[18]:





California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool



In [19]:

    
""" Operações aritméticas com alinhamento automático dos índices """

obj3 + obj4









    Out[19]:





California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

DataFrame

Um DataFrame representa uma estrutura de dados tabular, semelhante a uma planilha de excel, contendo um conjunto ordenado de colunas, podendo ser cada uma de tipos de valores diferente. Um DataFrame possui um índice de linhas e um de colunas e pode ser encarado como um dict de Series. Sua documentação completa se encontra em: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

Instanciando um DataFrame



In [20]:

    
""" A partir de um dictionary de vetores """

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], \
        'year': [2000, 2001, 2002, 2001, 2002], \
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

frame = pd.DataFrame(data)
frame



In [21]:

    
""" A partir de um dictionary em uma ordem específica das colunas """

pd.DataFrame(data, columns=['year', 'state', 'pop'])



In [22]:

    
""" A partir de um dictionary e dos índices das colunas e/ou dos índices das linhas """

frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five'])
frame2









    Out[22]:






  
    
      
      year
      state
      pop
      debt
    
  
  
    
      one
      2000
      Ohio
      1.5
      NaN
    
    
      two
      2001
      Ohio
      1.7
      NaN
    
    
      three
      2002
      Ohio
      3.6
      NaN
    
    
      four
      2001
      Nevada
      2.4
      NaN
    
    
      five
      2002
      Nevada
      2.9
      NaN



In [23]:

    
""" A partir de um dictionary de dictionaries aninhados """

pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
frame3

Note que estas não são todas as formas possíveis de se fazê-lo. Para uma visão mais completa veja a seguinte tabela com as possíveis entradas para o construtor do DataFrame:

Type	Notes
2D ndarray	A matrix of data, passing optional row and column labels
dict of arrays, lists, or tuples	Each sequence becomes a column in the DataFrame. All sequences must be the same length.
NumPy structured/record array	Treated as the “dict of arrays” case
dict of Series	Each value becomes a column. Indexes from each Series are unioned together to form the result’s row index if no explicit index is passed.
dict of dicts	Each inner dict becomes a column. Keys are unioned to form the row index as in the “dict of Series” case.
list of dicts or Series	Each item becomes a row in the DataFrame. Union of dict keys or Series indexes become the DataFrame’s column labels
List of lists or tuples	Treated as the “2D ndarray” case
Another DataFrame	The DataFrame’s indexes are used unless different ones are passed
NumPy MaskedArray	Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result

Manipulando linhas e colunas de um DataFrame



In [24]:

    
""" Acessando colunas como em uma Series ou dictionary """

frame2['state']









    Out[24]:





one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object



In [25]:

    
""" Como colunas como um atributo """

frame2.year









    Out[25]:





one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64



In [26]:

    
""" Acessando linhas com o nome da linha """

frame2.ix['three']









    Out[26]:





year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object



In [27]:

    
""" Acessando linhas com o índice da linha """

frame2.ix[3]









    Out[27]:





year       2001
state    Nevada
pop         2.4
debt        NaN
Name: four, dtype: object



In [28]:

    
""" Modificando uma coluna com um valor """

frame2['debt'] = 16.5
frame2



In [29]:

    
""" Modificando uma coluna com um vetor """

frame2['debt'] = np.arange(5.)
frame2



In [30]:

    
""" Modificando uma coluna com uma Series """

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

frame2['debt'] = val
frame2









    Out[30]:






  
    
      
      year
      state
      pop
      debt
    
  
  
    
      one
      2000
      Ohio
      1.5
      NaN
    
    
      two
      2001
      Ohio
      1.7
      -1.2
    
    
      three
      2002
      Ohio
      3.6
      NaN
    
    
      four
      2001
      Nevada
      2.4
      -1.5
    
    
      five
      2002
      Nevada
      2.9
      -1.7



In [31]:

    
""" Adicionando uma coluna que não existe """

frame2['eastern'] = frame2.state == 'Ohio'
frame2









    Out[31]:






  
    
      
      year
      state
      pop
      debt
      eastern
    
  
  
    
      one
      2000
      Ohio
      1.5
      NaN
      True
    
    
      two
      2001
      Ohio
      1.7
      -1.2
      True
    
    
      three
      2002
      Ohio
      3.6
      NaN
      True
    
    
      four
      2001
      Nevada
      2.4
      -1.5
      False
    
    
      five
      2002
      Nevada
      2.9
      -1.7
      False



In [32]:

    
""" Deletando uma coluna """

del frame2['eastern']
frame2.columns









    Out[32]:





Index([u'year', u'state', u'pop', u'debt'], dtype='object')

	pop	state	year
0	1.5	Ohio	2000
1	1.7	Ohio	2001
2	3.6	Ohio	2002
3	2.4	Nevada	2001
4	2.9	Nevada	2002

	year	state	pop	debt
one	2000	Ohio	1.5	16.5
two	2001	Ohio	1.7	16.5
three	2002	Ohio	3.6	16.5
four	2001	Nevada	2.4	16.5
five	2002	Nevada	2.9	16.5

	year	state	pop	debt	eastern
one	2000	Ohio	1.5	NaN	True
two	2001	Ohio	1.7	-1.2	True
three	2002	Ohio	3.6	NaN	True
four	2001	Nevada	2.4	-1.5	False
five	2002	Nevada	2.9	-1.7	False