Analise de Dados em Python com PANDAS


Original: McKinney - NextDayVideo Conference, March 9 2012


In [114]:
import pandas
#from pandas import *
from pandas import Series, DataFrame

import numpy as np
#import import datetime

"""
 chamei o script com --pylab=inline entao nao preciso a linha abaix
 caso contrario descomente-a
"""
#mport matplotlib as plt 


def side_by_side(*objs, **kwds):
    from pandas.core.common import adjoin
    space = kwds.get('space', 4)
    reprs = [repr(obj).split('\n') for obj in objs]
    print(adjoin(space, *reprs))
    
plt.rc('figure', figsize=(10, 6))
#Obsoleto: pandas.set_printoptions(notebook_repr_html=False)
#Atual:  pandas.set_option('display.notebook_repr_html', True)
pandas.set_option('display.notebook_repr_html', False)

In [13]:
#Criando uma lista simples com strings
labels = ['a', 'b','c','d','e']
# randn(5) array de 5 dados aleatorios
#Segundo argumento diz que as labels serao os indices do array
s = Series(randn(5), index=labels)
s


Out[13]:
a    0.288637
b   -0.454567
c    0.826371
d    0.807914
e   -1.271622
dtype: float64

Os Objetos Series em um comportamento semelhande aos dicionarios, sendo formadas por um pars indice-valor


In [23]:
'b' in s


Out[23]:
True

In [26]:
s['c']


Out[26]:
0.82637145593922612

Conversoes


In [27]:
mapping = s.to_dict()
mapping


Out[27]:
{'e': -1.2716217775174052,
 'd': 0.80791350297457454,
 'a': 0.28863694845010518,
 'b': -0.45456719368292797,
 'c': 0.82637145593922612}

In [31]:
# Eh possivel passar um dicionario como argumento de um Objeto Series
s = Series(mapping)
s


Out[31]:
a    0.288637
b   -0.454567
c    0.826371
d    0.807914
e   -1.271622
dtype: float64

In [33]:
# Eh possivel alterar a ordem e  os indices
s = Series(mapping, index=['b','e','a','d','f'])
s


Out[33]:
b   -0.454567
e   -1.271622
a    0.288637
d    0.807914
f         NaN
dtype: float64

Observamos que o valor para o novo indice 'f' eh NaN, pois nao possuia valor no mapping original. Este eh o comportamento padrao para dados ausentes (missing data).

No modulo pandas existem duas funcoes que identificam esses valores, resultando em um valor logico: isnull e notnull.


In [34]:
isnull(s)


Out[34]:
b    False
e    False
a    False
d    False
f     True
dtype: bool

In [36]:
# Encontrando ela
s[isnull(s)]


Out[36]:
f   NaN
dtype: float64

In [37]:
notnull(s)


Out[37]:
b     True
e     True
a     True
d     True
f    False
dtype: bool

In [43]:
# Removendo dadso ausentes (NaN) - Lembre da imutabilidade
s.dropna()


Out[43]:
b   -0.454567
e   -1.271622
a    0.288637
d    0.807914
dtype: float64

In [42]:
s.index


Out[42]:
Index(['b', 'e', 'a', 'd', 'f'], dtype='object')

DataFrame: Colecao 2D de Series



In [45]:
df = DataFrame({
    'a': np.random.randn(6),
    'b': ['foo', 'bar'] * 3,
    'c': np.random.randn(6)
})

In [50]:
df.index


Out[50]:
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [51]:
df.columns


Out[51]:
Index(['a', 'b', 'c'], dtype='object')

In [60]:
df


Out[60]:
          a    b         c
0 -0.380374  foo  0.561932
1  1.215852  bar  0.362633
2  0.219716  foo -0.142641
3 -0.757877  bar -0.698091
4  0.537636  foo -1.377559
5 -0.428384  bar  1.858543

In [100]:
# Criando e atribuindo valores a uma nova coluna
df['d'] = 5

In [62]:
df


Out[62]:
          a    b         c  d
0 -0.380374  foo  0.561932  5
1  1.215852  bar  0.362633  5
2  0.219716  foo -0.142641  5
3 -0.757877  bar -0.698091  5
4  0.537636  foo -1.377559  5
5 -0.428384  bar  1.858543  5

In [69]:
# Slicing ex: as ultimas duas linhas
df[-2:]


Out[69]:
          a    b         c  d
4  0.537636  foo -1.377559  5
5 -0.428384  bar  1.858543  5

In [73]:
# O metodo xs(n) retorna uma serie da linha n sendo os nomes
# das colunas seus indices
df.xs(0)


Out[73]:
a   -0.3803736
b          foo
c    0.5619322
d            5
Name: 0, dtype: object

In [85]:
timeit df.ix[0, 'b']


100000 loops, best of 3: 5.37 µs per loop

In [79]:
timeit df.get_value(2, 'b')


100000 loops, best of 3: 2.99 µs per loop

In [87]:
# Fatie a segunda linha ate a linha 4, nas colunas 'b' e 'c' 
df.ix[2:4, ['b','c'] ]


Out[87]:
     b         c
2  foo -0.142641
3  bar -0.698091
4  foo -1.377559

In [99]:
# Retorne as linhas em que apenas os valores da coluna c 
# sejam maiores do que zero
df[df.c > 0]


Out[99]:
          a    b         c  d
0 -0.380374  foo  0.561932  5
1  1.215852  bar  0.362633  5
5 -0.428384  bar  1.858543  5

In [97]:
df['c'] > 0


Out[97]:
0     True
1     True
2    False
3    False
4    False
5     True
Name: c, dtype: bool

Pandas pode gerar indices predefinidos, por exemplo datas


In [101]:
# Pandas pode gerar indices predefinidos, por exemplo datas

In [115]:
df = DataFrame({
    'a': np.random.randn(6),
    'b': ['foo', 'bar'] * 3,
    'c': np.random.randn(6)}, 
    index=date_range('1/1/2000', periods=6)
    )

In [118]:
df


Out[118]:
                   a    b         c
2000-01-01 -1.002696  foo -0.621513
2000-01-02  1.337229  bar -0.378102
2000-01-03  0.245945  foo  0.298781
2000-01-04  0.033842  bar -0.907996
2000-01-05  0.185411  foo  1.136669
2000-01-06 -0.904722  bar  0.728981

In [121]:
df = DataFrame({
    'a': np.random.randn(6),
    'b': ['foo', 'bar'] * 3,
    'c': np.random.randn(6)
}, columns=['a','b','c','d'])

In [124]:
df
#isnull(df)


Out[124]:
          a    b         c    d
0  1.263390  foo  1.350885  NaN
1  0.664581  bar -0.463260  NaN
2  0.413526  foo -1.756543  NaN
3  1.371956  bar  0.609647  NaN
4  1.385532  foo  0.926239  NaN
5 -0.233507  bar  0.967252  NaN

Criando a partir de dicionarios aninhados



In [152]:
data = {}
for col in ['foo','bar','baz']:
    for row in ['a', 'b', 'c', 'd']:
        data.setdefault(col, {})[row] = rand()
        
data


Out[152]:
{'bar': {'d': 0.0524687957944856,
  'a': 0.9562966579289016,
  'b': 0.5913622415423637,
  'c': 0.8251898086165323},
 'baz': {'d': 0.7021834919304516,
  'a': 0.9910727538132068,
  'b': 0.3274066961511257,
  'c': 0.8436968913186512},
 'foo': {'d': 0.2803303160608259,
  'a': 0.05018655537100347,
  'b': 0.5102109808663076,
  'c': 0.2819725832615575}}

In [153]:
# Deletando um valor
del data['foo']['c']

In [154]:
DataFrame(data)


Out[154]:
        bar       baz       foo
a  0.956297  0.991073  0.050187
b  0.591362  0.327407  0.510211
c  0.825190  0.843697       NaN
d  0.052469  0.702183  0.280330

Observe o valor faltante onde usamos o comando del data['foo']['c'] sendo substituido por NaN.

Alinhamento de dados



In [162]:
# Lendo dados de um arquivo .csv
close_px = read_csv('dados/stock_data.csv', index_col=0, parse_dates=True)
# Comandos de terminal podem ser usados iniciando a linha com !

#!head dados/stock_data.csv
# Via funcao do pandas close_px.head()

In [167]:
close_px


Out[167]:
               AA    AAPL     GE     IBM    JNJ   MSFT    PEP      SPX    XOM
2007-10-29  37.41  185.09  34.46  106.78  57.13  31.78  65.67  1540.98  85.51
2007-10-30  36.43  187.00  34.39  106.15  56.99  32.70  65.80  1531.02  83.25
2007-10-31  36.79  189.95  34.97  108.01  57.30  33.84  65.69  1549.38  84.03
2007-11-01  35.22  187.44  34.27  105.72  56.85  34.07  64.51  1508.44  80.84
2007-11-02  35.83  187.87  34.27  106.59  56.95  34.07  65.03  1509.65  80.32
2007-11-05  35.22  186.18  34.15  105.48  56.70  33.77  64.84  1502.17  80.07
2007-11-06  35.84  191.79  34.14  105.27  56.80  33.47  65.49  1520.27  82.56
2007-11-07  34.76  186.30  33.20  103.69  56.19  32.65  64.46  1475.62  79.96
2007-11-08  35.00  175.47  33.15   99.05  56.79  31.94  65.50  1474.77  82.00
2007-11-09  34.47  165.37  32.61   93.58  57.29  31.01  65.48  1453.70  79.64
2007-11-12  33.33  153.76  32.50   94.70  58.30  30.69  64.67  1439.18  77.52
2007-11-13  34.04  169.96  33.31   98.27  59.18  31.79  65.35  1481.05  79.67
2007-11-14  34.25  166.11  33.14   96.56  58.81  31.30  65.99  1470.58  79.15
2007-11-15  33.76  164.30  32.55   96.71  58.80  31.14  65.57  1451.15  77.48
2007-11-16  33.74  166.39  32.84   97.82  59.56  31.44  66.08  1458.74  78.04
2007-11-19  32.59  163.95  32.42   95.42  59.42  31.32  66.90  1433.27  77.13
2007-11-20  32.88  168.85  32.32   96.54  59.47  31.90  66.69  1439.70  80.53
2007-11-21  32.68  168.46  31.58   95.42  59.03  31.57  66.87  1416.77  79.82
2007-11-23  32.66  171.54  32.01   97.13  59.16  31.46  67.29  1440.70  80.96
2007-11-26  32.41  172.54  31.21   95.19  59.60  30.41  66.56  1407.22  78.57
2007-11-27  32.56  174.81  31.82   96.92  59.92  30.49  67.65  1428.23  79.21
2007-11-28  33.65  180.22  32.68  100.23  60.46  31.08  68.76  1469.02  80.62
2007-11-29  33.97  184.29  32.40  100.35  60.51  30.98  68.20  1469.72  81.24
2007-11-30  33.80  182.22  32.53   98.18  59.93  30.99  68.78  1481.14  81.76
2007-12-03  32.98  178.86  31.38   98.79  59.90  30.36  68.24  1472.42  81.48
2007-12-04  32.76  179.81  30.80   99.54  60.10  30.23  68.30  1462.79  80.81
2007-12-05  33.09  185.50  31.19  100.97  60.34  31.50  68.30  1485.01  82.46
2007-12-06  33.32  189.95  31.66  102.40  60.42  31.87  68.98  1507.34  83.85
2007-12-07  34.30  194.30  31.63  101.62  59.87  31.85  68.95  1504.66  83.91
2007-12-10  35.17  194.21  31.78  102.11  59.97  32.06  69.20  1515.96  84.39
...           ...     ...    ...     ...    ...    ...    ...      ...    ...
2011-09-02  12.04  374.05  15.61  166.98  64.07  25.80  63.30  1173.97  72.14
2011-09-06  11.77  379.74  15.11  165.11  64.64  25.51  62.45  1165.24  71.15
2011-09-07  12.25  383.93  15.65  167.31  65.43  26.00  61.62  1198.62  73.65
2011-09-08  12.03  384.14  15.44  165.25  64.95  26.22  61.34  1185.90  72.82
2011-09-09  11.58  377.48  14.95  161.37  63.64  25.74  59.99  1154.23  71.01
2011-09-12  11.55  379.94  14.87  162.42  63.59  25.89  60.14  1162.27  71.84
2011-09-13  11.63  384.62  15.26  163.43  63.61  26.04  60.54  1172.87  71.65
2011-09-14  11.73  389.30  15.64  167.24  63.73  26.50  61.58  1188.68  72.64
2011-09-15  11.98  392.96  16.08  170.09  64.40  26.99  63.22  1209.11  74.01
2011-09-16  11.97  400.50  16.33  172.99  64.59  27.12  62.05  1216.01  74.55
2011-09-19  11.58  411.63  16.18  173.13  64.14  27.21  60.56  1204.09  73.70
2011-09-20  11.25  413.45  16.04  174.72  64.22  26.98  60.39  1202.09  74.01
2011-09-21  10.84  412.14  15.38  173.02  63.13  25.99  60.79  1166.76  71.97
2011-09-22  10.11  401.82  15.04  168.62  61.92  25.06  60.92  1129.56  69.24
2011-09-23  10.07  404.30  15.21  169.34  61.59  25.06  60.34  1136.43  69.31
2011-09-26  10.45  403.17  15.57  174.51  62.69  25.44  61.89  1162.95  71.72
2011-09-27  10.48  399.26  15.76  177.71  63.82  25.67  62.43  1175.38  72.91
2011-09-28   9.97  397.01  15.45  177.55  63.25  25.58  61.97  1151.06  72.07
2011-09-29  10.06  390.57  15.86  179.17  63.90  25.45  62.58  1160.40  73.88
2011-09-30   9.57  381.32  15.22  174.87  63.69  24.89  61.90  1131.42  72.63
2011-10-03   8.90  374.60  14.69  173.29  62.08  24.53  60.29  1099.23  71.15
2011-10-04   9.12  372.50  14.86  174.74  62.17  25.34  60.45  1123.95  72.83
2011-10-05   9.37  378.25  15.27  176.85  62.35  25.89  60.29  1144.03  73.95
2011-10-06   9.88  377.37  15.53  181.69  62.81  26.34  60.57  1164.97  73.89
2011-10-07   9.71  369.80  15.50  182.39  63.13  26.25  61.02  1155.46  73.56
2011-10-10  10.09  388.81  16.14  186.62  64.43  26.94  61.87  1194.89  76.28
2011-10-11  10.30  400.29  16.14  185.00  63.96  27.00  60.95  1195.54  76.27
2011-10-12  10.05  402.19  16.40  186.12  64.33  26.96  62.70  1207.25  77.16
2011-10-13  10.10  408.43  16.22  186.82  64.23  27.18  62.36  1203.66  76.37
2011-10-14  10.26  422.00  16.60  190.53  64.72  27.27  62.24  1224.58  78.11

[1000 rows x 9 columns]

In [173]:
# Criando series com partes dos papeis da Apple
s1 = close_px['AAPL'][-20:]
s2 = close_px['AAPL'][-25:-10]

# Usando a funcao definida no inicio desse material para 
# apresenta-las lado a lado
side_by_side(s1, s2)


2011-09-19    411.63          2011-09-12    379.94      
2011-09-20    413.45          2011-09-13    384.62      
2011-09-21    412.14          2011-09-14    389.30      
2011-09-22    401.82          2011-09-15    392.96      
2011-09-23    404.30          2011-09-16    400.50      
2011-09-26    403.17          2011-09-19    411.63      
2011-09-27    399.26          2011-09-20    413.45      
2011-09-28    397.01          2011-09-21    412.14      
2011-09-29    390.57          2011-09-22    401.82      
2011-09-30    381.32          2011-09-23    404.30      
2011-10-03    374.60          2011-09-26    403.17      
2011-10-04    372.50          2011-09-27    399.26      
2011-10-05    378.25          2011-09-28    397.01      
2011-10-06    377.37          2011-09-29    390.57      
2011-10-07    369.80          2011-09-30    381.32      
2011-10-10    388.81          Name: AAPL, dtype: float64
2011-10-11    400.29                                    
2011-10-12    402.19                                    
2011-10-13    408.43                                    
2011-10-14    422.00                                    
Name: AAPL, dtype: float64                              

In [186]:
s1 + s2


Out[186]:
2011-09-12       NaN
2011-09-13       NaN
2011-09-14       NaN
2011-09-15       NaN
2011-09-16       NaN
2011-09-19    823.26
2011-09-20    826.90
2011-09-21    824.28
2011-09-22    803.64
2011-09-23    808.60
2011-09-26    806.34
2011-09-27    798.52
2011-09-28    794.02
2011-09-29    781.14
2011-09-30    762.64
2011-10-03       NaN
2011-10-04       NaN
2011-10-05       NaN
2011-10-06       NaN
2011-10-07       NaN
2011-10-10       NaN
2011-10-11       NaN
2011-10-12       NaN
2011-10-13       NaN
2011-10-14       NaN
Name: AAPL, dtype: float64

Observe que na soma, onde as series nao se sobrepoe os valores sao substituidos por valores ausentes NaN

  • Existe tambem a possibilidade de escolhermos qual valore serah substituido pelo valor ausente ao inves de NaN padrao. O valor ausente, zero por exemplo: s1.add(s2, fill_value=0)
    2011-09-12    379.94
    2011-09-13    384.62
    2011-09-14    389.30
    2011-09-15    392.96
    2011-09-16    400.50
    2011-09-19    823.26
    2011-09-20    826.90
    2011-09-21    824.28
    2011-09-22    803.64
    2011-09-23    808.60
    2011-09-26    806.34
    2011-09-27    798.52
    2011-09-28    794.02
    2011-09-29    781.14
    2011-09-30    762.64
    2011-10-03    374.60
    2011-10-04    372.50
    2011-10-05    378.25
    2011-10-06    377.37
    2011-10-07    369.80
    2011-10-10    388.81
    2011-10-11    400.29
    2011-10-12    402.19
    2011-10-13    408.43
    2011-10-14    422.00
    Name: AAPL, dtype: float64

In [177]:
# Selecinando apenas situacoes em que elas se sobrepoe e possuem valres
(s1 + s2).dropna()


Out[177]:
2011-09-19    823.26
2011-09-20    826.90
2011-09-21    824.28
2011-09-22    803.64
2011-09-23    808.60
2011-09-26    806.34
2011-09-27    798.52
2011-09-28    794.02
2011-09-29    781.14
2011-09-30    762.64
Name: AAPL, dtype: float64

In [184]:
# Fatiar apenasultimas 10 linhas e apenas as primeiras 3 colunas
df = close_px.ix[-10:,:3]
df


Out[184]:
               AA    AAPL     GE
2011-10-03   8.90  374.60  14.69
2011-10-04   9.12  372.50  14.86
2011-10-05   9.37  378.25  15.27
2011-10-06   9.88  377.37  15.53
2011-10-07   9.71  369.80  15.50
2011-10-10  10.09  388.81  16.14
2011-10-11  10.30  400.29  16.14
2011-10-12  10.05  402.19  16.40
2011-10-13  10.10  408.43  16.22
2011-10-14  10.26  422.00  16.60

In [188]:
# Mostrando lado a lado s1 e s2, em que a serie s1 foi reindexada pelo
# indice da s2
side_by_side(s1.ix[s2.index], s2)


2011-09-12       NaN          2011-09-12    379.94      
2011-09-13       NaN          2011-09-13    384.62      
2011-09-14       NaN          2011-09-14    389.30      
2011-09-15       NaN          2011-09-15    392.96      
2011-09-16       NaN          2011-09-16    400.50      
2011-09-19    411.63          2011-09-19    411.63      
2011-09-20    413.45          2011-09-20    413.45      
2011-09-21    412.14          2011-09-21    412.14      
2011-09-22    401.82          2011-09-22    401.82      
2011-09-23    404.30          2011-09-23    404.30      
2011-09-26    403.17          2011-09-26    403.17      
2011-09-27    399.26          2011-09-27    399.26      
2011-09-28    397.01          2011-09-28    397.01      
2011-09-29    390.57          2011-09-29    390.57      
2011-09-30    381.32          2011-09-30    381.32      
Name: AAPL, dtype: float64    Name: AAPL, dtype: float64

Metodos de alinhamento


In [195]:
# Intersection
b,c = s1.align(s2, join='inner')
side_by_side(b,c)


2011-09-19    411.63          2011-09-19    411.63      
2011-09-20    413.45          2011-09-20    413.45      
2011-09-21    412.14          2011-09-21    412.14      
2011-09-22    401.82          2011-09-22    401.82      
2011-09-23    404.30          2011-09-23    404.30      
2011-09-26    403.17          2011-09-26    403.17      
2011-09-27    399.26          2011-09-27    399.26      
2011-09-28    397.01          2011-09-28    397.01      
2011-09-29    390.57          2011-09-29    390.57      
2011-09-30    381.32          2011-09-30    381.32      
Name: AAPL, dtype: float64    Name: AAPL, dtype: float64

In [194]:
#
b,c = s1.align(s2, join='outer')
side_by_side(b,c)


2011-09-12       NaN          2011-09-12    379.94      
2011-09-13       NaN          2011-09-13    384.62      
2011-09-14       NaN          2011-09-14    389.30      
2011-09-15       NaN          2011-09-15    392.96      
2011-09-16       NaN          2011-09-16    400.50      
2011-09-19    411.63          2011-09-19    411.63      
2011-09-20    413.45          2011-09-20    413.45      
2011-09-21    412.14          2011-09-21    412.14      
2011-09-22    401.82          2011-09-22    401.82      
2011-09-23    404.30          2011-09-23    404.30      
2011-09-26    403.17          2011-09-26    403.17      
2011-09-27    399.26          2011-09-27    399.26      
2011-09-28    397.01          2011-09-28    397.01      
2011-09-29    390.57          2011-09-29    390.57      
2011-09-30    381.32          2011-09-30    381.32      
2011-10-03    374.60          2011-10-03       NaN      
2011-10-04    372.50          2011-10-04       NaN      
2011-10-05    378.25          2011-10-05       NaN      
2011-10-06    377.37          2011-10-06       NaN      
2011-10-07    369.80          2011-10-07       NaN      
2011-10-10    388.81          2011-10-10       NaN      
2011-10-11    400.29          2011-10-11       NaN      
2011-10-12    402.19          2011-10-12       NaN      
2011-10-13    408.43          2011-10-13       NaN      
2011-10-14    422.00          2011-10-14       NaN      
Name: AAPL, dtype: float64    Name: AAPL, dtype: float64

In [196]:
#
b,c = s1.align(s2, join='right')
side_by_side(b,c)


2011-09-12       NaN          2011-09-12    379.94      
2011-09-13       NaN          2011-09-13    384.62      
2011-09-14       NaN          2011-09-14    389.30      
2011-09-15       NaN          2011-09-15    392.96      
2011-09-16       NaN          2011-09-16    400.50      
2011-09-19    411.63          2011-09-19    411.63      
2011-09-20    413.45          2011-09-20    413.45      
2011-09-21    412.14          2011-09-21    412.14      
2011-09-22    401.82          2011-09-22    401.82      
2011-09-23    404.30          2011-09-23    404.30      
2011-09-26    403.17          2011-09-26    403.17      
2011-09-27    399.26          2011-09-27    399.26      
2011-09-28    397.01          2011-09-28    397.01      
2011-09-29    390.57          2011-09-29    390.57      
2011-09-30    381.32          2011-09-30    381.32      
Name: AAPL, dtype: float64    Name: AAPL, dtype: float64

In [201]:
df = close_px.ix[-10:, ['AAPL','IBM','MSFT']]
df


Out[201]:
              AAPL     IBM   MSFT
2011-10-03  374.60  173.29  24.53
2011-10-04  372.50  174.74  25.34
2011-10-05  378.25  176.85  25.89
2011-10-06  377.37  181.69  26.34
2011-10-07  369.80  182.39  26.25
2011-10-10  388.81  186.62  26.94
2011-10-11  400.29  185.00  27.00
2011-10-12  402.19  186.12  26.96
2011-10-13  408.43  186.82  27.18
2011-10-14  422.00  190.53  27.27

Eh possivel aplicar uma serie de funcoes nas linhas e colunas de um DataFrame usando o metodo

df.apply(funcs)


In [232]:
# Media sobre linhas axis=1 e sobre colunas axis=0
#df.apply(np.mean, axis=1)
# df.mean(1, skipna=False) ou True caso omitir valores ausentes
df.mean(1)


Out[232]:
2011-10-03    190.806667
2011-10-04    190.860000
2011-10-05    193.663333
2011-10-06    195.133333
2011-10-07    192.813333
2011-10-10    200.790000
2011-10-11    204.096667
2011-10-12    205.090000
2011-10-13    207.476667
2011-10-14    213.266667
dtype: float64

In [233]:
# Media sobre colunas
df.mean()


Out[233]:
AAPL    389.424
IBM     182.405
MSFT     26.370
dtype: float64

In [237]:
"""
 Criando uma funcao que encontra os indices dos valores maximos das 
series em um dataframe
"""
def peak_date(series):
    return series.idxmax()

close_px.apply(peak_date)


Out[237]:
AA     2008-05-19
AAPL   2011-10-14
GE     2007-10-31
IBM    2011-10-14
JNJ    2011-07-07
MSFT   2007-11-01
PEP    2008-01-10
SPX    2007-10-31
XOM    2008-05-20
dtype: datetime64[ns]

In [238]:
#
df.apply(lambda x: x.max() - x.min()) # np.ptp


Out[238]:
AAPL    52.20
IBM     17.24
MSFT     2.74
dtype: float64

In [252]:
#np.log(close_px)
close_px.head(5)


Out[252]:
               AA    AAPL     GE     IBM    JNJ   MSFT    PEP      SPX    XOM
2007-10-29  37.41  185.09  34.46  106.78  57.13  31.78  65.67  1540.98  85.51
2007-10-30  36.43  187.00  34.39  106.15  56.99  32.70  65.80  1531.02  83.25
2007-10-31  36.79  189.95  34.97  108.01  57.30  33.84  65.69  1549.38  84.03
2007-11-01  35.22  187.44  34.27  105.72  56.85  34.07  64.51  1508.44  80.84
2007-11-02  35.83  187.87  34.27  106.59  56.95  34.07  65.03  1509.65  80.32

In [253]:
close_px.tail(5)


Out[253]:
               AA    AAPL     GE     IBM    JNJ   MSFT    PEP      SPX    XOM
2011-10-10  10.09  388.81  16.14  186.62  64.43  26.94  61.87  1194.89  76.28
2011-10-11  10.30  400.29  16.14  185.00  63.96  27.00  60.95  1195.54  76.27
2011-10-12  10.05  402.19  16.40  186.12  64.33  26.96  62.70  1207.25  77.16
2011-10-13  10.10  408.43  16.22  186.82  64.23  27.18  62.36  1203.66  76.37
2011-10-14  10.26  422.00  16.60  190.53  64.72  27.27  62.24  1224.58  78.11

In [255]:
!head -5 dados/stock_data.csv; echo '\n';tail -5 dados/stock_data.csv;


,AA,AAPL,GE,IBM,JNJ,MSFT,PEP,SPX,XOM
2007-10-29 00:00:00,37.41,185.09,34.46,106.78,57.13,31.78,65.67,1540.98,85.51
2007-10-30 00:00:00,36.43,187.0,34.39,106.15,56.99,32.7,65.8,1531.02,83.25
2007-10-31 00:00:00,36.79,189.95,34.97,108.01,57.3,33.84,65.69,1549.38,84.03
2007-11-01 00:00:00,35.22,187.44,34.27,105.72,56.85,34.07,64.51,1508.44,80.84


2011-10-10 00:00:00,10.09,388.81,16.14,186.62,64.43,26.94,61.87,1194.89,76.28
2011-10-11 00:00:00,10.3,400.29,16.14,185.0,63.96,27.0,60.95,1195.54,76.27
2011-10-12 00:00:00,10.05,402.19,16.4,186.12,64.33,26.96,62.7,1207.25,77.16
2011-10-13 00:00:00,10.1,408.43,16.22,186.82,64.23,27.18,62.36,1203.66,76.37
2011-10-14 00:00:00,10.26,422.0,16.6,190.53,64.72,27.27,62.24,1224.58,78.11

Plotting

Algumas plogatens basicas com biblioteca matplotlib aplicadas as Series/ataframes


In [242]:
#%matplotlib inline
close_px[['AAPL','IBM','MSFT','XOM']].plot()


Out[242]:
<matplotlib.axes.AxesSubplot at 0x7fa0fca67278>

In [260]:
#Mostrando a ultima linha
close_px.ix[-1]


Out[260]:
AA        10.26
AAPL     422.00
GE        16.60
IBM      190.53
JNJ       64.72
MSFT      27.27
PEP       62.24
SPX     1224.58
XOM       78.11
Name: 2011-10-14 00:00:00, dtype: float64

In [258]:
# Fechamento no ultimo dia (ultima linha)
close_px.ix[-1].plot(kind='bar')
title('Preco em %s' % close_px.index[-1])
axhline(0)


Out[258]:
<matplotlib.lines.Line2D at 0x7fa0fd261780>

In [ ]: