Pandas


In [ ]:
#installing pandas libraries
!pip install pandas-datareader
!pip install --upgrade html5lib==1.0b8

#There is a bug in the latest version of html5lib so install an earlier version
#Restart kernel after installing html5lib

Imports


In [2]:
import pandas as pd #pandas library
from pandas_datareader import data #data readers (google, html, etc.)
#The following line ensures that graphs are rendered in the notebook
%matplotlib inline 
import numpy as np
import matplotlib.pyplot as plt #Plotting library
import datetime as dt #datetime for timeseries support

The structure of a dataframe


In [3]:
pd.DataFrame([[1,2,3],[1,2,3]],columns=['A','B','C'])


Out[3]:
A B C
0 1 2 3
1 1 2 3

Accessing columns and rows


In [40]:
df = pd.DataFrame([['r1','00','01','02'],['r2','10','11','12'],['r3','20','21','22']],columns=['row_label','A','B','C'])
print(id(df))
df.set_index('row_label',inplace=True)
print(id(df))
df


4562807608
4562807608
Out[40]:
A B C
row_label
r1 00 01 02
r2 10 11 12
r3 20 21 22

In [39]:
data = {'nationality': ['UK', 'China', 'US', 'UK', 'Japan', 'China', 'UK', 'UK', 'Japan', 'US'],
        'age': [25, 30, 15, np.nan, 25, 22, np.nan,45 ,18, 33],
        'type': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'diabetes': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df=pd.DataFrame(data=data,index=labels)
#print(df[df['age'].between(20,30)])
#print(df.groupby('nationality').mean()['age'])
#print(df.sort_values(by=['age','type'],ascending=[False,True]))
df['nationality'] = df['nationality'].replace('US','United States')
print(df)


    age diabetes    nationality  type
a  25.0      yes             UK     1
b  30.0      yes          China     3
c  15.0       no  United States     2
d   NaN      yes             UK     3
e  25.0       no          Japan     2
f  22.0       no          China     3
g   NaN       no             UK     1
h  45.0      yes             UK     1
i  18.0       no          Japan     2
j  33.0       no  United States     1

In [41]:
df.ix[1]


Out[41]:
A    10
B    11
C    12
Name: r2, dtype: object

Getting column data


In [42]:
df['B']


Out[42]:
row_label
r1    01
r2    11
r3    21
Name: B, dtype: object

Getting row data


In [43]:
df.loc['r1']


Out[43]:
A    00
B    01
C    02
Name: r1, dtype: object

Getting a row by row number


In [44]:
df.iloc[0]


Out[44]:
A    00
B    01
C    02
Name: r1, dtype: object

Getting multiple columns


In [45]:
df[['B','A']] #Note that the column identifiers are in a list


Out[45]:
B A
row_label
r1 01 00
r2 11 10
r3 21 20

Getting a specific cell


In [46]:
df.loc['r2','B']


Out[46]:
'11'

In [47]:
df.loc['r2']['A']


Out[47]:
'10'

Slicing


In [49]:
print(df)
print(df.loc['r1':'r2'])


            A   B   C
row_label            
r1         00  01  02
r2         10  11  12
r3         20  21  22
            A   B   C
row_label            
r1         00  01  02
r2         10  11  12

In [50]:
df.loc['r1':'r2','B':'C']


Out[50]:
B C
row_label
r1 01 02
r2 11 12

Pandas datareader

  • Access data from html tables on any web page
  • Get data from google finance
  • Get data from the federal reserve
  • HTML Tables

  • Pandas datareader can read a table in an html page into a dataframe
  • the read_html function returns a list of all dataframes with one dataframe for each html table on the page
  • Example: Read the tables on the google finance page

    
    
    In [60]:
    #df_list = pd.read_html('http://www.bloomberg.com/markets/currencies/major')
    df_list = pd.read_html('http://www.waihuipaijia.cn/'
                          , encoding='utf-8')
    print(len(df_list))
    
    
    
    
    2
    

    The page contains only one table so the read_html function returns a list of one element

    
    
    In [61]:
    df = df_list[0]
    print(df)
    
    
    
    
              0                                         1
    0  中行外汇牌价查询  中国银行外汇牌价更新:2017-08-30 06:37:40 外汇交易平台排行榜
    

    Note that the read_html function has automatically detected the header columns

    If an index is necessary, we need to explicitly specify it

    
    
    In [ ]:
    df.set_index('Currency',inplace=True)
    print(df)
    

    Now we can use .loc to extract specific currency rates

    
    
    In [ ]:
    df.loc['EUR-CHF','Value']
    

    Working with views and copies

    Chained indexing creates a copy and changes to the copy won't be reflected in the original dataframe

    
    
    In [ ]:
    eur_usd = df.loc['EUR-USD']['Change'] #This is chained indexing
    df.loc['EUR-USD']['Change'] = 1.0 #Here we are changing a value in a copy of the dataframe
    print(eur_usd)
    print(df.loc['EUR-USD']['Change']) #Neither eur_usd, nor the dataframe are changed
    
    
    
    In [ ]:
    eur_usd = df.loc['EUR-USD','Change'] #eur_usd points to the value inside the dataframe
    df.loc['EUR-USD','Change'] = 1.0 #Change the value in the view 
    print(eur_usd) #eur_usd is changed (because it points to the view)
    print(df.loc['EUR-USD']['Change']) #The dataframe has been correctly updated
    

    Getting historical stock prices from Google financs

    Usage: DataReader(ticker,source,startdate,enddate)
    Unfortunately, the Yahoo finance datareader has stopped working because of a change to Yahoo's website

    
    
    In [62]:
    from pandas_datareader import data
    import datetime as dt
    start=dt.datetime(2017, 1, 1)
    end=dt.datetime.today()
    
    
    print(start,end)
    
    
    df = data.DataReader('IBM', 'google', start, end)
    
    
    
    
    2017-01-01 00:00:00 2017-08-30 14:58:14.819387
    
    
    
    In [63]:
    df
    
    
    
    
    Out[63]:
    Open High Low Close Volume
    Date
    2017-01-03 167.00 167.87 166.01 167.19 2934299
    2017-01-04 167.77 169.87 167.36 169.26 3381432
    2017-01-05 169.25 169.39 167.26 168.70 2682301
    2017-01-06 168.69 169.92 167.52 169.53 2945536
    2017-01-09 169.47 169.80 167.62 167.65 3189891
    2017-01-10 167.98 168.09 165.34 165.52 4118694
    2017-01-11 166.05 167.76 165.60 167.75 3599464
    2017-01-12 167.77 168.01 165.56 167.95 2927973
    2017-01-13 167.97 168.48 166.88 167.34 2875433
    2017-01-17 166.69 168.18 166.12 167.89 3315655
    2017-01-18 167.45 168.59 166.69 166.80 4007779
    2017-01-19 166.96 167.45 165.80 166.81 6963386
    2017-01-20 167.81 170.64 166.00 170.55 12690029
    2017-01-23 170.08 171.25 170.01 171.03 5478224
    2017-01-24 171.36 176.00 171.16 175.90 6952369
    2017-01-25 176.26 179.25 176.13 178.29 5922462
    2017-01-26 178.06 178.88 177.51 178.66 3063924
    2017-01-27 178.47 179.20 177.30 177.30 3482311
    2017-01-30 176.98 177.07 174.58 175.80 4094755
    2017-01-31 175.05 175.58 173.61 174.52 4138962
    2017-02-01 175.00 175.70 172.89 174.29 2795745
    2017-02-02 174.23 174.97 173.55 174.58 2620526
    2017-02-03 175.00 176.34 174.89 175.82 3309288
    2017-02-06 175.31 175.98 174.38 175.86 2650785
    2017-02-07 176.00 178.62 175.90 178.46 3873455
    2017-02-08 177.50 177.50 175.88 176.17 2913126
    2017-02-09 176.17 177.80 175.91 177.21 3090710
    2017-02-10 177.37 178.87 176.76 178.68 2930710
    2017-02-13 179.24 179.90 178.84 179.36 3093372
    2017-02-14 178.57 180.13 178.35 180.13 2776528
    ... ... ... ... ... ...
    2017-07-19 150.02 150.25 146.71 147.53 14293610
    2017-07-20 147.53 148.83 147.03 147.66 7212233
    2017-07-21 147.59 147.87 146.51 147.08 6509459
    2017-07-24 147.00 147.04 145.80 145.99 3877593
    2017-07-25 146.57 147.49 146.03 146.19 4208895
    2017-07-26 146.27 146.46 144.47 145.36 3932361
    2017-07-27 145.00 145.40 143.64 145.07 6430841
    2017-07-28 144.81 145.04 143.84 144.29 3054834
    2017-07-31 NaN 144.93 NaN 144.67 4355718
    2017-08-01 145.00 145.67 144.72 145.30 3219477
    2017-08-02 145.12 145.29 144.22 144.45 3540949
    2017-08-03 144.43 145.34 144.43 144.94 3035393
    2017-08-04 145.00 145.39 144.40 145.16 2549400
    2017-08-07 145.00 145.09 142.75 143.47 4618284
    2017-08-08 142.00 142.76 142.00 142.11 3298718
    2017-08-09 141.75 142.03 141.19 141.77 2925770
    2017-08-10 141.64 142.42 141.23 141.84 4525378
    2017-08-11 142.45 142.59 141.52 141.84 3004360
    2017-08-14 142.00 142.76 141.94 142.32 2248607
    2017-08-15 142.55 143.34 140.58 142.07 4681545
    2017-08-16 141.79 142.91 141.61 142.50 3287821
    2017-08-17 142.52 142.89 140.70 140.70 4422087
    2017-08-18 140.87 141.15 139.58 139.70 4152714
    2017-08-21 139.59 140.42 139.13 140.33 3190691
    2017-08-22 140.64 141.15 140.38 141.01 2544446
    2017-08-23 140.70 143.05 140.68 142.14 4048317
    2017-08-24 142.50 143.68 142.25 142.94 3592183
    2017-08-25 143.30 144.19 143.25 143.74 2872893
    2017-08-28 144.11 144.56 142.39 142.51 5134463
    2017-08-29 142.10 143.42 141.90 143.14 3009520

    166 rows × 5 columns

    Datareader documentation

    http://pandas-datareader.readthedocs.io/en/latest/

    Working with a timeseries data frame

  • The data is organized by time with the index serving as the timeline
  • Creating new columns

  • Add a column to a dataframe
  • Base the elements of the column on some combination of data in the existing columns
  • Example: Number of Days that the stock closed higher than it opened

  • We'll create a new column with the header "UP"
  • And use np.where to decide what to put in the column
  • 
    
    In [64]:
    df['UP']=np.where(df['Close']>df['Open'],1,0)
    df
    
    
    
    
    Out[64]:
    Open High Low Close Volume UP
    Date
    2017-01-03 167.00 167.87 166.01 167.19 2934299 1
    2017-01-04 167.77 169.87 167.36 169.26 3381432 1
    2017-01-05 169.25 169.39 167.26 168.70 2682301 0
    2017-01-06 168.69 169.92 167.52 169.53 2945536 1
    2017-01-09 169.47 169.80 167.62 167.65 3189891 0
    2017-01-10 167.98 168.09 165.34 165.52 4118694 0
    2017-01-11 166.05 167.76 165.60 167.75 3599464 1
    2017-01-12 167.77 168.01 165.56 167.95 2927973 1
    2017-01-13 167.97 168.48 166.88 167.34 2875433 0
    2017-01-17 166.69 168.18 166.12 167.89 3315655 1
    2017-01-18 167.45 168.59 166.69 166.80 4007779 0
    2017-01-19 166.96 167.45 165.80 166.81 6963386 0
    2017-01-20 167.81 170.64 166.00 170.55 12690029 1
    2017-01-23 170.08 171.25 170.01 171.03 5478224 1
    2017-01-24 171.36 176.00 171.16 175.90 6952369 1
    2017-01-25 176.26 179.25 176.13 178.29 5922462 1
    2017-01-26 178.06 178.88 177.51 178.66 3063924 1
    2017-01-27 178.47 179.20 177.30 177.30 3482311 0
    2017-01-30 176.98 177.07 174.58 175.80 4094755 0
    2017-01-31 175.05 175.58 173.61 174.52 4138962 0
    2017-02-01 175.00 175.70 172.89 174.29 2795745 0
    2017-02-02 174.23 174.97 173.55 174.58 2620526 1
    2017-02-03 175.00 176.34 174.89 175.82 3309288 1
    2017-02-06 175.31 175.98 174.38 175.86 2650785 1
    2017-02-07 176.00 178.62 175.90 178.46 3873455 1
    2017-02-08 177.50 177.50 175.88 176.17 2913126 0
    2017-02-09 176.17 177.80 175.91 177.21 3090710 1
    2017-02-10 177.37 178.87 176.76 178.68 2930710 1
    2017-02-13 179.24 179.90 178.84 179.36 3093372 1
    2017-02-14 178.57 180.13 178.35 180.13 2776528 1
    ... ... ... ... ... ... ...
    2017-07-19 150.02 150.25 146.71 147.53 14293610 0
    2017-07-20 147.53 148.83 147.03 147.66 7212233 1
    2017-07-21 147.59 147.87 146.51 147.08 6509459 0
    2017-07-24 147.00 147.04 145.80 145.99 3877593 0
    2017-07-25 146.57 147.49 146.03 146.19 4208895 0
    2017-07-26 146.27 146.46 144.47 145.36 3932361 0
    2017-07-27 145.00 145.40 143.64 145.07 6430841 1
    2017-07-28 144.81 145.04 143.84 144.29 3054834 0
    2017-07-31 NaN 144.93 NaN 144.67 4355718 0
    2017-08-01 145.00 145.67 144.72 145.30 3219477 1
    2017-08-02 145.12 145.29 144.22 144.45 3540949 0
    2017-08-03 144.43 145.34 144.43 144.94 3035393 1
    2017-08-04 145.00 145.39 144.40 145.16 2549400 1
    2017-08-07 145.00 145.09 142.75 143.47 4618284 0
    2017-08-08 142.00 142.76 142.00 142.11 3298718 1
    2017-08-09 141.75 142.03 141.19 141.77 2925770 1
    2017-08-10 141.64 142.42 141.23 141.84 4525378 1
    2017-08-11 142.45 142.59 141.52 141.84 3004360 0
    2017-08-14 142.00 142.76 141.94 142.32 2248607 1
    2017-08-15 142.55 143.34 140.58 142.07 4681545 0
    2017-08-16 141.79 142.91 141.61 142.50 3287821 1
    2017-08-17 142.52 142.89 140.70 140.70 4422087 0
    2017-08-18 140.87 141.15 139.58 139.70 4152714 0
    2017-08-21 139.59 140.42 139.13 140.33 3190691 1
    2017-08-22 140.64 141.15 140.38 141.01 2544446 1
    2017-08-23 140.70 143.05 140.68 142.14 4048317 1
    2017-08-24 142.50 143.68 142.25 142.94 3592183 1
    2017-08-25 143.30 144.19 143.25 143.74 2872893 1
    2017-08-28 144.11 144.56 142.39 142.51 5134463 0
    2017-08-29 142.10 143.42 141.90 143.14 3009520 1

    166 rows × 6 columns

    Get summary statistics

  • The "describe" function returns a dataframe containing summary stats for all numerical columns
  • Columns containing non-numerical data are ignored
  • 
    
    In [65]:
    df.describe()
    
    
    
    
    Out[65]:
    Open High Low Close Volume UP
    count 165.000000 166.000000 165.000000 166.000000 1.660000e+02 166.000000
    mean 161.618667 162.346084 160.736848 161.486566 4.112198e+06 0.487952
    std 12.905701 13.028233 12.835382 12.995748 2.090827e+06 0.501367
    min 139.590000 140.420000 139.130000 139.700000 1.825048e+06 0.000000
    25% 152.030000 152.875000 151.600000 152.102500 3.057106e+06 0.000000
    50% 159.440000 159.470000 158.640000 158.945000 3.547950e+06 0.000000
    75% 174.300000 175.390000 173.460000 174.507500 4.406930e+06 1.000000
    max 182.000000 182.790000 180.920000 181.950000 1.928428e+07 1.000000

    Calculate the percentage of days that the stock has closed higher than its open

    
    
    In [66]:
    df['UP'].sum()/df['UP'].count()
    
    
    
    
    Out[66]:
    0.48795180722891568

    Calculate percent changes

  • The function pct_change computes a percent change between successive rows (times in timeseries data)
  • Defaults to a single time delta
  • With an argument, the time delta can be changed
  • 
    
    In [67]:
    df['Close'].pct_change() #One timeperiod percent change
    
    
    
    
    Out[67]:
    Date
    2017-01-03         NaN
    2017-01-04    0.012381
    2017-01-05   -0.003309
    2017-01-06    0.004920
    2017-01-09   -0.011089
    2017-01-10   -0.012705
    2017-01-11    0.013473
    2017-01-12    0.001192
    2017-01-13   -0.003632
    2017-01-17    0.003287
    2017-01-18   -0.006492
    2017-01-19    0.000060
    2017-01-20    0.022421
    2017-01-23    0.002814
    2017-01-24    0.028475
    2017-01-25    0.013587
    2017-01-26    0.002075
    2017-01-27   -0.007612
    2017-01-30   -0.008460
    2017-01-31   -0.007281
    2017-02-01   -0.001318
    2017-02-02    0.001664
    2017-02-03    0.007103
    2017-02-06    0.000228
    2017-02-07    0.014784
    2017-02-08   -0.012832
    2017-02-09    0.005903
    2017-02-10    0.008295
    2017-02-13    0.003806
    2017-02-14    0.004293
                    ...   
    2017-07-19   -0.042013
    2017-07-20    0.000881
    2017-07-21   -0.003928
    2017-07-24   -0.007411
    2017-07-25    0.001370
    2017-07-26   -0.005678
    2017-07-27   -0.001995
    2017-07-28   -0.005377
    2017-07-31    0.002634
    2017-08-01    0.004355
    2017-08-02   -0.005850
    2017-08-03    0.003392
    2017-08-04    0.001518
    2017-08-07   -0.011642
    2017-08-08   -0.009479
    2017-08-09   -0.002393
    2017-08-10    0.000494
    2017-08-11    0.000000
    2017-08-14    0.003384
    2017-08-15   -0.001757
    2017-08-16    0.003027
    2017-08-17   -0.012632
    2017-08-18   -0.007107
    2017-08-21    0.004510
    2017-08-22    0.004846
    2017-08-23    0.008014
    2017-08-24    0.005628
    2017-08-25    0.005597
    2017-08-28   -0.008557
    2017-08-29    0.004421
    Name: Close, dtype: float64
    
    
    In [71]:
    n=2
    df['Close'].pct_change(n) #n timeperiods percent change
    
    
    
    
    Out[71]:
    Date
    2017-01-03         NaN
    2017-01-04         NaN
    2017-01-05    0.009032
    2017-01-06    0.001595
    2017-01-09   -0.006224
    2017-01-10   -0.023654
    2017-01-11    0.000596
    2017-01-12    0.014681
    2017-01-13   -0.002444
    2017-01-17   -0.000357
    2017-01-18   -0.003227
    2017-01-19   -0.006433
    2017-01-20    0.022482
    2017-01-23    0.025298
    2017-01-24    0.031369
    2017-01-25    0.042449
    2017-01-26    0.015691
    2017-01-27   -0.005553
    2017-01-30   -0.016008
    2017-01-31   -0.015680
    2017-02-01   -0.008589
    2017-02-02    0.000344
    2017-02-03    0.008778
    2017-02-06    0.007332
    2017-02-07    0.015015
    2017-02-08    0.001763
    2017-02-09   -0.007004
    2017-02-10    0.014248
    2017-02-13    0.012132
    2017-02-14    0.008115
                    ...   
    2017-07-19   -0.035815
    2017-07-20   -0.041169
    2017-07-21   -0.003050
    2017-07-24   -0.011310
    2017-07-25   -0.006051
    2017-07-26   -0.004315
    2017-07-27   -0.007661
    2017-07-28   -0.007361
    2017-07-31   -0.002757
    2017-08-01    0.007000
    2017-08-02   -0.001521
    2017-08-03   -0.002478
    2017-08-04    0.004915
    2017-08-07   -0.010142
    2017-08-08   -0.021011
    2017-08-09   -0.011849
    2017-08-10   -0.001900
    2017-08-11    0.000494
    2017-08-14    0.003384
    2017-08-15    0.001622
    2017-08-16    0.001265
    2017-08-17   -0.009643
    2017-08-18   -0.019649
    2017-08-21   -0.002630
    2017-08-22    0.009377
    2017-08-23    0.012898
    2017-08-24    0.013687
    2017-08-25    0.011257
    2017-08-28   -0.003008
    2017-08-29   -0.004174
    Name: Close, dtype: float64

    NaN support

    Pandas functions can ignore NaNs

    
    
    In [72]:
    n=13
    df['Close'].pct_change(n).mean()
    
    
    
    
    Out[72]:
    -0.013587425292889614

    Rolling windows

  • "rolling" function extracts rolling windows
  • For example, the 21 period rolling window of the 13 period percent change
  • 
    
    In [73]:
    df['Close'].pct_change(n).rolling(21)
    
    
    
    
    Out[73]:
    Rolling [window=21,center=False,axis=0]

    Calculate something on the rolling windows

    Example: mean (the 21 day moving average of the 13 day percent change)

    
    
    In [74]:
    n=13
    df['Close'].pct_change(n).rolling(21).mean()
    
    
    
    
    Out[74]:
    Date
    2017-01-03         NaN
    2017-01-04         NaN
    2017-01-05         NaN
    2017-01-06         NaN
    2017-01-09         NaN
    2017-01-10         NaN
    2017-01-11         NaN
    2017-01-12         NaN
    2017-01-13         NaN
    2017-01-17         NaN
    2017-01-18         NaN
    2017-01-19         NaN
    2017-01-20         NaN
    2017-01-23         NaN
    2017-01-24         NaN
    2017-01-25         NaN
    2017-01-26         NaN
    2017-01-27         NaN
    2017-01-30         NaN
    2017-01-31         NaN
    2017-02-01         NaN
    2017-02-02         NaN
    2017-02-03         NaN
    2017-02-06         NaN
    2017-02-07         NaN
    2017-02-08         NaN
    2017-02-09         NaN
    2017-02-10         NaN
    2017-02-13         NaN
    2017-02-14         NaN
                    ...   
    2017-07-19   -0.000918
    2017-07-20   -0.003539
    2017-07-21   -0.006686
    2017-07-24   -0.009687
    2017-07-25   -0.012159
    2017-07-26   -0.015860
    2017-07-27   -0.019281
    2017-07-28   -0.022425
    2017-07-31   -0.024900
    2017-08-01   -0.027353
    2017-08-02   -0.030923
    2017-08-03   -0.033265
    2017-08-04   -0.035073
    2017-08-07   -0.035799
    2017-08-08   -0.037118
    2017-08-09   -0.038652
    2017-08-10   -0.039790
    2017-08-11   -0.041058
    2017-08-14   -0.041750
    2017-08-15   -0.042200
    2017-08-16   -0.042386
    2017-08-17   -0.041653
    2017-08-18   -0.041579
    2017-08-21   -0.040335
    2017-08-22   -0.039247
    2017-08-23   -0.038309
    2017-08-24   -0.036125
    2017-08-25   -0.032987
    2017-08-28   -0.029972
    2017-08-29   -0.026738
    Name: Close, dtype: float64

    Calculate several moving averages and graph them

    
    
    In [75]:
    ma_8 = df['Close'].pct_change(n).rolling(window=8).mean()
    ma_13= df['Close'].pct_change(n).rolling(window=13).mean()
    ma_21= df['Close'].pct_change(n).rolling(window=21).mean()
    ma_34= df['Close'].pct_change(n).rolling(window=34).mean()
    ma_55= df['Close'].pct_change(n).rolling(window=55).mean()
    
    
    
    In [76]:
    ma_8.plot()
    ma_34.plot()
    
    
    
    
    Out[76]:
    <matplotlib.axes._subplots.AxesSubplot at 0x113819898>

    Linear regression with pandas

    Example: TAN is the ticker for a solar ETF. FSLR, RGSE, and SCTY are tickers of companies that build or lease solar panels. Each has a different business model. We'll use pandas to study the risk reward tradeoff between the 4 investments and also see how correlated they are

    
    
    In [77]:
    import datetime
    import pandas_datareader as data
    start = datetime.datetime(2015,7,1)
    end = datetime.datetime(2016,6,1)
    solar_df = data.DataReader(['FSLR', 'TAN','RGSE','SCTY'],'google', start=start,end=end)['Close']
    
    
    
    In [78]:
    solar_df
    
    
    
    
    Out[78]:
    FSLR RGSE SCTY TAN
    Date
    2015-07-01 46.04 1128.00 52.40 38.84
    2015-07-02 45.17 1200.00 52.27 38.55
    2015-07-06 44.19 1008.00 51.75 36.37
    2015-07-07 45.12 984.00 53.21 36.10
    2015-07-08 43.27 852.00 51.48 33.67
    2015-07-09 43.65 876.00 51.98 35.23
    2015-07-10 44.03 942.00 53.00 36.49
    2015-07-13 46.01 978.00 53.39 37.37
    2015-07-14 45.81 906.00 54.26 37.86
    2015-07-15 44.49 888.00 52.54 37.06
    2015-07-16 44.76 894.00 52.80 37.51
    2015-07-17 43.95 900.00 52.15 37.48
    2015-07-20 43.28 852.00 55.77 37.66
    2015-07-21 43.48 894.00 56.20 37.06
    2015-07-22 42.50 924.00 55.36 36.62
    2015-07-23 42.74 900.00 53.67 35.88
    2015-07-24 42.97 942.00 54.62 35.47
    2015-07-27 42.27 966.00 54.69 34.58
    2015-07-28 43.38 990.00 56.03 35.27
    2015-07-29 45.54 990.00 58.05 36.24
    2015-07-30 45.31 996.00 59.58 36.05
    2015-07-31 44.30 1038.00 58.00 35.32
    2015-08-03 43.69 1116.00 57.92 34.22
    2015-08-04 44.50 1272.00 59.00 33.93
    2015-08-05 51.92 1218.00 60.17 35.59
    2015-08-06 51.20 1254.00 55.34 33.91
    2015-08-07 52.08 1188.00 52.25 33.29
    2015-08-10 53.48 1068.00 51.17 34.16
    2015-08-11 51.44 846.00 48.79 32.78
    2015-08-12 51.88 839.04 50.86 33.10
    ... ... ... ... ...
    2016-04-20 61.72 324.00 32.65 24.16
    2016-04-21 61.94 300.00 33.87 24.21
    2016-04-22 61.54 312.00 33.34 24.33
    2016-04-25 60.39 306.60 31.18 23.94
    2016-04-26 61.24 317.40 33.00 24.33
    2016-04-27 62.03 306.00 33.31 24.79
    2016-04-28 57.02 306.06 32.24 24.18
    2016-04-29 55.84 303.00 30.32 23.82
    2016-05-02 54.85 300.06 29.55 23.53
    2016-05-03 52.85 271.20 26.45 22.74
    2016-05-04 51.78 276.00 23.65 22.47
    2016-05-05 51.96 279.00 21.67 22.09
    2016-05-06 51.80 279.00 21.84 22.13
    2016-05-09 50.60 277.08 22.51 21.64
    2016-05-10 51.27 276.00 17.82 21.40
    2016-05-11 50.74 294.00 18.69 21.62
    2016-05-12 49.09 270.00 18.92 21.17
    2016-05-13 48.56 190.20 19.60 20.98
    2016-05-16 49.07 210.00 20.06 21.28
    2016-05-17 48.51 216.00 20.50 21.11
    2016-05-18 48.71 210.00 20.00 21.20
    2016-05-19 47.09 216.00 20.37 20.93
    2016-05-20 48.02 192.00 21.94 21.49
    2016-05-23 49.04 198.00 24.05 21.82
    2016-05-24 50.46 204.00 23.74 22.06
    2016-05-25 50.56 202.80 24.07 22.09
    2016-05-26 49.73 186.00 23.73 21.97
    2016-05-27 49.90 198.00 22.26 21.96
    2016-05-31 49.65 180.00 22.39 22.01
    2016-06-01 50.07 177.00 21.41 22.10

    232 rows × 4 columns

    Let's calculate returns (the 1 day percent change)

    
    
    In [79]:
    rets = solar_df.pct_change()
    print(rets)
    
    
    
    
                    FSLR      RGSE      SCTY       TAN
    Date                                              
    2015-07-01       NaN       NaN       NaN       NaN
    2015-07-02 -0.018897  0.063830 -0.002481 -0.007467
    2015-07-06 -0.021696 -0.160000 -0.009948 -0.056550
    2015-07-07  0.021045 -0.023810  0.028213 -0.007424
    2015-07-08 -0.041002 -0.134146 -0.032513 -0.067313
    2015-07-09  0.008782  0.028169  0.009713  0.046332
    2015-07-10  0.008706  0.075342  0.019623  0.035765
    2015-07-13  0.044969  0.038217  0.007358  0.024116
    2015-07-14 -0.004347 -0.073620  0.016295  0.013112
    2015-07-15 -0.028815 -0.019868 -0.031699 -0.021130
    2015-07-16  0.006069  0.006757  0.004949  0.012142
    2015-07-17 -0.018097  0.006711 -0.012311 -0.000800
    2015-07-20 -0.015245 -0.053333  0.069415  0.004803
    2015-07-21  0.004621  0.049296  0.007710 -0.015932
    2015-07-22 -0.022539  0.033557 -0.014947 -0.011873
    2015-07-23  0.005647 -0.025974 -0.030527 -0.020208
    2015-07-24  0.005381  0.046667  0.017701 -0.011427
    2015-07-27 -0.016290  0.025478  0.001282 -0.025092
    2015-07-28  0.026260  0.024845  0.024502  0.019954
    2015-07-29  0.049793  0.000000  0.036052  0.027502
    2015-07-30 -0.005051  0.006061  0.026357 -0.005243
    2015-07-31 -0.022291  0.042169 -0.026519 -0.020250
    2015-08-03 -0.013770  0.075145 -0.001379 -0.031144
    2015-08-04  0.018540  0.139785  0.018646 -0.008475
    2015-08-05  0.166742 -0.042453  0.019831  0.048924
    2015-08-06 -0.013867  0.029557 -0.080273 -0.047204
    2015-08-07  0.017187 -0.052632 -0.055837 -0.018284
    2015-08-10  0.026882 -0.101010 -0.020670  0.026134
    2015-08-11 -0.038145 -0.207865 -0.046512 -0.040398
    2015-08-12  0.008554 -0.008227  0.042427  0.009762
    ...              ...       ...       ...       ...
    2016-04-20  0.019828  0.038262  0.078269  0.018979
    2016-04-21  0.003564 -0.074074  0.037366  0.002070
    2016-04-22 -0.006458  0.040000 -0.015648  0.004957
    2016-04-25 -0.018687 -0.017308 -0.064787 -0.016030
    2016-04-26  0.014075  0.035225  0.058371  0.016291
    2016-04-27  0.012900 -0.035917  0.009394  0.018907
    2016-04-28 -0.080767  0.000196 -0.032122 -0.024607
    2016-04-29 -0.020694 -0.009998 -0.059553 -0.014888
    2016-05-02 -0.017729 -0.009703 -0.025396 -0.012175
    2016-05-03 -0.036463 -0.096181 -0.104907 -0.033574
    2016-05-04 -0.020246  0.017699 -0.105860 -0.011873
    2016-05-05  0.003476  0.010870 -0.083721 -0.016911
    2016-05-06 -0.003079  0.000000  0.007845  0.001811
    2016-05-09 -0.023166 -0.006882  0.030678 -0.022142
    2016-05-10  0.013241 -0.003898 -0.208352 -0.011091
    2016-05-11 -0.010337  0.065217  0.048822  0.010280
    2016-05-12 -0.032519 -0.081633  0.012306 -0.020814
    2016-05-13 -0.010796 -0.295556  0.035941 -0.008975
    2016-05-16  0.010502  0.104101  0.023469  0.014299
    2016-05-17 -0.011412  0.028571  0.021934 -0.007989
    2016-05-18  0.004123 -0.027778 -0.024390  0.004263
    2016-05-19 -0.033258  0.028571  0.018500 -0.012736
    2016-05-20  0.019749 -0.111111  0.077074  0.026756
    2016-05-23  0.021241  0.031250  0.096171  0.015356
    2016-05-24  0.028956  0.030303 -0.012890  0.010999
    2016-05-25  0.001982 -0.005882  0.013901  0.001360
    2016-05-26 -0.016416 -0.082840 -0.014125 -0.005432
    2016-05-27  0.003418  0.064516 -0.061947 -0.000455
    2016-05-31 -0.005010 -0.090909  0.005840  0.002277
    2016-06-01  0.008459 -0.016667 -0.043770  0.004089
    
    [232 rows x 4 columns]
    

    Let's visualize the relationship between each stock and the ETF

    
    
    In [80]:
    import matplotlib.pyplot as plt
    plt.scatter(rets.FSLR,rets.TAN)
    
    
    
    
    Out[80]:
    <matplotlib.collections.PathCollection at 0x113864240>
    
    
    In [81]:
    plt.scatter(rets.RGSE,rets.TAN)
    
    
    
    
    Out[81]:
    <matplotlib.collections.PathCollection at 0x112f29eb8>
    
    
    In [82]:
    plt.scatter(rets.SCTY,rets.TAN)
    
    
    
    
    Out[82]:
    <matplotlib.collections.PathCollection at 0x11317e518>

    The correlation matrix

    
    
    In [83]:
    solar_corr = rets.corr()
    print(solar_corr)
    
    
    
    
              FSLR      RGSE      SCTY       TAN
    FSLR  1.000000  0.249923  0.272612  0.670114
    RGSE  0.249923  1.000000  0.236604  0.389566
    SCTY  0.272612  0.236604  1.000000  0.559854
    TAN   0.670114  0.389566  0.559854  1.000000
    

    Basic risk analysis

    We'll plot the mean and std or returns for each ticker to get a sense of the risk return profile

    
    
    In [84]:
    plt.scatter(rets.mean(), rets.std())
    plt.xlabel('Expected returns')
    plt.ylabel('Standard deviations')
    for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
        plt.annotate(
            label, 
            xy = (x, y), xytext = (20, -20),
            textcoords = 'offset points', ha = 'right', va = 'bottom',
            bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
            arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
    plt.show()
    
    
    
    

    Steps for regression

  • Construct y (dependent variable series)
  • Construct matrix (dataframe) of X (independent variable series)
  • Add intercept
  • Model the regression
  • Get the results
  • The statsmodels library contains various regression packages. We'll use the OLS (Ordinary Least Squares) model

    
    
    In [85]:
    import numpy as np
    import statsmodels.api as sm
    X=solar_df[['FSLR','RGSE','SCTY']]
    X = sm.add_constant(X)
    y=solar_df['TAN']
    model = sm.OLS(y,X,missing='drop')
    result = model.fit()
    print(result.summary())
    
    
    
    
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                    TAN   R-squared:                       0.851
    Model:                            OLS   Adj. R-squared:                  0.849
    Method:                 Least Squares   F-statistic:                     435.1
    Date:                Thu, 31 Aug 2017   Prob (F-statistic):           4.89e-94
    Time:                        17:59:54   Log-Likelihood:                -464.33
    No. Observations:                 232   AIC:                             936.7
    Df Residuals:                     228   BIC:                             950.4
    Df Model:                           3                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    const         15.2915      1.180     12.956      0.000        12.966    17.617
    FSLR           0.0087      0.017      0.521      0.603        -0.024     0.041
    RGSE           0.0073      0.001      9.684      0.000         0.006     0.009
    SCTY           0.2156      0.015     14.177      0.000         0.186     0.246
    ==============================================================================
    Omnibus:                        4.153   Durbin-Watson:                   0.109
    Prob(Omnibus):                  0.125   Jarque-Bera (JB):                3.754
    Skew:                           0.239   Prob(JB):                        0.153
    Kurtosis:                       2.600   Cond. No.                     5.88e+03
    ==============================================================================
    
    Warnings:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    [2] The condition number is large, 5.88e+03. This might indicate that there are
    strong multicollinearity or other numerical problems.
    

    Finally plot the fitted line with the actual y values

    
    
    In [86]:
    fig, ax = plt.subplots(figsize=(8,6))
    ax.plot(y)
    ax.plot(result.fittedvalues)
    
    
    
    
    Out[86]:
    [<matplotlib.lines.Line2D at 0x115f92978>]
    
    
    In [ ]: