In [1]:

    
!pip install .









    



Processing /Users/Cecilia/Desktop/STA 663/project/STA-663-Final-Project/fastfsr
  Requirement already satisfied (use --upgrade to upgrade): fastfsr==0.1 from file:///Users/Cecilia/Desktop/STA%20663/project/STA-663-Final-Project/fastfsr in /Users/Cecilia/anaconda/lib/python3.5/site-packages



In [2]:

    
import fastfsr



In [3]:

    
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

In the originally paper, the authors used the R package leaps for best subset selection. However, there does not exist corresponding package or function in Python and hence we also implemented a regression subset selection.

The function takes in X and Y and return the best subset selection.

The first returned elements are the indices of the returned variables
The second returned elements are the variables
The third returned elements are the RSS
The forth returned elements are the corresponding p-values



In [4]:

    
ncaa = pd.read_csv("http://www4.stat.ncsu.edu/~boos/var.select/ncaa.data2.txt", 
                   delim_whitespace = True)
x = ncaa.ix[:,:-1]
y = ncaa.ix[:,-1]
x.head()



In [5]:

    
fastfsr.reg_subset(x, y)









    Out[5]:





(array([ 1,  2,  4,  3,  6, 16, 14,  5,  8,  7, 11,  9, 12, 17, 10,  0, 13,
        18, 15]),
 array(['x2', 'x3', 'x5', 'x4', 'x7', 'x17', 'x15', 'x6', 'x9', 'x8', 'x12',
        'x10', 'x13', 'x18', 'x11', 'x1', 'x14', 'x19', 'x16'], dtype=object),
 array([ 24148.87276596,   7077.86138645,   5942.5387484 ,   5534.10114754,
          5068.71560461,   4565.33245127,   4354.85020305,   4167.86982143,
          4040.70513154,   3897.53072484,   3725.51704978,   3661.59293365,
          3610.27305639,   3564.96214367,   3510.07074574,   3489.07672866,
          3478.67153268,   3472.10439175,   3470.66621376,   3469.96781547]),
 array([  1.11022302e-16,   6.95109478e-05,   1.15845889e-02,
          5.29907450e-03,   2.48342482e-03,   4.33151925e-02,
          5.27341418e-02,   1.05631519e-01,   8.26270613e-02,
          5.36356789e-02,   2.34958569e-01,   2.86440303e-01,
          3.16318194e-01,   2.69726331e-01,   4.95325787e-01,
          6.32648734e-01,   7.05641795e-01,   8.60540370e-01,
          9.03197593e-01]))



In [ ]:

	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15	x16	x17	x18	x19
0	13	17	9	15	28.0	0	-1.14045	3.660	4.490	3409	65.8	18	81	42.2	660000	77	100	59	1
1	28	20	32	18	18.4	18	-0.13719	2.594	3.610	7258	66.3	17	82	40.5	150555	88	94	41	25
2	32	20	20	20	34.8	18	1.55358	2.060	4.930	6405	75.0	19	71	46.5	415400	94	81	25	36
3	32	21	24	21	14.5	20	2.05712	2.887	3.876	18294	66.0	16	84	42.2	211000	93	88	26	13
4	24	20	16	20	21.8	13	-0.77082	2.565	4.960	8259	63.5	16	91	41.2	44000	90	92	32	31