In [1]:
!pip install .


Processing /Users/Cecilia/Desktop/STA 663/project/STA-663-Final-Project/fastfsr
  Requirement already satisfied (use --upgrade to upgrade): fastfsr==0.1 from file:///Users/Cecilia/Desktop/STA%20663/project/STA-663-Final-Project/fastfsr in /Users/Cecilia/anaconda/lib/python3.5/site-packages

In [2]:
import fastfsr

In [3]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

In the originally paper, the authors used the R package leaps for best subset selection. However, there does not exist corresponding package or function in Python and hence we also implemented a regression subset selection.

The function takes in X and Y and return the best subset selection.

  • The first returned elements are the indices of the returned variables
  • The second returned elements are the variables
  • The third returned elements are the RSS
  • The forth returned elements are the corresponding p-values

In [4]:
ncaa = pd.read_csv("http://www4.stat.ncsu.edu/~boos/var.select/ncaa.data2.txt", 
                   delim_whitespace = True)
x = ncaa.ix[:,:-1]
y = ncaa.ix[:,-1]
x.head()


Out[4]:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19
0 13 17 9 15 28.0 0 -1.14045 3.660 4.490 3409 65.8 18 81 42.2 660000 77 100 59 1
1 28 20 32 18 18.4 18 -0.13719 2.594 3.610 7258 66.3 17 82 40.5 150555 88 94 41 25
2 32 20 20 20 34.8 18 1.55358 2.060 4.930 6405 75.0 19 71 46.5 415400 94 81 25 36
3 32 21 24 21 14.5 20 2.05712 2.887 3.876 18294 66.0 16 84 42.2 211000 93 88 26 13
4 24 20 16 20 21.8 13 -0.77082 2.565 4.960 8259 63.5 16 91 41.2 44000 90 92 32 31

In [5]:
fastfsr.reg_subset(x, y)


Out[5]:
(array([ 1,  2,  4,  3,  6, 16, 14,  5,  8,  7, 11,  9, 12, 17, 10,  0, 13,
        18, 15]),
 array(['x2', 'x3', 'x5', 'x4', 'x7', 'x17', 'x15', 'x6', 'x9', 'x8', 'x12',
        'x10', 'x13', 'x18', 'x11', 'x1', 'x14', 'x19', 'x16'], dtype=object),
 array([ 24148.87276596,   7077.86138645,   5942.5387484 ,   5534.10114754,
          5068.71560461,   4565.33245127,   4354.85020305,   4167.86982143,
          4040.70513154,   3897.53072484,   3725.51704978,   3661.59293365,
          3610.27305639,   3564.96214367,   3510.07074574,   3489.07672866,
          3478.67153268,   3472.10439175,   3470.66621376,   3469.96781547]),
 array([  1.11022302e-16,   6.95109478e-05,   1.15845889e-02,
          5.29907450e-03,   2.48342482e-03,   4.33151925e-02,
          5.27341418e-02,   1.05631519e-01,   8.26270613e-02,
          5.36356789e-02,   2.34958569e-01,   2.86440303e-01,
          3.16318194e-01,   2.69726331e-01,   4.95325787e-01,
          6.32648734e-01,   7.05641795e-01,   8.60540370e-01,
          9.03197593e-01]))

In [ ]: