Exploring the files with Pandas

Many statistical Python packages can deal with numpy Arrays.

Numpy Arrays however are not always easy to use.

Pandas is a package that provides a dataframe interface, similar to what R uses as the main data structure. Since Pandas has become so popular, many packages accept both pd.DataFrames and numpy Arrays.


In [5]:
import os
from dotenv import load_dotenv, find_dotenv

# find .env automagically by walking up directories until it's found
dotenv_path = find_dotenv()

# load up the entries as environment variables
load_dotenv(dotenv_path)


Out[5]:
True

First some environment variables

We now use the files that are stored in the RAW directory.

If we decide to change the data format by changing names, adding features, created summary data frames etc., we will save those files in the INTERIM directory.


In [6]:
PROJECT_DIR = os.path.dirname(dotenv_path)
RAW_DATA_DIR = PROJECT_DIR + os.environ.get("RAW_DATA_DIR")
INTERIM_DATA_DIR = PROJECT_DIR + os.environ.get("INTERIM_DATA_DIR")
files=os.environ.get("FILES").split()

print("Project directory is  : {0}".format(PROJECT_DIR))
print("Raw data directory is : {0}".format(RAW_DATA_DIR))
print("Interim directory is  : {0}".format(INTERIM_DATA_DIR))


Project directory is  : /home/gsentveld/lunch_and_learn
Raw data directory is : /home/gsentveld/lunch_and_learn/data/raw
Interim directory is  : /home/gsentveld/lunch_and_learn/data/interim

Importing pandas and matplotlib.pyplot


In [7]:
# The following jupyter notebook magic makes the plots appear in the notebook. 
# If you run in batch mode, you have to save your plots as images.
%matplotlib notebook

# matplotlib.pyplot is traditionally imported as plt
import matplotlib.pyplot as plt

# Pandas is traditionaly imported as pd.
import pandas as pd
from pylab import rcParams

# some display options to size the figures. feel free to experiment
pd.set_option('display.max_columns', 25)
rcParams['figure.figsize'] = (17, 7)

Reading a file in Pandas

Reading a CSV file is really easy in Pandas. There are several formats that Pandas can deal with.

Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq

We will use pd.read_csv().

As you will see, the Jupyter notebook prints out a very nice rendition of the DataFrame object that is the result


In [8]:
samchild=pd.read_csv(RAW_DATA_DIR+'/samchild.csv')

In [9]:
samchild.columns.values.tolist()


Out[9]:
['WTIA_SC',
 'WTFA_SC',
 'RECTYPE',
 'SRVY_YR',
 'HHX',
 'INTV_QRT',
 'INTV_MON',
 'FMX',
 'FPX',
 'REGION',
 'STRAT_P',
 'PSU_P',
 'SEX',
 'HISPAN_I',
 'RACERPI2',
 'MRACRPI2',
 'MRACBPI2',
 'AGE_P',
 'FDRN_FLG',
 'CUSUALPL',
 'CPLKIND',
 'CHCPLROU',
 'CHCPLKND',
 'CHCCHGYR',
 'CHCCHGHI',
 'CPRVTRYR',
 'CPRVTRFD',
 'CDRNANP',
 'CDRNAI',
 'CHCDLYR1',
 'CHCDLYR2',
 'CHCDLYR3',
 'CHCDLYR4',
 'CHCDLYR5',
 'CHCAFYR',
 'CHCAFYRN',
 'CHCAFYRF',
 'CHCAFYR1',
 'CHCAFYR2',
 'CHCAFYR3',
 'CHCAFYR4',
 'CHCAFYR5',
 'CHCAFYR6',
 'CDNLONGR',
 'CHCSYR11',
 'CHCSYR12',
 'CHCSYR13',
 'CHCSYR14',
 'CHCSYR1',
 'CHCSYR2',
 'CHCSYR3',
 'CHCSYR4',
 'CHCSYR5',
 'CHCSYR6',
 'CHCSYR7',
 'CHCSYR81',
 'CHCSYR82',
 'CHCSYR10',
 'CHCSYREM',
 'CHPXYR_C',
 'CHERNOY2',
 'CERVISND',
 'CERHOS',
 'CERREA1R',
 'CERREA2R',
 'CERREA3R',
 'CERREA4R',
 'CERREA5R',
 'CERREA6R',
 'CERREA7R',
 'CERREA8R',
 'CHCHYR',
 'CHCHMOYR',
 'CHCHNOY2',
 'CHCNOYR2',
 'CSRGYR',
 'RSRGNOYR',
 'CMDLONGR',
 'CSHFLU12',
 'CSHFLUNM',
 'CSHFLUM1',
 'CSHFLUY1',
 'CSHSPFL1',
 'CSHFLUM2',
 'CSHFLUY2',
 'CSHSPFL2',
 'TOTOZ_P',
 'BWTGRM_P',
 'CHGHT_TC',
 'CWGHT_TC',
 'BMI_SC',
 'AMR1R',
 'AODD1',
 'ADD2',
 'AMR2R',
 'AUTISM',
 'AODD2',
 'CCONDRR1',
 'CCONDRR2',
 'CCONDRR3',
 'CCONDRR4',
 'CCONDRR5',
 'CCONDRR6',
 'CCONDRR7',
 'CCONDRR8',
 'CCONDRR9',
 'CPOX',
 'CPOX12MO',
 'CASHMEV',
 'CASSTILL',
 'CASHYR',
 'CASERYR1',
 'HAYF1',
 'RALLG1',
 'DALLG1',
 'SALLG1',
 'DIARH1',
 'ANEMIA1',
 'EARINF1',
 'SEIZE1',
 'HAYF2',
 'RALLG2',
 'DALLG2',
 'SALLG2',
 'DIARH2',
 'ANEMIA2',
 'FHEAD',
 'EARINF2',
 'SEIZE2',
 'STUTTER',
 'CHSTATYR',
 'SCHDAYR1',
 'CCOLD2W',
 'CINTIL2W',
 'CHEARST1',
 'CVISION',
 'CBLIND',
 'IHSPEQ',
 'IHMOB',
 'IHMOBYR',
 'PROBRX',
 'LEARND',
 'MHIBOY2',
 'CMHAGM15',
 'MHIGRL2',
 'CMHAGF15',
 'CSRESPNO',
 'CSRELTVP',
 'LATEINTC',
 'RSCL2_C2',
 'RSCL2_E2',
 'RSCL3_E3',
 'RSCL5_P5',
 'RSCL5_H5',
 'RSCL6']

In [6]:
features=[x for x in samadult.columns.values.tolist() if x.startswith('ALDURA')]

In [7]:
import numpy as np

np.sum(samadult.WKDAYR.notnull() & (samadult['WKDAYR']<900))


Out[7]:
21385

In [8]:
features=[
 'ALDURA3',
 'ALDURA4',
 'ALDURA6',
 'ALDURA7',
 'ALDURA8',
 'ALDURA11',
 'ALDURA17',
 'ALDURA20',
 'ALDURA21',
 'ALDURA22',
 'ALDURA23',
 'ALDURA24',
 'ALDURA27',
 'ALDURA28',
 'ALDURA29',
 'ALDURA33']

In [9]:
target='WKDAYR'

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

reg=LinearRegression()

data=samadult[samadult.WKDAYR.notnull() & (samadult.WKDAYR<900)]

X=data[features]
X.shape


Out[10]:
(21385, 16)

In [11]:
X=X+1
X.shape


Out[11]:
(21385, 16)

In [12]:
X=X.fillna(0)
X.shape


Out[12]:
(21385, 16)

In [13]:
#X=np.log1p(X)
#X.shape

In [14]:
y=np.log1p(data[target])

In [15]:
indicator_names=[x+"_I" for x in features]

In [16]:
indicators=pd.DataFrame()
for feature in features:
    indicators[feature+"_I"]=data[feature].notnull().astype(int)

In [17]:
indicators


Out[17]:
ALDURA3_I ALDURA4_I ALDURA6_I ALDURA7_I ALDURA8_I ALDURA11_I ALDURA17_I ALDURA20_I ALDURA21_I ALDURA22_I ALDURA23_I ALDURA24_I ALDURA27_I ALDURA28_I ALDURA29_I ALDURA33_I
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
22 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
37 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
41 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
43 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
48 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
33619 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33620 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33622 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33626 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33627 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33628 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33629 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33630 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33631 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33632 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33634 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33635 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33638 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33639 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33646 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
33647 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33649 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33650 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33651 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33652 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33653 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33654 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33660 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
33661 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33664 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33666 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33668 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33670 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
33671 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

21385 rows × 16 columns


In [18]:
X=pd.concat([X, indicators], axis=1, join_axes=[X.index])

In [19]:
X


Out[19]:
ALDURA3 ALDURA4 ALDURA6 ALDURA7 ALDURA8 ALDURA11 ALDURA17 ALDURA20 ALDURA21 ALDURA22 ALDURA23 ALDURA24 ... ALDURA8_I ALDURA11_I ALDURA17_I ALDURA20_I ALDURA21_I ALDURA22_I ALDURA23_I ALDURA24_I ALDURA27_I ALDURA28_I ALDURA29_I ALDURA33_I
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
4 21.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
10 0.0 4.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
15 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
17 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
18 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
19 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
22 13.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
26 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
28 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
34 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
36 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
37 11.0 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
38 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
40 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
41 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
42 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
43 0.0 17.0 0.0 0.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 0 0 0 0 0 0 0 0
47 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
48 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
33619 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33620 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33622 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33625 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33626 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33627 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33628 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33629 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33630 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33631 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33632 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33634 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33635 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33638 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33639 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33646 2.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 1 0 0 0 0 0 0 0 0 0
33647 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33649 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33650 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33651 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33652 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33653 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33654 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33660 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 0 0 0 0 0 0 0 0
33661 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33664 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33666 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33668 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33670 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0
33671 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0

21385 rows × 32 columns


In [20]:
reg.fit(X,y)


Out[20]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [21]:
y_pred=reg.predict(X)

score=r2_score(y, y_pred)

In [22]:
import matplotlib.pyplot as plt
plt.plot(y,y_pred,marker='.', linestyle='None', alpha=0.5 )
plt.xlabel('Y Train')
plt.ylabel('Y Predict')
plt.show()



In [23]:
score


Out[23]:
0.042318959394934352

In [24]:
from sklearn.linear_model import Ridge

ridge=Ridge(alpha=0.07, normalize=True)
ridge.fit(X,y)


Out[24]:
Ridge(alpha=0.07, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, random_state=None, solver='auto', tol=0.001)

In [25]:
y_pred=ridge.predict(X)

In [26]:
def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

In [30]:
# Import necessary modules
import numpy as np

from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, X, y, cv=10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)



In [31]:
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans


fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=40, azim=20)

plt.cla()
pca = PCA(n_components=3)
pca.fit(X)
X_pca = pca.transform(X)

kmean=KMeans(n_clusters=8)
kmean.fit(X_pca)
y_lab=kmean.labels_


# Reorder the labels to have colors matching the cluster results
ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], label=y_lab,c=y_lab, cmap=plt.cm.spectral)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
plt.legend(bbox_to_anchor=(0, 1), loc='upper right', ncol=7)

plt.show()

In [148]:
y_lab


Out[148]:
array([1, 1, 1, ..., 1, 1, 1], dtype=int32)

In [ ]:


In [ ]: