notebook.community

Edit and run



In [ ]:



In [4]:

    
adult.dtypes









    Out[4]:





age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object



In [5]:

    
adult.income.values









    Out[5]:





array([' <=50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' >50K'], dtype=object)



In [6]:

    
#Only numeric data for regression
X=adult[["age ","fnlwgt","education-num","capital-gain","capital-loss","hours-per-week"]].values
X









    Out[6]:





array([[    39,  77516,     13,   2174,      0,     40],
       [    50,  83311,     13,      0,      0,     13],
       [    38, 215646,      9,      0,      0,     40],
       ..., 
       [    58, 151910,      9,      0,      0,     40],
       [    22, 201490,      9,      0,      0,     20],
       [    52, 287927,      9,  15024,      0,     40]], dtype=int64)



In [7]:

    
from sklearn import datasets ## imports datasets from scikit-learn
data = datasets.load_boston() ## loads Boston dataset from datasets library



In [9]:

    
print (data.DESCR)









    



Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)



In [10]:

    
import numpy as np
import pandas as pd



In [13]:

    
# define the data/predictors as the pre-set feature names  
df = pd.DataFrame(data.data, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
target = pd.DataFrame(data.target, columns=["MEDV"])



In [14]:

    
## Without a constant

import statsmodels.api as sm

X = df["RM"]
y = target["MEDV"]



In [15]:

    
# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model



In [16]:

    
# Print out the statistics
model.summary()









    Out[16]:





OLS Regression Results

  Dep. Variable:           MEDV          R-squared:             0.901 


  Model:                    OLS          Adj. R-squared:        0.901 


  Method:              Least Squares     F-statistic:           4615. 


  Date:              Wed, 21 Feb 2018    Prob (F-statistic):  3.74e-256


  Time:                  15:09:26        Log-Likelihood:      -1747.1 


  No. Observations:          506         AIC:                   3496. 


  Df Residuals:              505         BIC:                   3500. 


  Df Model:                    1                                      


  Covariance Type:       nonrobust                                    




        coef      std err       t       P>|t|   [0.025     0.975]  


  RM      3.6534      0.054     67.930   0.000      3.548      3.759




  Omnibus:        83.295    Durbin-Watson:         0.493


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    152.507


  Skew:            0.955    Prob(JB):           7.65e-34


  Kurtosis:        4.894    Cond. No.               1.00



In [17]:

    
mtcars=pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv")



In [18]:

    
mtcars









    Out[18]:







  
    
      
      Unnamed: 0
      mpg
      cyl
      disp
      hp
      drat
      wt
      qsec
      vs
      am
      gear
      carb
    
  
  
    
      0
      Mazda RX4
      21.0
      6
      160.0
      110
      3.90
      2.620
      16.46
      0
      1
      4
      4
    
    
      1
      Mazda RX4 Wag
      21.0
      6
      160.0
      110
      3.90
      2.875
      17.02
      0
      1
      4
      4
    
    
      2
      Datsun 710
      22.8
      4
      108.0
      93
      3.85
      2.320
      18.61
      1
      1
      4
      1
    
    
      3
      Hornet 4 Drive
      21.4
      6
      258.0
      110
      3.08
      3.215
      19.44
      1
      0
      3
      1
    
    
      4
      Hornet Sportabout
      18.7
      8
      360.0
      175
      3.15
      3.440
      17.02
      0
      0
      3
      2
    
    
      5
      Valiant
      18.1
      6
      225.0
      105
      2.76
      3.460
      20.22
      1
      0
      3
      1
    
    
      6
      Duster 360
      14.3
      8
      360.0
      245
      3.21
      3.570
      15.84
      0
      0
      3
      4
    
    
      7
      Merc 240D
      24.4
      4
      146.7
      62
      3.69
      3.190
      20.00
      1
      0
      4
      2
    
    
      8
      Merc 230
      22.8
      4
      140.8
      95
      3.92
      3.150
      22.90
      1
      0
      4
      2
    
    
      9
      Merc 280
      19.2
      6
      167.6
      123
      3.92
      3.440
      18.30
      1
      0
      4
      4
    
    
      10
      Merc 280C
      17.8
      6
      167.6
      123
      3.92
      3.440
      18.90
      1
      0
      4
      4
    
    
      11
      Merc 450SE
      16.4
      8
      275.8
      180
      3.07
      4.070
      17.40
      0
      0
      3
      3
    
    
      12
      Merc 450SL
      17.3
      8
      275.8
      180
      3.07
      3.730
      17.60
      0
      0
      3
      3
    
    
      13
      Merc 450SLC
      15.2
      8
      275.8
      180
      3.07
      3.780
      18.00
      0
      0
      3
      3
    
    
      14
      Cadillac Fleetwood
      10.4
      8
      472.0
      205
      2.93
      5.250
      17.98
      0
      0
      3
      4
    
    
      15
      Lincoln Continental
      10.4
      8
      460.0
      215
      3.00
      5.424
      17.82
      0
      0
      3
      4
    
    
      16
      Chrysler Imperial
      14.7
      8
      440.0
      230
      3.23
      5.345
      17.42
      0
      0
      3
      4
    
    
      17
      Fiat 128
      32.4
      4
      78.7
      66
      4.08
      2.200
      19.47
      1
      1
      4
      1
    
    
      18
      Honda Civic
      30.4
      4
      75.7
      52
      4.93
      1.615
      18.52
      1
      1
      4
      2
    
    
      19
      Toyota Corolla
      33.9
      4
      71.1
      65
      4.22
      1.835
      19.90
      1
      1
      4
      1
    
    
      20
      Toyota Corona
      21.5
      4
      120.1
      97
      3.70
      2.465
      20.01
      1
      0
      3
      1
    
    
      21
      Dodge Challenger
      15.5
      8
      318.0
      150
      2.76
      3.520
      16.87
      0
      0
      3
      2
    
    
      22
      AMC Javelin
      15.2
      8
      304.0
      150
      3.15
      3.435
      17.30
      0
      0
      3
      2
    
    
      23
      Camaro Z28
      13.3
      8
      350.0
      245
      3.73
      3.840
      15.41
      0
      0
      3
      4
    
    
      24
      Pontiac Firebird
      19.2
      8
      400.0
      175
      3.08
      3.845
      17.05
      0
      0
      3
      2
    
    
      25
      Fiat X1-9
      27.3
      4
      79.0
      66
      4.08
      1.935
      18.90
      1
      1
      4
      1
    
    
      26
      Porsche 914-2
      26.0
      4
      120.3
      91
      4.43
      2.140
      16.70
      0
      1
      5
      2
    
    
      27
      Lotus Europa
      30.4
      4
      95.1
      113
      3.77
      1.513
      16.90
      1
      1
      5
      2
    
    
      28
      Ford Pantera L
      15.8
      8
      351.0
      264
      4.22
      3.170
      14.50
      0
      1
      5
      4
    
    
      29
      Ferrari Dino
      19.7
      6
      145.0
      175
      3.62
      2.770
      15.50
      0
      1
      5
      6
    
    
      30
      Maserati Bora
      15.0
      8
      301.0
      335
      3.54
      3.570
      14.60
      0
      1
      5
      8
    
    
      31
      Volvo 142E
      21.4
      4
      121.0
      109
      4.11
      2.780
      18.60
      1
      1
      4
      2



In [21]:

    
X = mtcars[["disp","wt","qsec"]]
y = mtcars["mpg"]



In [22]:

    
# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model



In [23]:

    
# Print out the statistics
model.summary()









    Out[23]:





OLS Regression Results

  Dep. Variable:            mpg          R-squared:             0.981


  Model:                    OLS          Adj. R-squared:        0.979


  Method:              Least Squares     F-statistic:           487.7


  Date:              Wed, 21 Feb 2018    Prob (F-statistic):  6.67e-25


  Time:                  15:46:10        Log-Likelihood:      -79.700


  No. Observations:           32         AIC:                   165.4


  Df Residuals:               29         BIC:                   169.8


  Df Model:                    3                                     


  Covariance Type:       nonrobust                                   




          coef      std err       t       P>|t|   [0.025     0.975]  


  disp      0.0152      0.011      1.377   0.179     -0.007      0.038


  wt       -5.9904      1.382     -4.336   0.000     -8.816     -3.165


  qsec      2.0021      0.131     15.265   0.000      1.734      2.270




  Omnibus:         0.506    Durbin-Watson:         1.539


  Prob(Omnibus):   0.777    Jarque-Bera (JB):      0.571


  Skew:           -0.263    Prob(JB):              0.751


  Kurtosis:        2.611    Cond. No.               669.



In [25]:

    
X



In [26]:

    
y









    Out[26]:





0     21.0
1     21.0
2     22.8
3     21.4
4     18.7
5     18.1
6     14.3
7     24.4
8     22.8
9     19.2
10    17.8
11    16.4
12    17.3
13    15.2
14    10.4
15    10.4
16    14.7
17    32.4
18    30.4
19    33.9
20    21.5
21    15.5
22    15.2
23    13.3
24    19.2
25    27.3
26    26.0
27    30.4
28    15.8
29    19.7
30    15.0
31    21.4
Name: mpg, dtype: float64



In [31]:

    
#http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score



In [27]:

    
# Split the data into training/testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#data_X_train = data_X[:-20]
#data_X_test = data_X[-20:]



In [28]:

    
X_train



In [29]:

    
X_test



In [ ]:



In [32]:

    
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)









    Out[32]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [35]:

    
# Make predictions using the testing set
y_pred = regr.predict(X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))









    



Coefficients: 
 [-0.00899436 -4.1675056   0.75104005]
Mean squared error: 5.65
Variance score: 0.82



In [ ]:

Dep. Variable:	MEDV	R-squared:	0.901
Model:	OLS	Adj. R-squared:	0.901
Method:	Least Squares	F-statistic:	4615.
Date:	Wed, 21 Feb 2018	Prob (F-statistic):	3.74e-256
Time:	15:09:26	Log-Likelihood:	-1747.1
No. Observations:	506	AIC:	3496.
Df Residuals:	505	BIC:	3500.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
RM	3.6534	0.054	67.930	0.000	3.548	3.759

Omnibus:	83.295	Durbin-Watson:	0.493
Prob(Omnibus):	0.000	Jarque-Bera (JB):	152.507
Skew:	0.955	Prob(JB):	7.65e-34
Kurtosis:	4.894	Cond. No.	1.00

	Unnamed: 0	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
5	Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
6	Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
7	Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
8	Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
9	Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
10	Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
11	Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
12	Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
13	Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
14	Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
15	Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
16	Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
17	Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
18	Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
19	Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
20	Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
21	Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
22	AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
23	Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
24	Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
25	Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
26	Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
27	Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
28	Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
29	Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
30	Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
31	Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

Dep. Variable:	mpg	R-squared:	0.981
Model:	OLS	Adj. R-squared:	0.979
Method:	Least Squares	F-statistic:	487.7
Date:	Wed, 21 Feb 2018	Prob (F-statistic):	6.67e-25
Time:	15:46:10	Log-Likelihood:	-79.700
No. Observations:	32	AIC:	165.4
Df Residuals:	29	BIC:	169.8
Df Model:	3
Covariance Type:	nonrobust

Omnibus:	0.506	Durbin-Watson:	1.539
Prob(Omnibus):	0.777	Jarque-Bera (JB):	0.571
Skew:	-0.263	Prob(JB):	0.751
Kurtosis:	2.611	Cond. No.	669.