In [ ]:


In [4]:
adult.dtypes


Out[4]:
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

In [5]:
adult.income.values


Out[5]:
array([' <=50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' >50K'], dtype=object)

In [6]:
#Only numeric data for regression
X=adult[["age ","fnlwgt","education-num","capital-gain","capital-loss","hours-per-week"]].values
X


Out[6]:
array([[    39,  77516,     13,   2174,      0,     40],
       [    50,  83311,     13,      0,      0,     13],
       [    38, 215646,      9,      0,      0,     40],
       ..., 
       [    58, 151910,      9,      0,      0,     40],
       [    22, 201490,      9,      0,      0,     20],
       [    52, 287927,      9,  15024,      0,     40]], dtype=int64)

In [7]:
from sklearn import datasets ## imports datasets from scikit-learn
data = datasets.load_boston() ## loads Boston dataset from datasets library

In [9]:
print (data.DESCR)


Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)


In [10]:
import numpy as np
import pandas as pd

In [13]:
# define the data/predictors as the pre-set feature names  
df = pd.DataFrame(data.data, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
target = pd.DataFrame(data.target, columns=["MEDV"])

In [14]:
## Without a constant

import statsmodels.api as sm

X = df["RM"]
y = target["MEDV"]

In [15]:
# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

In [16]:
# Print out the statistics
model.summary()


Out[16]:
OLS Regression Results
Dep. Variable: MEDV R-squared: 0.901
Model: OLS Adj. R-squared: 0.901
Method: Least Squares F-statistic: 4615.
Date: Wed, 21 Feb 2018 Prob (F-statistic): 3.74e-256
Time: 15:09:26 Log-Likelihood: -1747.1
No. Observations: 506 AIC: 3496.
Df Residuals: 505 BIC: 3500.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
RM 3.6534 0.054 67.930 0.000 3.548 3.759
Omnibus: 83.295 Durbin-Watson: 0.493
Prob(Omnibus): 0.000 Jarque-Bera (JB): 152.507
Skew: 0.955 Prob(JB): 7.65e-34
Kurtosis: 4.894 Cond. No. 1.00

In [17]:
mtcars=pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/mtcars.csv")

In [18]:
mtcars


Out[18]:
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

In [21]:
X = mtcars[["disp","wt","qsec"]]
y = mtcars["mpg"]

In [22]:
# Note the difference in argument order
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make the predictions by the model

In [23]:
# Print out the statistics
model.summary()


Out[23]:
OLS Regression Results
Dep. Variable: mpg R-squared: 0.981
Model: OLS Adj. R-squared: 0.979
Method: Least Squares F-statistic: 487.7
Date: Wed, 21 Feb 2018 Prob (F-statistic): 6.67e-25
Time: 15:46:10 Log-Likelihood: -79.700
No. Observations: 32 AIC: 165.4
Df Residuals: 29 BIC: 169.8
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
disp 0.0152 0.011 1.377 0.179 -0.007 0.038
wt -5.9904 1.382 -4.336 0.000 -8.816 -3.165
qsec 2.0021 0.131 15.265 0.000 1.734 2.270
Omnibus: 0.506 Durbin-Watson: 1.539
Prob(Omnibus): 0.777 Jarque-Bera (JB): 0.571
Skew: -0.263 Prob(JB): 0.751
Kurtosis: 2.611 Cond. No. 669.

In [25]:
X


Out[25]:
disp wt qsec
0 160.0 2.620 16.46
1 160.0 2.875 17.02
2 108.0 2.320 18.61
3 258.0 3.215 19.44
4 360.0 3.440 17.02
5 225.0 3.460 20.22
6 360.0 3.570 15.84
7 146.7 3.190 20.00
8 140.8 3.150 22.90
9 167.6 3.440 18.30
10 167.6 3.440 18.90
11 275.8 4.070 17.40
12 275.8 3.730 17.60
13 275.8 3.780 18.00
14 472.0 5.250 17.98
15 460.0 5.424 17.82
16 440.0 5.345 17.42
17 78.7 2.200 19.47
18 75.7 1.615 18.52
19 71.1 1.835 19.90
20 120.1 2.465 20.01
21 318.0 3.520 16.87
22 304.0 3.435 17.30
23 350.0 3.840 15.41
24 400.0 3.845 17.05
25 79.0 1.935 18.90
26 120.3 2.140 16.70
27 95.1 1.513 16.90
28 351.0 3.170 14.50
29 145.0 2.770 15.50
30 301.0 3.570 14.60
31 121.0 2.780 18.60

In [26]:
y


Out[26]:
0     21.0
1     21.0
2     22.8
3     21.4
4     18.7
5     18.1
6     14.3
7     24.4
8     22.8
9     19.2
10    17.8
11    16.4
12    17.3
13    15.2
14    10.4
15    10.4
16    14.7
17    32.4
18    30.4
19    33.9
20    21.5
21    15.5
22    15.2
23    13.3
24    19.2
25    27.3
26    26.0
27    30.4
28    15.8
29    19.7
30    15.0
31    21.4
Name: mpg, dtype: float64

In [31]:
#http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [27]:
# Split the data into training/testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
#data_X_train = data_X[:-20]
#data_X_test = data_X[-20:]

In [28]:
X_train


Out[28]:
disp wt qsec
16 440.0 5.345 17.42
5 225.0 3.460 20.22
13 275.8 3.780 18.00
11 275.8 4.070 17.40
23 350.0 3.840 15.41
1 160.0 2.875 17.02
2 108.0 2.320 18.61
26 120.3 2.140 16.70
3 258.0 3.215 19.44
21 318.0 3.520 16.87
27 95.1 1.513 16.90
22 304.0 3.435 17.30
18 75.7 1.615 18.52
31 121.0 2.780 18.60
20 120.1 2.465 20.01
7 146.7 3.190 20.00
10 167.6 3.440 18.90
14 472.0 5.250 17.98
28 351.0 3.170 14.50
19 71.1 1.835 19.90
6 360.0 3.570 15.84

In [29]:
X_test


Out[29]:
disp wt qsec
29 145.0 2.770 15.50
15 460.0 5.424 17.82
24 400.0 3.845 17.05
17 78.7 2.200 19.47
8 140.8 3.150 22.90
9 167.6 3.440 18.30
30 301.0 3.570 14.60
25 79.0 1.935 18.90
12 275.8 3.730 17.60
0 160.0 2.620 16.46
4 360.0 3.440 17.02

In [ ]:


In [32]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, y_train)


Out[32]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [35]:
# Make predictions using the testing set
y_pred = regr.predict(X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, y_pred))


Coefficients: 
 [-0.00899436 -4.1675056   0.75104005]
Mean squared error: 5.65
Variance score: 0.82

In [ ]: