So far, we've been running all of our code linearly, i.e. we execute a file an we let Python do its job. Here we will be running code interactively. Each cell will contain some code, which we can execute by clicking on the cell and pressing Shift + Enter. Here is an example:



In [1]:

    
print("Hello, world!")









    



Hello, world!

As you can see, the code executed and the output is now in the notebook. One thing that's special about Jupyter is that our variables are persistent. I.e. we could create the block:



In [2]:

    
hello = "Hello, world!"

And then in a new block execute:



In [3]:

    
print(hello)









    



Hello, world!

You might have noticed the In [ ] to the side of the cells. The numbers inside indicate the order in which the code in the cells was executed. You should imagine this notebook as maintaining a hidden Python program, with all of our variables and functions, etc. executed in the order of the In [ ] blocks.

Now, lets analyze some data. To start off with, we will import the libraries which we will need.



In [4]:

    
import pandas as pd
import sklearn as sk
from sklearn import datasets as ds
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt

In our first example, we will analyze a dataset of Boston house prices, which comes packaged with the sklearn module.



In [5]:

    
boston = ds.load_boston()

When you run a cell in Jupyter, it will try to print out the value on the last line of the cell. So far, we've loaded the dataset into a variable boston. Lets investigate how this dataset is actually structured.



In [6]:

    
boston









    Out[6]:





{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
        19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
        20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
        23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
        33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
        21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
        20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
        23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
        15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,
        17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,
        25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,
        23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,
        32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,
        34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,
        20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,
        26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,
        31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,
        22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,
        42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,
        36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,
        32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,
        20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,
        20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,
        22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,
        21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,
        19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,
        32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,
        18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,
        16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,
        13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3,  8.8,
         7.2, 10.5,  7.4, 10.2, 11.5, 15.1, 23.2,  9.7, 13.8, 12.7, 13.1,
        12.5,  8.5,  5. ,  6.3,  5.6,  7.2, 12.1,  8.3,  8.5,  5. , 11.9,
        27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3,  7. ,  7.2,  7.5, 10.4,
         8.8,  8.4, 16.7, 14.2, 20.8, 13.4, 11.7,  8.3, 10.2, 10.9, 11. ,
         9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4,  9.6,  8.7,  8.4, 12.8,
        10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,
        15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,
        19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,
        29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,
        20.6, 21.2, 19.1, 20.6, 15.2,  7. ,  8.1, 13.6, 20.1, 21.8, 24.5,
        23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9]),
 'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
        'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7'),
 'DESCR': ".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:**  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n.. topic:: References\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n",
 'filename': 'c:\\users\\bozhidar\\appdata\\local\\programs\\python\\python38-32\\lib\\site-packages\\sklearn\\datasets\\data\\boston_house_prices.csv'}

As we can see, this dataset is actually a dict, containing a few fields. We can get a list of those fields:



In [7]:

    
boston.keys()









    Out[7]:





dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

The DESCR key is particularly interesting, as it contains a string which describes the dataset. Lets check it out:



In [8]:

    
print(boston['DESCR'])









    



.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

So as we can see, the data field contains the first 13 attributes, the target field contains the median value of a house, and the feature_names contains the names of the features. We could view these all individually.



In [9]:

    
boston['data']









    Out[9]:





array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])



In [10]:

    
boston['target']









    Out[10]:





array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
       20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
       23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
       33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
       21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
       20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
       23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
       15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,
       17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,
       25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,
       23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,
       32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,
       34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,
       20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,
       26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,
       31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,
       22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,
       42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,
       36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,
       32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,
       20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,
       20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,
       22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,
       21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,
       19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,
       32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,
       18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,
       16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,
       13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3,  8.8,
        7.2, 10.5,  7.4, 10.2, 11.5, 15.1, 23.2,  9.7, 13.8, 12.7, 13.1,
       12.5,  8.5,  5. ,  6.3,  5.6,  7.2, 12.1,  8.3,  8.5,  5. , 11.9,
       27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3,  7. ,  7.2,  7.5, 10.4,
        8.8,  8.4, 16.7, 14.2, 20.8, 13.4, 11.7,  8.3, 10.2, 10.9, 11. ,
        9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4,  9.6,  8.7,  8.4, 12.8,
       10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,
       15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,
       19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,
       29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,
       20.6, 21.2, 19.1, 20.6, 15.2,  7. ,  8.1, 13.6, 20.1, 21.8, 24.5,
       23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9])



In [11]:

    
boston['feature_names']









    Out[11]:





array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

However, this is very inconvenient. To address this issue, we will use the DataFrame class from the pandas module we imported.



In [13]:

    
df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
df









    Out[13]:







  
    
      
      CRIM
      ZN
      INDUS
      CHAS
      NOX
      RM
      AGE
      DIS
      RAD
      TAX
      PTRATIO
      B
      LSTAT
    
  
  
    
      0
      0.00632
      18.0
      2.31
      0.0
      0.538
      6.575
      65.2
      4.0900
      1.0
      296.0
      15.3
      396.90
      4.98
    
    
      1
      0.02731
      0.0
      7.07
      0.0
      0.469
      6.421
      78.9
      4.9671
      2.0
      242.0
      17.8
      396.90
      9.14
    
    
      2
      0.02729
      0.0
      7.07
      0.0
      0.469
      7.185
      61.1
      4.9671
      2.0
      242.0
      17.8
      392.83
      4.03
    
    
      3
      0.03237
      0.0
      2.18
      0.0
      0.458
      6.998
      45.8
      6.0622
      3.0
      222.0
      18.7
      394.63
      2.94
    
    
      4
      0.06905
      0.0
      2.18
      0.0
      0.458
      7.147
      54.2
      6.0622
      3.0
      222.0
      18.7
      396.90
      5.33
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      501
      0.06263
      0.0
      11.93
      0.0
      0.573
      6.593
      69.1
      2.4786
      1.0
      273.0
      21.0
      391.99
      9.67
    
    
      502
      0.04527
      0.0
      11.93
      0.0
      0.573
      6.120
      76.7
      2.2875
      1.0
      273.0
      21.0
      396.90
      9.08
    
    
      503
      0.06076
      0.0
      11.93
      0.0
      0.573
      6.976
      91.0
      2.1675
      1.0
      273.0
      21.0
      396.90
      5.64
    
    
      504
      0.10959
      0.0
      11.93
      0.0
      0.573
      6.794
      89.3
      2.3889
      1.0
      273.0
      21.0
      393.45
      6.48
    
    
      505
      0.04741
      0.0
      11.93
      0.0
      0.573
      6.030
      80.8
      2.5050
      1.0
      273.0
      21.0
      396.90
      7.88
    
  

506 rows × 13 columns

Much nicer. There are settings in Jupyter which allow us to force rendering more rows, but we will not go into these now. The important thing is that we can inspect our data. We might wish to query our data to filter some results. Pandas supports a few ways of doing this. One of them is the query function of DataFrames.



In [14]:

    
df.query("CRIM >= 20") # List all towns with average crime rate >= 20 per capita

The query function is quite intuitive to use and if your query looks like python code, it will probably get executed.

Now, maybe we want to inspect our data, to see how each of the attributes relates to the median house price (the "target"). To do that we will use the pyplot submodule of the matplotlib module. Lets first plot the crime rate against the house price.



In [20]:

    
plt.scatter(df['CRIM'], boston['target'])
plt.xlabel('Average crime rate per capita')
plt.ylabel('Median house price in $1000')
plt.show()

As we might expect, the house price goes down as the crime rate increases. However, at the lower end, crime rate doesn't seem to be such a good predictor. We could do the same thing and plot each attribute against the house price. Or we could do it programatically.



In [21]:

    
plt.figure(figsize=(20,15))
for i, column in enumerate(df):
    plt.subplot(4, 4, i+1)
    plt.scatter(df[column], boston['target'])

As we can see, some attributes seem more predictive than others. Let's focus on the RM attribute, which is the average number of rooms per dwelling.



In [22]:

    
plt.scatter(df['RM'], boston['target'])
plt.xlabel('Average number of rooms per dwelling')
plt.ylabel('Median house price in $1000')
plt.show()

As we might expect, this also seems like a pretty good predictor of house price. But what if we wanted to model the correlation between the number of rooms in a dwelling and its expected price in that town? One simple model is called Linear Regression, and it essentially involves drawing a line through the data. The module sklearn provides us with such a model, which we will use in our program.



In [23]:

    
reg = LinearRegression().fit(df[['RM']], boston['target'])

Now the variable reg contains our model. It is fit on our data, meaning that it is the line which best fits our data (according to the linear regression interpretation). The red line is how our model fits the data. Essentially, it will always predict a house price that lies on the red line, given the number of rooms.



In [25]:

    
plt.scatter(df['RM'], boston['target'])
plt.plot(df['RM'], reg.predict(df[['RM']]), color='r')
plt.xlabel('Average number of rooms per dwelling')
plt.ylabel('Median house price in $1000')
plt.show()

One way to evaluate how good our model fits the data is called an $R^2$ score, or a coefficient of determination. Essentially, it assigns a number between $-\infty$ and $1$ which indicates how well our model fits the data. The best possible score is $1$. We can check out the $R^2$ score of our model using the score function.



In [27]:

    
reg.score(df[['RM']], boston['target'])









    Out[27]:





0.4835254559913343

Not great, not terrible.

We can use the same technique to try to use all of our attributes to predict the house price. This would look like:



In [28]:

    
reg_full = LinearRegression().fit(df, boston['target'])

Now, to get an intuition of how well our model fits the data now, we will again plot the average number of rooms against the median price. This time, the actual data points are in blue, and the predictions of our model are in red.



In [31]:

    
plt.scatter(df['RM'], boston['target'])
plt.scatter(df['RM'], reg_full.predict(df), color='r')
plt.xlabel('Average number of rooms per dwelling')
plt.ylabel('Median house price in $1000')
plt.show()

This looks much better. In fact, we can try the score again:



In [32]:

    
reg_full.score(df, boston['target'])









    Out[32]:





0.7406426641094095

As we can see, this model is a lot better than the first one. As an exercise, you might want to try removing some of the attributes, and seeing which ones affect the score a lot, and which ones don't.

If we wanted to import our own dataset, we can do that as well. Pandas has functions like read_csv and read_excel which allow you to import your dataset from a .csv or .xlsx file and analyze it like any other DataFrame. What you would do is:



In [ ]:

    
my_df = pd.read_csv('my_dataset.csv')



In [ ]:

    
my_df = pd.read_excel('my_dataset.xlsx')

The output here is omitted, but you can try googling "csv dataset" or "excel dataset" and try exploring some of them.

As an additional excercise, you might want to check out the other datasets included with sklearn and try exploring them like we did here. Try looking up methods other than Linear Regression as well.



In [ ]:

	CRIM	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
378	23.6482	18.1	0.671	6.380	96.2	1.3861	24.0	666.0	20.2	396.90	23.69
380	88.9762	18.1	0.671	6.968	91.9	1.4165	24.0	666.0	20.2	396.90	17.21
384	20.0849	18.1	0.700	4.368	91.2	1.4395	24.0	666.0	20.2	285.83	30.63
386	24.3938	18.1	0.700	4.652	100.0	1.4672	24.0	666.0	20.2	396.90	28.28
387	22.5971	18.1	0.700	5.000	89.5	1.5184	24.0	666.0	20.2	396.90	31.99
398	38.3518	18.1	0.693	5.453	100.0	1.4896	24.0	666.0	20.2	396.90	30.59
400	25.0461	18.1	0.693	5.987	100.0	1.5888	24.0	666.0	20.2	396.90	26.77
403	24.8017	18.1	0.693	5.349	96.0	1.7028	24.0	666.0	20.2	396.90	19.77
404	41.5292	18.1	0.693	5.531	85.4	1.6074	24.0	666.0	20.2	329.46	27.38
405	67.9208	18.1	0.693	5.683	100.0	1.4254	24.0	666.0	20.2	384.97	22.98
406	20.7162	18.1	0.659	4.138	100.0	1.1781	24.0	666.0	20.2	370.22	23.34
410	51.1358	18.1	0.597	5.757	100.0	1.4130	24.0	666.0	20.2	2.60	10.11
413	28.6558	18.1	0.597	5.155	100.0	1.5894	24.0	666.0	20.2	210.97	20.08
414	45.7461	18.1	0.693	4.519	100.0	1.6582	24.0	666.0	20.2	88.27	36.98
417	25.9406	18.1	0.679	5.304	89.1	1.6475	24.0	666.0	20.2	127.36	26.64
418	73.5341	18.1	0.679	5.957	100.0	1.8026	24.0	666.0	20.2	16.45	20.62
427	37.6619	18.1	0.679	6.202	78.7	1.8629	24.0	666.0	20.2	18.82	14.52
440	22.0511	18.1	0.740	5.818	92.4	1.8662	24.0	666.0	20.2	391.45	22.11

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.0	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.0	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.0	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.0	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.0	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33
...	...	...	...	...	...	...	...	...	...	...	...	...	...
501	0.06263	0.0	11.93	0.0	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67
502	0.04527	0.0	11.93	0.0	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08
503	0.06076	0.0	11.93	0.0	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64
504	0.10959	0.0	11.93	0.0	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48
505	0.04741	0.0	11.93	0.0	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88